Command-Line Interface (CLI)

If you would like to work with a continuous or discrete outcome in your *.pheno file with genoml continuous or genoml discrete.

The current iteration of documentation supports the supervised learning workflow; the list of workflows incoming is scary! After that, you can add one of the subcommands, including choice of munge, train, tune, or test, these are steps within the supervised workflow.

If you are interested in harmonizing a test dataset for external validation of a previous model, use genoml harmonize.

The general structure of GenoML commands is genoml then outcome type (discrete or continuous for now), then workflow (supervised for now), then subcommand that is usually a step within a larger workflow.

Detailed command line options relating to published GenoML workflows and their subcommands below.

genoml - the root of all commands#

usage: genoml <command> [<args>]
continuous for processing continuous datatypes (ex: age at onset)
discrete for processing discrete datatypes (ex: case vs. control status)
harmonize for harmonizing incoming test datasets to use the same SNPs and reference alleles prior to munging, training, and testing
genoml
positional arguments:
command Subcommand to run

genoml harmonize#

usage: genoml harmonize [-h] --test_geno_prefix TEST_GENO_PREFIX
[--test_prefix TEST_PREFIX]
[--ref_model_prefix REF_MODEL_PREFIX]
--training_snps_alleles TRAINING_SNPS_ALLELES [-v]
optional arguments:
-h, --help show this help message and exit
--test_geno_prefix TEST_GENO_PREFIX
Prefix of the genotypes for the test dataset in PLINK
binary format.
--test_prefix TEST_PREFIX
Prefix for the dataset you would like to test against
your reference model. Remember, the model will not
function well if it does not include the same
features, and these features should be on the same
numeric scale, you can leave off the '.dataForML.h5'
suffix.
--ref_model_prefix REF_MODEL_PREFIX
Prefix of your reference model file, you can leave off
the '.joblib' suffix.
--training_snps_alleles TRAINING_SNPS_ALLELES
File to the SNPs and alleles file generated in the
training phase that we will use to compare.
-v, --verbose Verbose output.

genoml discrete supervised munge#

usage: genoml discrete supervised munge [-h] [--prefix PREFIX]
[--impute {median,mean}] [--geno GENO]
--pheno PHENO [--addit ADDIT]
[--feature_selection FEATURE_SELECTION]
[--gwas GWAS] [--p P] [--vif VIF]
[--iter ITER]
[--ref_cols_harmonize REF_COLS_HARMONIZE]
[-v]
optional arguments:
-h, --help show this help message and exit
--prefix PREFIX Prefix for your output build.
--impute {median,mean}
Imputation: (mean, median). Governs secondary
imputation and data transformation [default: median].
--geno GENO Genotype: (string file path). Path to PLINK format
genotype file, everything before the *.bed/bim/fam
[default: None].
--pheno PHENO Phenotype: (string file path). Path to CSV phenotype
file [default: lost].
--addit ADDIT Additional: (string file path). Path to CSV format
feature file [default: None].
--feature_selection FEATURE_SELECTION
Run a quick tree-based feature selection routine prior
to anything else, here you input the integer number of
estimators needed, we suggest >= 50. The default of 0
will skip this functionality. This will also output a
reduced dataset for analyses in addition to feature
ranks. [default: 0]
--gwas GWAS GWAS summary stats: (string file path). Path to CSV
format external GWAS summary statistics containing at
least the columns SNP and P in the header [default:
nope].
--p P P threshold for GWAS: (some value between 0-1). P
value to filter your SNP data on [default: 0.001].
--vif VIF Variance Inflation Factor (VIF): (integer). This is
the VIF threshold for pruning non-genotype features.
We recommend a value of 5-10. The default of 0 means
no VIF filtering will be done. [default: 0].
--iter ITER Iterator: (integer). How many iterations of VIF
pruning of features do you want to run. To save time
VIF is run in randomly assorted chunks of 1000
features per iteration. The default of 1 means only
one pass through the data. [default: 1].
--ref_cols_harmonize REF_COLS_HARMONIZE
Are you now munging a test dataset following the
harmonize step? Here you input the path to the to the
*_refColsHarmonize_toKeep.txt file generated at that
step.
-v, --verbose Verbose output.

genoml continuous supervised munge#

usage: genoml continuous supervised munge [-h] [--prefix PREFIX]
[--impute {median,mean}]
[--geno GENO] --pheno PHENO
[--addit ADDIT]
[--feature_selection FEATURE_SELECTION]
[--gwas GWAS] [--p P] [--vif VIF]
[--iter ITER]
[--ref_cols_harmonize REF_COLS_HARMONIZE]
[-v]
optional arguments:
-h, --help show this help message and exit
--prefix PREFIX Prefix for your output build.
--impute {median,mean}
Imputation: (mean, median). Governs secondary
imputation and data transformation [default: median].
--geno GENO Genotype: (string file path). Path to PLINK format
genotype file, everything before the *.bed/bim/fam
[default: None].
--pheno PHENO Phenotype: (string file path). Path to CSV phenotype
file [default: lost].
--addit ADDIT Additional: (string file path). Path to CSV format
feature file [default: None].
--feature_selection FEATURE_SELECTION
Run a quick tree-based feature selection routine prior
to anything else, here you input the integer number of
estimators needed, we suggest >= 50. The default of 0
will skip this functionality. This will also output a
reduced dataset for analyses in addition to feature
ranks. [default: 0]
--gwas GWAS GWAS summary stats: (string file path). Path to CSV
format external GWAS summary statistics containing at
least the columns SNP and P in the header [default:
nope].
--p P P threshold for GWAS: (some value between 0-1). P
value to filter your SNP data on [default: 0.001].
--vif VIF Variance Inflation Factor (VIF): (integer). This is
the VIF threshold for pruning non-genotype features.
We recommend a value of 5-10. The default of 0 means
no VIF filtering will be done. [default: 0].
--iter ITER Iterator: (integer). How many iterations of VIF
pruning of features do you want to run. To save time
VIF is run in randomly assorted chunks of 1000
features per iteration. The default of 1 means only
one pass through the data. [default: 1].
--ref_cols_harmonize REF_COLS_HARMONIZE
Are you now munging a test dataset following the
harmonize step? Here you input the path to the to the
*_refColsHarmonize_toKeep.txt file generated at that
step.
-v, --verbose Verbose output.

genoml discrete supervised train#

usage: genoml discrete supervised train [-h] [--prefix PREFIX]
[--metric_max {AUC,Balanced_Accuracy,Specificity,Sensitivity}]
[--prob_hist PROB_HIST] [--auc AUC]
[--matching_columns MATCHING_COLUMNS]
[-v]
optional arguments:
-h, --help show this help message and exit
--prefix PREFIX Prefix for your output build.
--metric_max {AUC,Balanced_Accuracy,Specificity,Sensitivity}
How do you want to determine which algorithm performed
the best? [default: AUC].
--prob_hist PROB_HIST
--auc AUC
--matching_columns MATCHING_COLUMNS
This is the list of intersecting columns between
reference and testing datasets with the suffix
*_finalHarmonizedCols_toKeep.txt
-v, --verbose Verbose output.

genoml continuous supervised train#

usage: genoml continuous supervised train [-h] [--prefix PREFIX]
[--export_predictions EXPORT_PREDICTIONS]
[--matching_columns MATCHING_COLUMNS]
[-v]
optional arguments:
-h, --help show this help message and exit
--prefix PREFIX Prefix for your output build.
--export_predictions EXPORT_PREDICTIONS
--matching_columns MATCHING_COLUMNS
This is the list of intersecting columns between
reference and testing datasets with the suffix
*_finalHarmonizedCols_toKeep.txt
-v, --verbose Verbose output.

genoml discrete supervised tune#

usage: genoml discrete supervised tune [-h] [--prefix PREFIX]
[--metric_tune {AUC,Balanced_Accuracy}]
[--max_tune MAX_TUNE] [--n_cv N_CV]
[-v]
optional arguments:
-h, --help show this help message and exit
--prefix PREFIX Prefix for your output build.
--metric_tune {AUC,Balanced_Accuracy}
Using what metric of the best algorithm do you want to
tune on? [default: AUC].
--max_tune MAX_TUNE Max number of tuning iterations: (integer likely
greater than 10). This governs the length of tuning
process, run speed and the maximum number of possible
combinations of tuning parameters [default: 50].
--n_cv N_CV Number of cross validations: (integer likely greater
than 3). Here we set the number of cross-validation
runs for the algorithms [default: 5].
-v, --verbose Verbose output.

genoml continuous supervised tune#

usage: genoml continuous supervised tune [-h] [--prefix PREFIX]
[--max_tune MAX_TUNE] [--n_cv N_CV]
[-v]
optional arguments:
-h, --help show this help message and exit
--prefix PREFIX Prefix for your output build.
--max_tune MAX_TUNE Max number of tuning iterations: (integer likely
greater than 10). This governs the length of tuning
process, run speed and the maximum number of possible
combinations of tuning parameters [default: 50].
--n_cv N_CV Number of cross validations: (integer likely greater
than 3). Here we set the number of cross-validation
runs for the algorithms [default: 5].
-v, --verbose Verbose output.

genoml discrete supervised test#

usage: genoml discrete supervised test [-h] [--prefix PREFIX]
[--test_prefix TEST_PREFIX]
[--ref_model_prefix REF_MODEL_PREFIX]
[-v]
optional arguments:
-h, --help show this help message and exit
--prefix PREFIX Prefix for your output build.
--test_prefix TEST_PREFIX
Prefix for the dataset you would like to test against
your reference model. Remember, the model will not
function well if it does not include the same
features, and these features should be on the same
numeric scale, you can leave off the '.dataForML.h5'
suffix.
--ref_model_prefix REF_MODEL_PREFIX
Prefix of your reference model file, you can leave off
the '.joblib' suffix.
-v, --verbose Verbose output.

genoml continuous supervised test#

usage: genoml continuous supervised test [-h] [--prefix PREFIX]
[--test_prefix TEST_PREFIX]
[--ref_model_prefix REF_MODEL_PREFIX]
[-v]
optional arguments:
-h, --help show this help message and exit
--prefix PREFIX Prefix for your output build.
--test_prefix TEST_PREFIX
Prefix for the dataset you would like to test against
your reference model. Remember, the model will not
function well if it does not include the same
features, and these features should be on the same
numeric scale, you can leave off the '.dataForML.h5'
suffix.
--ref_model_prefix REF_MODEL_PREFIX
Prefix of your reference model file, you can leave off
the '.joblib' suffix.
-v, --verbose Verbose output.