# Outcome prediction

##### note

This section is under development.

One of the commonly used use cases of GenoML is outcome prediction. Outcome prediction is paramount to personalized medicine, which promises individualized disease prediction and treatment. The convergence of machine learning (ML) and available multi-modal data is key moving forward.

Here is an example for developing a model that predicts a binary outcome and showcases how simple it is to get started with GenoML!

The binary outcome for this example is if a person is a positive Parkinson's disease case vs. if they are a healthy control.

The workflow we will be following are these three steps:

- Installing GenoML
**Munging:**Where we clean, normalize, and standardize the input data**Training:**Where we split the data 70:30 (training:testing), compete a dozen algorithms and nominate the best model**Tuning:**Where we use the entire dataset to tune hyperparameters and cross-validate to improve accuracy

## #

0. Installing GenoMLInstall the newest version of GenoML that's available on pip

## #

1. MungingWhere we clean, normalize, and standardize the input data. Munging with GenoML will, at minimum, do the following:

- Prune your genotypes using PLINK v1.9 (if
`--geno`

flag is used) - Impute per column using median or mean (can be changed with the
`--impute`

flag) - Z-scaling of features and removing columns with a std dev = 0 (no variation means that feature won't contribute)
- Required arguments for GenoML munging are
`--prefix`

and`--pheno`

The following files are made:

`outputs/test_discrete_geno.dataForML.h5`

`outputs/test_discrete_geno.list_features.txt`

`outputs/test_discrete_geno.variants_and_alleles.tab`

## #

2. TrainingTraining with GenoML competes several different algorithms and outputs the best algorithm based on a specific metric that can be tweaked using the `--metric_max`

flag *(default is AUC; other options include Balanced_Accuracy, Sensitivity, and Specificity)*

The following files are made:

`outputs/test_discrete_geno.best_algorithm.txt`

`outputs/test_discrete_geno.trainedModel.joblib`

`outputs/test_discrete_geno.trainedModel_trainingSample_Predictions.csv`

`outputs/test_discrete_geno.trainedModel_withheldSample_Predictions.csv`

`outputs/test_discrete_geno.trainedModel_withheldSample_ROC.png`

`outputs/test_discrete_geno.trainedModel_withheldSample_probabilities.png`

`outputs/test_discrete_geno.training_withheldSamples_performanceMetrics.csv`

## #

3. TuningIf you are interested in changing the number of iterations the tuning process goes through by modifying `--max_tune`

*(default is 50)*, or the number of cross-validations by modifying `--n_cv`

*(default is 5)*

The following files are made:

`outputs/test_discrete_geno.tunedModel_CV_Summary.csv`

`outputs/test_discrete_geno.tunedModel_allSample_Predictions.csv`

`outputs/test_discrete_geno.tunedModel_allSample_probabilities.png`

`outputs/test_discrete_geno.tunedModel_top10Iterations_Summary.csv`

# #

Done!You've now cleaned your data, competed trained output against 12 algorithms, determined the best model, and tuned this model! A quick explanation of some of the files:

`*training_withheldSamples_performanceMetrics.csv`

has the performance metrics for each of the 12 models (like AUC, balanced accuracy)`*.trainedModel_withheldSample_ROC.png`

has a ROC curve figure of your best trained model`*.trainedModel_withheldSample_probabilities.png`

visual representation of how well the model predicted the classes`*.trainedModel_withheldSample_Predictions.csv`

has the predictions of how your model classified individuals (did they become a case or control?)`*.tunedModel_CV_Summary.csv`

summary of the gains made when tuning was ran, how it improved the model

### #

So... where is my model?The models are saved as a `*.joblib`

file, and can be used for transfer learning purposes.