To properly test how your model performs on a dataset it's never seen before, the process includes:
- Munge and train your first dataset. That will be your "reference" model
- Use the outputs of step 1's munge for your reference model to harmonize your incoming validation dataset
- Run through the harmonization step with your validation dataset
- Run through munging with your newly harmonized dataset
- Retrain your reference model with only the matching columns of your unseen data. Given the nature of ML algorithms, you cannot test a model on a set of data that does not have identical features
- Test your newly retrained reference model on the unseen data
If you are using an external validation dataset, Steps 1-4 are performed as part of the harmonization process. Steps 5-6 are described next.
After munging and training your reference model and harmonizing and munging your unseen test data, you will retrain your reference model to include only matching features. Given ML algorithms' nature, you cannot test a model on a set of data that does not have identical features.
To retrain your model appropriately, after munging your test data with the
--ref_cols_harmonize flag, a final columns list will be generated at
*.finalHarmonizedCols_toKeep.txt. This includes all the features that match between your unseen test data and your reference model. Use the
--matching_columns flag when retraining your reference model to use the appropriate features.
When retraining of the reference model is complete, you are ready to test!
A step-by-step guide on how to achieve this is listed below:
When munging the test dataset on the reference model columns using the
--ref_cols_harmonize, be sure not to include the
--feature_selection flag, as you have already specified the columns to keep moving forward.