This repository contains the Python 3.5.3 framework for the practical projects offered during the Machine Learning course at ETH Zurich. It serves two main purposes:
- Convenient execution of machine learning models conforming with the scikit-learn pattern.
- Structured & reproducible experiments by integration of sumatra and miniconda.
The project description and result submission are hosted by Kaggle:
Contents
Many brilliant implementations will be created during the projects, so wouldn't it be great to learn from them?
But have you ever tried to read the code of somebody else? If you just shuddered, you know what we are talking about.
We want to take this pain away (or most of it). This framework aims to enable every student to write their code in the same way.
So when you go to another work, you know what structure to expect, and you can instantly start to navigate through it.
For this purpose, we provide a common file structure and an interface to the scikit-learn framework. It offers standardized base classes to derive your solutions from.
But to understand a great result, we need more than the code, that produced it. Which data was used as input? How was it processed? What parameters were used?
For this reason, we have included sumatra in the framework. It allows you to track, organize and search your experiments.
Ok, now we understand the code and the experiment setup. So let's run their code!
ImportError: No module named fancymodule
Sounds familiar? Don't worry, miniconda is a central part of the framework, which provides your code an isolated, functional environment to run.
The project framework has been tested mainly in Linux (Ubuntu) environments. If you are using Linux already, you can skip forward to Get Started for Linux.
The framework should also work on OS X, but it has not been tested extensively. OS X users may choose to skip forward to Get started for Linux and OS X.
If you are using Windows, you need to install VirtualBox and create an 64-bit Ubuntu virtual machine (VM).
Make sure you allocate sufficient RAM (>= 8GB) and disk space (>= 64GB) for the VM.
If you can not choose 64-bit Ubuntu in VirtualBox, you might have to enable virtualization in your BIOS.
Once your VM is running, open a terminal and install git:
sudo aptitude install git
After that, please continue with Getting Started for Linux.
First you need to install miniconda on your system. If you already have Anaconda installed you can skip this step.
Having installed miniconda, clone the repository and run the setup script:
git clone https://gitlab.vis.ethz.ch/vwegmayr/ml-project.git
cd ml-project
python setup.py
A simple way to download the data is with the kaggle-cli tool. Make sure the environment is activated:
source activate ml_project
If you encounter problems with site-packages try:
export PYTHONNOUSERSITE=True; source activate ml_project
Then download the data:
cd data/
kg download -c ml-project-1 -u username -p password
Replace username
with your Kaggle Username and password
with your Kaggle password.
Make sure the environment is activated:
source activate ml_project
If you encounter problems with site-packages try:
export PYTHONNOUSERSITE=True; source activate ml_project
Make sure you have downloaded the data to the data folder, either by using the kaggle-cli tool or from the kaggle homepage.
To run an example experiment, simply type
smt run --config .config.yaml -X data/X_train.npy -a fit_transform
>> =========== Config ===========
>> {'class': <class 'ml_project.models.transformers.RandomSelection'>,
>> 'params': {'n_components': 1000, 'random_state': 37}}
>> ==============================
>> Record label for this run: '20170810-131658'
>> Data keys are [20170810-131658/RandomSelection.pkl(9b028327c83a153c0824ca8701f3b78a5106071c [2017-08-10 13:17:04]),
>> 20170810-131658/X_new.npy(b8c093d7c8e13399b6fe4145f14b4dbc0f241503 [2017-08-10 13:17:04])]
The default experiment will reduce the dimensionality of the training data by selecting 1000 dimensions at random.
Results can be found in timestamped directories data/YYYYMMDD-hhmmss
, i.e. for the experiment shown above, you would find the results in
data/20170810-131658
.
It produced two outputs, first the fitted model RandomSelection.pkl and second the transformed training data X_new.npy.
To view the experiment record, type smtweb
:
This command will open a new window in your webbrowser, where you can explore the information stored about the example experiment.
You can choose from different examples in the example config file.
Let us consider the above command in more detail:
smt run --config .config.yaml -X data/X_train.npy -a fit_transform
smt
invokes sumatra, which is an experiment tracking tool.run
tells sumatra to execute the experiment runner.--config
points to the paramter file for this experiment.-X
points to the input data-a
tells the runner which action to perform.
In addition to --config
experiments, you can run --model
experiments.
These two flags cover fit/fit_transform and transform/predict, respectively.
The reason is that for fit/fit_tranform you typically require parameters, whereas for transform/predict you start from a fitted model.
Continuing the example, we can transform the test data, using the fitted model from before:
smt run --model data/20170810-131658/RandomSelection.pkl -X data/X_test.npy -a transform
>> Record label for this run: '20170810-134027'
>> Data keys are [20170810-134027/X_new.npy(b33b0e0b794b64e5d284a602f5440620a21cac1c [2017-08-10 13:40:32])]
Again, sumatra created an experiment record, which you can use to track input/output paths.
Derive your models from sklearn base classes and implement the fit/fit_transform/transform/predict functions. For this project, these functions should cover all you ever need to implement.
For instance, if you want to implement smoothing as a precprocessing step, it clearly matched the fit_transform/transform pattern.
We have provided several placeholder modules in models, where you can put the code. Two simple examples are already included, KernelEstimator in regression and RandomSelection in feature selection.
Please do not create any new model files or other files or folders, as we want to preserve the common structure.
To make experimenting easier, we provide an interface to the sklearn classes pipeline and gridsearch. Check out the example config to find out more about how to use them.
Make sure to read the sklearn-dev-guide, especially the sections Coding guidelines, APIs of scikit-learn objects, and Rolling your own estimator.
Furthermore, try to look at the sklearn source code - it is very instructive. You will spot many more of the sklearn utilities!
If you add new packages to your code, please include them in the .environment file, so that it is available when other people build your environment.
If you think something is missing or should be changed, please contact us via the Piazza forum or start an issue on gitlab.
If you only want to check if your code runs without invoking sumatra and without saving outputs, you can simply run
python run.py [-h] [-c CONFIG] [-m MODEL] -X X [-y Y] -a {transform,predict,fit,fit_transform}
Use this for debugging only, otherwise your experiments remain untracked and unsaved!
It is required to publish your code shortly after the kaggle submission deadline (kaggle submission deadline + 24 hours).
Make sure you request access in time, so that you can create a new branch and push code.
First, you have to make sure that your code passes the flake8 tests. You can check by running
flake8
in the ml-project folder. It will return a list of coding quality errors.
Try to run it every now end then, otherwise the list of fixes you have to do before submission may get rather long.
Make sure that your Sumatra records are added:
git add .smt/
Next, create and push a new branch which is named legi-number/ml-project-1
, e.g.
git checkout -b 17-123-456/ml-project-1
git push origin 17-123-456/ml-project-1
The first part has to be your Legi-Number, the number in the second part identifies the project.
This repository runs an automatic quality check, when you push your branch. Additionally, the timestamp of the push is checked.
Results are only accepted, if the checks are positive and submission is before the deadline.
Check under Pipelines, if your commit passed the check. The latest flag indicates which commit is the most current.
To submit a prediction (y_YYMMDD-hhmmss.csv), e.g. to get the validation score, you can use the kaggle-cli tool:
kg submit data/YYMMDD-hhmmss/y_YYMMDD-hhmmss.csv -c ml-project-1 -u username -p password -m "Brief description"
To view your submissions, just type
kg submissions
which will list all your previous submissions. To set a default username, password and project:
kg config -u username -p password -c competition
Please note, you have to explicitly select your final submission on Kaggle (here).
Otherwise, Kaggle will automatically select the submission with the best validation score.
Please post general questions about the machine learning projects to the dedicated Piazza forum.
For suggestions and problems specifically concerning the project framework, please open an issue here on gitlab.
If you want to discuss a problem in person, we will offer a weekly project office hour (tbd).