Under the folder toy_example, we provide a jupyter notebook jan22_toy_example.ipynb
that works through the training and evaluation of our autoencoder (as well as other baseline algorithms) for the synthetic1 dataset. We highly recommend interested readers to take a look before diving deep into our code.
The source code contains four parts:
- Core
- model.py
- utils.py
- datasets.py
- baselines.py
- Code for each dataset
- synthetic_main.py
- synthetic_powerlaw_main.py
- amazon_main.py
- amazon_parallel_l1.py
- rcv1_main.py
- rcv1_parallel_l1.py
- Scripts for reproducing our results
- scripts/synthetic1.sh
- scripts/synthetic2.sh
- scripts/amazon.sh
- scripts/rcv1.sh
- scripts/synthetic_powerlaw.sh
- Code and scripts for one of the baselines
Simple AE + l1-min
- synthetic_simpleAE.py
- amazon_simpleAE.py
- rcv1_simpleAE.py
- scripts are under simpleAE_scripts/
To reproduce our experimental results, first run chmod +x scripts/*.sh
to make the scripts executable. After that, run the given scripts:
$ ./scripts/synthetic1.sh
$ ./scripts/synthetic2.sh
$ ./scripts/amazon.sh
$ ./scripts/rcv1.sh
$ ./scripts/synthetic_powerlaw.sh
Note:
- The results are stored in a python dictionary which is then saved under the folder
ckpts/
. They can be used to reproduce the figures shown in our paper. - Before running
amazon.sh
, downloadtrain.csv
from this kaggle competition and specify its location via --data_dir. - The RCV1 dataset will be fetched automatically using the
sklearn.datasets.fetch_rcv1
function. - To reproduce results of one of the baselines
Simple AE + l1-min
, run scripts under the folder simpleAE_scripts/. - For high-dimensional vectors, solving
l1-min
using Gurobi takes a long time on a single CPU. To speed up, we solvel1_min
in parallel on a multi-core machine. Inamazon_main.py
andrcv1_main.py
, performance evaluation is performed on a small set of the test samples (while training is still done using the complete training set). After training the autoencoder, we use a multi-core machine and solvel1_min
in parallel on the complete test set usingamazon_parallel_l1.py
andrcv1_parallel_l1.py
. Depending on your multi-core machine, solvingl1_min
in parallel on the complete test set may still take a long time, I would recommend runningamazon_parallel_l1.py
andrcv1_parallel_l1.py
first with a small subset (by setting a small number for the parametersnum_core
andbatch
in the python file).
Here is our software environment.
- Python 2.7.12
- numpy 1.13.3
- sklearn 0.19.1
- scipy 1.0.0
- joblib 0.10.0
- Tensorflow r1.4
- Gurobi 7.5.1