Links to various kernels:
Data feature engineering:
https://www.kaggle.com/pradyu99914/data-feature-engineering
Version 29 has supporting visualizations, while version 32 adds more features to the existing dataset.
Please note that the code has been executed over a number of commits.(as it takes ~7H per feature)
Linear, Ridge, Lasso regression, and Random forest model with 100 estimators: https://www.kaggle.com/anushkini/nyc-taxi-fare-models
Data Visualizations (visualizations that gave insights into feature engineering new columns): https://www.kaggle.com/anushkini/nyc-taxi-fare-graphs
Xgboost: https://www.kaggle.com/anushkini/taxi-xgboost
kNN regression with vizualisation for best k value: https://www.kaggle.com/pradyu99914/nyc-taxi-fare-models
LGBM: https://www.kaggle.com/anushkini/taxi-lightgbm
Final Pipeline, with the links to all relevant kernels: https://www.kaggle.com/pradyu99914/
A brief description of the files and folders:
-- TeamAPP_FinalReport.pdf - The final report of our project.
-- demo.py - A demo script which shows our recommender system in action.
--feature_engineering.py - A script which describes the feature engineering performed on the data.
--final_pipeline.py - The final pipeline code for our project.
--model.txt - The final LGBM model with an RMSE of 2.93.
--test_df.feather - A feather file which contains the test dataset in compressed form.
--visualization.py - A python script of all the visualizations performed on the dataset.
--Models:
----kNN Model/knn.py - Script to train the K Nearest Neighbours Model.
----ANN/ann.py - Script to train the Neural Network Model.
----LGBM/lgbm.py - Script to train the LGBM Model.
----XGBoost/XGBoost.py - Script to train the XGBoost Model.
----Lasso regression.py - Script to train the Lasso regression Model.
----LRRF.py - Script to train the Random forest Model.
----LR.py - Script to train the Linear regression Model.
----RidgreRegression.py - Script to train the XGBoost Model.
Results:
We have been able to obtain an RMSE rate of about 2.93 on the kaggle competition.
All kaggle submissions made till date:
Model | Model details | RMSE |
---|---|---|
XGBoost | Trained on 1 Million data points | 4.46939 |
XGBoost (Bagging) | Trained on 8 Million data points | 4.19958 |
XGBoost (Bagging) | Trained on 54 Million data points | 4.12760 |
XGBoost | Used Bayeisan Optimization for hyperparameter tuning | 4.18783 |
XGBoost (Bagging) | Improved Dataset with feature Engineering, 8 Million data points | 3.91798 |
XGBoost | Improved dataset and used Bayesian Optimization | 3.17963 |
XGBoost | Converted Coordinates from decimals into radians and Bayesian Optimization | 3.17697 |
XGBoost | Feature Engineered Day, Month columns to the dataset | 3.11282 |
LGBM | Improved Dataset | 3.13226 |
LGBM | Changed boosting hyperparameters | 3.08951 |
LGBM | Converted Coordinates from decimals into radians | 3.08830 |
LGBM | Feature Engineered Day, Month columns to the dataset | 3.02238 |
LGBM | Reworked Dataset with more Feature Engineering | 2.99228 |
LGBM | Added new distance features | 2.99095 |
LGBM | Trained on 15 Million data points | 2.93553 |
Linear Regression | Baseline Model | 5.39 |
Linear Regression | Final model trained on engineered data | 5.18 |
Ridge ression | Ridge regression with grid search | 5.18 |
Lasso regression | Lasso regression | 9.409 |
Lasso regression | Lasso regression with grid search | 5.05 |
Random forest regressor | Random forest regressor | 4.43 |
Linear Regression | Final model trained on engineered data | 5.18 |
kNN regressor with bagging | using feature engineered data | 3.54 |
Artificial neural network | ANN with normalization of the data | 3.39 |