- Business Problem Address
- Environment and Dataset
- Exploration and Analysis
- Prediction and Model Performance
- Recommendations For Taxis Services
- Other Details
This project aims to solve this real-world business partner's problems:
Understande the senario where shared-riding services succeed and providing recommendations for traditional services stakeholders.
Based on the underlying relations of traditional taxi services and competitive shared-riding services such as Uber
- Os: Ubuntu 20.04
- Storage: 30GB+
- Language: Python 3.9
- Packages / Libraries:
- pandas,
- pyspark,
- sklearn,
- statsmodels,
- folium,
- os,
- bokeh,
- geopandas,
- numpy,
- seaborn,
- matplotlib.pyplot,
- statsmodels
Over 20GB+ data has been used in this project.
- NYC TLC: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
- NYC Traffic Data: https://data.cityofnewyork.us/Transportation/Traffic-Volume-Counts-2014-2019-/ertz-hr4r
- NYC Extreme Weather: https://www.weather.gov/okx/stormevents
- Yellow/Green Taxi's business performance is descresing.
- HVFHV (Share-riding services such as Uber)'s business performance is stable.
- The whole graph trending as a time-series data (wave shape, up and down.)
- Need further Analysis towards smaller time frame.
- More traffic during the weekend.
- Less traffic in the middle of the week (Wed, Thu, Fri).
- Saturday is the peak day of the week.
- Green/Yello Taxis are popular mostly towards Thursday and Friday.
- Uber (HVFHV) got most businesses during the weekend and holiday.
- Traditional Taxi services need to focus on non-work day business promotions.
Considering different time-scaled analysis above and combining with the general traffic condition over the week.
It is interesting that the Traditional Taxi's performance is not that good during the non-working day such as weekend and holiday.
It also less competitive given an extreme weather condition (snow, rain etc. And the service performance of Taxi is decreasing over time, which is a dangerous sign given share-riding services such as Uber maintains their business performance level.
Next, a multi-factor Linear model has been built to enable our taxi business partner to predict their business performance based on other factors.
- This is potentially the parameters could be included in our linear model.
- Too many features to build a mode, we need model selection.
Data has been standardized, mapping to standard normal distributions to enable us compare the coefficient of the linear model.
- The model implements default F-test rather than Likelihood Ratio Test as two are proved the same in our case.
- Several entries have been removed to improve the model’s goodness of fitness by examining the outlier graph using cook distance.
- The model is further selected based on visualising component-component based graph and calling StepAIC11 function to remove insignificant variables.
- AIC is preferred over than Likelihood Ratio Test as AIC doesn’t restrict nested models to compare.
- Interaction among attributes has been tested with ANOVA.
Model Statistics | Model Parameters |
---|---|
After a series of analysis conducted above, here are some recommendations to our Taxi business Partner:
- To enable public order the ride much more easily: Equip advanced technologies such as easily used and customer favourable mobile Apps.
- Algorithmic rating systems to make their service interact more with customers and improve the general satisfaction and quality of the service.
- promotion campaign should focus more on leisure holiday-travelling topic to increase their trips volume.
- Such as: the idea that taking Taxis during holidays is somehow a better choice rather than calling a Uber or drive their own cars should be promoted.
- This can be done by giving away specific free coupons that can only be used during holidays.
- traditional taxis and FHV should take advantages of their inherit safety and liable services recognised by public within history.
- Under the goverment close scrutiny, taxi might be much safer or reliable than Uber-like companies.
- since the sources support that Uber and other share-riding company is more like a recently rapidly developed services without too much histories.
- With the linear relationship between Taxi and HVFHV(Uber) been found, it is safe to conclude that two has interconnected relationships.
- The competition between traditional transportation services and shared-riding services is highly possible to be an example of how contemporary technologies boost the traditional industries.
- The logic behind this is that two parties are closely correlated, with one has the ability to predict the other's business performance, it is possible for Taxi services to "adjust" a few of their business variables to catch up with the Uber-like services.
- The competitions between those giant service providers will benefit the local citizens enjoying rides of higher quality and safety with lower cost.
- Full Report Link: https://www.overleaf.com/read/qwcptssmfnvg
raw_data
: Contains all the raw data files. External Can be download through "DownloadData.ipynb" and "ExternalData.ipynb" notebook.preprocessed_data
: Contains all the preprocessed data files. Run notebook "Preprocessing.ipynb" should generate and save all needed preprocessed data.plots
: Output and saved plots.code
:- "DownloadData.ipynb" for "Downloading Data" and "Installing Packages".
- "ExternalData.ipynb" for "Accesing and Downloading, Preprocessing External Data".
- "Preprocessing.ipynb" for "Data preprocessing, feature engineering and saving to local".
- "GeoMap.ipynb" for "Data Visualization within Geolocational map".
- "OverAllAnalysis.ipynb" for "Data Aggregation and Visualisation on a Daily or Weekly Basis".
- "Correlation.ipynb" for "Finding initial underlying linear relations between attributes".
- "LinearModel.ipynb" for "Statistical Modelling".
- Big thanks to https://github.com/akiratwang/ for teaching me PySpark and general knowledge.
- How to install and use PySpark? https://github.com/akiratwang/MAST30034_Python/blob/main/advanced_tutorials/