Skip to content

Predicts the total ride duration of taxi trips in New York City.

Notifications You must be signed in to change notification settings

rutujahingankar/nyc-taxi-trip-time-prediction

Repository files navigation

nyc-taxi-trip-time-prediction

Build a model that predicts the total ride duration of taxi trips in New York City.

Exploring Solutions:

Having a deeper understanding of what problem we are trying to solve, what the users’ needs, and frustrations are, and what the goals are for achieving the best possible solution for both for the business as well as the user, I began by listing out the possible solutions that were arrived from the research.

Steps involved:

The full code for this article can be found here. It is implemented in Python and different machine learning algorithms are used. Below is a brief description of the general approach that I employed:

Data Loading and general checkups: We have loaded the data from the given csv files using a function from pandas library. Then we checked the general information about data

Exploratory Data Analysis:

We removed id variable as it doesn’t give much interpretation. We then calculated the distance based on haversine formula from pickup and drop-off latitude and longitude. Then we plotted the box plot for the variable and observed there are many outlier so we segregate this variable and see that most of the trip are within 10km, some trip are within 50km while a very few trip crosses 50km. so we eliminate trip with 0 and above 50km distance. We then checked for categorical variable store_and_fwd_flag and passenger_count. We observed the store and fwd. flag contain majority of one category. So we drop this feature. Passenger count variable has entries from 0 to 9. Since there is no trips with 0 passenger either this a miss entry or the driver forgot to enter passenger count of that trip. Also in a taxi maximum six person are allowed to sit including minor. So we eliminate 0 and 7-9 records from our dataset. Linear Regression: Linear Regression is a regression of dependent variable on independent variable. It is a linear model that assumes a linear relationship between dependent (y) and independent variables (x).

XGBoost: XGBoost comes under boosting and is known as extra gradient boosting. GBM first calculates the model using X and Y then after the prediction is obtain. It will again calculates the model based on residual of previous model, here loss function will give more weightage to error of previous model.

About

Predicts the total ride duration of taxi trips in New York City.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published