Taxi fare - exploration and prediction


The below is an exploration of the taxi fare data received.

The first section is an exploratory data analysis, that looks to understand the relationship between the variables and try to identify if any significant relation is present between the target variable and the other variables.

The second section is on modelling, trying to see how to best predict the target variable. It includes creating a simple linear model, using cross validation to determine model performance, as well as a custom defined function to cross compare models.  

Taxi Fares

The data set contains the following variables:

1. Exploratory Data Analysis

There are no missing values in the dataset.

Let's define the target variable as a python variable.

Adding two plots to get a basic understanding of the distribution of the target variable.

The data is obviously skewed to the right, and is similar to a power law distribution.

As there are only 9 rows with negative rowcounts, I am dropping them.

Creating column charts by the number of rides per day of the week and hour of the day

From the above barchart it is visible that the most popular days are Fridays and Saturdays. The least popular days are Sundays and Mondays, and there is no significant trend of diminishing apparent in the data.

The most common pickup time is in the evening, around 7 PM. There is an obvious and sudden decrease in the data, the number of taxi rides are significantly less during the night, than in the day. Most taxi rides take place during the second half of the day.

Let's have a look at how the different days compare to regarding pickup hour

The stacked column chart showcases the days of the week, and the different colours show the hours in the day, indicating whether it is in the morning, during the day or at night.

Identify outliers

Is there a relationship with any of the relavant variables?

The purpose of the following boxplots is to try and identify any patterns in the data, to see if there can be any relationships between pickup day, pickup hour, passenger count and fare amount.
In the above boxplots all data above the red horizontal line are outliers for the fare_amount (total). Anything above the blue line is above the 99th percentile for the total fare amount.

create pickup and dropoff locations

drop values where pickup location or dropoff location is 0.0

There are quite a few locations where the pickup or the dropoff location is equal to (0,0), however this is unlikely to be an actual dropoff location for a taxi ride as it is located in the Atlantic Ocean about 200 km's from the coast of Africa.

Identifying location hotspots

There still appear data that require additional cleaning.

There are still some values in the data that have to be cleaned. Longitude cannot have a value of 401.08, that is impossible.

Focus on new york

My suggestion is to focus on New York city, as most of the values in the above table pinpoint that location.

Identifying pickup and dropoff hotspots

The top pickup location is around Time Square

The top dropoff location is also Times Square, however based on the above scatterplot, there is a clear hotpot around(40.69,-74.18): the Newark Liberty International Airport in New Yersey.

Calculate the air distance using the great circle distance

image.png Due to the curvature of the Earth, calculating the distances on an Euclidean plane would yield incorrect results, hence the haversine function is used to calculate the great cirle distance between 2 locations.

There are 557 rows where the distance travelled in air is equal to zero. I'll drop these as well, as they appear to be counterintuitive.

There are also some columns where the passenger count is equal to zero. As we are interested in passenger transport, I'll drop these values as well.

Using a correlation matrix to identify columns that could be used for modelling

The above scatterplot shows the fare amount on the x-axis and the air-distance on the y-axis. There are some outliers in the air distance, which are worth examining further.

Again, there are some records, where the distance travelled is significantly larger than the 99th percentile, I'll drop these again for the purposes of the model.

Let's create a histogram to identify whether there are any patterns in the distribution of air distance.

Let's see if the above pattern can be further be examined by focusing on the 95% of the remaining dataset.

The above histogram appears to show a distribution similar to a Poisson distribution with a low lambda.

From the above jointplot and the correlation between air distance and the fare amount, I conclude that air distance could be a good predictor for fare amount.

2. Modelling

Fit baseline model - linear regression

In the following section, I will fit and evaluate different models. I'll start by fitting a baseline linear regression with a subset of the features. I'll then expand the model to contain all features examined. In the following section I'll run a cross validation using scikit-learn's own cross validator, then create a self defined function that randomly splits the data then trains and evaluates a set of predefined models. In the final section I will create a simple pipeline that could be used to deploy the winning model.

Adding additional columns to the model

Cross Validating the latter model

Compare multiple models based on their performance

The above table is the result of splitting the data 3 times and then training and evaluating model performance. In the above table the best performer is the random forest regressor, based on all 3 metrics. $R^2$ measures the variance explained by the model, while mean absolute error and root mean squared error measure the prediction error from the actual value. However, as mean squared error sums up the squares of errors, it penalizes larger errors more than the mean absolute error.

Parameter tuning

The following is a limited attempt at tuning the parameters of the winning random forest regressor. The limitation is due to time and computational constraints. For better model performance it is worth experimenting by changing the number of estimators, the depth of the decision tree and the number of saples required per leaf or per split. Also, the number of iterations have been decreased significantly to improve training speed.

These results can also be further improved using a grid search.

create a simple pipeline to deliver the winning model

Uncommenting the follwing cell enables to write the final model as a python binary file.

How to improve the current model?