You would never guess this is the most expensive neighborhood in Tokyo

Sakshi Agarwal
5 min readJul 27, 2021

--

Tokyo Station

Airbnb had made it easier for people to rent their houses and apartments to tourists and visitors, and with a vibrant and large city like Tokyo, a lot of factors can affect the price of such listings on Airbnb such as the neighborhood, etc. So the goal of this project is to find what factors play a major role in determining the price.

For this, I am using Tokyo Airbnb data found on Kaggle. The data contains 11466 samples (number of rows) of different listings across Tokyo, with 14 different features (number of columns) for these listings (eg. neighborhood, price…etc.).

We focus on the data set in two stages:

  1. Analyze the data.
  2. Use ML models to predict the prices.

In the first stage of analyzing the data we try to answer the following questions:

  1. What is the average price of neighborhoods in Tokyo?
  2. What is the average price of a room type in Tokyo neighborhoods?
  3. What percentage of Airbnb accommodations are of each room type?
  4. What is the average price of a room type in Tokyo neighborhoods?

To begin with, let’s do some data preprocessing.

I removed all features that have IDs information since won’t be helpful for this analysis. Also, I removed any other non-relevant features (for example, ‘name’, ‘host_name’, and ’last_review’ ). I also dropped all the rows with missing prices.

I then plotted the price distribution to see if there are any outliers, and indeed the plot(shown below - first image) suggests the existence of outliers. So I decided to keep only data points with prices between the 5th and 95th percentile and drop any samples that are not within that range which resulted in a total of 1021 samples being dropped. I then replotted the distribution again (shown below). The new plots suggest fewer outliers are present so I went ahead and used the remaining samples, which ended up being 10445 samples.

Now, Let’s start with the first part!

First, A walk through Tokyo’s neighborhoods: which neighborhoods have the highest and lowest prices and why?

To answer this question, I aggregated all the available data by neighborhood and added up all the prices of the listings in the neighborhood, taking the average over the number of listings in the neighborhood, then I sorted the prices to get the plot below.

It seems like the Hinohara Mura(I have been living in Tokyo for 2 years now and have never heard of this area :p) is on top when it comes to average listing price, and Higashiyamato shi is at the bottom.

Second, A walk through Tokyo’s neighborhoods: What is the average price of a room type in Tokyo neighborhoods?

To answer this, we plot the bar graph of room type with the prices. Below is the result:

It can be clearly seen that Shared rooms are the cheapest and the entire home/apt are the most expensive.

Third, A walk through Tokyo’s neighborhoods: What percentage of Airbnb accommodations are of each room type?

We make a pie chart in this case.

As it can be clearly seen from the chart, the majority of the houses are Entire home/apt type and a very small 7.2% are Shared rooms.

Fourth, A walk through Tokyo’s neighborhoods: What is the average price of a room type in Tokyo neighborhoods?

So, let’s take a deeper look into the types of listings found in those neighborhoods. First using the room type, and using the same order as in the plot before, we get the following figure below.

If we see the first plot, we see that the western Tokyo like Okutama, Hinohara areas are renting the most expensive entire homes. This had to happen as there are bigger houses in those areas. Miyake mura tops the list of private rooms. Then when we look at the shared room scenario, all the downtown areas top the listing. They are more expensive and have little space but are always in demand.

Let’s start with the second part!

Can we make accurate price predictions using our data?

In the dataset, we have access to other features that can help in predicting the price of a listing beside the neighborhoods. Some features are categorical like the room type,…etc. and some are quantitative like the number of reviews,…etc.

We drop some columns which might have no impact on our predictions such as longitude and latitude as well as neighbourhood_group since neighborhood determines the same thing.

For the categorical features, we have a total of 2 categorical features, which I chose to map since they are not ordinal.

For the quantitative features, we have a total of 4 features.

Following that we split our data into training and testing with ratios of 70% to 30% respectively, and train a linear regression model, and observe the average R2 score was 0.27, which is not a good value. However, with the limited number of features, this was all we could get.

We try three more regression models such as Ridge Regression, Lasso Regression, and ElasticNet Regression Model. However, there wasn’t any difference in the R2 score. Here are the metrics from the models:

Linear Regression Performance Metrics
MAE: 0.430953
RMSE: 0.527597
R2 0.273096
Ridge Regression Performance Metrics
MAE: 0.430950
RMSE: 0.526881
R2 0.275067
Lasso Regression Performance Metrics
MAE: 0.431205
RMSE: 0.527856
R2 0.272381
ElasticNet Regression Performance Metrics
MAE: 0.432859
RMSE: 0.529666
R2 0.267382

Hope you enjoyed the walk!
Please post your feedback or comments :)

All code is available in my Github repo here.

--

--

Sakshi Agarwal
Sakshi Agarwal

Written by Sakshi Agarwal

Computer Science Engineering graduate. Specialisation-Python Programming, Javascript, React, three.js

No responses yet