Can Machine Learning be a revenue booster in your AirBnB business?
In this blog we will have a look at some basic approaches which are supposed to give us a clue and inspiration for using data science and machine learning techniques to improve an existing, or start a profitable, AirBnB business.
The Seattle AirBnB homes data-set, which we decided to use for research and analytical techniques demonstration, can be found at the link below.
Following 3 questions are not the crucial questions to be answered in order to employ Data Science to increase our revenues. However, they can help one to find his/her crucial questions and they are good examples to demonstrate some technical solutions when trying to gain the answers.
1. Can we train a model which could predict rating with mae < 10?
2. Can we identify a useful set of features with meaningful impact on target?
3. How do prices of apartments vary by number of beds available?
To have a good plan means to see a light at the end of the tunnel before we start with work.
CRISP-DM stands for Cross Industry Standard Process for Data Mining. In our research we will try to follow the steps of this methodology (at least in order to answer first two questions).
The Business Understanding is a very important part of the process and ignoring this step can lead to a lot of wasted time or completely ruined projects.
This step usually includes talking to SME’s and business owners / business experts, reading articles about the business and many times it can overlap with Data Understanding phase as the data-set fields can give us a clue what to ask.
I was a bit lucky in this case because some time ago I reached an AirBnB superhost badge so I knew how the business is running and what to expect. However for those who are not familiar with AirBnB hosting, I would recommend to start here:
In Data Understanding part we preview the data and focus on key points like:
- Basic data characteristics like shape and column data types
- Column names
- Missing values across the rows & columns
- Distribution and count of values in the columns
- Having a closer look at the target and some other interesting and/or suspicious columns (features)
- Visualizing the data if need
For this project I did not need to install any necessary libraries to run the code except for the Anaconda distribution of Python (Python version 3.6).
When downloading the data you will receive several files. In our case, we have used only `listings.csv`.
We loaded the CSV file (CSV stands for comma-separated value) into a data-frame using pandas without any specific parameters (defaults only).
There was 3818 rows and 92 columns in the raw data-set.
62 columns were of object type (text or date and numbers stored as text), 30 columns were of numeric type.
In the first preview I realized that we need to get rid of some text fields to be able to preview data easily. However in a real life project I would recommend to have a closer look at these text fields too as they can turn into key features if a correct preprocessing & feature engineering is applied. Couple of text & url fields, that are not useful for research & model training, were identified and listed for later removal.
In the next steps we prepared a list of columns, that have more than 90% of missing values and also a list of columns containing single value and high cardinality fields.
What is high-cardinality?
In the data science world, cardinality refers to the number of unique values contained in a particular column, or field, of a database / data-frame. So if we are about to train a model on a data-set and some columns contain high number of unique values (categorical), we should consider whether to remove these fields or whether to reduce the number of unique values by clipping (replacing unimportant values by one default value). Actions like this can have positive impact on our results.
Before we move to Data Preparation part we can try to find also perfectly correlated columns (features) and add some of them to the list for removal as having groups of perfectly correlated (or high-correlated) columns in data-set could slow down the training process, and what’s worse, it can lead to a biased model.
This is a bit more funny part as all the struggling with unreadable data is over and we should already have a plan how to make our data-set nice and tidy.
Data preparation for our project have 2 parts:
- Feature engineering
- Data preprocessing
In the Feature engineering section we replace the date object fields with timestamps and convert currency fields (object) into float type fields. (Feel free to try more advanced ways of feature engineering if possible.)
In Data preprocessing the unwanted columns are dropped, missing values in numeric columns are filled with appropriate mean value, missing values in categorical fields are replaced with empty string and last but not least — the rows with missing target values are removed.
What is the target? The target variable of a data-set is the feature of a data-set about which you want to gain a deeper understanding. A supervised machine learning algorithm uses historical data to learn patterns and uncover relationships between other features of your data-set and the target. In this project the ‘rating’, mentioned in question 1 is our target. The name of the feature (column) in the AirBnB data-set is actually ‘review_scores_rating’.
After the Feature engineering and Data preprocessing functions were applied to the data-frame, we ended up with 3171 rows and 42 columns only.
Modeling & Evaluation
This might be the final step in order to answer question 1. We train the model and calculate the score which in our case is represented by mean absolute error.
However there are some foregoing preparation steps to be done before we can fit (train) the model and evaluate it:
- Split data-frame into label & features.
- Dummy the categorical variables
- Split data into train & test
Not sure what Dummy variables means? Categorical variables can be used directly in some machine learning classification algorithms, but they should be decomposed into dummy variables, if possible. A dummy variable is a binary variable coded as 1 or 0 to represent the presence or absence of a variable. When it comes to a regression algorithm, categorical variables need to be turned into numerical for sure. There are several approaches for such numerical encoding. In our case we use the one-hot-encoding implemented by pandas get_dummies function.
Finally we can select the right algorithm and fit(train) and evaluate our model. The decision was made to use GradientBoostingRegressor.
Why Regressor? The main difference between Regression and Classification algorithms that Regression algorithms are used to predict the continuous values such as price, salary, age, etc. and Classification algorithms are used to predict/Classify the discrete values such as Male or Female, True or False, Spam or Not Spam, etc. Guess, where the ‘review_scores_rating’ belongs to.
And how did it go with the mean absolute error of our predictions?
Due to the fact that our mae is something more than 4, the answer is obviously “Yes, we can train a model which could predict review_scores_rating with mae < 10.”
However I am pretty much sure that with better feature selection and more advanced feature engineering we could get even better results. Also I have to mention here that the “Evaluation” we use is very simple — basically we only use the testing mae score.
The last part of CRISP-DM process should be “Deployment” — this is applicable more in a real-life project but let’s consider the next question solution our Deployment.
Question 2: Can we identify a useful set of features with meaningful impact on target?
There are various data-science approaches for identifying important features. But whichever we choose, the best idea is to combine the approach with common sense and hands-on experience (if possible).
I decided to use the impurity-based ‘feature_importances_’ attribute of GradientBoosting Regressor.
The Pros are fast calculation and that it’s easy to retrieve but the Cons is biased approach, as it has a tendency to inflate the importance of continuous features or high-cardinality categorical variables. And this is why we should use our common sense too — not just extract the top rated features and focus on them.
To get the appropriate information from the feature importance chart is not about selecting the top n features. Now it is the time to use common sense too (and let’s support it with a heatmap):
The first feature in the chart is “number_of_reviews”. It’s obvious that this feature correlates with “host_since” and that it is important. More reviews mean more experience and also stabilized rating.
Let’s have a look at the second feature — “neighbourhood_cleansed”. This is a categorical variable so I decided to use boxplot for top 20 areas. Even though it is not clear why there are such differences in review_score_rating across the areas, the plot can help us to decide whether to buy a property in an area (if we have this possibility) or if we should be more careful (and do some deeper research) in order to gain good ratings if we are already offering a property in an area like for example the University District.
Now we can skip couple of features and have a look at the prices. Let’s say we have a 2-bed apartment and we need to set a price. Someone told us that lower price will help us gain better ratings. However from a simple scatterplot (where x=target and y=price ) we somehow cannot figure out if the person was right or wrong:
So a function was made which creates a scatterplot out of a grouped (by target) dataframe with median aggregate function applied to ‘price’. An upgrade (in compare with previous plot) is also removal of outliers:
So the previous chart shows, that lower prices have no impact on higher ratings. By contrast the higher scores ratings have higher price medians.
We don’t need to review all the features from feature importance list to be able to answer the question 2:
Yes, we can identify a useful set of features with meaningful impact on target however a deeper investigation of how the feature impacts the target is necessary.
Question 3: How do prices of apartments vary by number of beds available?
As we have a wide range of values (1 to 15) in the “beds” field, a simple clipping function was applied to reduce the range to only 4 categories.
The boxplot above is the answer to the question 3. We can see that there is some price overlapping (even between the 1-bed & 4+ bed apartments). On the other hand each category has at least a bit greater median than the previous one. Categories 3 and 4+ have nearly identical median but the minimum of 4+ is somewhere around 100 while the minimum of category 3 is close to 60. It is interesting that the 1st quartile of 4+ category is actually lower than the one from category 3.
We can train a model with requested score but the question is how to apply it in real-life and how to benefit from it? At the moment it looks like the feature_importances attribute of our model is much more valuable than prediction itself.
The feature importances can help us to identify a useful set of features with meaningful impact on ‘review_scores_rating’, but we need to support these details with additional research and common sense.
Number of beds in apartments affect the prices meaningfully, but there are many cases when even the 1-bed apartment is more expensive than the 4+ beds apartment. So it’s obvious that there are other aspects too.
Overall the Data Science with Machine Learning can be a revenue booster in our AirBnB business. It can help us to do right decisions whether we need to support an existing hosting or start a new one. Whether we already have a property or planning to buy one. We just need to customize our business questions, focus on appropriate data, do a deep research and try various approaches.
All the related code and details can be found on Github: data-science blog.