Music Streaming Service Churn Predictions with PySpark
Udacity Data Scientist Nanodegree is an adventurous, interesting and industry relevant data science program and so are its capstone projects. I decided to undergo the Spark-based project called “Sparkify”. Sparkify is supposed to be a fictional music streaming service, however the provided dataset is very realistic.
Completing this project can help us learn how to use Spark MLlib to build machine learning models with large datasets, far beyond what could be done with non-distributed technologies like scikit-learn.
Udacity provides 2 versions of datasets. The full dataset (12GB) for trying Spark cluster on the cloud using AWS or IBM Cloud. And a small (128MB) sample which can be processed directly in the workspace provided.
Customer churn, in general, represents a situation when a user/customer decides to stop using a service or doing a business with a company. How churn is identified, can vary from service to service. Sparkify treats a user as churned when Cancellation Confirmation page was visited — what is visible in the dataset of course. Our goal in this project will be to explore and clean the data, create a useful set of features and train a model (a classifier) which would be able to predict a churn situation, before it really happens.
The project is divided into 4 main parts:
- Load and Clean Dataset
- Exploratory data analysis
- Feature Engineering
Load and Clean Dataset
Before we load the data, we need to import all the necessary libraries and create a Spark session …
After we load the dataset with
spark.read.json(), we need to at least preview the dataframe and see the data types to get a better picture of what our next steps will look like.
When it comes to “cleaning steps”, we can get rid of some unneeded columns and rows with missing important values — like labels or user id’s. For now I removed only rows with missing userId values (as labels were not defined yet) and couple of columns which I’m not planning to use as features or to create features out of.
I removed columns like ‘artist’,’song’,’id_copy’,’firstName’, ‘lastName’. However if we had a bit more time and additional data sources, I would say that the music users listen to, can also be a good feature -especially if we would be able to categorize it like sad, happy, aggressive music etc …
There is another bunch of preprocessing actions applied to the dataframe after Feature engineering — that’s where we handle categorical variables, replace null values and remove spare rows.
Exploratory data analysis
Let’s start this part with defining our label — and creating the churn column. As it was suggested, the
Cancellation Confirmation event is what we are after. It is very important to realize that each row represents only a session, user had in a certain time. Therefore once we create the “churn” column and extract the values from the “page” field, we should spread values across all the sessions of each user. This is also the way we’ll be handling new features later in the feature engineering part. As there will come a time when we will need to get rid of majority of rows and keep only one row per user. And the time is called Modeling. And if we don’t spread the values across all the user’s rows, there is where our weeping and the gnashing of our teeth will be. And there shall in that time be rumors of things going astray and there shall be a great confusion as to where features really are …
So we found out that there are 52 occasions of churn ( out of 225 users) in the small dataset. Now is the time to explore the data in deep.
A column named
page is definitely going to be one of our best sources of features. (Btw this is the column we extracted our label from). Events like “Thumbs down”, “Error” or even “Help” — they all can indicate churn.
One of the features I can think about, would also be increased or decreased number of songs played daily per user. We get this number by counting “NextSong” events per day (per user). I decided to collect these numbers into lists, split the lists into 4 ongoing quarters and compare a mean of songs played daily for each quarter. This way we ended up with 3 values per customer, which indicate whether the mean increased or decreased in the next quarter (
I stands for Increase and
D for decreased) :
Even though I was expecting to have more D’s at the end of churn triplets, I will still use this feature in modeling part, because in combination with other features it can show up as a helpful one.
Another interesting field seems to be the
userAgent. My first idea was to extract only the Operating system but than I decided to spent a bit more time with it and extract more details. RegexTokenizer was the tool I used initially …
My plan was to create a new feature/column for each word and use 0’s and 1’s as values in each row, based on whether the word is present in the list of tokens or not. The only problem could be perfectly correlated features. In the heatmap below — created out of the word-matrix (from userAgent word) — we can see that some of the words are perfectly correlated with other ones (yellow dots besides the diagonal). We’ll need to remove some words accordingly — as we do not want them to slow down the the whole feature engineering process and we do not want them to bias our models.
Good set of features is the key to a successful model. There are easy-going features and there are features created by complicated processes. Some of the simple features I used in this project were:
- total sessions per user
- total error events per user
- total help events per user
- average session length
- total songs played by user
- Total Songs / Total Days ratio
- Total sessions / Total days ratio
- Number of downgrades per user
Except for them there are also features usable as they are in original dataset. The only work we do here is adding a prefix (it is a good manner to use the same prefix with all features we plan to use — so they can be easily extracted).
It is also important to mention that the classifiers (and probably even other model algorithms) from pyspark.ml require feature vectors — there is no place for categorical string features. But this is a transformation which can be applied just before modeling.
From the more complicated features I used, I would mention following:
- Level switch count — which tells us how many times user switched from PAID to FREE mode or vice versa
- State — a State abbreviation extracted from location field.
- Tunes played daily — increase/decrease by quarters — Based on the experiment from the the research part I created a function which can be easily used not only for songs played increase/decrease but also for Error or Thumbs Down or any other event from page.
- Multiple userAgent features — In this part we tokenized the userAgent words, reduced them in order to not get perfectly correlated features and for every word/token we created a new feature/column which holds the information about the total usage of each word per user. Each word represents a system, version or even a hardware used by customer and it can become a good feature when it comes to predict churn.
This is the part, every data scientist must love! However with low accuracy or model score numbers it can easily turn into nightmare. Fortunately it was not our case.
Let’s start with the final preprocessing of our dataframe. For this purpose I have combined all the necessary steps into one function called
As mentioned earlier, for modelling we need a dataframe containing 1 row per user. Extracting these rows in the first step of the function will save us a lot of time as every other operation will be applied to reduced dataframe. In the next steps, except for
fillna, we extract the bunch of features we plan to use, create the “label” column out of “churn”, we convert categorical features to numerical (As mentioned above, this is what pyspark.ml classifiers like RandomForest or GradientBoostedTrees are expecting.) and finally we split the data into 3 parts: Train, Test & Validation.
Finally we can start with the training! Here is the sequence of our steps:
- Create a dictionary with classifiers as the keys and appropriate paramGrids as values
- Prepare Vectorizer and Scaler for the pipeline (I decided to use RandomForest and GradientBoostedTrees and therefore Scaling is not necessary, however we can keep it in the pipeline in case we decide to add LinearRegression too. Scaling will not do any damage to the results anyway).
- Next we loop over the classifiers and get the best one of each grid by using F1 score metric of the MulticlassClassificationEvaluator.
- We save the results and models to a dictionary.
- In the last step we select the best of the best models by comparing the results of each iteration
Training score is quite high, let’s have a look at the testing one:
The best model selected by the CrossValidator out of the Classifiers provided and their paramGrid possibilities is GradientBoostedTree Classifier with maxDepth=7, Max number of bins =20 and max number of iterations =6. Several combinations of parameters were tried before the final (published) paramGrid builder. The depth of the trees is not too large to overfit the model but the trees with max depth =7 are also not too shallow to catch only few details and end up with poor performance. Number of iterations in boosted trees means how many decision trees or base learners will be build. Number of bins is the number data are partitioned into before the best split point is determined.
We can now validate our best model on the validation dataset. I’ve added couple of more details/metrics to get a better picture of how good our model really is.
Final results & Metrics used
Why did we use the F1 score as the key metric? As we are using a small dataset, we need to use a bit more demanding metric than precision or recall. F-score is calculated from the precision and recall of testing. Precision is the number of correctly identified positive results divided by the number of all positive results, including those not identified correctly, and the recall is the number of correctly identified positive results divided by the number of all samples that should have been identified as positive. Our goal is to predict true positives as accurate as possible while false positive predictions could lead to wasted resources. Therefore the F1 score became our metric.
We achieved a validation F1 score equals to 0.78. For such a small dataset it’s not bad at all. Also the fact we had only one false positive prediction is important. However this false-positive prediction is causing the F-score to be lower than recall.
The metric we decide to use should also be relevant to the methods we apply to prevent customers from churn. If these methods are not too costly, than we can consider also Precision to be our metric — but we should be prepared for increased number of false positive predictions. Also if we use some dramatically expensive methods to attract customers who are about to churn, we should consider to use Recall as our metric — as it should help to prevent the system from predicting churn incorrectly.
We ended up with a model which achieved quite a good validation score (0.78) but we need to take in consideration that we had a limited playground and the results of our model could be lower on the real 12GB dataset.
Improvements and potential improvements
In the first phase of my work there was a bug in
page_inc_dec_quarterly_FE and the values (increase-decrease triplets) were biased. This mistake led to significantly lower scores. I had to review the results of spark-based steps one by one and fix the window functionality. This confirms how important some features can be for final results.
Anyway there is much more stuff we didn’t try to use. We can improve the results by better feature engineering, by better parameter tuning and there is always a possibility to deploy a pipeline with some sort of ensemble model, which can use the predictions of our original model as a feature. This way we could achieve higher train, test and validation scores. Based on my previous experience the ensemble model could boost our score by another 5% at least.
If we have enough time, a valuable approach could be also to have a look at the feature importance of our model and try to get a bit more from features we recognized as important in our research, but the model didn’t find them very useful.
Difficulties and issues
During this project I experienced couple of spark-related difficulties:
- Even though we started with 128MB dataset, to process all the feature engineering steps took quite a long time in the local mode.
- Adding new features to the dataframe in a loop can cause some troubles due to lazy evaluation of spark. Therefore I highly suggest to use
df.cache()after every x iterations (depends on complexity).
- After I used randomSplit functionality twice in a row (train-test-val split), I realized that the volumes are not as expected.
df.cache()was the saviour even in this case!
- The goal of this project was to train a model which would be able to successfully predict a user churn, before it really happens.
- We loaded the small dataset, previewed and cleaned the data (corrupted rows and unneeded columns were removed).
- We created a churn (label) column out of suggested event logs.
- Than we started with some data analyzes, explorations and experiments related to new feature ideas.
- Next step was the feature engineering — to me the most brain consuming part as working with spark is a bit more challenging than with pandas and I had some crazy feature ideas I wanted to implement.
- Modeling on it’s own was not too difficult, however it took quite a lot of time as I started with too many parameters and models. I ended up with RandomForest and GradientBoostedTrees and couple of parameters in paramGrid builder.
- The score achieved on validation and the fact that we predicted only one false positive, could be considered a success and I believe it could add a value to a business similar to Sparkify. However there is always a place for improvement — especially in the data science fields.