Music Streaming Service Churn Predictions with PySpark

Udacity Data Scientist Nanodegree Project

Image for post
Image for post

Udacity Data Scientist Nanodegree is an adventurous, interesting and industry relevant data science program and so are its capstone projects. I decided to undergo the Spark-based project called “Sparkify”. Sparkify is supposed to be a fictional music streaming service, however the provided dataset is very realistic.

Completing this project can help us learn how to use Spark MLlib to build machine learning models with large datasets, far beyond what could be done with non-distributed technologies like scikit-learn.

Udacity provides 2 versions of datasets. The full dataset (12GB) for trying Spark cluster on the cloud using AWS or IBM Cloud. And a small (128MB) sample which can be processed directly in the workspace provided.

Customer churn, in general, represents a situation when a user/customer decides to stop using a service or doing a business with a company. How churn is identified, can vary from service to service. Sparkify treats a user as churned when Cancellation Confirmation page was visited — what is visible in the dataset of course. Our goal in this project will be to explore and clean the data, create a useful set of features and train a model (a classifier) which would be able to predict a churn situation, before it really happens.

The project is divided into 4 main parts:

Load and Clean Dataset

Before we load the data, we need to import all the necessary libraries and create a Spark session …

Image for post
Image for post
This is an example of creating a cloud spark session — with some relevant configs. However our workspace runs in local mode only and some of the settings seem to be simply ignored while a spark session is created without any errors (interesting).

After we load the dataset with spark.read.json(), we need to at least preview the dataframe and see the data types to get a better picture of what our next steps will look like.

Image for post
Image for post
Image for post
Image for post
Pandas can always be our little helper …

When it comes to “cleaning steps”, we can get rid of some unneeded columns and rows with missing important values — like labels or user id’s. For now I removed only rows with missing userId values (as labels were not defined yet) and couple of columns which I’m not planning to use as features or to create features out of.

Image for post
Image for post

I removed columns like ‘artist’,’song’,’id_copy’,’firstName’, ‘lastName’. However if we had a bit more time and additional data sources, I would say that the music users listen to, can also be a good feature -especially if we would be able to categorize it like sad, happy, aggressive music etc …

There is another bunch of preprocessing actions applied to the dataframe after Feature engineering — that’s where we handle categorical variables, replace null values and remove spare rows.

Exploratory data analysis

Let’s start this part with defining our label — and creating the churn column. As it was suggested, the Cancellation Confirmation event is what we are after. It is very important to realize that each row represents only a session, user had in a certain time. Therefore once we create the “churn” column and extract the values from the “page” field, we should spread values across all the sessions of each user. This is also the way we’ll be handling new features later in the feature engineering part. As there will come a time when we will need to get rid of majority of rows and keep only one row per user. And the time is called Modeling. And if we don’t spread the values across all the user’s rows, there is where our weeping and the gnashing of our teeth will be. And there shall in that time be rumors of things going astray and there shall be a great confusion as to where features really are …

Image for post
Image for post
The Time of Modeling is near!

So we found out that there are 52 occasions of churn ( out of 225 users) in the small dataset. Now is the time to explore the data in deep.

Image for post
Image for post

A column named page is definitely going to be one of our best sources of features. (Btw this is the column we extracted our label from). Events like “Thumbs down”, “Error” or even “Help” — they all can indicate churn.

Image for post
Image for post

One of the features I can think about, would also be increased or decreased number of songs played daily per user. We get this number by counting “NextSong” events per day (per user). I decided to collect these numbers into lists, split the lists into 4 ongoing quarters and compare a mean of songs played daily for each quarter. This way we ended up with 3 values per customer, which indicate whether the mean increased or decreased in the next quarter (I stands for Increase and D for decreased) :

Image for post
Image for post

Even though I was expecting to have more D’s at the end of churn triplets, I will still use this feature in modeling part, because in combination with other features it can show up as a helpful one.

Another interesting field seems to be the userAgent. My first idea was to extract only the Operating system but than I decided to spent a bit more time with it and extract more details. RegexTokenizer was the tool I used initially …

Image for post
Image for post

My plan was to create a new feature/column for each word and use 0’s and 1’s as values in each row, based on whether the word is present in the list of tokens or not. The only problem could be perfectly correlated features. In the heatmap below — created out of the word-matrix (from userAgent word) — we can see that some of the words are perfectly correlated with other ones (yellow dots besides the diagonal). We’ll need to remove some words accordingly — as we do not want them to slow down the the whole feature engineering process and we do not want them to bias our models.

Image for post
Image for post

Feature Engineering

Good set of features is the key to a successful model. There are easy-going features and there are features created by complicated processes. Some of the simple features I used in this project were:

Except for them there are also features usable as they are in original dataset. The only work we do here is adding a prefix (it is a good manner to use the same prefix with all features we plan to use — so they can be easily extracted).

It is also important to mention that the classifiers (and probably even other model algorithms) from pyspark.ml require feature vectors — there is no place for categorical string features. But this is a transformation which can be applied just before modeling.

From the more complicated features I used, I would mention following:

Modeling

This is the part, every data scientist must love! However with low accuracy or model score numbers it can easily turn into nightmare. Fortunately it was not our case.

Let’s start with the final preprocessing of our dataframe. For this purpose I have combined all the necessary steps into one function called preprocess_and_split.

Image for post
Image for post

As mentioned earlier, for modelling we need a dataframe containing 1 row per user. Extracting these rows in the first step of the function will save us a lot of time as every other operation will be applied to reduced dataframe. In the next steps, except for fillna, we extract the bunch of features we plan to use, create the “label” column out of “churn”, we convert categorical features to numerical (As mentioned above, this is what pyspark.ml classifiers like RandomForest or GradientBoostedTrees are expecting.) and finally we split the data into 3 parts: Train, Test & Validation.

Image for post
Image for post
Looks like we have an appropriate sample of “churn” occasions in each of the datasets.

Finally we can start with the training! Here is the sequence of our steps:

Image for post
Image for post
We achieved 0.94 F1 training score with GradientBoostedTrees.

Training score is quite high, let’s have a look at the testing one:

Image for post
Image for post
F1 test score is 0.91. Not bad at all.

Hyperparameters tuning

The best model selected by the CrossValidator out of the Classifiers provided and their paramGrid possibilities is GradientBoostedTree Classifier with maxDepth=7, Max number of bins =20 and max number of iterations =6. Several combinations of parameters were tried before the final (published) paramGrid builder. The depth of the trees is not too large to overfit the model but the trees with max depth =7 are also not too shallow to catch only few details and end up with poor performance. Number of iterations in boosted trees means how many decision trees or base learners will be build. Number of bins is the number data are partitioned into before the best split point is determined.

We can now validate our best model on the validation dataset. I’ve added couple of more details/metrics to get a better picture of how good our model really is.

Image for post
Image for post
F-1 score is 0.78

Final results & Metrics used

Why did we use the F1 score as the key metric? As we are using a small dataset, we need to use a bit more demanding metric than precision or recall. F-score is calculated from the precision and recall of testing. Precision is the number of correctly identified positive results divided by the number of all positive results, including those not identified correctly, and the recall is the number of correctly identified positive results divided by the number of all samples that should have been identified as positive. Our goal is to predict true positives as accurate as possible while false positive predictions could lead to wasted resources. Therefore the F1 score became our metric.

We achieved a validation F1 score equals to 0.78. For such a small dataset it’s not bad at all. Also the fact we had only one false positive prediction is important. However this false-positive prediction is causing the F-score to be lower than recall.

The metric we decide to use should also be relevant to the methods we apply to prevent customers from churn. If these methods are not too costly, than we can consider also Precision to be our metric — but we should be prepared for increased number of false positive predictions. Also if we use some dramatically expensive methods to attract customers who are about to churn, we should consider to use Recall as our metric — as it should help to prevent the system from predicting churn incorrectly.

Conclusions

We ended up with a model which achieved quite a good validation score (0.78) but we need to take in consideration that we had a limited playground and the results of our model could be lower on the real 12GB dataset.

Improvements and potential improvements

In the first phase of my work there was a bug in page_inc_dec_quarterly_FE and the values (increase-decrease triplets) were biased. This mistake led to significantly lower scores. I had to review the results of spark-based steps one by one and fix the window functionality. This confirms how important some features can be for final results.

Anyway there is much more stuff we didn’t try to use. We can improve the results by better feature engineering, by better parameter tuning and there is always a possibility to deploy a pipeline with some sort of ensemble model, which can use the predictions of our original model as a feature. This way we could achieve higher train, test and validation scores. Based on my previous experience the ensemble model could boost our score by another 5% at least.

If we have enough time, a valuable approach could be also to have a look at the feature importance of our model and try to get a bit more from features we recognized as important in our research, but the model didn’t find them very useful.

Difficulties and issues

During this project I experienced couple of spark-related difficulties:

Image for post
Image for post
df.cache() used in a loop.

E2E Summary

Data Scientist and Software Engineer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store