Sparkify is a (fictional) digital music service similar to Spotify or Pandora. Many users stream the songs from the service every day either using the free tier that place advertisements between the songs or using the premium subscription model, where they stream music with no advertisements but paying a monthly fee rate. Users can upgrade, downgrade or cancel their service at any time, so it is important that the users love the service.
Every time the user interacts with the service while they are playing songs, logging out, liking in a song or downgrading the service, it generates data. The purpose of this project is to use this data generated to predict which users are at risk to churn deleting their accounts since this can potentially save the company considerable money in revenues.
The following libraries are needed to run the notebook:
- numpy
- pandas
- time
- datetime
- matplotlib
- seaborn
- pyspark
Here will be done the analysis and develop of the machine learning models to predict the churn of the users in the service
This is the data that will be used for the analysis and training of the model:
- (
mini_sparkify_event_data.json.zip): dataset with the data collected in the service
After training 5 different models, the accuracy and f1 score obtained for each one is displayed in the following table:
| Model name | Accuracy | f1score | Training time (min:sec) |
|---|---|---|---|
| Random Forest | 0.800000 | 0.736508 | 04:34.036536 |
| Logistic Regression | 0.885714 | 0.860829 | 04:08.942884 |
| Decision Tree | 0.800000 | 0.805938 | 04:47.750021 |
| Gradient Boosted Trees | 0.685714 | 0.717108 | 04:24.014526 |
| LinearSVC | 0.828571 | 0.750893 | 04:04.784299 |
A comparison of these results can be seen in the following pictures:
The model with the best result is Logistic Regresion with an accuracy of 0.885 and a f1 score of 0.861. The features that have more impact are registration_min, errors, friend, played_time_session, avg_songs_session and thumbs_down as the picture below depicts.
From them, the higher the value of these features are, most likely the user will stay in the service and will not churn.
In order to improve the model in the future it would be good to try with a bigger dataset where the models have more data to learn from and use more parameters in Grid Search to tune the models


