Skip to content

pedflotor/Sparkify-Churn-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Churn Prediction Project

Sparkify is a (fictional) digital music service similar to Spotify or Pandora. Many users stream the songs from the service every day either using the free tier that place advertisements between the songs or using the premium subscription model, where they stream music with no advertisements but paying a monthly fee rate. Users can upgrade, downgrade or cancel their service at any time, so it is important that the users love the service.

Every time the user interacts with the service while they are playing songs, logging out, liking in a song or downgrading the service, it generates data. The purpose of this project is to use this data generated to predict which users are at risk to churn deleting their accounts since this can potentially save the company considerable money in revenues.

Table of Contents

  1. Requirements
  2. Jupyter Notebook
  3. Data
  4. Summary of Results
  5. Article of the Project
  6. Acknowledgements

Requirements

The following libraries are needed to run the notebook:

  • numpy
  • pandas
  • time
  • datetime
  • matplotlib
  • seaborn
  • pyspark

Jupyter Notebook

Here will be done the analysis and develop of the machine learning models to predict the churn of the users in the service

Data

This is the data that will be used for the analysis and training of the model:

Summary of Results

After training 5 different models, the accuracy and f1 score obtained for each one is displayed in the following table:

Model name Accuracy f1score Training time (min:sec)
Random Forest 0.800000 0.736508 04:34.036536
Logistic Regression 0.885714 0.860829 04:08.942884
Decision Tree 0.800000 0.805938 04:47.750021
Gradient Boosted Trees 0.685714 0.717108 04:24.014526
LinearSVC 0.828571 0.750893 04:04.784299

A comparison of these results can be seen in the following pictures:

Accuracy f1score


The model with the best result is Logistic Regresion with an accuracy of 0.885 and a f1 score of 0.861. The features that have more impact are registration_min, errors, friend, played_time_session, avg_songs_session and thumbs_down as the picture below depicts.

coeff

From them, the higher the value of these features are, most likely the user will stay in the service and will not churn.


In order to improve the model in the future it would be good to try with a bigger dataset where the models have more data to learn from and use more parameters in Grid Search to tune the models

Article of the Project

Acknowledgements

About

Analysis and churn prediction in a streaming musical service

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published