Tshepo Chris

Churn Prediction using the Logistic Regression Classifier

Logistic regression allows one to predict a categorical variable from a set of continuous or categorical variables. In logistic regression, the independent variables can either be continuous or categorical. However, the dependent variable is always binary. Logistic regression is useful when there are two outcomes (i.e. “yes” or “no”). Logistic regression does not specify that the independent variable is normally distributed, linear nor should there be equal variances. This is a fairly flexible classification algorithm. However, this classifier reasonably assumes that there are more than 50 observations for accurate hypothesis testing. 

Import Python libraries

Underneath we imported key Python libraries (a chuck of code that will used in the project) . This libraries helps with creating arrays and profile tables, manipulating dataframes, computing graphs and predicting future values.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set("talk","ticks",font_scale=1,font="sans-serif",color_codes=True,palette="viridis")
from pylab import rcParams
plt.rcParams["figure.figsize"] = [10,10]
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, learning_curve
import warnings
warnings.filterwarnings("ignore")
import statsmodels.api as sm
import scipy.stats as stats

Load data

df = pd.read_csv(r"C:\Users\Tshepo\Desktop\MLAgortihms\Datasets\churn_data.csv"

Above we loaded obtained data (churn_data.csv) as pandas data frame.

Since the customerID feature was not of use, the column was deleted.

del df["customerID"]
df.head()

The first five columns of the data are displayed as follows.

tenure PhoneService Contract PaperlessBilling PaymentMethod MonthlyCharges Churn
0 1 No Month-to-month Yes Electronic check 29.85 No
1 34 Yes One year No Mailed check 1889.5 No
2 2 Yes Month-to-month Yes Mailed check 108.15 Yes
3 45 No One year No Bank transfer (automatic) 1840.75 No
4 2 Yes Month-to-month Yes Electronic check 151.65 Yes

Detect missing values

The presence of missing values can result in measurement errors and significantly impact the conclusions. Underneath we checked whether there were any missing values in the data.

There were no missing values detected in the data.

Descriptive statistics

class_series = df.groupby("Churn").size()
class_series.name = "Churn Distribution"
class_series.plot.pie(autopct="%2.f",cmap="Oranges")
plt.show()

73% of customers left the company. Meanwhile, 27% are still making purchases with the company. This a very hard pill to swallow for the company. New ways of retaining customers must be found to have a sustainable business in the long run.

fig, axs = plt.subplots(2,2,figsize=(16,16))
class_series1 = df.groupby("PhoneService").size()
class_series2 = df.groupby("Contract").size()
class_series3 = df.groupby("PaperlessBilling").size()
class_series4 = df.groupby("PaymentMethod").size()
class_series1.names = "PhoneService"
class_series2.names = "Contract"
class_series3.names = "PaperlessBilling"
class_series4.names = "PaymentMethod"
class_series1.plot.pie(autopct="%.2f",cmap="Oranges", ax=axs[0,0])
axs[0,0].set_ylabel("Count of PhoneService")
class_series2.plot.pie(autopct="%.2f",cmap="Oranges", ax=axs[0,1])
axs[0,1].set_ylabel("Count of Contract")
class_series3.plot.pie(autopct="%.2f",cmap="Oranges", ax=axs[1,0])
axs[1,0].set_ylabel("Count of PaperlessBilling")
class_series4.plot.pie(autopct="%.2f",cmap="Oranges", ax=axs[1,1])
axs[1,1].set_ylabel("Count of PaymentMethod")
plt.show()
  • 90.32% of customers have a phone service, meanwhile, 9.68% of them do not have a phone service.
  • 55.02% of customers have month-to-month contracts, 24.07% have two-years contracts and 20.91% have one-year contracts. Customers who have month-to-month contracts account for most of the customers.
  • 59.22% of customers use paperless billing and 40.78% use paper-based billing.
  • Among customers, 33.58% use electronic checks to make payments, 22.89% use mail check, 21.92% use debit orders and 21.61% use credit cards.
fig, axs = plt.subplots(2,2,figsize=(16,16))
sns.countplot(df["PhoneService"],ax=axs[0,0])
sns.countplot(df["Contract"],ax=axs[0,1])
sns.countplot(df["PaperlessBilling"],ax=axs[1,0])
sns.countplot(df["PaymentMethod"],ax=axs[1,1])
plt.show()

Create dummy variables

We created dummy variables in the underneath section. Dummy variables consist of numerical values that represent the presence of categorical data. We dropped first columns to avoid the dummy variable trap.

df["PhoneService"] = pd.get_dummies(df["PhoneService"],drop_first=True)
df["Contract"] = pd.get_dummies(df["Contract"],drop_first=True)
df["PaperlessBilling"] = pd.get_dummies(df["PaperlessBilling"],drop_first=True)
df["PaymentMethod"] = pd.get_dummies(df["PaymentMethod"],drop_first=True)
df["Churn"] = pd.get_dummies(df["Churn"],drop_first=True)

Correlation matrix

When dealing with two or more variables, one is frequently concerned with correlation between those variables. The rule of thumb is that correlation coefficients should lie between -1 and 1.

dfcorr = df.corr()
sns.heatmap(dfcorr, annot=True, annot_kws={"size":12},cmap="Oranges")
plt.title("Correlation Correlation Matrix")
plt.show()

In the correlation matrix above, correlation coefficients range from -1 to +1. There are more postive correlation coefficients than negative ones. It is against the background of this example to explain all correlation coefficients as represented by the figure above.

Covariance

Underneath, is a visual representation of the joint variability between several variables under investigation.

dfcov = df.cov()
sns.heatmap(dfcov, annot=True, annot_kws={"size":12},cmap="Oranges")
plt.title("Covariance Matrix")
plt.show()
df.describe().transpose()

The underneath table gives us detailed description of our data.

count mean std min 25% 50% 75% max
tenure 7043.0 32.371149 24.559481 0.00 9.0 29.00 55.00 72.00
PhoneService 7043.0 0.903166 0.295752 0.00 1.0 1.00 1.00 1.00
Contract 7043.0 0.209144 0.406726 0.00 0.0 0.00 0.00 1.00
PaperlessBilling 7043.0 0.592219 0.491457 0.00 0.0 1.00 1.00 1.00
PaymentMethod 7043.0 0.216101 0.411613 0.00 0.0 0.00 0.00 1.00
MonthlyCharges 7043.0 64.761692 30.090047 18.25 35.5 70.35 89.85 118.75
Churn 7043.0 0.265370 0.441561 0.00 0.0 0.00 1.00 1.00

Creating x and y array

Underneath we defined object types “x” and “y” that represent arrays of integers and floating numbers that will be used for analysis.

x = df[["tenure","PhoneService","Contract","PaperlessBilling","PaymentMethod","MonthlyCharges"]]
y = df.iloc[::,-1]

x represent independent variables (tenure,PhoneService,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges)

y represents the dependent variable (Churn)

Add constant

According to Python documentation, an intercept is not included in the statsmodels library. Consequently, we added an intercept using the code underneath.

x_constant = sm.add_constant(x)

Develop Logit model using MLE

Underneath we developed a logit model using the Maximum Likelihood Estimation method. This method is concerned with estimating parameters of a probability distribution using the likelihood function. The basic assumption is that information of inference can be contained in a likelihood function.

model = sm.Logit(y,x_constant).fit()
model.predict(x_constant)
model.summary()
Dep. Variable: Churn No. Observations: 7043
Model: Logit Df Residuals: 7036
Method: MLE Df Model: 6
Date: Sun, 31 May 2020 Pseudo R-squ.: inf
Time: 15:25:26 Log-Likelihood: -inf
converged: True LL-Null: 0.0000
Covariance Type: nonrobust LLR p-value: 1.000
coef std err z P>|z| [0.025 0.975]
const -1.3051 0.115 -11.358 0.000 -1.530 -1.080
tenure -0.0516 0.002 -28.939 0.000 -0.055 -0.048
PhoneService -0.8910 0.115 -7.771 0.000 -1.116 -0.666
Contract -0.5563 0.097 -5.748 0.000 -0.746 -0.367
PaperlessBilling 0.5567 0.071 7.857 0.000 0.418 0.696
PaymentMethod -0.3711 0.088 -4.209 0.000 -0.544 -0.198
MonthlyCharges 0.0331 0.002 21.741 0.000 0.030 0.036

All p-values are above 0.05. We rejected null hypotheses in favor of alternative hypotheses. There are significant differences between all independent variables (tenure, PhoneService, Contract, PaperlessBilling, PaymentMethod, MonthlyCharges) with the dependent variable (Churn). These variables are decent predictors of customer churn.

Split training and test data

In the section underneath, we trained our model using obtained data. Thereafter, tested the model against the obtained data. Data was split into training data and test data by calling the train_test_split() function. Training data was used to fit the data. Meanwhile, the test data was used to validate our model. 80% of the data was trained and 20% of data for testing purposes. 

x_trainlog, x_testlog, y_trainlog, y_testlog = train_test_split(x,y,test_size=0.8,random_state=0)

Normal data

We used the StandardScaler() function to normalize our data. This function transforms data in such a way that the mean value is 0 and the standard deviation is 1. It is important to normalize data prior to training data for any model if one wants to make more accurate predictions.

scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_train = scaler.transform(x_train)

Train data

According to Python documentation, an intercept is not included on the statsmodels library. Consequently, we added an intercept using the code underneath.

logreg = LogisticRegression()
logreg.fit(x_train,y_train)

Predict churn

Now that we chose and finalized our machine learning model, the next step is to predict new data instances. Firstly, we passed the model using the predict() function. Thereafter, we created a data frame consisting of both actual values and predicted values. This helped us in comparing predicted values against actual values.

y_pred = logreg.predict(x_test)
pd.DataFrame({"Actual":y_test, "Predicted":y_pred})
Actual Predicted
220001
462701
322501
282801
376801
550211
343600
425600
681011
666301

5635 rows × 2 columns

Classification report

A classification report gives an idea of a model’s performance. It compromises of:-

  • Precision how often a classifier is correct.
  • Accuracy how often a classifier got predictions right.
  • F1-support mean of precision and recall.
  • Support number of samples of the true response that lies in that class.
  • Recall actual predictions that are correctly classified.
classificationreport = pd.DataFrame(metrics.classification_report(y_test,y_pred,output_dict=True)).transpose()
classificationreport
precisionrecallf1-scoresupport
00.954175
0.369650
0.532866
4112.000000
10.358733
0.952068
0.521114
1523.000000
Accuracy0.527063
0.527063
0.527063
0.527063
macro avg0.656454 0.660859 0.526990 5635.000000
macro avg0.793242
0.527063
0.529690
5635.000000

Confusion matrix

We used a confusion matrix (also known as error matrix) to determine the performance of a classifier. Underneath we constructed a confusion matrix and passed it into a data frame. Remember that the predicted classes are “YES” and “NO”, “YES” indicates that customers have left our company, “no” indicates that customers have not left our company.

cmat = pd.DataFrame(metrics.confusion_matrix(y_testlogreg,y_predlogreg), columns = ("Positive","Negative"), index=["Positive","Negative"])
cmat
Predicted: NOPredicted: YES
Actual: NO15202592
Actual: YES73
1450

The confusion matrix above clearly highlights that our logistic classifier was made up of a total of 5 635 predictions. Out of the 5 635 predictions, the classifier predicted “Yes” 4 042‬ times. In reality, 1 523‬ customers fell under the churn, 4,112 do not fall under the churn. 

Accuracy indicates to how often a classifier got predictions right. The calculation is as follows: –

(TP + TN) / total 

(1450+ 73) / 5 635 = 0.5270.

Misclassification rate indicates how often a classifier got the predictions wrong.  The calculation is as follows: –

(FP + FN) / total

(2 592‬ + 73) / 5 635 = 0.4859

True Positive Rate indicates how often a logistic regression predicts “yes”. The calculation is as follows: –

TP / Actual “YES” 

1 450 / 1 523 = 0.9520

False positive rate indicates to how often our logistic classifier predicts “no” The calculation is as follows: –

TN / Actual “NO”

1 520 / 4 112‬ = 0.3696

Precision refers to how often our logistic classifier is correct. The calculation is as follows: –

TP / predicted “YES”

1 450 / 4 042‬ = 0.3587

Prevalence indicate how often yes condition actually occur in our sample. The calculation is as follows: –

Actual “YES” / total

1 523 / 5 635 = 0.2702

Validation ROC Curve

Validation ROC Curve is a tool used to find appropriate hyper-parameter settings. It helps summarize trade-off between the true positive rate and the false positive rate using different probability thresholds. ROC stands for Receiver Operating Characteristics. On the x-axis lies the false positive rate and, on the y-axis, lies the true positive rate. The closer the curve follows the left-hand side border then the top border of the ROC space, the more accurate it is. 

y_predlogreg_proba = logreg.predict_proba(x_testlogreg)[::,1]
fprlogreg, tprlogreg, _ = metrics.roc_curve(y_testlogreg,y_predlogreg_proba)
auclogreg = metrics.roc_auc_score(y_testlogreg,y_predlogreg_proba)
plt.plot(fprlogreg, tprlogreg,label="Logistic Regression , auc " +str(auclogreg), color = "navy",alpha=0.8)
plt.plot([0,1],[0,1],color="red",alpha=0.8)
plt.xlim([0.00,1.01])
plt.ylim([0.00,1.01])
plt.title("Validation ROC Curve - Logsitic Regression")
plt.xlabel("Specificity")
plt.ylabel("Sensitivity")
plt.legend(loc=4)
plt.show()

The curve does not follow the left-hand border. From the start, the curve extremely bends to the right. The model is less accurate. This is also confirmed by the area under the curve (auc), which is 0.71. An accurate model should have auc score that is above 0.80

Precision-Recall curve

Precision of any estimate should have a small variance, accurate and unbiased. 

precisionrecalllogreg,recalllogreg,thresholdlogreg = precision_recall_curve(y_testlogreg,y_predlogreg)
apslogreg = average_precision_score(y_testlogreg,y_predlogreg)
fig, ax = plt.subplots(figsize=(10,10))
plt.plot(precisionrecalllogreg,recalllogreg, label = "Logistic Regression , aps " + str(apslogreg), color="navy")
plt.title("Precision-Recall Curve - Logistic Regression")
plt.xlabel("Precision")
plt.ylabel("Recall")
plt.legend(loc=4)
plt.axhline(y=0.5, color="red", lw=2)
plt.show()

The average precision score (aps) is 0.35

Learning curve

We used the  learning_curve() function to generate values for plotting the learning curve (number of samples used and mean of the training set and cross-validation set). This tool enables one to determine how they will benefit from adding more training data, so as finding out whether the estimator suffers variance error or bias. 

trainsize, trainscore, testscore = learning_curve(LogisticRegression(),x,y,cv=10, n_jobs=-10, train_sizes=np.linspace(0.1,1.0,50))
trainscore_mean = np.mean(trainscore, axis=1)
trainscore_std = np.std(trainscore,axis=1)
testscore_mean = np.mean(testscore,axis=1)
testscore_std = np.std(testscore,axis=1)
plt.plot(trainsize,trainscore_mean,color="red",alpha=0.8, label="Training Score")
plt.plot(trainsize,testscore_mean,color="navy",alpha=0.8, label="Cross-Validation Score")
plt.title("Learning Curve - Logsitic Regresion")
plt.xlabel("Training Set Size")
plt.ylabel("Accuracy Score")
plt.legend(loc=1)
plt.show()

The training score of the logistic classifier was greater than the cross-validation score. Adding more training samples will likely increase generalization.