Logistic regression allows one to predict a categorical variable from a set of continuous or categorical variables. In logistic regression, the independent variables can either be continuous or categorical. However, the dependent variable is always binary. Logistic regression is useful when there are two outcomes (i.e. “yes” or “no”). Logistic regression does not specify that the independent variable is normally distributed, linear nor should there be equal variances. This is a fairly flexible classification algorithm. However, this classifier reasonably assumes that there are more than 50 observations for accurate hypothesis testing.
Import Python libraries
Underneath we imported key Python libraries (a chuck of code that will used in the project) . This libraries helps with creating arrays and profile tables, manipulating dataframes, computing graphs and predicting future values.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set("talk","ticks",font_scale=1,font="sans-serif",color_codes=True,palette="viridis")
from pylab import rcParams
plt.rcParams["figure.figsize"] = [10,10]
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, learning_curve
import warnings
warnings.filterwarnings("ignore")
import statsmodels.api as sm
import scipy.stats as stats
Load data
df = pd.read_csv(r"C:\Users\Tshepo\Desktop\MLAgortihms\Datasets\churn_data.csv"
Above we loaded obtained data (churn_data.csv) as pandas data frame.
Since the
del df["customerID"]
df.head()
The first five columns of the data are displayed as follows.
| tenure | PhoneService | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | Churn | |
| 0 | 1 | No | Month-to-month | Yes | Electronic check | 29.85 | No |
| 1 | 34 | Yes | One year | No | Mailed check | 1889.5 | No |
| 2 | 2 | Yes | Month-to-month | Yes | Mailed check | 108.15 | Yes |
| 3 | 45 | No | One year | No | Bank transfer (automatic) | 1840.75 | No |
| 4 | 2 | Yes | Month-to-month | Yes | Electronic check | 151.65 | Yes |
Detect missing values
The presence of missing values can result in measurement errors and significantly impact the conclusions. Underneath we checked whether there were any missing values in the data.

There were no missing values detected in the data.
Descriptive statistics
class_series = df.groupby("Churn").size()
class_series.name = "Churn Distribution"
class_series.plot.pie(autopct="%2.f",cmap="Oranges")
plt.show()

73% of customers left the company. Meanwhile, 27% are still making purchases with the company. This a very hard pill to swallow for the company. New ways of retaining customers must be found to have a sustainable business in the long run.
fig, axs = plt.subplots(2,2,figsize=(16,16))
class_series1 = df.groupby("PhoneService").size()
class_series2 = df.groupby("Contract").size()
class_series3 = df.groupby("PaperlessBilling").size()
class_series4 = df.groupby("PaymentMethod").size()
class_series1.names = "PhoneService"
class_series2.names = "Contract"
class_series3.names = "PaperlessBilling"
class_series4.names = "PaymentMethod"
class_series1.plot.pie(autopct="%.2f",cmap="Oranges", ax=axs[0,0])
axs[0,0].set_ylabel("Count of PhoneService")
class_series2.plot.pie(autopct="%.2f",cmap="Oranges", ax=axs[0,1])
axs[0,1].set_ylabel("Count of Contract")
class_series3.plot.pie(autopct="%.2f",cmap="Oranges", ax=axs[1,0])
axs[1,0].set_ylabel("Count of PaperlessBilling")
class_series4.plot.pie(autopct="%.2f",cmap="Oranges", ax=axs[1,1])
axs[1,1].set_ylabel("Count of PaymentMethod")
plt.show()

- 90.32% of customers have a phone service, meanwhile, 9.68% of them do not have a phone service.
- 55.02% of customers have month-to-month contracts, 24.07% have two-years contracts and 20.91% have one-year contracts. Customers who have month-to-month contracts account for most of the customers.
- 59.22% of customers use paperless billing and 40.78% use paper-based billing.
- Among customers, 33.58% use electronic checks to make payments, 22.89% use mail check, 21.92% use debit orders and 21.61% use credit cards.
fig, axs = plt.subplots(2,2,figsize=(16,16))
sns.countplot(df["PhoneService"],ax=axs[0,0])
sns.countplot(df["Contract"],ax=axs[0,1])
sns.countplot(df["PaperlessBilling"],ax=axs[1,0])
sns.countplot(df["PaymentMethod"],ax=axs[1,1])
plt.show()

Create dummy variables
We created dummy variables in the underneath section. Dummy variables consist of numerical values that represent the presence of categorical data. We dropped
df["PhoneService"] = pd.get_dummies(df["PhoneService"],drop_first=True)
df["Contract"] = pd.get_dummies(df["Contract"],drop_first=True)
df["PaperlessBilling"] = pd.get_dummies(df["PaperlessBilling"],drop_first=True)
df["PaymentMethod"] = pd.get_dummies(df["PaymentMethod"],drop_first=True)
df["Churn"] = pd.get_dummies(df["Churn"],drop_first=True)
Correlation matrix
When dealing with two or more variables, one is frequently concerned with correlation between those variables. The rule of thumb is that correlation coefficients should lie between -1 and 1.
dfcorr = df.corr()
sns.heatmap(dfcorr, annot=True, annot_kws={"size":12},cmap="Oranges")
plt.title("Correlation Correlation Matrix")
plt.show()

In the correlation matrix above, correlation coefficients range from -1 to +1. There are more postive correlation coefficients than negative ones. It is against the background of this example to explain all correlation coefficients as represented by the figure above.
Covariance
Underneath, is a visual representation of the joint variability between several variables under investigation.
dfcov = df.cov()
sns.heatmap(dfcov, annot=True, annot_kws={"size":12},cmap="Oranges")
plt.title("Covariance Matrix")
plt.show()

df.describe().transpose()
The underneath table gives us detailed description of our data.
| count | mean | std | min | 25% | 50% | 75% | max | |
| tenure | 7043.0 | 32.371149 | 24.559481 | 0.00 | 9.0 | 29.00 | 55.00 | 72.00 |
| PhoneService | 7043.0 | 0.903166 | 0.295752 | 0.00 | 1.0 | 1.00 | 1.00 | 1.00 |
| Contract | 7043.0 | 0.209144 | 0.406726 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 |
| PaperlessBilling | 7043.0 | 0.592219 | 0.491457 | 0.00 | 0.0 | 1.00 | 1.00 | 1.00 |
| PaymentMethod | 7043.0 | 0.216101 | 0.411613 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 |
| MonthlyCharges | 7043.0 | 64.761692 | 30.090047 | 18.25 | 35.5 | 70.35 | 89.85 | 118.75 |
| Churn | 7043.0 | 0.265370 | 0.441561 | 0.00 | 0.0 | 0.00 | 1.00 | 1.00 |
Creating x and y array
Underneath we defined object types “x” and “y” that represent arrays of integers and floating numbers that will be used for analysis.
x = df[["tenure","PhoneService","Contract","PaperlessBilling","PaymentMethod","MonthlyCharges"]]
y = df.iloc[::,-1]
x represent independent variables (tenure,PhoneService,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges)
y represents the dependent variable (Churn)
Add constant
According to Python documentation, an intercept is not included
x_constant = sm.add_constant(x)
Develop Logit model using MLE
Underneath we developed a logit model using the Maximum Likelihood Estimation method. This method is concerned with estimating parameters of a probability distribution using the likelihood function. The basic assumption is that information of inference can be contained in a likelihood function.
model = sm.Logit(y,x_constant).fit()
model.predict(x_constant)
model.summary()
| Dep. Variable: | Churn | No. Observations: | 7043 |
| Model: | Logit | Df Residuals: | 7036 |
| Method: | MLE | Df Model: | 6 |
| Date: | Sun, 31 May 2020 | Pseudo R-squ.: | inf |
| Time: | 15:25:26 | Log-Likelihood: | -inf |
| converged: | True | LL-Null: | 0.0000 |
| Covariance Type: | nonrobust | LLR p-value: | 1.000 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
| const | -1.3051 | 0.115 | -11.358 | 0.000 | -1.530 | -1.080 |
| tenure | -0.0516 | 0.002 | -28.939 | 0.000 | -0.055 | -0.048 |
| PhoneService | -0.8910 | 0.115 | -7.771 | 0.000 | -1.116 | -0.666 |
| Contract | -0.5563 | 0.097 | -5.748 | 0.000 | -0.746 | -0.367 |
| PaperlessBilling | 0.5567 | 0.071 | 7.857 | 0.000 | 0.418 | 0.696 |
| PaymentMethod | -0.3711 | 0.088 | -4.209 | 0.000 | -0.544 | -0.198 |
| MonthlyCharges | 0.0331 | 0.002 | 21.741 | 0.000 | 0.030 | 0.036 |
All p-values are above 0.05. We rejected null hypotheses in favor of alternative hypotheses. There are significant differences between all independent variables (tenure,
Split training and test data
In the section underneath, we trained our model using obtained data. Thereafter,
x_trainlog, x_testlog, y_trainlog, y_testlog = train_test_split(x,y,test_size=0.8,random_state=0)
Normal data
We used the StandardScaler() function to normalize our data. This function transforms data in such a way that the mean value is 0 and the standard deviation is 1. It is important to normalize data prior to training data for any model if one wants to make more accurate predictions.
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_train = scaler.transform(x_train)
Train data
According to Python documentation, an intercept is not included on the statsmodels library. Consequently, we added an intercept using the code underneath.
logreg = LogisticRegression()
logreg.fit(x_train,y_train)
Predict churn
Now that we chose and finalized our machine learning model, the next step is to predict new data instances. Firstly, we passed the model using the predict() function. Thereafter, we created a data frame consisting of both actual values and predicted values. This helped us in comparing predicted values against actual values.
y_pred = logreg.predict(x_test)
pd.DataFrame({"Actual":y_test, "Predicted":y_pred})
| Actual | Predicted | |
|---|---|---|
| 2200 | 0 | 1 |
| 4627 | 0 | 1 |
| 3225 | 0 | 1 |
| 2828 | 0 | 1 |
| 3768 | 0 | 1 |
| … | … | … |
| 5502 | 1 | 1 |
| 3436 | 0 | 0 |
| 4256 | 0 | 0 |
| 6810 | 1 | 1 |
| 6663 | 0 | 1 |
5635 rows × 2 columns
Classification report
A classification report gives an idea of a model’s performance. It compromises of:-
- Precision how often a classifier is correct.
- Accuracy how often a classifier got predictions right.
- F1-support mean of precision and recall.
- Support number of samples of the true response that lies in that class.
- Recall actual predictions that are correctly classified.
classificationreport = pd.DataFrame(metrics.classification_report(y_test,y_pred,output_dict=True)).transpose()
classificationreport
| precision | recall | f1-score | support | |
| 0 | 0.954175 | 0.369650 | 0.532866 | 4112.000000 |
| 1 | 0.358733 | 0.952068 | 0.521114 | 1523.000000 |
| Accuracy | 0.527063 | 0.527063 | 0.527063 | 0.527063 |
| macro avg | 0.656454 | 0.660859 | 0.526990 | 5635.000000 |
| macro avg | 0.793242 | 0.527063 | 0.529690 | 5635.000000 |
Confusion matrix
We used a confusion matrix (also known as error matrix) to determine the performance of a classifier. Underneath we constructed a confusion matrix and passed it into a data frame. Remember that the predicted classes are “YES” and “NO”, “YES” indicates that customers have left our company, “no” indicates that customers have not left our company.
cmat = pd.DataFrame(metrics.confusion_matrix(y_testlogreg,y_predlogreg), columns = ("Positive","Negative"), index=["Positive","Negative"])
cmat
| Predicted: NO | Predicted: YES | |
| Actual: NO | 1520 | 2592 |
| Actual: YES | 73 | 1450 |
The confusion matrix above clearly highlights that our logistic classifier was made up of a total of 5 635 predictions. Out of the 5 635 predictions, the classifier predicted “Yes” 4 042 times. In reality, 1 523 customers fell under the churn, 4,112 do not fall under the churn.
(TP + TN) / total
(1450+ 73) / 5 635 = 0.5270.
Misclassification rate indicates how often a classifier got the predictions wrong. The calculation is as follows: –
(FP + FN) / total
(2 592 + 73) / 5 635 = 0.4859
True Positive Rate indicates how often a logistic regression predicts “yes”. The calculation is as follows: –
TP / Actual “YES”
1 450 / 1 523 = 0.9520
False positive rate indicates to how often our logistic classifier predicts “no” The calculation is as follows: –
TN / Actual “NO”
1 520 / 4 112 = 0.3696
Precision refers to how often our logistic classifier is correct. The calculation is as follows: –
TP / predicted “YES”
1 450 / 4 042 = 0.3587
Prevalence indicate how often yes condition actually occur in our sample. The calculation is as follows: –
Actual “YES” / total
1 523 / 5 635 = 0.2702
Validation ROC Curve
Validation ROC Curve is a tool used to find appropriate hyper-parameter settings. It helps summarize trade-off between the true positive rate and the false positive rate using different probability thresholds. ROC stands for Receiver Operating Characteristics. On the x-axis lies the false positive rate and, on the y-axis, lies the true positive rate. The closer the curve follows the left-hand side border then the top border of the ROC space, the more accurate it is.
y_predlogreg_proba = logreg.predict_proba(x_testlogreg)[::,1]
fprlogreg, tprlogreg, _ = metrics.roc_curve(y_testlogreg,y_predlogreg_proba)
auclogreg = metrics.roc_auc_score(y_testlogreg,y_predlogreg_proba)
plt.plot(fprlogreg, tprlogreg,label="Logistic Regression , auc " +str(auclogreg), color = "navy",alpha=0.8)
plt.plot([0,1],[0,1],color="red",alpha=0.8)
plt.xlim([0.00,1.01])
plt.ylim([0.00,1.01])
plt.title("Validation ROC Curve - Logsitic Regression")
plt.xlabel("Specificity")
plt.ylabel("Sensitivity")
plt.legend(loc=4)
plt.show()

The curve does not follow the left-hand border. From the start, the curve extremely bends to the right. The model is less accurate. This is also confirmed by the area under the curve (auc), which is 0.71. An accurate model should have auc score that is above 0.80
Precision-Recall curve
Precision of any estimate should have a small variance, accurate and unbiased.
precisionrecalllogreg,recalllogreg,thresholdlogreg = precision_recall_curve(y_testlogreg,y_predlogreg)
apslogreg = average_precision_score(y_testlogreg,y_predlogreg)
fig, ax = plt.subplots(figsize=(10,10))
plt.plot(precisionrecalllogreg,recalllogreg, label = "Logistic Regression , aps " + str(apslogreg), color="navy")
plt.title("Precision-Recall Curve - Logistic Regression")
plt.xlabel("Precision")
plt.ylabel("Recall")
plt.legend(loc=4)
plt.axhline(y=0.5, color="red", lw=2)
plt.show()

The average precision score (aps) is 0.35
Learning curve
We used the learning_curve() function to generate values for plotting the learning curve (number of samples used and mean of the training set and cross-validation set). This tool enables one to determine how they will benefit from adding more training data, so as finding out whether the estimator suffers variance error or bias.
trainsize, trainscore, testscore = learning_curve(LogisticRegression(),x,y,cv=10, n_jobs=-10, train_sizes=np.linspace(0.1,1.0,50))
trainscore_mean = np.mean(trainscore, axis=1)
trainscore_std = np.std(trainscore,axis=1)
testscore_mean = np.mean(testscore,axis=1)
testscore_std = np.std(testscore,axis=1)
plt.plot(trainsize,trainscore_mean,color="red",alpha=0.8, label="Training Score")
plt.plot(trainsize,testscore_mean,color="navy",alpha=0.8, label="Cross-Validation Score")
plt.title("Learning Curve - Logsitic Regresion")
plt.xlabel("Training Set Size")
plt.ylabel("Accuracy Score")
plt.legend(loc=1)
plt.show()

The training score of the logistic classifier was greater than the cross-validation score. Adding more training samples will likely increase generalization.