Churn Prediction

9 minute read

3. Machine Learning for Classification

We’ll use logistic regression to predict churn

3.1 Churn prediction project

  • Dataset: https://www.kaggle.com/blastchar/telco-customer-churn
  • https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-03-churn-prediction/WA_Fn-UseC_-Telco-Customer-Churn.csv

3.2 Data preparation

  • Download the data, read it with pandas
  • Look at the data
  • Make column names and values look uniform
  • Check if all the columns read correctly
  • Check if the churn variable needs any preparation
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
data = 'https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-03-churn-prediction/WA_Fn-UseC_-Telco-Customer-Churn.csv'
df = pd.read_csv(r'E:\gito\mlbookcamp-code\chapter-03-churn-prediction\WA_Fn-UseC_-Telco-Customer-Churn.csv')
df.head()
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity ... DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No ... No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 5575-GNVDE Male 0 No No 34 Yes No DSL Yes ... Yes No No No One year No Mailed check 56.95 1889.5 No
2 3668-QPYBK Male 0 No No 2 Yes No DSL Yes ... No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes ... Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 9237-HQITU Female 0 No No 2 Yes No Fiber optic No ... No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes

5 rows × 21 columns

df.columns = df.columns.str.lower().str.replace(' ', '_')

categorical_columns = list(df.dtypes[df.dtypes == 'object'].index)

for c in categorical_columns:
    df[c] = df[c].str.lower().str.replace(' ', '_')
df.head().T
0 1 2 3 4
customerid 7590-vhveg 5575-gnvde 3668-qpybk 7795-cfocw 9237-hqitu
gender female male male male female
seniorcitizen 0 0 0 0 0
partner yes no no no no
dependents no no no no no
tenure 1 34 2 45 2
phoneservice no yes yes no yes
multiplelines no_phone_service no no no_phone_service no
internetservice dsl dsl dsl dsl fiber_optic
onlinesecurity no yes yes yes no
onlinebackup yes no yes no no
deviceprotection no yes no yes no
techsupport no no no yes no
streamingtv no no no no no
streamingmovies no no no no no
contract month-to-month one_year month-to-month one_year month-to-month
paperlessbilling yes no yes no yes
paymentmethod electronic_check mailed_check mailed_check bank_transfer_(automatic) electronic_check
monthlycharges 29.85 56.95 53.85 42.3 70.7
totalcharges 29.85 1889.5 108.15 1840.75 151.65
churn no no yes no yes
tc = pd.to_numeric(df.totalcharges, errors='coerce')
df.totalcharges = pd.to_numeric(df.totalcharges, errors='coerce')
df.totalcharges = df.totalcharges.fillna(0)
df.churn.head()
0     no
1     no
2    yes
3     no
4    yes
Name: churn, dtype: object
df.churn = (df.churn == 'yes').astype(int)

3.3 Setting up the validation framework

  • Perform the train/validation/test split with Scikit-Learn
from sklearn.model_selection import train_test_split
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)
len(df_train), len(df_val), len(df_test)
(4225, 1409, 1409)
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)
y_train = df_train.churn.values
y_val = df_val.churn.values
y_test = df_test.churn.values

del df_train['churn']
del df_val['churn']
del df_test['churn']

3.4 EDA

  • Check missing values
  • Look at the target variable (churn)
  • Look at numerical and categorical variables
df_full_train = df_full_train.reset_index(drop=True)
df_full_train.isnull().sum()
customerid          0
gender              0
seniorcitizen       0
partner             0
dependents          0
tenure              0
phoneservice        0
multiplelines       0
internetservice     0
onlinesecurity      0
onlinebackup        0
deviceprotection    0
techsupport         0
streamingtv         0
streamingmovies     0
contract            0
paperlessbilling    0
paymentmethod       0
monthlycharges      0
totalcharges        0
churn               0
dtype: int64
df_full_train.churn.value_counts(normalize=True)
0    0.730032
1    0.269968
Name: churn, dtype: float64
df_full_train.churn.mean()
0.26996805111821087
numerical = ['tenure', 'monthlycharges', 'totalcharges']
categorical = [
    'gender',
    'seniorcitizen',
    'partner',
    'dependents',
    'phoneservice',
    'multiplelines',
    'internetservice',
    'onlinesecurity',
    'onlinebackup',
    'deviceprotection',
    'techsupport',
    'streamingtv',
    'streamingmovies',
    'contract',
    'paperlessbilling',
    'paymentmethod',
]
df_full_train[categorical].nunique()
gender              2
seniorcitizen       2
partner             2
dependents          2
phoneservice        2
multiplelines       3
internetservice     3
onlinesecurity      3
onlinebackup        3
deviceprotection    3
techsupport         3
streamingtv         3
streamingmovies     3
contract            3
paperlessbilling    2
paymentmethod       4
dtype: int64

3.5 Feature importance: Churn rate and risk ratio

Feature importance analysis (part of EDA) - identifying which features affect our target variable

  • Churn rate
  • Risk ratio
  • Mutual information - later

Churn rate

df_full_train.head()
customerid gender seniorcitizen partner dependents tenure phoneservice multiplelines internetservice onlinesecurity ... deviceprotection techsupport streamingtv streamingmovies contract paperlessbilling paymentmethod monthlycharges totalcharges churn
0 5442-pptjy male 0 yes yes 12 yes no no no_internet_service ... no_internet_service no_internet_service no_internet_service no_internet_service two_year no mailed_check 19.70 258.35 0
1 6261-rcvns female 0 no no 42 yes no dsl yes ... yes yes no yes one_year no credit_card_(automatic) 73.90 3160.55 1
2 2176-osjuv male 0 yes no 71 yes yes dsl yes ... no yes no no two_year no bank_transfer_(automatic) 65.15 4681.75 0
3 6161-erdgd male 0 yes yes 71 yes yes dsl yes ... yes yes yes yes one_year no electronic_check 85.45 6300.85 0
4 2364-ufrom male 0 no no 30 yes no dsl yes ... no yes yes no one_year no electronic_check 70.40 2044.75 0

5 rows × 21 columns

churn_female = df_full_train[df_full_train.gender == 'female'].churn.mean()
churn_female
0.27682403433476394
churn_male = df_full_train[df_full_train.gender == 'male'].churn.mean()
churn_male
0.2632135306553911
global_churn = df_full_train.churn.mean()
global_churn
0.26996805111821087
global_churn - churn_female
-0.006855983216553063
global_churn - churn_male
0.006754520462819769
df_full_train.partner.value_counts()
no     2932
yes    2702
Name: partner, dtype: int64
churn_partner = df_full_train[df_full_train.partner == 'yes'].churn.mean()
churn_partner
0.20503330866025166
global_churn - churn_partner
0.06493474245795922
churn_no_partner = df_full_train[df_full_train.partner == 'no'].churn.mean()
churn_no_partner
0.3298090040927694
global_churn - churn_no_partner
-0.05984095297455855

Risk ratio

churn_no_partner / global_churn
1.2216593879412643
churn_partner / global_churn
0.7594724924338315
SELECT
    gender,
    AVG(churn),
    AVG(churn) - global_churn AS diff,
    AVG(churn) / global_churn AS risk
FROM
    data
GROUP BY
    gender;
from IPython.display import display
for c in categorical:
    print(c)
    df_group = df_full_train.groupby(c).churn.agg(['mean', 'count'])
    df_group['diff'] = df_group['mean'] - global_churn
    df_group['risk'] = df_group['mean'] / global_churn
    display(df_group)
    print()
    print()
gender
mean count diff risk
gender
female 0.276824 2796 0.006856 1.025396
male 0.263214 2838 -0.006755 0.974980
seniorcitizen
mean count diff risk
seniorcitizen
0 0.242270 4722 -0.027698 0.897403
1 0.413377 912 0.143409 1.531208
partner
mean count diff risk
partner
no 0.329809 2932 0.059841 1.221659
yes 0.205033 2702 -0.064935 0.759472
dependents
mean count diff risk
dependents
no 0.313760 3968 0.043792 1.162212
yes 0.165666 1666 -0.104302 0.613651
phoneservice
mean count diff risk
phoneservice
no 0.241316 547 -0.028652 0.893870
yes 0.273049 5087 0.003081 1.011412
multiplelines
mean count diff risk
multiplelines
no 0.257407 2700 -0.012561 0.953474
no_phone_service 0.241316 547 -0.028652 0.893870
yes 0.290742 2387 0.020773 1.076948
internetservice
mean count diff risk
internetservice
dsl 0.192347 1934 -0.077621 0.712482
fiber_optic 0.425171 2479 0.155203 1.574895
no 0.077805 1221 -0.192163 0.288201
onlinesecurity
mean count diff risk
onlinesecurity
no 0.420921 2801 0.150953 1.559152
no_internet_service 0.077805 1221 -0.192163 0.288201
yes 0.153226 1612 -0.116742 0.567570
onlinebackup
mean count diff risk
onlinebackup
no 0.404323 2498 0.134355 1.497672
no_internet_service 0.077805 1221 -0.192163 0.288201
yes 0.217232 1915 -0.052736 0.804660
deviceprotection
mean count diff risk
deviceprotection
no 0.395875 2473 0.125907 1.466379
no_internet_service 0.077805 1221 -0.192163 0.288201
yes 0.230412 1940 -0.039556 0.853480
techsupport
mean count diff risk
techsupport
no 0.418914 2781 0.148946 1.551717
no_internet_service 0.077805 1221 -0.192163 0.288201
yes 0.159926 1632 -0.110042 0.592390
streamingtv
mean count diff risk
streamingtv
no 0.342832 2246 0.072864 1.269897
no_internet_service 0.077805 1221 -0.192163 0.288201
yes 0.302723 2167 0.032755 1.121328
streamingmovies
mean count diff risk
streamingmovies
no 0.338906 2213 0.068938 1.255358
no_internet_service 0.077805 1221 -0.192163 0.288201
yes 0.307273 2200 0.037305 1.138182
contract
mean count diff risk
contract
month-to-month 0.431701 3104 0.161733 1.599082
one_year 0.120573 1186 -0.149395 0.446621
two_year 0.028274 1344 -0.241694 0.104730
paperlessbilling
mean count diff risk
paperlessbilling
no 0.172071 2313 -0.097897 0.637375
yes 0.338151 3321 0.068183 1.252560
paymentmethod
mean count diff risk
paymentmethod
bank_transfer_(automatic) 0.168171 1219 -0.101797 0.622928
credit_card_(automatic) 0.164339 1217 -0.105630 0.608733
electronic_check 0.455890 1893 0.185922 1.688682
mailed_check 0.193870 1305 -0.076098 0.718121

3.6 Feature importance: Mutual information

Mutual information - concept from information theory, it tells us how much we can learn about one variable if we know the value of another

  • https://en.wikipedia.org/wiki/Mutual_information
from sklearn.metrics import mutual_info_score
mutual_info_score(df_full_train.churn, df_full_train.contract)
0.0983203874041556
mutual_info_score(df_full_train.gender, df_full_train.churn)
0.0001174846211139946
mutual_info_score(df_full_train.contract, df_full_train.churn)
0.0983203874041556
mutual_info_score(df_full_train.partner, df_full_train.churn)
0.009967689095399745
def mutual_info_churn_score(series):
    return mutual_info_score(series, df_full_train.churn)
mi = df_full_train[categorical].apply(mutual_info_churn_score)
mi.sort_values(ascending=False)
contract            0.098320
onlinesecurity      0.063085
techsupport         0.061032
internetservice     0.055868
onlinebackup        0.046923
deviceprotection    0.043453
paymentmethod       0.043210
streamingtv         0.031853
streamingmovies     0.031581
paperlessbilling    0.017589
dependents          0.012346
partner             0.009968
seniorcitizen       0.009410
multiplelines       0.000857
phoneservice        0.000229
gender              0.000117
dtype: float64

3.7 Feature importance: Correlation

How about numerical columns?

  • Correlation coefficient - https://en.wikipedia.org/wiki/Pearson_correlation_coefficient
df_full_train.tenure.max()
72
df_full_train[numerical].corrwith(df_full_train.churn).abs()
tenure            0.351885
monthlycharges    0.196805
totalcharges      0.196353
dtype: float64
df_full_train[df_full_train.tenure <= 2].churn.mean()
0.5953420669577875
df_full_train[(df_full_train.tenure > 2) & (df_full_train.tenure <= 12)].churn.mean()
0.3994413407821229
df_full_train[df_full_train.tenure > 12].churn.mean()
0.17634908339788277
df_full_train[df_full_train.monthlycharges <= 20].churn.mean()
0.08795411089866156
df_full_train[(df_full_train.monthlycharges > 20) & (df_full_train.monthlycharges <= 50)].churn.mean()
0.18340943683409436
df_full_train[df_full_train.monthlycharges > 50].churn.mean()
0.32499341585462205

3.8 One-hot encoding

  • Use Scikit-Learn to encode categorical features
from sklearn.feature_extraction import DictVectorizer
dv = DictVectorizer(sparse=False)

train_dict = df_train[categorical + numerical].to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

val_dict = df_val[categorical + numerical].to_dict(orient='records')
X_val = dv.transform(val_dict)

3.9 Logistic regression

  • Binary classification
  • Linear vs logistic regression
def sigmoid(z):
    return 1 / (1 + np.exp(-z))
z = np.linspace(-7, 7, 51)
sigmoid(10000)
1.0
plt.plot(z, sigmoid(z))
[<matplotlib.lines.Line2D at 0x7f342d0bf080>]

png

def linear_regression(xi):
    result = w0
    
    for j in range(len(w)):
        result = result + xi[j] * w[j]
        
    return result
def logistic_regression(xi):
    score = w0
    
    for j in range(len(w)):
        score = score + xi[j] * w[j]
        
    result = sigmoid(score)
    return result

3.10 Training logistic regression with Scikit-Learn

  • Train a model with Scikit-Learn
  • Apply it to the validation dataset
  • Calculate the accuracy
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='lbfgs')
# solver='lbfgs' is the default solver in newer version of sklearn
# for older versions, you need to specify it explicitly
model.fit(X_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
model.intercept_[0]
-0.10903395348323511
model.coef_[0].round(3)
array([ 0.475, -0.175, -0.408, -0.03 , -0.078,  0.063, -0.089, -0.081,
       -0.034, -0.073, -0.335,  0.316, -0.089,  0.004, -0.258,  0.141,
        0.009,  0.063, -0.089, -0.081,  0.266, -0.089, -0.284, -0.231,
        0.124, -0.166,  0.058, -0.087, -0.032,  0.07 , -0.059,  0.141,
       -0.249,  0.215, -0.12 , -0.089,  0.102, -0.071, -0.089,  0.052,
        0.213, -0.089, -0.232, -0.07 ,  0.   ])
y_pred = model.predict_proba(X_val)[:, 1]
churn_decision = (y_pred >= 0.5)
(y_val == churn_decision).mean()
0.8034066713981547
df_pred = pd.DataFrame()
df_pred['probability'] = y_pred
df_pred['prediction'] = churn_decision.astype(int)
df_pred['actual'] = y_val
df_pred['correct'] = df_pred.prediction == df_pred.actual
df_pred.correct.mean()
0.8034066713981547
churn_decision.astype(int)
array([0, 0, 0, ..., 0, 1, 1])

3.11 Model interpretation

  • Look at the coefficients
  • Train a smaller model with fewer features
a = [1, 2, 3, 4]
b = 'abcd'
dict(zip(a, b))
{1: 'a', 2: 'b', 3: 'c', 4: 'd'}
dict(zip(dv.get_feature_names(), model.coef_[0].round(3)))
{'contract=month-to-month': 0.475,
 'contract=one_year': -0.175,
 'contract=two_year': -0.408,
 'dependents=no': -0.03,
 'dependents=yes': -0.078,
 'deviceprotection=no': 0.063,
 'deviceprotection=no_internet_service': -0.089,
 'deviceprotection=yes': -0.081,
 'gender=female': -0.034,
 'gender=male': -0.073,
 'internetservice=dsl': -0.335,
 'internetservice=fiber_optic': 0.316,
 'internetservice=no': -0.089,
 'monthlycharges': 0.004,
 'multiplelines=no': -0.258,
 'multiplelines=no_phone_service': 0.141,
 'multiplelines=yes': 0.009,
 'onlinebackup=no': 0.063,
 'onlinebackup=no_internet_service': -0.089,
 'onlinebackup=yes': -0.081,
 'onlinesecurity=no': 0.266,
 'onlinesecurity=no_internet_service': -0.089,
 'onlinesecurity=yes': -0.284,
 'paperlessbilling=no': -0.231,
 'paperlessbilling=yes': 0.124,
 'partner=no': -0.166,
 'partner=yes': 0.058,
 'paymentmethod=bank_transfer_(automatic)': -0.087,
 'paymentmethod=credit_card_(automatic)': -0.032,
 'paymentmethod=electronic_check': 0.07,
 'paymentmethod=mailed_check': -0.059,
 'phoneservice=no': 0.141,
 'phoneservice=yes': -0.249,
 'seniorcitizen': 0.215,
 'streamingmovies=no': -0.12,
 'streamingmovies=no_internet_service': -0.089,
 'streamingmovies=yes': 0.102,
 'streamingtv=no': -0.071,
 'streamingtv=no_internet_service': -0.089,
 'streamingtv=yes': 0.052,
 'techsupport=no': 0.213,
 'techsupport=no_internet_service': -0.089,
 'techsupport=yes': -0.232,
 'tenure': -0.07,
 'totalcharges': 0.0}
small = ['contract', 'tenure', 'monthlycharges']
df_train[small].iloc[:10].to_dict(orient='records')
[{'contract': 'two_year', 'tenure': 72, 'monthlycharges': 115.5},
 {'contract': 'month-to-month', 'tenure': 10, 'monthlycharges': 95.25},
 {'contract': 'month-to-month', 'tenure': 5, 'monthlycharges': 75.55},
 {'contract': 'month-to-month', 'tenure': 5, 'monthlycharges': 80.85},
 {'contract': 'two_year', 'tenure': 18, 'monthlycharges': 20.1},
 {'contract': 'month-to-month', 'tenure': 4, 'monthlycharges': 30.5},
 {'contract': 'month-to-month', 'tenure': 1, 'monthlycharges': 75.1},
 {'contract': 'month-to-month', 'tenure': 1, 'monthlycharges': 70.3},
 {'contract': 'two_year', 'tenure': 72, 'monthlycharges': 19.75},
 {'contract': 'month-to-month', 'tenure': 6, 'monthlycharges': 109.9}]
dicts_train_small = df_train[small].to_dict(orient='records')
dicts_val_small = df_val[small].to_dict(orient='records')
dv_small = DictVectorizer(sparse=False)
dv_small.fit(dicts_train_small)
DictVectorizer(dtype=<class 'numpy.float64'>, separator='=', sort=True,
               sparse=False)
dv_small.get_feature_names()
['contract=month-to-month',
 'contract=one_year',
 'contract=two_year',
 'monthlycharges',
 'tenure']
X_train_small = dv_small.transform(dicts_train_small)
model_small = LogisticRegression(solver='lbfgs')
model_small.fit(X_train_small, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
w0 = model_small.intercept_[0]
w0
-2.476775657751665
w = model_small.coef_[0]
w.round(3)
array([ 0.97 , -0.025, -0.949,  0.027, -0.036])
dict(zip(dv_small.get_feature_names(), w.round(3)))
{'contract=month-to-month': 0.97,
 'contract=one_year': -0.025,
 'contract=two_year': -0.949,
 'monthlycharges': 0.027,
 'tenure': -0.036}
-2.47 + (-0.949) + 30 * 0.027 + 24 * (-0.036)
-3.473
sigmoid(_)
0.030090303318277657

3.12 Using the model

dicts_full_train = df_full_train[categorical + numerical].to_dict(orient='records')
dv = DictVectorizer(sparse=False)
X_full_train = dv.fit_transform(dicts_full_train)
y_full_train = df_full_train.churn.values
model = LogisticRegression(solver='lbfgs')
model.fit(X_full_train, y_full_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
dicts_test = df_test[categorical + numerical].to_dict(orient='records')
X_test = dv.transform(dicts_test)
y_pred = model.predict_proba(X_test)[:, 1]
churn_decision = (y_pred >= 0.5)
(churn_decision == y_test).mean()
0.815471965933286
y_test
array([0, 0, 0, ..., 0, 0, 1])
customer = dicts_test[-1]
customer
{'gender': 'female',
 'seniorcitizen': 0,
 'partner': 'yes',
 'dependents': 'yes',
 'phoneservice': 'yes',
 'multiplelines': 'yes',
 'internetservice': 'fiber_optic',
 'onlinesecurity': 'yes',
 'onlinebackup': 'no',
 'deviceprotection': 'yes',
 'techsupport': 'no',
 'streamingtv': 'yes',
 'streamingmovies': 'yes',
 'contract': 'month-to-month',
 'paperlessbilling': 'yes',
 'paymentmethod': 'electronic_check',
 'tenure': 17,
 'monthlycharges': 104.2,
 'totalcharges': 1743.5}
X_small = dv.transform([customer])
model.predict_proba(X_small)[0, 1]
0.5968852088293909
y_test[-1]
1

Updated: