Take home data challenge

Publish Date: 2018-01-24

Author: Naixin Zhang

Word Count: 1.6k

Read Times: 10 Min

Read Count:

Background: This notebook aims to predict when employees are going to quit by understanding the main drivers of employee churn based on the dataset with info about the employees. specifically,we need to solve these problems:

create a table with 3 columns: day, employee_headcount,company_id
what are the main factors drive employee churn? do they make sense? explain
what variable can be added that could help explain emplyee churn

This notebook covers:

create table
Feature engineering(Data Visualization)
Decision tree

import  warnings
warnings.simplefilter('ignore')

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import graphviz
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder

%matplotlib inline

Load Dataset

data = pd.read_csv('employee_retention.csv', parse_dates=['join_date', 'quit_date'])
data.head()

	employee_id	company_id	dept	seniority	salary	join_date	quit_date
0	13021.0	7	customer_service	28	89000.0	2014-03-24	2015-10-30
1	825355.0	7	marketing	20	183000.0	2013-04-29	2014-04-04
2	927315.0	4	marketing	14	101000.0	2014-10-13	NaT
3	662910.0	7	customer_service	20	115000.0	2012-05-14	2013-06-07
4	256971.0	2	data_science	23	276000.0	2011-10-17	2014-08-22

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24702 entries, 0 to 24701
Data columns (total 7 columns):
employee_id    24702 non-null float64
company_id     24702 non-null int64
dept           24702 non-null object
seniority      24702 non-null int64
salary         24702 non-null float64
join_date      24702 non-null datetime64[ns]
quit_date      13510 non-null datetime64[ns]
dtypes: datetime64[ns](2), float64(2), int64(2), object(1)
memory usage: 1.3+ MB

data.describe()

	employee_id	company_id	seniority	salary
count	24702.000000	24702.000000	24702.000000	24702.000000
mean	501604.403530	3.426969	14.127803	138183.345478
std	288909.026101	2.700011	8.089520	76058.184573
min	36.000000	1.000000	1.000000	17000.000000
25%	250133.750000	1.000000	7.000000	79000.000000
50%	500793.000000	2.000000	14.000000	123000.000000
75%	753137.250000	5.000000	21.000000	187000.000000
max	999969.000000	12.000000	99.000000	408000.000000

# Null information
data.isnull().sum()

employee_id        0
company_id         0
dept               0
seniority          0
salary             0
join_date          0
quit_date      11192
dtype: int64

Create Table for `day`, `employee_headcount`, and `company_id`

# Define useful information
unique_date = pd.date_range(start='2011-01-24', end='2015-12-13', freq='D')
unique_company = sorted(data['company_id'].unique())

day = []
company = []
headcount = []

# Loop through date and company id
for date in unique_date:
    for idx in unique_company:
        total_join = len(data[(data['join_date'] <= date) & (data['company_id'] == idx)])
        total_quit = len(data[(data['quit_date'] <= date) & (data['company_id'] == idx)])
        day.append(date)
        company.append(idx)
        headcount.append(total_join - total_quit)

# Create table for day, employee_headcount, company_id
table = pd.DataFrame({'day': day, 'company_id': company, 'employee_headcount': headcount}, 
                     columns=['day', 'company_id', 'employee_headcount'])

table.head()

	day	company_id	employee_headcount
0	2011-01-24	1	25
1	2011-01-24	2	17
2	2011-01-24	3	9
3	2011-01-24	4	12
4	2011-01-24	5	5

Employee Churn Analysis

# Separate stay and quit data
quit_data = data[~data['quit_date'].isnull()]
stay_data = data[data['quit_date'].isnull()]

Feature Engineering

# Total working days
work_days = np.array(list(map(lambda x: x.days, quit_data['quit_date'] - quit_data['join_date'])))

hist_kws={'histtype': 'bar', 'edgecolor':'black', 'alpha': 0.2}
fig, ax = plt.subplots(figsize=(15, 6))
sns.distplot(work_days, bins=100, kde=False, ax=ax, hist_kws=hist_kws)
ax.set_title('Histogram of Working Days', fontsize=16)
ax.set_xlabel('Working Days', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
plt.show()

png

there are peaks around each employee year anniversary

# Week No. for quiting
quit_week = np.array(list(map(lambda x: x.month, quit_data['quit_date'])))

fig, ax = plt.subplots(figsize=(15, 6))
sns.countplot(quit_week, ax=ax)
ax.set_title('Countplot of Quit Month', fontsize=16)
ax.set_xlabel('Month', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
plt.show()

png

# Week No. for quiting
weeks  = np.array(list(map(lambda x: x.week, quit_data['quit_date'])))
hist_kws={'histtype': 'bar', 'edgecolor':'black', 'alpha': 0.2}
fig, ax = plt.subplots(figsize=(15, 6))
sns.distplot(weeks , bins=50, kde=False, ax=ax, hist_kws=hist_kws)
ax.set_title('Histogram of Quit Weeks', fontsize=16)
ax.set_xlabel('Quit Weeks', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
plt.show()

png

And it also peaks around the new year. Makes sense, companies have much more money to hire at the beginning of the year. So our goal becomes to prevent employees to quit within 1 year.

Now, let’s see if we find the characteristics of the people who quit early. Looking at the histogram of employment_length, it looks like we could define early quitters as those people who quit within 1 yr or so.

So, let’s create two classes of users : quit within 13 months or not (if they haven’t been in the current company for at least 13 months, we remove them).

# Choose quit data
quit_data['work_days'] = work_days
quit_data['quit_week'] = quit_week
quit_data.head()

	employee_id	company_id	dept	seniority	salary	join_date	quit_date	work_days	quit_week
0	13021.0	7	customer_service	28	89000.0	2014-03-24	2015-10-30	585	44
1	825355.0	7	marketing	20	183000.0	2013-04-29	2014-04-04	340	14
3	662910.0	7	customer_service	20	115000.0	2012-05-14	2013-06-07	389	23
4	256971.0	2	data_science	23	276000.0	2011-10-17	2014-08-22	1040	34
5	509529.0	4	data_science	14	165000.0	2012-01-30	2013-08-30	578	35

Decision Tree Model

# Choose the subset data
stop_date = pd.to_datetime('2015-12-13') - pd.DateOffset(days=365 + 31)
subset = data[data['join_date'] < stop_date]

# Binary label for early quit (less than 13 months)
quit = subset['quit_date'].isnull() | (subset['quit_date'] > subset['join_date'] + pd.DateOffset(days=396))
subset['quit'] = 1 - quit.astype(int)

subset.head()

	employee_id	company_id	dept	seniority	salary	join_date	quit_date	quit
0	13021.0	7	customer_service	28	89000.0	2014-03-24	2015-10-30	0
1	825355.0	7	marketing	20	183000.0	2013-04-29	2014-04-04	1
2	927315.0	4	marketing	14	101000.0	2014-10-13	NaT	0
3	662910.0	7	customer_service	20	115000.0	2012-05-14	2013-06-07	1
4	256971.0	2	data_science	23	276000.0	2011-10-17	2014-08-22	0

# # One-hot encoding
# subset['company_id'] = subset['company_id'].astype(str)
# dummies = pd.get_dummies(subset[['company_id', 'dept']])
# train_x = pd.concat(objs=[subset[['seniority', 'salary']], dummies], axis=1)
# train_y = subset['quit'].values

# train_x.head()

# Label encoder
le = LabelEncoder()
train_x = subset[['company_id', 'seniority', 'salary']]
train_x['dept'] = le.fit_transform(subset['dept'])
train_y = subset['quit'].values

train_x.head()

	company_id	seniority	salary	dept
0	7	28	89000.0	0
1	7	20	183000.0	4
2	4	14	101000.0	4
3	7	20	115000.0	0
4	2	23	276000.0	1

# Build decision tree
clf = DecisionTreeClassifier(max_depth=3, min_samples_leaf=30, random_state=42)
clf = clf.fit(X=train_x, y=train_y)

# Visualization
features = list(train_x.columns)
targets = ['Not quit', 'Quit']
dot_data = tree.export_graphviz(clf, out_file=None, feature_names=features, class_names=targets, 
                                filled=True, rounded=True, special_characters=True, )  
graph = graphviz.Source(dot_data)
graph

svg

# Feature importance
importance = sorted(zip(features, clf.feature_importances_), key=lambda x:x[1], reverse=True)
for feature, val in importance:
    print('{0:10s} | {1:.5f}'.format(feature, val))

salary     | 0.97439
seniority  | 0.02561
company_id | 0.00000
dept       | 0.00000

# Visualization
hist_kws={'histtype': 'bar', 'edgecolor':'black', 'alpha': 0.2}
fig, ax = plt.subplots(figsize=(15, 7))
sns.distplot(subset[subset['quit']==0]['salary'], 
             label='Not Quit', ax=ax, hist_kws=hist_kws)
sns.distplot(subset[subset['quit']==1]['salary'], 
             label='Quit', ax=ax, hist_kws=hist_kws)
ax.set_xlabel('Salary', fontsize=12)
ax.set_ylabel('PDF', fontsize=12)
ax.legend()
plt.show()

png

From this graph, people who make a lot of money and very little are not likely to quit. If salary between 80000 and 200000, the employee has higher probability of being an early quitter.

Other Factors

Given how important is salary, I would definitely love to have as a variable the salary the employee who quit was offered in the next job. Otherwise, things like: promotions or raises received during the employee tenure would be interesting.
The major findings are that employees quit at year anniversaries or at the beginning of the year. Both cases make sense. Even if you don’t like your current job, you often stay for 1 yr before quitting + you often get stocks after 1 yr so it makes sense to wait. Also, the beginning of the year is well known to be the best time to change job: companies are hiring more and you often want to stay until end of Dec to get the calendar year bonus.
Employees with low and high salaries are less likely to quit. Probably because employees with high salaries are happy there and employees with low salaries are not that marketable, so they have a hard time finding a new job.