Take home data challenge

Publish Date: 2018-01-24

Author: Naixin Zhang

Word Count: 2.9k

Read Times: 18 Min

Read Count:

The goal of this challenge is to build a machine learning model that predicts the probability that the first transaction of a new user is fraudulent.

Index

Answer question 1
Answer question 2
Answer question 3
Answer question 4

import bisect
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier,export_graphviz
from sklearn.model_selection import  train_test_split
from sklearn.metrics import accuracy_score,classification_report,roc_curve
import xgboost as xgb

import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline

import  warnings
warnings.simplefilter('ignore')

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import auc, roc_curve, classification_report

import h2o
from h2o.frame import H2OFrame
from h2o.estimators.random_forest import H2ORandomForestEstimator

%matplotlib inline

Answer question 1

For each user, determine her country based on the numeric IP address.

data = pd.read_csv('Fraud_Data.csv', parse_dates=['signup_time', 'purchase_time'])
data.head()
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 151112 entries, 0 to 151111
Data columns (total 11 columns):
user_id           151112 non-null int64
signup_time       151112 non-null datetime64[ns]
purchase_time     151112 non-null datetime64[ns]
purchase_value    151112 non-null int64
device_id         151112 non-null object
source            151112 non-null object
browser           151112 non-null object
sex               151112 non-null object
age               151112 non-null int64
ip_address        151112 non-null float64
class             151112 non-null int64
dtypes: datetime64[ns](2), float64(1), int64(4), object(4)
memory usage: 12.7+ MB

data.shape[0] == data['user_id'].nunique()

	signup_day	signup_week	purchase_day	purchase_week	purchase_value	source	browser	sex	age	country	time_diff	device_num	ip_num	class
0	1	9	5	16	34	SEO	Chrome	M	39	Japan	13882	1	1	0
1	6	23	0	24	16	Ads	Chrome	F	53	United States	17944	1	1	0
2	3	1	3	1	15	SEO	Opera	M	53	United States	1	12	12	1
3	1	18	0	19	44	SEO	Safari	M	41	NA	60085	1	1	0
4	1	30	2	37	39	Ads	Safari	M	45	United States	41461	1	1	0

address2country = pd.read_csv('IpAddress_to_Country.csv')
address2country.head()

	lower_bound_ip_address	upper_bound_ip_address	country
0	16777216.0	16777471	Australia
1	16777472.0	16777727	China
2	16777728.0	16778239	China
3	16778240.0	16779263	Australia
4	16779264.0	16781311	China

countries = []
for i in range(len(data)):
    ip_address = data.loc[i, 'ip_address']
    tmp = address2country[(address2country['lower_bound_ip_address'] <= ip_address) &
                          (address2country['upper_bound_ip_address'] >= ip_address)]
    if len(tmp) == 1:
        countries.append(tmp['country'].values[0])
    else:
        countries.append('NA')

data['country'] = countries

data.head()

	user_id	signup_time	purchase_time	purchase_value	device_id	source	browser	sex	age	ip_address	class	country
0	22058	2015-02-24 22:55:49	2015-04-18 02:47:11	34	QVPSPJUOCKZAR	SEO	Chrome	M	39	7.327584e+08	0	Japan
1	333320	2015-06-07 20:39:50	2015-06-08 01:38:54	16	EOGFQPIZPYXFZ	Ads	Chrome	F	53	3.503114e+08	0	United States
2	1359	2015-01-01 18:52:44	2015-01-01 18:52:45	15	YSSKYOSJHPPLJ	SEO	Opera	M	53	2.621474e+09	1	United States
3	150084	2015-04-28 21:13:25	2015-05-04 13:54:50	44	ATGTXKYKUDUQN	SEO	Safari	M	41	3.840542e+09	0	NA
4	221365	2015-07-21 07:09:52	2015-09-09 18:40:53	39	NAUITBZFJKHWW	Ads	Safari	M	45	4.155831e+08	0	United States

Feature Engineering

Time difference between sign-up time and purchase time
If the device id is unique or certain users are sharing the same device (many different user ids using the same device could be an indicator of fake accounts)
Same for the ip address. Many different users having the same ip address could be an indicator of fake accounts
Usual week of the year and day of the week from time variables

# Get the time difference between purchase time and signup time
time_diff = data['purchase_time'] - data['signup_time']
time_diff = time_diff.apply(lambda x: x.seconds)
data['time_diff'] = time_diff
data.head()

	user_id	signup_time	purchase_time	purchase_value	device_id	source	browser	sex	age	ip_address	class	country	time_diff
0	22058	2015-02-24 22:55:49	2015-04-18 02:47:11	34	QVPSPJUOCKZAR	SEO	Chrome	M	39	7.327584e+08	0	Japan	13882
1	333320	2015-06-07 20:39:50	2015-06-08 01:38:54	16	EOGFQPIZPYXFZ	Ads	Chrome	F	53	3.503114e+08	0	United States	17944
2	1359	2015-01-01 18:52:44	2015-01-01 18:52:45	15	YSSKYOSJHPPLJ	SEO	Opera	M	53	2.621474e+09	1	United States	1
3	150084	2015-04-28 21:13:25	2015-05-04 13:54:50	44	ATGTXKYKUDUQN	SEO	Safari	M	41	3.840542e+09	0	NA	60085
4	221365	2015-07-21 07:09:52	2015-09-09 18:40:53	39	NAUITBZFJKHWW	Ads	Safari	M	45	4.155831e+08	0	United States	41461

# Check user number for unique devices
device_num = data[['user_id', 'device_id']].groupby('device_id').count().reset_index()
device_num = device_num.rename(columns={'user_id': 'device_num'})
data = data.merge(device_num, how='left', on='device_id')
data.head()

	user_id	signup_time	purchase_time	purchase_value	device_id	source	browser	sex	age	ip_address	class	country	time_diff	device_num
0	22058	2015-02-24 22:55:49	2015-04-18 02:47:11	34	QVPSPJUOCKZAR	SEO	Chrome	M	39	7.327584e+08	0	Japan	13882	1
1	333320	2015-06-07 20:39:50	2015-06-08 01:38:54	16	EOGFQPIZPYXFZ	Ads	Chrome	F	53	3.503114e+08	0	United States	17944	1
2	1359	2015-01-01 18:52:44	2015-01-01 18:52:45	15	YSSKYOSJHPPLJ	SEO	Opera	M	53	2.621474e+09	1	United States	1	12
3	150084	2015-04-28 21:13:25	2015-05-04 13:54:50	44	ATGTXKYKUDUQN	SEO	Safari	M	41	3.840542e+09	0	NA	60085	1
4	221365	2015-07-21 07:09:52	2015-09-09 18:40:53	39	NAUITBZFJKHWW	Ads	Safari	M	45	4.155831e+08	0	United States	41461	1

# Check user number for unique ip_address
ip_num = data[['user_id', 'ip_address']].groupby('ip_address').count().reset_index()
ip_num = ip_num.rename(columns={'user_id': 'ip_num'})
data = data.merge(ip_num, how='left', on='ip_address')
data.head()

	user_id	signup_time	purchase_time	purchase_value	device_id	source	browser	sex	age	ip_address	class	country	time_diff	device_num	ip_num
0	22058	2015-02-24 22:55:49	2015-04-18 02:47:11	34	QVPSPJUOCKZAR	SEO	Chrome	M	39	7.327584e+08	0	Japan	13882	1	1
1	333320	2015-06-07 20:39:50	2015-06-08 01:38:54	16	EOGFQPIZPYXFZ	Ads	Chrome	F	53	3.503114e+08	0	United States	17944	1	1
2	1359	2015-01-01 18:52:44	2015-01-01 18:52:45	15	YSSKYOSJHPPLJ	SEO	Opera	M	53	2.621474e+09	1	United States	1	12	12
3	150084	2015-04-28 21:13:25	2015-05-04 13:54:50	44	ATGTXKYKUDUQN	SEO	Safari	M	41	3.840542e+09	0	NA	60085	1	1
4	221365	2015-07-21 07:09:52	2015-09-09 18:40:53	39	NAUITBZFJKHWW	Ads	Safari	M	45	4.155831e+08	0	United States	41461	1	1

# Signup day and week
data['signup_day'] = data['signup_time'].apply(lambda x: x.dayofweek)
data['signup_week'] = data['signup_time'].apply(lambda x: x.week)

# Purchase day and week
data['purchase_day'] = data['purchase_time'].apply(lambda x: x.dayofweek)
data['purchase_week'] = data['purchase_time'].apply(lambda x: x.week)

data.head()

	user_id	signup_time	purchase_time	purchase_value	device_id	source	browser	sex	age	ip_address	class	country	time_diff	device_num	ip_num	signup_day	signup_week	purchase_day	purchase_week
0	22058	2015-02-24 22:55:49	2015-04-18 02:47:11	34	QVPSPJUOCKZAR	SEO	Chrome	M	39	7.327584e+08	0	Japan	13882	1	1	1	9	5	16
1	333320	2015-06-07 20:39:50	2015-06-08 01:38:54	16	EOGFQPIZPYXFZ	Ads	Chrome	F	53	3.503114e+08	0	United States	17944	1	1	6	23	0	24
2	1359	2015-01-01 18:52:44	2015-01-01 18:52:45	15	YSSKYOSJHPPLJ	SEO	Opera	M	53	2.621474e+09	1	United States	1	12	12	3	1	3	1
3	150084	2015-04-28 21:13:25	2015-05-04 13:54:50	44	ATGTXKYKUDUQN	SEO	Safari	M	41	3.840542e+09	0	NA	60085	1	1	1	18	0	19
4	221365	2015-07-21 07:09:52	2015-09-09 18:40:53	39	NAUITBZFJKHWW	Ads	Safari	M	45	4.155831e+08	0	United States	41461	1	1	1	30	2	37

# Define features and target to be used
columns = ['signup_day', 'signup_week', 'purchase_day', 'purchase_week', 'purchase_value', 'source', 
           'browser', 'sex', 'age', 'country', 'time_diff', 'device_num', 'ip_num', 'class']
data = data[columns]
data.head()

	signup_day	signup_week	purchase_day	purchase_week	purchase_value	source	browser	sex	age	country	time_diff	device_num	ip_num	class
0	1	9	5	16	34	SEO	Chrome	M	39	Japan	13882	1	1	0
1	6	23	0	24	16	Ads	Chrome	F	53	United States	17944	1	1	0
2	3	1	3	1	15	SEO	Opera	M	53	United States	1	12	12	1
3	1	18	0	19	44	SEO	Safari	M	41	NA	60085	1	1	0
4	1	30	2	37	39	Ads	Safari	M	45	United States	41461	1	1	0

Fraudulent Activity Identification

# Initialize H2O cluster
h2o.init()
h2o.remove_all()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
; Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
  Starting server from C:\Users\Naixin\Anaconda3\lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\Naixin\AppData\Local\Temp\tmpaer_0p20
  JVM stdout: C:\Users\Naixin\AppData\Local\Temp\tmpaer_0p20\h2o_Naixin_started_from_python.out
  JVM stderr: C:\Users\Naixin\AppData\Local\Temp\tmpaer_0p20\h2o_Naixin_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.

H2O cluster uptime:	03 secs
H2O cluster timezone:	America/Chicago
H2O data parsing timezone:	UTC
H2O cluster version:	3.26.0.6
H2O cluster version age:	5 days
H2O cluster name:	H2O_from_python_Naixin_uxelgu
H2O cluster total nodes:	1
H2O cluster free memory:	1.747 Gb
H2O cluster total cores:	0
H2O cluster allowed cores:	0
H2O cluster status:	accepting new members, healthy
H2O connection url:	http://127.0.0.1:54321
H2O connection proxy:	None
H2O internal security:	False
H2O API Extensions:	Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4
Python version:	3.7.3 final

# Transform to H2O Frame, and make sure the target variable is categorical
h2o_df = H2OFrame(data)

for name in ['signup_day', 'purchase_day', 'source', 'browser', 'sex', 'country', 'class']:
    h2o_df[name] = h2o_df[name].asfactor()

h2o_df.summary()

Parse progress: |█████████████████████████████████████████████████████████| 100%

	signup_day	signup_week	purchase_day	purchase_week	purchase_value	source	browser	sex	age	country	time_diff	device_num	ip_num	class
type	enum	int	enum	int	int	enum	enum	enum	int	enum	int	int	int	enum
mins		1.0		1.0	9.0				18.0		1.0	1.0	1.0
mean		16.50174043093866		24.65857112605202	36.93537243898601				33.14070358409675		40942.58442744426	1.6843665625496351	1.6027185134205097
maxs		34.0		51.0	154.0				76.0		86399.0	20.0	20.0
sigma		9.814287461798903		11.651556782719481	18.32276214866213				8.617733490961495		26049.66190211841	2.616953602804173	2.596239527375834
zeros		0		0	0				0		0	0	0
missing	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	1	9.0	5	16.0	34.0	SEO	Chrome	M	39.0	Japan	13882.0	1.0	1.0	0
1	6	23.0	0	24.0	16.0	Ads	Chrome	F	53.0	United States	17944.0	1.0	1.0	0
2	3	1.0	3	1.0	15.0	SEO	Opera	M	53.0	United States	1.0	12.0	12.0	1
3	1	18.0	0	19.0	44.0	SEO	Safari	M	41.0	NA	60085.0	1.0	1.0	0
4	1	30.0	2	37.0	39.0	Ads	Safari	M	45.0	United States	41461.0	1.0	1.0	0
5	3	21.0	3	28.0	42.0	Ads	Chrome	M	18.0	Canada	7331.0	1.0	1.0	0
6	5	31.0	3	35.0	11.0	Ads	Chrome	F	19.0	NA	17825.0	1.0	1.0	0
7	0	15.0	0	22.0	27.0	Ads	Opera	M	34.0	United States	35129.0	1.0	1.0	0
8	1	17.0	1	23.0	30.0	SEO	IE	F	43.0	China	51800.0	1.0	1.0	0
9	6	4.0	0	13.0	62.0	Ads	IE	M	31.0	United States	18953.0	1.0	1.0	0

# Split into 70% training and 30% test dataset
strat_split = h2o_df['class'].stratified_split(test_frac=0.3, seed=42)

train = h2o_df[strat_split == 'train']
test = h2o_df[strat_split == 'test']

# Define features and target
feature = ['signup_day', 'signup_week', 'purchase_day', 'purchase_week', 'purchase_value', 
           'source', 'browser', 'sex', 'age', 'country', 'time_diff', 'device_num', 'ip_num']
target = 'class'

# Build random forest model 
model = H2ORandomForestEstimator(balance_classes=True, ntrees=100, mtries=-1, stopping_rounds=5, 
                                 stopping_metric='auc', score_each_iteration=True, seed=42)
model.train(x=feature, y=target, training_frame=train, validation_frame=test)

drf Model Build progress: |███████████████████████████████████████████████| 100%

# Feature importance
importance = model.varimp(use_pandas=True)

fig, ax = plt.subplots(figsize=(10, 8))
sns.barplot(x='scaled_importance', y='variable', data=importance)
plt.show()

png

# Make predictions 
train_true = train.as_data_frame()['class'].values
test_true = test.as_data_frame()['class'].values
train_pred = model.predict(train).as_data_frame()['p1'].values
test_pred = model.predict(test).as_data_frame()['p1'].values

train_fpr, train_tpr, _ = roc_curve(train_true, train_pred)
test_fpr, test_tpr, _ = roc_curve(test_true, test_pred)
train_auc = np.round(auc(train_fpr, train_tpr), 3)
test_auc = np.round(auc(test_fpr, test_tpr), 3)

drf prediction progress: |████████████████████████████████████████████████| 100%
drf prediction progress: |████████████████████████████████████████████████| 100%

# Classification report
print(classification_report(y_true=test_true, y_pred=(test_pred > 0.5).astype(int)))

              precision    recall  f1-score   support

           0       0.95      1.00      0.98     41088
           1       1.00      0.53      0.69      4245

    accuracy                           0.96     45333
   macro avg       0.98      0.76      0.83     45333
weighted avg       0.96      0.96      0.95     45333

train_fpr = np.insert(train_fpr, 0, 0)
train_tpr = np.insert(train_tpr, 0, 0)
test_fpr = np.insert(test_fpr, 0, 0)
test_tpr = np.insert(test_tpr, 0, 0)

fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(train_fpr, train_tpr, label='Train AUC: ' + str(train_auc))
ax.plot(test_fpr, test_tpr, label='Test AUC: ' + str(test_auc))
ax.plot(train_fpr, train_fpr, 'k--', label='Chance Curve')
ax.set_xlabel('False Positive Rate', fontsize=12)
ax.set_ylabel('True Positive Rate', fontsize=12)
ax.grid(True)
ax.legend(fontsize=12)
plt.show()

png

Based on the ROC, if we care about minimizing false positive, we would choose a cut-off that would give us true positive rate of ~0.5 and false positive rate almost zero (this was essentially the random forest output). However, if we care about maximizing true positive, we will have to decrease the cut-off. This way we will classify more events as “1”: some will be true ones (so true positive goes up) and many, unfortunately, will be false ones (so false positive will also go up).

cols = ['device_num', 'time_diff', 'purchase_week', 'country', 'ip_num']
_ = model.partial_plot(data=train, cols=cols, nbins=200, figsize=(18, 20))

PartialDependencePlot progress: |█████████████████████████████████████████| 100%

png

# Shutdown h2o instance
h2o.cluster().shutdown()

Regarding “how to use this from a product perspective”: you now have a model that assigns to each user a probability of committing a fraud. You want to think about creating different experiences based on that. For instance:

If predicted fraud probability < X, the user has the normal experience (the high majority should fall here)
If X <= predicted fraud probability < Z (so the user is at risk, but not too much), you can create an additional verification step, like verify your phone number via a code sent by SMS or log in via Facebook.
If predicted fraud probability >= Z (so here is really likely the user is trying to commit a fraud), you can tell the user his session has been put on hold, send this user info to someone who reviews it manually and either blocks the user or decides it is not a fraud so the session is resumed.

This is just an example and there are many different ways to build products around some fraud score. However, it is important because it highlights that a ML model is often really useful when it is combined with a product which is able to take advantage of its strengths and minimize its possible drawbacks (like false positives).