Credit Card Fraud Detection

This notebook illustrates techniques and machine learning algorithms for detecting fradulent credit card transactions. Common issue with this types of problems is class imbalance. We combat this issue with over and udersampling techniques. Modeling is done with Logistic Regression and Random Forest classifiers. Let's get started.

What is dataset look like?

The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.
It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.
Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.
Data is taken from kaggle.com: https://www.kaggle.com/mlg-ulb/creditcardfraud

TL;DR

Dataset	Model	Tuned	Precision	Recall	F1	AUC
Upsampling	Random Forest	No	0.86	0.77	0.81	97%
Imbalanced	Logistic Regression	Yes	0.54	0.82	0.65	97%
Downsampling	Logistic Regression	No	0.05	0.82	0.10	93%
Upsampling	Logistic Regression	No	0.01	0.94	0.02	97%
Downsampling	Random Forest	No	0.00	1.00	0.00	97%

Importing Libraries

# Data manipulation, visualization
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import numpy as np
import seaborn as sns
from time import time

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import IsolationForest
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.cluster import KMeans
from sklearn.svm import SVC
import xgboost as xgb

# Preprocessing, dimensionality reduction
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import NearMiss
#from sklearn.decomposition import SparsePCA
#from sklearn.decomposition import TruncatedSVD
#from sklearn.manifold import TSNE
#from sklearn.pipeline import make_pipeline

# Evaluation
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_roc_curve

import warnings
warnings.filterwarnings('ignore')

def evaluation(model, X_train, y_train, X_test, y_test, y_pred) -> None:
    print(classification_report(y_test, y_pred))
    print('-' * 80)
    recall_cv = cross_val_score(model, X_test, y_test, cv=5, scoring='recall')
    print(f'Cross-Validation Score (Recall): {recall_cv.mean()}')
    print('-' * 80)
    plt.rcParams['figure.dpi'] = 72
    sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='g')
    plt.show()
    print('-' * 80)
    logreg_auc = plot_roc_curve(model, X_test, y_test)
    plt.show()

Loading Data

df = pd.read_csv('creditcard.csv')

print(f'Dataset size: {len(df)}')
print(f'Features: {df.shape[1]}')

Dataset size: 284807
Features: 31

Quick Glimpse at the Data

df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Time	V1	V2	V3	V4	V5	V6	V7	V8	V9	...	V21	V22	V23	V24	V25	V26	V27	V28	Amount
0	0.0	-1.359807	-0.072781	2.536347	1.378155	-0.338321	0.462388	0.239599	0.098698	0.363787	...	-0.018307	0.277838	-0.110474	0.066928	0.128539	-0.189115	0.133558	-0.021053	149.62
1	0.0	1.191857	0.266151	0.166480	0.448154	0.060018	-0.082361	-0.078803	0.085102	-0.255425	...	-0.225775	-0.638672	0.101288	-0.339846	0.167170	0.125895	-0.008983	0.014724	2.69
2	1.0	-1.358354	-1.340163	1.773209	0.379780	-0.503198	1.800499	0.791461	0.247676	-1.514654	...	0.247998	0.771679	0.909412	-0.689281	-0.327642	-0.139097	-0.055353	-0.059752	378.66
3	1.0	-0.966272	-0.185226	1.792993	-0.863291	-0.010309	1.247203	0.237609	0.377436	-1.387024	...	-0.108300	0.005274	-0.190321	-1.175575	0.647376	-0.221929	0.062723	0.061458	123.50
4	2.0	-1.158233	0.877737	1.548718	0.403034	-0.407193	0.095921	0.592941	-0.270533	0.817739	...	-0.009431	0.798278	-0.137458	0.141267	-0.206010	0.502292	0.219422	0.215153	69.99

5 rows × 31 columns

Checking for NULL

Dataset has no null values

df.isna().sum().sum()

Describe

Looking at the data we can tell that Amount and Time are not scaled and most of the feauters have outliers

df.describe()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Time	V1	V2	V3	V4	V5	V6	V7	V8	V9	...	V21	V22	V23	V24	V25	V26	V27	V28	Amount	Class
count	284807.000000	2.848070e+05	2.848070e+05	2.848070e+05	2.848070e+05	2.848070e+05	2.848070e+05	2.848070e+05	2.848070e+05	2.848070e+05	...	2.848070e+05	2.848070e+05	2.848070e+05	2.848070e+05	2.848070e+05	2.848070e+05	2.848070e+05	2.848070e+05	284807.000000	284807.000000
mean	94813.859575	1.168375e-15	3.416908e-16	-1.379537e-15	2.074095e-15	9.604066e-16	1.487313e-15	-5.556467e-16	1.213481e-16	-2.406331e-15	...	1.654067e-16	-3.568593e-16	2.578648e-16	4.473266e-15	5.340915e-16	1.683437e-15	-3.660091e-16	-1.227390e-16	88.349619	0.001727
std	47488.145955	1.958696e+00	1.651309e+00	1.516255e+00	1.415869e+00	1.380247e+00	1.332271e+00	1.237094e+00	1.194353e+00	1.098632e+00	...	7.345240e-01	7.257016e-01	6.244603e-01	6.056471e-01	5.212781e-01	4.822270e-01	4.036325e-01	3.300833e-01	250.120109	0.041527
min	0.000000	-5.640751e+01	-7.271573e+01	-4.832559e+01	-5.683171e+00	-1.137433e+02	-2.616051e+01	-4.355724e+01	-7.321672e+01	-1.343407e+01	...	-3.483038e+01	-1.093314e+01	-4.480774e+01	-2.836627e+00	-1.029540e+01	-2.604551e+00	-2.256568e+01	-1.543008e+01	0.000000	0.000000
25%	54201.500000	-9.203734e-01	-5.985499e-01	-8.903648e-01	-8.486401e-01	-6.915971e-01	-7.682956e-01	-5.540759e-01	-2.086297e-01	-6.430976e-01	...	-2.283949e-01	-5.423504e-01	-1.618463e-01	-3.545861e-01	-3.171451e-01	-3.269839e-01	-7.083953e-02	-5.295979e-02	5.600000	0.000000
50%	84692.000000	1.810880e-02	6.548556e-02	1.798463e-01	-1.984653e-02	-5.433583e-02	-2.741871e-01	4.010308e-02	2.235804e-02	-5.142873e-02	...	-2.945017e-02	6.781943e-03	-1.119293e-02	4.097606e-02	1.659350e-02	-5.213911e-02	1.342146e-03	1.124383e-02	22.000000	0.000000
75%	139320.500000	1.315642e+00	8.037239e-01	1.027196e+00	7.433413e-01	6.119264e-01	3.985649e-01	5.704361e-01	3.273459e-01	5.971390e-01	...	1.863772e-01	5.285536e-01	1.476421e-01	4.395266e-01	3.507156e-01	2.409522e-01	9.104512e-02	7.827995e-02	77.165000	0.000000
max	172792.000000	2.454930e+00	2.205773e+01	9.382558e+00	1.687534e+01	3.480167e+01	7.330163e+01	1.205895e+02	2.000721e+01	1.559499e+01	...	2.720284e+01	1.050309e+01	2.252841e+01	4.584549e+00	7.519589e+00	3.517346e+00	3.161220e+01	3.384781e+01	25691.160000	1.000000

8 rows × 31 columns

1. EDA up

1.1 Target Variable Imbalance

Target Variable(Class) is extremely imbalanced because most of the transactions are non fraudulent (class 0).

This presents two problems:

Model will have trouble to learn fradulent transactions because there is little data
Evaluation of the model cannot be done with simple accuracy because even if the model wouldn't be able to correcly identify fradulent transactions at all it still will be correct 99.998% of times

To solve this problem we most likely will be using over/undersampling technique which comes with imblearn package.

plt.style.use('seaborn-whitegrid')
plt.rcParams['figure.dpi'] = 70
plt.bar(
    df.Class.value_counts().keys(),
    df.Class.value_counts().values,
    tick_label = [0, 1]
)
plt.title('Target Variable', fontsize=16)
plt.xlabel('Classes', fontsize=13)
plt.ylabel('Records', fontsize=13)
plt.show()

1.2 Checking for Outliers

print(f'Maximum amount for normal transaction: {df[df.Class==0]["Amount"].max():>14}')
print(f'Maximum amount for fraudulent transaction: {df[df.Class==1]["Amount"].max():>10}')

Maximum amount for normal transaction:       25691.16
Maximum amount for fraudulent transaction:    2125.87

Average transaction amount is 88, but some of the transactions go up to 25000. In fact, ~99.8% of all the transaction are in 2200 range. With that being said it would make sense to cut outliers out of data set, but we still will keep all the fraudulent transactions in place.

print(f'Original dataset size: {len(df)}')
df = df[df.Amount <= 2200]
print(f'After removing outliers: {len(df)}')

Original dataset size: 284807
After removing outliers: 284239

We can observe that Amount still contains some outliers, but I decided to keep these data points anyway

plt.boxplot(df.Amount)
plt.show()

Amounts are similar for both normal and fraudulent transactions

plt.figure(figsize=(16,5))
plt.hist(df[df.Class==0].Amount, bins=70, density=True, label='Normal')
plt.hist(df[df.Class==1].Amount, bins=70, density=True, alpha=0.7, label='Fraudulent')
plt.title('Amounts', fontsize=16)
plt.xticks([i for i in range(0, 2500, 250)])
plt.legend()
plt.show()

Outliers for V1-V28 Features

Features V1 - V28 are scaled to some degree. We can observe that these features have outliers on both ends

plt.rcParams['figure.dpi'] = 220
fig, axs = plt.subplots(7, 4, figsize=(15,15))
fig.suptitle('Checking for Outliers')
fig.tight_layout()

v = 1
for i in range(7):
    for j in range(4):
        axs[i, j].set_title(f'V{v}')
        axs[i, j].boxplot(df[f'V{v}'])
        v += 1

1.3 Correlations

plt.figure(figsize=(16,10))
sns.set(font_scale=1.2)
sns.heatmap(df.corr(), linewidths=0.01, linecolor='#6b1e5b')
plt.show()

1.4 Features Visualization (2D) with Matplotlib

We are going to use Principal Component Analysis to select 2 features with maximum variance

pca = PCA(2)
pca_features = pca.fit_transform(X=df.drop(['Time','Class'], axis=1), y=df.Class)

pca_2d = pd.DataFrame(pca_features, columns=['Feature 1', 'Feature 2'])
pca_2d['Class'] = df.Class

plt.rcParams['figure.dpi'] = 227
plt.style.use('seaborn-whitegrid')
plt.figure(figsize=(14,5))
plt.scatter(pca_2d[pca_2d.Class==0]['Feature 1'], pca_2d[pca_2d.Class==0]['Feature 2'])
plt.scatter(pca_2d[pca_2d.Class==1]['Feature 1'], pca_2d[pca_2d.Class==1]['Feature 2'], alpha=0.6)
plt.legend(['Non Fraudulent', 'Fraudulent'])
plt.show()

1.5 Features Visualization (3D) with Plotly

df_sample = df[df.Class == 0].sample(150000)
df_sample = df_sample.append(df[df.Class == 1])

pca = PCA(3)
pca_features = pca.fit_transform(X = df_sample.drop(['Time','Class'], axis=1), y = df.Class)

pca_df = pd.DataFrame(pca_features, columns=['Feature 1', 'Feature 2', 'Feature 3'])
pca_df['Target'] = df_sample.Class.to_numpy()

fig = px.scatter_3d(
    pca_df, 
    x='Feature 3', 
    y='Feature 2', 
    z='Feature 1',
    color='Target',
    width=1000,
    height=1000,
    opacity=1
)
fig.show()

2. Preprocessing up

X = df.drop(['Class'], axis=1)
y = df.Class

2.1 Train-Test Split

Before we decide how to work with class imbalance, we should split data into train and test parts to avoid data leakage

X_train, \
X_test, \
y_train, \
y_test = train_test_split(X, y, train_size=0.75)

2.2 Scaling

Scaling is important for some of the ML algorithms as it will reduce learning time and overall model performance. Tree-based models do not require any kind of normalization and will perform the same.

scaling = MinMaxScaler()
X_train = scaling.fit_transform(X_train)
X_test = scaling.transform(X_test)

2.3 Class Imbalance (downsampling)

We are going to try both down and upsampling and see which technique produces better results. Here downsampling technique is appied to Train part of the dataset. Result of this procedure is equally distributed positive and negative class.

downsample = NearMiss(n_jobs=-1)

X_train_down, y_train_down = downsample.fit_resample(X_train, y_train)

# y_train_down now has equal amounts of observations for each class
y_train_down.value_counts()

0    352
1    352
Name: Class, dtype: int64

# Test set contains data that model has no access to
y_test.value_counts()

0    70920
1      140
Name: Class, dtype: int64

2.4 Class Imbalance (upsampling)

Common technique is to use upsampling with SMOTE (Synthetic Minority Oversampling Technique), which produces equal amount of observation for both class by generating synthetic samples for minority class, in our case it is Class 1

upsample = SMOTE(n_jobs=-1)

X_train_up, y_train_up = upsample.fit_resample(X_train, y_train)

# y_train_down now has equal amounts of observations for each class
y_train_up.value_counts()

0    212827
1    212827
Name: Class, dtype: int64

# Test set contains data that model has no access to
y_test.value_counts()

0    70920
1      140
Name: Class, dtype: int64

3. Modeling up

Modeling part is all about trying different models, hyperparameters tuning, validation

3.1 Isolation Forest

Isolation forest is the type of the tree based model that is used for identifying outliers. Without going into details of inner workings, we can fit the model on the Class 0 only, then predict on the Class 1 and observe how many positive (fraudulent) observations were labelled as outliers. Result will give an idea what result we can expect when applying models such as Logistic Regression and such.

clf = IsolationForest(n_estimators=100)
clf.fit(df[df.Class==0])
y_pred_outliers = clf.predict(df[df.Class==1])
sum(y_pred_outliers == -1) / len(y_pred_outliers)

0.8252032520325203

82.5% observations were correctly identified as outliers, which means 16.3% of the fraudulent transactions blend in with normal transactions.

3.2 Hyper Parameters Tuning for Logistic Regression (Imbalanced Data)

First I want to find the best parameters for the logistic regression model on the original, imbalanced data. Logistic regression has class_weight parameter, which will help to overcome class imbalance by putting more attention to the minority class.

parameters = {
    'C': (0.01, 0.5, 1), 
    'class_weight': [{0: w1, 1: w2} for w1 in (0.05, 0.1) for w2 in (2.5, 3, 3.5, 4)]
}

logreg = LogisticRegression()
logreg_gridsearch = GridSearchCV(logreg, param_grid=parameters, scoring='recall', n_jobs=-1, cv=5)
logreg_gridsearch.fit(X_train, y_train)

logreg_gridsearch.best_estimator_

LogisticRegression(C=1, class_weight={0: 0.05, 1: 4})

# We won't need these anymore
try:
    del parameters, logreg, logreg_gridsearch
except:
    print('Parameters already got deleted')

3.3 Logistic Regression with Optimal Parameters (Imbalanced Data)

It makes no difference if I use logreg_gridsearch or start new model with the best parameters. For my own convenience I want to start with a clean slate and fit brand new model with optimal hyperparameters

logreg = LogisticRegression(C=1, class_weight={0: 0.05, 1: 4})

logreg.fit(X_train, y_train)

LogisticRegression(C=1, class_weight={0: 0.05, 1: 4})

y_pred = logreg.predict(X_test)

evaluation(logreg, X_train, y_train, X_test, y_test, y_pred)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     70934
           1       0.54      0.82      0.65       126

    accuracy                           1.00     71060
   macro avg       0.77      0.91      0.82     71060
weighted avg       1.00      1.00      1.00     71060

--------------------------------------------------------------------------------
Cross-Validation Score (Recall): 0.7449230769230769
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------

'''
false_negative = pd.DataFrame(y_test)
false_negative['Prediction'] = y_pred
fn_index = false_negative[(false_negative['Class'] == 1) & (false_negative['Prediction'] == 0)].index
tp_index = false_negative[(false_negative['Class'] == 1) & (false_negative['Prediction'] == 1)].index
scaling.inverse_transform(df.loc[tp_index].Amount.to_numpy().reshape(-1, 1)).mean()
scaling.inverse_transform(df.loc[fn_index].Amount.to_numpy().reshape(-1, 1)).mean()
'''

"\nfalse_negative = pd.DataFrame(y_test)\nfalse_negative['Prediction'] = y_pred\nfn_index = false_negative[(false_negative['Class'] == 1) & (false_negative['Prediction'] == 0)].index\ntp_index = false_negative[(false_negative['Class'] == 1) & (false_negative['Prediction'] == 1)].index\nscaling.inverse_transform(df.loc[tp_index].Amount.to_numpy().reshape(-1, 1)).mean()\nscaling.inverse_transform(df.loc[fn_index].Amount.to_numpy().reshape(-1, 1)).mean()\n"

3.4 Logistic Regression on Balanced Data (Downsampling)

logreg = LogisticRegression()
logreg.fit(X_train_down, y_train_down)
y_pred = logreg.predict(X_test)
evaluation(logreg, X_train, y_train, X_test, y_test, y_pred)

              precision    recall  f1-score   support

           0       1.00      0.97      0.99     70934
           1       0.05      0.82      0.10       126

    accuracy                           0.97     71060
   macro avg       0.53      0.90      0.54     71060
weighted avg       1.00      0.97      0.98     71060

--------------------------------------------------------------------------------
Cross-Validation Score (Recall): 0.42030769230769227
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------

3.5 Logistic Regression on Balanced Data (Upsampling)

logreg = LogisticRegression(C=1)
logreg.fit(X_train_up, y_train_up)
y_pred = logreg.predict(X_test)
evaluation(logreg, X_train, y_train, X_test, y_test, y_pred)

              precision    recall  f1-score   support

           0       1.00      0.81      0.89     70934
           1       0.01      0.94      0.02       126

    accuracy                           0.81     71060
   macro avg       0.50      0.88      0.46     71060
weighted avg       1.00      0.81      0.89     71060

--------------------------------------------------------------------------------
Cross-Validation Score (Recall): 0.42030769230769227
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------

3.6 Random Forest on Balanced Data (Downsampling)

start = time()
random_forest = RandomForestClassifier(n_jobs=-1)
random_forest.fit(X_train_down, y_train_down)
y_pred = random_forest.predict(X_test)
evaluation(logreg, X_train, y_train, X_test, y_test, y_pred)
print(f'Execution Time: {(time() - start):.2f} sec.')

              precision    recall  f1-score   support

           0       1.00      0.00      0.00     70934
           1       0.00      1.00      0.00       126

    accuracy                           0.00     71060
   macro avg       0.50      0.50      0.00     71060
weighted avg       1.00      0.00      0.00     71060

--------------------------------------------------------------------------------
Cross-Validation Score (Recall): 0.42030769230769227
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------

Execution Time: 1.62 sec.

3.7 Random Forest on Balanced Data (Upsampling)

start = time()
random_forest = RandomForestClassifier(n_jobs=-1)
random_forest.fit(X_train_up, y_train_up)
y_pred = random_forest.predict(X_test)
evaluation(logreg, X_train, y_train, X_test, y_test, y_pred)
print(f'Execution Time: {(time() - start):.2f} sec.')

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     70934
           1       0.86      0.75      0.80       126

    accuracy                           1.00     71060
   macro avg       0.93      0.88      0.90     71060
weighted avg       1.00      1.00      1.00     71060

--------------------------------------------------------------------------------
Cross-Validation Score (Recall): 0.42030769230769227
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------

Execution Time: 79.99 sec.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
images		images
.gitignore		.gitignore
Fraud Detection.ipynb		Fraud Detection.ipynb
README.md		README.md
info.txt		info.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Credit Card Fraud Detection

What is dataset look like?

TL;DR

Table of Contents

Importing Libraries

Loading Data

Quick Glimpse at the Data

Checking for NULL

Describe

1. EDA up

1.1 Target Variable Imbalance

1.2 Checking for Outliers

Outliers for V1-V28 Features

1.3 Correlations

1.4 Features Visualization (2D) with Matplotlib

1.5 Features Visualization (3D) with Plotly

2. Preprocessing up

2.1 Train-Test Split

2.2 Scaling

2.3 Class Imbalance (downsampling)

2.4 Class Imbalance (upsampling)

3. Modeling up

3.1 Isolation Forest

3.2 Hyper Parameters Tuning for Logistic Regression (Imbalanced Data)

3.3 Logistic Regression with Optimal Parameters (Imbalanced Data)

3.4 Logistic Regression on Balanced Data (Downsampling)

3.5 Logistic Regression on Balanced Data (Upsampling)

3.6 Random Forest on Balanced Data (Downsampling)

3.7 Random Forest on Balanced Data (Upsampling)

4. Conclusion up

About

Releases

Packages

Languages

arseniyturin/Credit-Card-Fraud

Folders and files

Latest commit

History

Repository files navigation

Credit Card Fraud Detection

What is dataset look like?

TL;DR

Table of Contents

Importing Libraries

Loading Data

Quick Glimpse at the Data

Checking for NULL

Describe

1. EDA up

1.1 Target Variable Imbalance

1.2 Checking for Outliers

Outliers for V1-V28 Features

1.3 Correlations

1.4 Features Visualization (2D) with Matplotlib

1.5 Features Visualization (3D) with Plotly

2. Preprocessing up

2.1 Train-Test Split

2.2 Scaling

2.3 Class Imbalance (downsampling)

2.4 Class Imbalance (upsampling)

3. Modeling up

3.1 Isolation Forest

3.2 Hyper Parameters Tuning for Logistic Regression (Imbalanced Data)

3.3 Logistic Regression with Optimal Parameters (Imbalanced Data)

3.4 Logistic Regression on Balanced Data (Downsampling)

3.5 Logistic Regression on Balanced Data (Upsampling)

3.6 Random Forest on Balanced Data (Downsampling)

3.7 Random Forest on Balanced Data (Upsampling)

4. Conclusion up

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages