The problem of detecting fraud in e-commerce transactions using machine learning models

10 min readAug 31, 2023

The volume of transactions that take place online is increasing year by year. The current scale of online sales is gigantic. The reliability of the payment process has become of great importance. This problem has often turned out to be a bottleneck in the development of sales. Unfortunately, with the increase in turnover, the scale of embezzlement has also increased. The huge scale of transactions forces the use of artificial intelligence systems. Only such a solution guarantees the efective detection and reduction of the scale of fraud in the e-commerce industry.

What are models and why are they effective in detecting fraud?

Mathematical models are used to generalize reality, find factors that increase the probability of detecting anomalies. Models can also be used to indicate trends, predict future sales or create recommendation systems. In our case, we will be talking about machine learning models whose primary purpose will be to detect fraud. Dichotomous classification models are best for such purposes. Such models indicate whether a transaction is right or wrong. The result of the model compution will be a value of 0 or 1. Of course, there are also multiclass classification models. Such models could not only detect fraud in a large number of ordinary transactions, but also determine what type of fraud they are. Each of the classes could represent a specific type of embezzlement.

Analysis of the training database

First, we should download database. The data source can be found at this link: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import itertools
import warnings
import joblib
import time

warnings.filterwarnings("ignore")

from sklearn.svm import SVC
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from catboost import CatBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from termcolor import colored as cl
from simple_colors import *
from prettytable import PrettyTable
from pandas.plotting import scatter_matrix

from sklearn.metrics import confusion_matrix, accuracy_score, f1_score
from beautifultable import BeautifulTable

df = pd.read_csv('creditcard.csv')
del df['Time']
df.head(4)

I assess the amount of anomalies and the level of imbalance in the set.

a1,b1 = df.Class.value_counts('1')
a2,b2 = df.Class.value_counts()
a3,a4 = df.shape

table = BeautifulTable()
table.column_headers = ["", "quantity","percentage"]
table.append_row(["Clean transactions:", a2,    a1,  ])
table.append_row(["Phony transactions:", b2,    b1,  ])
table.append_row(["Total number of operations:", "a3"  ,"1"])
print(table)

The database contains a register up over of 284 thausends transactions, of which almost 500 are fraudulent. The database contains 29 features in numerical format. For a better understanding, let’s imagine that these are such characteristics as the customer’s gender, place of payment, time of payment, type of goods for which the fee was paid, and so on. The data contained in the database are numerical, while in normal practice the data is both discrete and continuous, it can also be in text form. In order to start using machine learning models effectively, all text data formats should be encoded into digital form.

df = df.sample(frac = 0.1, random_state=148)

After downloading the data, we performed a random reduction of the set, we reduced the set to the level of 10% of the original state. The amount of about 28 thousands transactions is enough to train machine learning models. By reducing the size of the set, our calculations will be performed quickly.

Data standardization

Below we will perform a simple statistical analysis of the descriptive variables expressed in numerical values. The goal is the level of divergence of the characteristics of individual independent variables.

You can immediately see that the independent variable describing the transaction “Amount” has high values that significantly differ from the values of other independent variables. Unfortunately, these other features are also significantly different from each other. The problem with machine learning models is that the larger the value of a variable, the more seriously the model takes that variable. In extreme cases, a variable that is of key importance to the process may be omitted by the model only because it is characterized by negligible nominal values. Therefore, it is a good practice to normalize the data, i.e. transform the numbers describing the features so that the minimum value of these features is 0 and the maximum is 1.

print("shape:",df.shape)
df.hist(figsize=(18,12))
plt.show()

maxy = df.max().values
miny = df.min().values
meany = df.mean().values
stdy = df.std().values

keys=['max','min','mean','stdy']
vals=[maxy,miny,meany, stdy]

data = {key:vals[n] for n, key in enumerate(keys)}
Na = df.columns.tolist()
stat =pd.DataFrame(data, index = Na)
stat.tail(10)

Data normalization

Na = df.columns.tolist()
print(Na)

Na.remove('Class')
print(Na)

from sklearn.preprocessing import MinMaxScaler

for t in Na:     
    sc = MinMaxScaler()
    amount = df[t].values
    df[t] = sc.fit_transform(amount.reshape(-1, 1))

maxy = df.max().values
miny = df.min().values
meany = df.mean().values
stdy = df.std().values

keys=['max','min','mean','stdy']
vals=[maxy,miny,meany, stdy]

data = {key:vals[n] for n, key in enumerate(keys)}
Na = df.columns.tolist()
stat =pd.DataFrame(data, index = Na)
stat.tail(10)

Udało nam się przeprowadzić taką standaryzację co widać w powyższej tabeli.

Setting the result value, training and test sets

target = 'Class'

X = df.drop(target, axis=1) 
y = df[target]  

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.20,
                                                    random_state=148,
                                                    stratify=y)

Oversampling

A very important issue when creating models that detect anomalies is the problem of set imbalance. Anomalies, as the name suggests, are rare phenomena. These phenomena may constitute an insignificant part of all transactions carried out. And so it is also in our case. We found 492 anomalies out of over 284,000 transactions. So anomalies account for 0.2%. Classification models will obviously not consider this size of anomaly as significant in the process. As we remember, the main goal of the model is to generalize reality. Based on the describing variables, the models find certain relationships that allow you to effectively find the desired values.
In our case, it is necessary to balance the transaction register. This means that we will have to lead to a situation where the number of valid transactions will be equal to the number of fraudulent transactions. We can achieve this goal by reducing the number of legitimate transactions to the level of fraudulent transactions, i.e. in our case it will be about 1000 transactions, 500 for each type of transaction. This solution is obviously not optimal because the number of one thousand transactions is too small to create an effective tool for identifying a fraudster. So we are left with another solution, namely to create copies of falsified transactions in the number equal to the number of correct transactions. This method is called oversampling. The way to perform this operation is programmed in the code below. I wrote the following code myself, but you can also use the Sklearn library.

def oversampling(ytrain, Xtrain):
    import matplotlib.pyplot as plt
    
    global Xtrain_OV
    global ytrain_OV

    calss1 = np.round((sum(ytrain == 1)/(sum(ytrain == 0)+sum(ytrain == 1))),decimals=2)*100
    calss0 = np.round((sum(ytrain == 0)/(sum(ytrain == 0)+sum(ytrain == 1))),decimals=2)*100
    
    print("y = 0: ", sum(ytrain == 0),'-------',calss0,'%')
    print("y = 1: ", sum(ytrain == 1),'-------',calss1,'%')
    print('--------------------------------------------------------')
    
    ytrain.value_counts(dropna = False, normalize=True).plot(kind='pie',title='Before oversampling')
    plt.show
    print()
    
    Proporcja = sum(ytrain == 0) / sum(ytrain == 1)
    Proporcja = np.round(Proporcja, decimals=0)
    Proporcja = Proporcja.astype(int)
       
    ytrain_OV = pd.concat([ytrain[ytrain==1]] * Proporcja, axis = 0) 
    Xtrain_OV = pd.concat([Xtrain.loc[ytrain==1, :]] * Proporcja, axis = 0)
    
    ytrain_OV = pd.concat([ytrain, ytrain_OV], axis = 0).reset_index(drop = True)
    Xtrain_OV = pd.concat([Xtrain, Xtrain_OV], axis = 0).reset_index(drop = True)
    
    Xtrain_OV = pd.DataFrame(Xtrain_OV)
    ytrain_OV = pd.DataFrame(ytrain_OV)
       
    print("Before oversampling Xtrain:     ", Xtrain.shape)
    print("Before oversampling ytrain:     ", ytrain.shape)
    print('--------------------------------------------------------')
    print("After oversampling Xtrain_OV:  ", Xtrain_OV.shape)
    print("After oversampling ytrain_OV:  ", ytrain_OV.shape)
    print('--------------------------------------------------------')
    
    ax = plt.subplot(1, 2, 1)
    ytrain.value_counts(dropna = False, normalize=True).plot(kind='pie',title='Before oversampling')
    plt.show
       
    kot = pd.concat([ytrain[ytrain==1]] * Proporcja, axis = 0)
    kot = pd.concat([ytrain, kot], axis = 0).reset_index(drop = True)
    ax = plt.subplot(1, 2, 2)
    kot.value_counts(dropna = False, normalize=True).plot(kind='pie',title='After oversampling')
    plt.show

oversampling(y_train, X_train)

As you can see in the pie chart, we already have a balanced training set of trades.

Machine learning models

There are many machine learning models for classification. I wrote out 9 models (including two logistic regression models with changed parameters) that I want to train over. We have no space in this publication to discuss each of these models. Models were used in default settings. The models will be run using a loop. The algorithm I wrote allows also you to measure the operation time in seconds. If we used the full set of transactions, the calculation time would be very long, especially in the case of the SVM model.

SVM = SVC(probability=True)
CBC = CatBoostClassifier(verbose=0, n_estimators=100)
XGB = XGBClassifier()
LREN = LogisticRegression(solver='newton-cg')
KNN = KNeighborsClassifier(n_neighbors=1, p=2)
NBC = GaussianNB()
LRE = LogisticRegression(solver='lbfgs')
RFC = RandomForestClassifier()
GBC = GradientBoostingClassifier()

print('----classification models-----------------------------')
print()

clasifier_VAL = [SVM,CBC,XGB,LREN,KNN,NBC,LRE,RFC,GBC]
name_VAL = ['SVM','CBC','XGB','LREN','KNN','NBC','LRE','RFC','GBC']

for n,t in zip(name_VAL,clasifier_VAL):     
    start_time = time.time()
    t.fit(Xtrain_OV, ytrain_OV)
    p = np.round((time.time() - start_time),decimals=1)
    print(blue(n),p,"time of processed ")

Validation of machine learning models

The models were retrained on training sets. Time to validate the models. This time I wrote two diagnostic modules, one of them uses traditional classification quality indicators such as precision and recall, the diagnostics also includes counted field values in the so-called confusion matrix.

def Type_error(six_classifiers,name, X_test,y_test):

    from sklearn.datasets import make_classification
    from sklearn.calibration import CalibratedClassifierCV, calibration_curve
    from sklearn.metrics import confusion_matrix
    from sklearn import metrics
    import simple_colors
      
    RECALL = ['RECALL']
    PRECISSION =['PRECISSION']
    ACCURACY = ['ACCURACY']
    AUC = ['AUC']
    TN = ['True Negative']
    FN = ['False Negative']
    TP = ['True Positive']
    FP = ['False Positive']
       
    def compute_metric(model):
        
        #model = model.fit(X_train,y_train)   #<-- model już przećwiczył się na pełnych danych
        cm = confusion_matrix(y_test, model.predict(X_test)) 
        TN,FP,FN,TP  = cm.ravel()   
        # TN,FP,FN,TP
        
        AUC = np.round(metrics.roc_auc_score(y_test,model.predict_proba(X_test)[:,1]),decimals=3) 
        PRECISSION = np.round(TP/(TP + FP),decimals=3)
        RECALL = np.round(TP/(TP + FN),decimals=3)
        ACCURACY = np.round((TP+TN)/(TP+TN+FP+FN),decimals=3)
               
        return AUC,PRECISSION,RECALL,ACCURACY, TP, TN, FN, FP

    for cls in six_classifiers:      
        
        results = compute_metric(cls)
        AUC.append(blue(results[0],'bold'))
        PRECISSION.append(red(results[1],'bold'))
        RECALL.append(red(results[2],'bold'))
        ACCURACY.append(green(results[3],'bold'))
        TP.append(green(results[4],'bold'))
        TN.append(green(results[5],'bold'))
        FN.append(green(results[6],'bold'))
        FP.append(green(results[7],'bold'))

    t = PrettyTable(['Name', name[0],name[1],name[2],name[3],name[4],name[5],name[6],name[7],name[8]])
    t.add_row(AUC)
    t.add_row(PRECISSION)
    t.add_row(RECALL)
    t.add_row(ACCURACY)
    
    t2 = PrettyTable(['Name', name[0],name[1],name[2],name[3],name[4],name[5],name[6],name[7],name[8]])
    t2.add_row(TP)
    t2.add_row(TN)
    t2.add_row(FN)
    t2.add_row(FP)
    
    print(t)
    print(t2)

Type_error(clasifier_VAL,name_VAL,X_test,y_test)

The confusion matrix analysis is very simple and intuitive. For example, for the XGB model, the number of correctly indicated falsified transactions is 7 (True Positive). However, 3 fraudulent transactions were not detected by the model (False Negative). The XGB model identified one honest transaction as falsified (False Positive).

And here is the second part of the universal code I wrote for diagnosing classification models. This time I used a Multiclass Receiver Operating Characteristic (ROC) and an AUC plot.

def BinaryClassPlot(six_classifiers,name, X_test,y_test):

    from matplotlib import rcParams     
    rcParams['axes.titlepad'] = 20 
    
   
    from plot_metric.functions import BinaryClassification
      
    plt.figure(figsize=(15,10))
    grid = plt.GridSpec(3, 3, wspace=0.3, hspace=0.5)
        
    for i in range(9):
        col, row = i%3,i//3
        ax = plt.subplot(grid[row,col]) 
        ax.title.set_color('blue')
            
        model = six_classifiers[i]
        bc = BinaryClassification(y_test, model.predict_proba(X_test)[:,1], labels=["Class 1", "Class 2"])
        bc.plot_roc_curve(title=type(six_classifiers[i]).__name__)
        ax.text(0.0, 1.09, 'ROC for:',color='black', fontsize=10) 
        ax.text(0.5, 1.09, name[i],fontsize=10)

BinaryClassPlot(clasifier_VAL,name_VAL,X_test,y_test)

AUC graphs are rough because there is little data. However, this does not prevent the assessment of the quality of model classification. The diagnostics showed that logistic regression models were the least efficient at classifying anomalies. According to AUC, the NCB and SVM models also performed poorly.
The classification quality of the other models seems to be very similar. Let us remember that we are still in the test set randomly selected from the set constituting less than 10% of the original register.

Diagnostic analysis on complete data

Now we are downloading the entire database and the entire register, which is a record of approximately 284,000 transactions. Then we extract a test set from the entire registry. We substitute the test data again. It turned out that the only sensible boosted models that effectively detect anomalies are: the Russian Cat Boost Classifier and the American XGB Classifier. As can be seen in the case of the Cat Boost Classifier, the confusion matrix detected 75 fraudulent transactions, which accounted for 76% of all embezzlements recorded in the test set. The XGB model detected 82 fraudulent transactions. In addition, the XGB model detected as many as 21826 correct transactions, which were classified as incorrect. The Russian Cat Boost Classifier model detected 7,256 correct transactions, which it classified as incorrect. In the case of fraudulent transactions, it is better for the model to indicate honest transactions as fraudulent, instead of letting fraudulent transactions be considered as fair. The XGB model missed 16 fraudulent transactions, while the CBC failed to detect 23 fraudulent transactions.
We should let go of the precision indicator in the case of this kind anomaly detection.

kko = pd.read_csv('C:/2/creditcard.csv')
del kko['Time']

target = 'Class'

Xf = kko.drop(target, axis=1) 
yf = kko[target]  

Xf_train, Xf_test, yf_train, yf_test = train_test_split(Xf,
                                                        yf,
                                                        test_size=0.20,
                                                        random_state=148,
                                                        stratify=yf)

Type_error(clasifier_VAL,name_VAL,Xf_test,yf_test)

BinaryClassPlot(clasifier_VAL,name_VAL,Xf_test,yf_test)