Adding AWS powered fraud detection to Sugar Sell - Part 2

When working with fraud detection and many other machine learning projects, it’s crucial that you understand that your insights and predictions are only as good as the data you use. Bad data can negatively impact the results and performance.

Data needs to be refined and cleaned to match the purpose. In this post, we will be constructing datasets for account signup and e-commerce transaction fraud detection. Each dataset will be a CSV with where each row corresponds to a record in the dataset and each column represents an attribute or variable.

We need to create datasets that are ready to be used by AWSs sophisticated machine learning algorithms to detect fraud. So let's start with the basics, gather your data and clean it.

Gathering Historical Data

You will need account and transaction data. Gathering this data can be cumbersome for many organizations because it can be stored in various locations. You will need to work potentially across multiple departments such as Finance, IT, Sales, and whoever else owns this data.

You’ll need fraud data to be included in your datasets. Remember, in order to detect future fraud, you need past fraud ‘detected’ data to train your models.

If you don’t have that data captured or don’t have enough of it, then you should focus your efforts on gathering enough fraud data. For example, the organization could develop a list of accounts that did commit fraud and fraudulent credit card transactions and use that as input for your models.

How much data is needed you might ask? For AWS Fraud Detector service, you need a minimum of 10k records with at least 400 examples of fraud. Keep in mind, a good dataset is a large dataset. More data will give you better predictions, but you can start small and retrain your models later.

You need a large dataset but it also needs to be relevant. For example, if you have a dataset that includes credit card transactions from 5 or 10 years ago then this might not be useful in detecting credit card fraud today. People’s buying habits, their credit card numbers, and their home addresses all change over time. Make sure your dataset reflects current organizational activity.

And, of course, you could always choose to implement only account or transaction fraud detection if you can’t collect the necessary data.

Using Public Datasets

If you don’t have those 20k+ transactions or accounts that includes fraudulent data, there are industry specific public datasets published by data scientists, leaders, companies, students and other data enthusiasts for you to choose from. Some of them also include their own studies and results based on the data they’ve collected and published.

Kaggle.com is a popular platform for data scientists that hosts several public datasets for many industries. All you need to do is create a free account and you should be able to freely download any dataset.

If you are using Kaggle, you need to identify a dataset in their catalog that works for your organization’s context. Some datasets will contain encoded data, a common practice in public datasets, to hide and scramble real data before making them publicly available. For example, public datasets for credit card transactions are often PCA encoded. If that’s the case, you will need to remember to similarly encode the data you send to AWS for predictions.

Remember, nothing compares to our own authentic data. So use public datasets as a learning tool until you have your own data ready to go.

Cleaning your historical data

At this point, you should have a dataset in hand. You now have to clean it. You may have duplicates or information gaps if you’ve collected large amounts of data from multiple sources. 

Data cleaning is a process used by data scientists to identify and remove incorrect, incomplete, corrupted, missing or inaccurate parts of the data from a dataset. This process helps improve the quality of your data for better results.

There are different techniques that you can use to clean up your data but everything starts with a quality plan. Make sure you know your data and the outcome you expect from it. 

As a best practice, you will need to: (not an exhaustive list)

  • Remove duplicates
  • Remove data irrelevant for a fraud scenario
    • For example, profession would have little weight in a web based credit card transaction.
  • Trim whitespace and eliminate empty data/columns
  • Decimal fields with proper precision (ex: show USD values with 2 decimal places)
  • Consistent date/time formatting including use of UTC time zone
  • Normalize values.
    • For example for a CreditCardType field, you may have values like Visa, VISA, visa, or Visa Credit Card. Different names but they mean the same thing!
  • Identify and remove outliers

There is plenty of content on the web about data cleaning. So do yourself a favor and check them out so you can apply them to your datasets. You will benefit enormously!

AWS Fraud Detector requirements

AWS Fraud Detector service requires that each dataset includes these two columns:

  • EVENT_TIMESTAMP (Timestamp): Indicates date and time when the event happened, regardless of whether it was fraudulent or not.
  • EVENT_LABEL (Is Fraud): Indicates that this record was a fraudulent event. It can be a string “fraud”/”legit” or an integer (0 for legit and 1 for fraud).

As of the time of writing, AWS Fraud Detector can only process CSV files. The full list of dataset requirements are listed here. If not met, AWS will not be able to train your model.

Dataset Examples

For the purposes of this blog series, we are going to build our own example datasets so you can follow along.

Accounts Dataset

The smart guys at AWS have provided an interesting account dataset that is large enough to train and test our account sign-up model. Woohoo!

This synthetic dataset contains 2 distinct files that you could use. But we will use registration_data_20k_minimum.csv because it has the basic structure required to create and train our model.

Our accounts dataset will have the following columns:

  • IP address
  • Email address
  • EVENT_TIMESTAMP - Timestamp as required by AWS
  • EVENT_LABEL - Fraud Flag as required by AWS

Here is an example on how you’d output the CSV. The actual order of the fields (and header names) is irrelevant.

ip_address,email_address,EVENT_TIMESTAMP,EVENT_LABEL
46.41.252.160,fake_acostasusan@example.org,10/8/2019 20:44,legit
152.58.247.12,fake_christopheryoung@example.com,5/23/2020 19:44,legit
12.252.206.222,fake_jeffrey09@example.org,4/24/2020 18:26,fraud

Transactions Dataset

This dataset should represent your credit card transactions only. Any other transactional data should be excluded. These types of datasets can be very large which can impact storage and processing time.

For this synthetic dataset, we’ll use the fraudTrain.csv public dataset from Kaggle. It contains such a large amount of credit card data that you will need to trim down for the sake of this exercise. This Python script below trims the data to include only transactions from the year 2020 (one year’s worth of data). After running the script below, fraudTrain.csv shrinks from 350MB to 99.7MB.

import pandas as pd
# Note that "date" is a pandas keyword so not wise to use for column names
df = pd.read_csv('./fraudTrain.csv', dtype={'cc_num': 'str'}, parse_dates=[1])  # assumes transaction_date column is the 1st column (0-based)
 
# choose start/end dates as filters (here is 1 year worth of data)
start = pd.to_datetime("2020-01-01")
end = pd.to_datetime("2020-12-31")
# then filter out anything not inside the specified date ranges:
df = df[(start <= df.trans_date_trans_time) & (df.trans_date_trans_time <= end)]
 
# use only CARD_BIN from the credit card information
df['cc_num'] = df['cc_num'].str[:5]
# replace 0 with legit and 1 with fraud for our labels.
df['is_fraud'] = df['is_fraud'].replace({0:'legit', 1:'fraud'})
 
# rename columns to match mandatory field name for AWS service
df = df.rename(columns={"trans_date_trans_time": "EVENT_TIMESTAMP", "is_fraud": "EVENT_LABEL"})
 
# trick to remove a pandas column (it always add to the resulting file
df.drop('Unnamed: 0', axis=1, inplace=True)
 
df.to_csv('./fraudTrain_2020.csv')

We will use the following columns that are relevant to our e-commerce scenario:

  • Credit Card Bank Identification Number (BIN)
  • Billing Zip Code
  • Category
  • Amount
  • Transaction Code
  • EVENT_TIMESTAMP - Timestamp as required by AWS
  • EVENT_LABEL - Fraud Flag as required by AWS

Here is an example on how you’d output the CSV. Note that the order of the fields (and header names) is irrelevant.

,EVENT_TIMESTAMP,cc_num,merchant,category,amt,first,last,gender,street,city,state,zip,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,EVENT_LABEL
924850,2020-01-01 00:00:57,22862,fraud_Kunde-Sanford,misc_net,9.13,Morgan,Murray,F,2788 Brittney Island,Blairstown,MO,64726,38.5319,-93.9221,467,Agricultural consultant,1950-05-27,59e089f4b553f5d7ce7a37235dd40a98,1356998457,37.928457,-94.266468,legit
924851,2020-01-01 00:01:28,45385,"fraud_Tromp, Kerluke and Glover",grocery_net,68.31,Jerry,Kelly,M,3539 Mckenzie Stream,Fairview,NJ,7022,40.817,-74.0,13835,"Programmer, multimedia",1967-05-28,ab426adf393f1d668b70d2a7d5a14093,1356998488,40.737538,-73.048997,legit

Further Reading

If you’d like to expand your knowledge and understand better how data science works, check this Jupyter Notebooks, it gives you an impressive detailed analysis on a transaction dataset we have used in this blog series. 

Here is a quite interesting article on Credit Card Fraud Detection that covers different approaches and machine learning models.

Principal Component Analysis (PCA) is an interesting alternative for those who don’t possess credit card transaction data and would like to use real anonymized data. Here you can find a very good explanation and how to encode your data in PCA.

References

https://en.wikipedia.org/wiki/Data_set

https://www.tableau.com/learn/articles/what-is-data-cleaning

https://en.wikipedia.org/wiki/Data_cleansing

https://blog.postman.com/looping-through-a-data-file-in-the-postman-collection-runner/

https://www.nature.com/articles/d41586-018-07196-1