Bodo allows machine learning practitioners to rapidly explore data and build complex pipelines. Using Bodo, developers can seamlessly scale their codes from using their own laptop to using Bodo's platform. In this series, will present a few hands-on.
In this example, we'll use of classification techniques from machine learning helped by Bodo to detect potential cases of credit card fraud.
A notebook version of this blog post is available here.
In this post, we will run the code using Bodo. It means that the data will be distributed in chunks across processes. Bodo's documentation provides more information about the parallel execution model.
If you want to run the example using Pandas only (without Bodo), simply comment out lines with the Bodo JIT decorator (@bodo.jit) from all the code cells. Bodo is as easy as that.
Note: Start an IPyParallel cluster (skip if running on Bodo's Platform)
Consult the IPyParallel documentation for more information about setting up a local cluster.
Import all packages
These are the main packages we are going to work with:
- Bodo to parallelize Python code automatically;
- IPyParallel for setting up a local cluster;
- s3fs to download data from AWS S3 buckets;
- Pandas to manipulate tabular data structures;
- Seaborn and Matplotlib for data visualization;
- Scikit-Learn to build and evaluate regression models; and
- Imbalanced-Learn to oversample the dataset.
The dataset contains transactions made by credit cards in September 2013 by European cardholders. We can fetch the data directly from [Kaggle]https://www.kaggle.com/mlg-ulb/creditcardfraud); alternatively, we can get it from one of Bodo's AWS S3 open buckets (no AWS credentials required).
In the following cell, we load the dataset directly from AWS using the URI s3://bodo-examples-data/creditcard/creditcard.csv.
By default, the data is distributed to all available engines; that is, the rows of the DataFrame records are split into partitions and sent to each of the cores. This allows embarrassingly parallel algorithms—like computing sums—to scale with the data and resources at hand.
The dataset consists of 30 columns. To preserve user anonymity, the first 28 columns (V1...V28) result from a Principal Component Analysis (PCA). The Kaggle documentation for this dataset provides more details. Our goal here is to classify records using some features.
Before doing anything, it might be interesting to have a look at the raw data to understand it and see if we can extract some patterns. First, let's have a look at both fraudulent and normal transactions distributions. The following cell mixes Bodo with normal Pandas and Seaborn/Matplotlib code. This is handy when a particular feature or package is not yet supported by Bodo.
It's interesting to see that the distributions are not identical. This indicates that the fraudulent behaviours deviate from normal/legitimate behaviours. While the median/mode are close for both fraudulent and normal transactions, the fraudulent transactions have 3 modes. These transactions are also generally larger.
- The distribution has significant outliers. The distribution is very skewed; we observe extrema several standard deviations away from most reasonable measures of centrality.
- The dataset is strongly unbalanced—only 0.17% of the records are fraudulent so the signal from fraudulent transactions is effectively swamped by legitimate ones. Mitigating imbalanced data is a known challenge in solving classification problems.
There are strategies to mitigate imbalanced data like sub-sampling to prevent overfitting, and choosing an appropriate metric.
The Amount column is the only one which is not scaled which is problematic for our analysis. We need to scale this column to remove bias from the model. Instantiating an object from the StandardScaler class from Scikit-Learn's preprocessing submodule provides a way to effect this transformation across the partitioned dataset.
Sub-sampling the dataset to get a balanced dataset—i.e., one with 50% fraudulent transactions—will prevent the model from overfitting. Otherwise, the model might learn to classify all records as non-fraudulent—that is, producing a constant-valued classifier—and it would be correct over 99% of the time.
Correlation matrices are often used to identify important features in matrices representing data. In this case, we also want to distinguish features that can be used to classify fraudulent cases. The parameters V1, ..., V28 result from a Principal Component Analysis that maximizes variance in an abstract sense. We don't know how these numerical features relate to the original features nor how they relate to the remaining features that are not encoded.
From this analysis, we observe that some features don't really correlate meaningfully with whether a record is fraudulent. Zero correlation indicates no correlation, +1+1 correlation indicates that the feature is totally correlated/linked to the output, and -1−1 correlation indicates strong anti-correlation. Then, for instance, given that the correlation of feature V23 to the fraud class is -0.0033−0.0033, it is not meaningfully correlated at all. Removing features like this can help the model to better discover meaningful patterns.
Classifying Fraudulent Records
We can reasonably drop all features with a correlation value below 0.10.1 (in absolute value) from the sub-sample. Once the weakly-correlated columns are dropped, we split the data into training & testing sets. Given that there are not many rows in the sub-sample, we'll use an 80%/20% ratio (as opposed to a more typical 70%/30% split when more data is available).
Train a Random Forest Classifier
Next, having selected suitable features and having partitioned the rebalanced data into training & testing sets, we can train a classifier and evaluate the model using the testing dataset.
We'll define a helper function to gather a distributed dataframe from all engines onto the root engine. There are a number of distributed datasets we will apply this to below.
We'll apply this function first to the vector of targets from the testing data set—that is, y_test—that was previously distributed to all the engines. This permits us to apply standard Scikit-Learn reporting functions like roc_auc_score from sklearn.metrics to summarise the model's performance.
The function roc_auc_score computes the area under the Receiver Operating Characteristic (ROC) curve from prediction scores. As we remarked earlier, our choice of the metric is very important for assessing the classifier. In this instance, the recall is what matters the most. A perfect recall means that the model yields no false negatives, so it has identified all fraudulent records. But there is always a tradeoff with the precision—determined by the false positives. Any false positive might prevent legitimate transactions to happen which would not be good for clients.
Another Look at Metrics
Confusion matrices are often time used to assess the quality of classifiers. We've imported the class ConfusionMatrixDisplay from sklearn.metrics for this purpose. AS the documentation recommends, we use the classmethod from_predictions to instantiate the class and produce a useful visualization of counts of correctly & incorrectly classified records.
From these results, the model appears to be able to correctly classify records with a relatively low false positive rate.
Since the data is very limited, another approach is to oversample the dataset. Again, due to the unbalanced labels, random upsampling would not work well. Instead, we apply the Synthetic Minority Over-sampling Technique (SMOTE) as implemented in Imbalanced-Learn. We oversample the training sets only to avoid contaminating the testing set. The training data needs to be gathered onto the root engine at present because the SMOTE function is not supported on Bodo at present.
We now need to distribute the SMOTE-upsampled DataFrames x_train & y_train from the root engine to all other engines using the bodo.rebalance function.
This allows us to use the Bodo-compiled function predicion_model to construct a classifier using the SMOTE-upsampled training dataset and to generate appropriate predictions from the testing data.
As above, we can gather the test labels back to the root node and produce a confusion matrix.
The entries in this confusion matrix are smaller in a relative sense because we used more data to build this model. As always, these metrics suggest how a model will generalize when faced with new data.
Note: Don't forget to stop your cluster: