Fraud detection

Fraud is responsible for the loss of tens of billions of dollars each year worldwide. Typically, companies use rules-based fraud detection applications that are not accurate enough and are out of step with changing fraudster behaviors. With machine learning solutions, companies can proactively prevent and detect online fraud with greater accuracy. These solutions help reduce lost revenue, avoid brand damage, and deliver a frictionless online customer experience while adapting to changing threat patterns.

I have worked on fraud detection in several of its aspects here are some key points.

ML Models for fraud detection

The main difficulty lies in the class imbalance. According to the OSMP, fraudulent bank transactions represent only 0.001% of transactions, while the amount of fraudulent transactions represents 1% of transaction amounts.

Cost-sensitive models

Unlike classical prediction models which aim to minimize the loss function in order to reduce misclassification, cost-sensitive methods aim to reduce the cost attributed to the whole model. These methods assign a cost to each misclassification and a global cost is calculated.

Cost-sensitive XGboost

A decision tree is a sequence of rules applied on a dataset to separate the data into two or more groups.

Unlike bagging, boosting is a sequence of the same algorithm that produces a new sample after each training. Each new sample is generated from the hard-to-classify observations, which allows for increasingly accurate predictions. XGboost is based on the gradient principle.

Transparency of algorithms

It is essential to explain how an ML model works, often referred to as "AI Explicable". ML methods and models are generally black boxes. It is very difficult (if not impossible) to explain to analysts why they got the score or decision they received. There are many approaches to interpreting anti-fraud analytic indicators, including indicators based on local linear estimates, generation of textual explanations and visual graphs. These are approximations that can give users insight into the ML model and guide the fraud investigation process.

From this point of view, models using trees are particularly adapted to fraud detection because the application of explicability methods, such as Shapley Values, is then simpler.

Data, more data !

Having quality data is essential to obtain efficient ML models. This is even more the case in fraud detection. Indeed, this problem is very complex from a business point of view because for the same company and the same type of transaction, fraud can take place in different ways.

Open data

Open data are digital data whose access and use are left free to the users, which can be of private origin but especially public.I often use this type of data to improve my models because this information allows me to identify general trends in a market, which is complicated to do with only the information available to the company for example DataGouv.fr).

Document analysis

Extracting document data is a major issue in my fraud detection. Whether it comes from SWIFT data or insurance contracts, these data contain a lot of information and often the modification of these is a fraud pattern. OCR algorithms allow to read the content of the documents but it is also necessary to add business knowledge to get a global meaning to the document.

Web scrapers

Web scraping is a technique of extracting content from websites, via a script or a program, in order to transform it for use in another context. Since data is not always available in a structured way, I also developed scrapers to extract the data I needed.

Conclusion

Fraud detection by ML is a domain that I particularly appreciate because it calls upon many Data Science techniques and also requires a lot of business knowledge. The models I developed allowed me to acquire the subtleties of this field of machine learning and to improve my inventiveness to have the best possible results