This is the fourth part of the fraud detection whitepaper. You can find the first part
, the second part
, and the third part
in my previous blog posts about this topic.
We create multiple mining models by using different algorithms, different input data sets, and different algorithm parameters. Then we evaluate the models in order to find the most appropriate candidates for the actual deployment to production.
Many different algorithms can be used for fraud detection; it is difficult to say which one would generally yield the best result. In a project, the available algorithms are typically chosen, based on experience and the knowledge about the given domain. Because we use the Microsoft SQL Server suite, we use Microsoft Decision Trees, Microsoft Neural Network, and Microsoft Naïve Bayes directed algorithms, and Microsoft Clustering for the undirected one. In recent years, the Support Vector Machines methods are becoming more and more popular. SSAS does not bring this algorithm out of the box. However, it can be downloaded as a free plug-in algorithm for SSAS from the Microsoft CodePlex site at
Valkonet, J. (2008). Support Vector Machine plug-in in Analysis Services. Retrieved from Microsoft CodePlex: http://svmplugin.codeplex.com/
Of course, if there are time and software policy constraints that prevent us from using this download, we simply skip it. We do not lose much, because, according to
Sahin Y., & Duman E. (2011). Detecting Credit Card Fraud by Decision Trees and Support Vector Machines. Proceedings of the International MultiConference of Engineers and Computer Scientists 2011 Vol 1. Hong Kong: IMECS.,
the Decision Tress algorithm usually yields better results in fraud detection analysis than Support Vector Machines. For details on specific data mining algorithms, please refer to
Han J., Kamber M., & Pei J. (2011). Data Mining: Concepts and Techniques, Third Edition. Morgan Kaufmann,
or to the SolidQ course
Sarka D. (2012). Data Mining with SQL Server 2012. SolidQ. Retrieved from http://www.solidq.com/squ/courses/Pages/Data-Mining-with-SQL-Server-2012.aspx.
We evaluate the efficiency of different supervised models by using standard techniques, namely the Lift Chart, the Classification Matrix, and Cross Validation. All of these techniques are built into the SSAS data mining feature and are described in more detail in
MacLennan J., Tang Z., & Crivat B. (2009). Data Mining with Microsoft SQL Server 2008. Wiley.
To evaluate the Clustering models, we have developed a technique of our own, implementing entropy. If the individual clusters are homogenous, the entropy in any given cluster must be low. We calculate the average entropy and the standard deviation of the entropy across all clusters. In a SSAS Clustering model that has been trained (or processed), it is possible to read the model data with DMX queries. In the cluster notes we can identify the distribution of the input variables, and then use it to can calculate the entropy.
From experience, we have learned that not all algorithms are equally useful for all data sets. The Microsoft Neural Network algorithm works best when the frequency of the target state (i.e. fraud) is about 50%. Microsoft Naïve Bayes can work well when the target state is represented by approximately 10% of the population or more. However, Microsoft Decision Trees work well even if the target state frequency is only about 1%, and is thus a very suitable algorithm for small data sets and low frequency of the target state as well.
The continuous learning cycle is shown graphically in Figure 1.
Figure 1: The continuous learning cycle
We start by creating the directed models, assuming that the customer has already flagged frauds in the existing data. We evaluate the directed models and then use the best one to predict the frauds in the new data. We also create the undirected models, evaluate them, and use the best one for selection of potential frauds. We do this over time and check the difference between the number or percentage of frauds caught with the directed and the undirected model deployed. When this difference drops, it is time to refine the directed model. In addition, we store the predictions of both models and the actual, confirmed or reported frauds in a data warehouse. When the percentage of the predicted frauds in the total number of frauds drops, it is time to refine both models. We use an OLAP cube on the top of the DW to measure the efficiency of the models over time.