Abstract:
We investigate techniques to improve the cost performance of a bank’s fraud detector by importing remote classifiers from other banks and combining this remotely learned knowledge with locally stored classifiers. The law and competitive concerns restrict banks from sharing information about their customers with other banks. However, they may share black-box fraud-detection models. Our distributed data-mining approach provides a direct and efficient solution to sharing knowledge without sharing data. We also address possible incompatibility of data schemata among different banks. We designed and developed an agent based distributed environment to demonstrate our distributed and parallel data-mining techniques.
Our proposed methods of combining multiple learned fraud detectors under a “cost model” are general and demonstrably useful; our empirical results demonstrate that we can significantly reduce loss due to fraud through distributed data mining of fraud models.
Challenging Issues of credit card fraud-detection domain to Data Mining:
In today’s increasingly electronic society and with the rapid advances of electronic commerce on the Internet, the use of credit cards for purchases has become convenient and necessary. Credit card transactions have become the de facto standard for Internet and Web-based e-commerce. Credit cards accounted for a large amount in Internet sales. Their figure is expected to grow rapidly each year. However, the growing number of credit card transactions provides more opportunity for thieves to steal credit card numbers and subsequently commit fraud. When banks lose money because of credit card fraud, cardholders pay for all of that loss through higher interest rates, higher fees, and reduced benefits. Hence, it is in both the banks’ and the cardholders’ interest to reduce illegitimate use of credit cards by early fraud detection. For many years, the credit card industry has studied computing models for automated detection systems; recently, these models have been the subjects of academic research, especially with respect to e-commerce. The credit card fraud-detection domain presents a number of challenging issues for data mining:
There are millions of credit card transactions processed each day. Mining such
The data are highly skewed—many more transactions are legitimate than fraudulent. Typical accuracy-based mining techniques can generate highly accurate fraud detectors by simply predicting that all transactions are legitimate, although this is equivalent to not detecting fraud at all.
Each transaction record has a different dollar amount and thus has a variable potential loss, rather than a fixed misclassification cost per error type, as is commonly assumed in cost-based mining techniques. Our approach addresses the efficiency and scalability issues in several ways. We divide a large data set of labeled transactions (either fraudulent or legitimate) into smaller subsets, apply mining techniques to generate classifiers in parallel, and combine the resultant base models by metalearning from the classifiers’ behavior to generate a metaclassifier.
Our approach towards credit card fraud-detection:
Our approach treats the classifiers as black boxes so that we can employ a variety of learning algorithms. Besides extensibility, combining multiple models computed over all available data produces metaclassifiers that can offset the loss of predictive performance that usually occurs when mining from data subsets or sampling. Furthermore, when we use the learned classifiers (for example, during transaction authorization), the base classifiers can execute in parallel, with the metaclassifier then combining their results. So, our approach is highly efficient in generating these models and also relatively efficient in applying them.
Skewed distributions
Because the learning processes on the subsets are independent, the subsets can be distributed to different processors and each learning process can run in parallel. For massive
{draw:frame}
Table 2. Cost and savings in the credit card fraud domain using class-combiner
(cost ± 95% confidence interval)
amounts of data, our approach can substantially improve speed for super-linear time-learning algorithms. The generated classifiers are combined by metalearning from their classification behavior. We have described several metalearning strategies elsewhere. To simplify our discussion, we only describe the class-combiner (or stacking) strategy. This strategy composes a metalevel training set by using the base classifiers’ predictions on a validation set as attribute values and the actual classification as the class label. This training set then serves for training a metaclassifier. For integrating subsets, the class-combiner strategy is more effective than the voting-based techniques. When the learned models are used during online fraud detection, transactions feed into the learned base classifiers and the metaclassifier then combines their predictions. Again, the base classifiers are independent and can execute in parallel on different processors. In addition, our approach can prune redundant base classifiers without affecting the cost performance, making it relatively efficient in the credit card authorization process.
Database compatibility
An+1 ¹ Bn+1: The two attributes are of entirely different types drawn from distinct domains. The problem can then be reduced to two dual problems where one database has one more attribute than the other; that is,
Schema(_DBA_) = {_A_1_, A_2 ,…, An, An+1_, C_}
where we assume that attribute Bn+1 is not present in DBB. (The dual problem has DBB composed with Bn+1, but An+1 is not available to A.)
Bridging methods
There are two methods for handling the missing attributes.
Method I:_ Learn a local model for the missing attribute and exchange. _Database DBB imports, along with the remote classifier agent, a bridging agent from database DBA that is trained to compute values for the missing attribute An+1 in DBB’s data. Next, DBB applies the bridging agent on its own data to estimate the missing values. In many cases, however, this might not be possible or desirable by DBA (for example, in case the attribute is proprietary).
The alternative for database DBB is to simply add the missing attribute to its data set and fill it with null values. Even though the missing attribute might have high predictive value for DBA, it is of no value to DBB. After all, DBB did not include it in its schema, and presumably other attributes (including the common ones) have predictive value.
Both approaches address the incompatible schema problem, and metalearning over these models should proceed in a straightforward manner. Compared to other approaches our approach is more general because our techniques support both categorical and continuous attributes and are not limited to a specific syntactic case or purely logical consistency of generated rule models. Instead, these bridging techniques can employ machine- or statistical-learning algorithms to compute arbitrary models for the missing values.
Pruning
An ensemble of classifiers can be unnecessarily complex, meaning that many classifiers might be redundant, wasting resources and reducing system throughput. (Throughput here denotes the rate at which a metaclassifier can pipe through and label a stream of data items.) We study the efficiency of metaclassifiers by investigating the effects of pruning (discarding certain base classifiers) on their performance. Determining the optimal set of classifiers for metalearning is a combinatorial problem. Hence, the objective of pruning is to utilize heuristic methods to search for partially grown metaclassifiers (metaclassifiers with pruned subtrees) that are more efficient and scalable and at the same time achieve comparable or better predictive performance results than fully grown (unpruned) metaclassifiers. To this end, we introduced two stages for pruning metaclassifiers; the pretraining and posttraining pruning stages. Both levels are essential and complementary to each other with respect to the improvement of the system’s accuracy and efficiency. Pretraining_ pruning _refers to the filtering of the classifiers before they are combined.
Instead of combining classifiers in a bruteforce manner, we introduce a preliminary stage for analyzing the available classifiers and qualifying them for inclusion in a combined metaclassifier. Only those classifiers that appear (according to one or more predefined metrics) to be most promising participate in the final metaclassifier. Here, we adopt a black-box approach that evaluates the set of classifiers based only on their input and output behavior, not their internal structure. Conversely, posttraining_ pruning _denotes the evaluation and pruning of
{draw:frame}
constituent base classifiers after a complete metaclassifier has been constructed. We have implemented and experimented with three pretraining and two posttraining pruning algorithms, each with different search heuristics. The first pretraining pruning algorithm ranks and selects its classifiers by evaluating each candidate classifier independently (metric- based), and the second algorithm decides by examining the classifiers in correlation with each other (diversity-based).
The third relies on the independent performance of the classifiers and the manner in which they predict with respect to each other and with respect to the underlying data set (coverage and specialty- based).
The first posttraining pruning algorithms are based on a cost-complexity pruning technique (a technique the CART decision-tree learning algorithm uses that seeks to minimize the cost and size of its tree while reducing the misclassification rate).
The second is based on the correlation between the classifiers and the metaclassifier. Compared to other approaches, ours considers the more general setting where ensembles of classifiers can be obtained by applying possibly different learning algorithms over (possibly) distinct databases. Furthermore, instead of voting over the predictions of classifiers for the final classification, we use metalearning to combine the individual classifiers’ predictions.
Conclusion:
One limitation to our approach to skewed distributions is the need to run preliminary experiments to determine the desired training distribution based on a defined cost model. This process can be automated, but it is unavoidable because the desired distribution highly depends on the cost model and the learning algorithm. Currently, for simplicity reasons, all the base learners use the same desired distribution; using an individualized training distribution for each base learner could improve the performance. Furthermore, because thieves also learn and fraud patterns evolve over time, some classifiers are more relevant than others at a particular time. Therefore, an adaptive classifier-selection method is essential. Unlike a monolithic approach of learning one classifier using incremental learning, our modular multi classifier approach facilitates adaptation over time and removes out-of-date knowledge. By first applying feature-extraction algorithms, followed by the application of machine-learning algorithms (such as Ripper) to learn and combine multiple models for different types of intrusions, we have achieved remarkably good success in detecting intrusions. This work, as well as the results reported in this article, demonstrates convincingly that distributed data-mining techniques that combine multiple models produce effective fraud and intrusion detectors.
References
“_JAM_: Java Agents for Metalearning over Distributed Databases” Proc.
P. Chan and S. Stolfo, “Toward Scalable Learning with Nonuniform Class and CostDistributions: A Case Study in Credit Card Fraud Detection,” _Proc. Fourth Int’l Conf.Knowledge Discovery and Data Mining_.