This is a submission to the insight data challenge. Challenge details can be found at https://github.com/InsightDataScience/anomaly_detection.
Detailed documentation about the code is provided in the code itself. Here, I'll provide some general descriptions about the techniques used.
There are two phases: The build phase and the stream phase.
Build phase: Build the network based on batch_log.json.
- Each user is assigned a unique User object. A Network object is used as a container for the Users.
- The User object stores the user's friends and recent purchases (up to size T). It also stores purchases in the user's social network. Purchases are stored as their own object, so any list that stores purchases is simply storing a pointer to a purchase object rather than a duplicate record.
- When the network is initially being built, we don't need to flag anomolous purchases, so every purchase is simply stored in the purchaser's list. Every befriend or unfriend event only interacts with the affected users at this stage.
Stream phase: Begin updating the network based on stream_log.json.
- It is now important for each user to have an up-to-date list of the latest purchases within their social network when they personally make a new purchase. Rather than updating the list for every befriend/unfriend event that may affect the user (which would be a lot of adding and removing), the list is rebuilt from scratch when necessary. Each user has a flag to indicate when their social network purchase list needs to be rebuilt. Since it was not built during the first phase, each user is begins flagged. They are only flagged otherwise when a befriend/unfriend connection is made that would affect their network. (The code could easily be changed to treat the build phase as a stream phase.)
- A user's social network is not stored with the user itself, but is rather crawled and built with every event directly concerning the user. This is a major design decision.
- The cost is that it needs to be crawled with every purchase event rather than referring to a list. In particular, when crawling a network to generate a list, edges may be transversed to users already on the list. Since users are flagged when added to the list, the cost is simply the number of such edges transversed for each purchase. In the worst case (a clique), this would mean crawling O(n^2) edges rather than O(n). In a highly connected graph where purchase events outnumbered befriend/unfriend events by a large magnitude, it may actually make sense to trade the high maintenance costs of befriend/unfriend events for the low maintenance cost of purchase events.
- The alternatives involve either storing a graph of a user's social network with the user, or storing a list with some information about the graph with the user. These alternatives mean a higher space cost and more maintenance for each befriend/unfriend event. There are quite a few techniques we could try with a variety of tradeoffs, but ultimately, befriend/unfriend events occur often enough and the cost of crawling the social network is low enough in practice that crawling the social network with every purchase appears to be the best option.
- Purchase event: Each purchase is added to the user's personal purchase history, as well as to the social network histories of those within distance D of the user. Next, the purchase is checked to see if it is anomolous. Rather than using the social network purchase history directly, two variables corresponding to the sum and sum-of-squares are used. These two variables are updated in tandem with the list.
- Befriend/Defriend event: A befriend or unfriend event between two users first updates the friend lists of the two users accordingly. These events may affect the social networks of users within distance D-1 of either user, so the networks at these distances are traversed, and each user in these networks is flagged. (Note that all connections are two-way, but the code could easily be made to accommodate directional connections.)