I am looking into how to generate transactional data for testing and tuning a fraud detection system, and would like to model the transactional activity seen on a bank. I’d like to reflect complex relationships on the data generation, so I’m willing to invest some time into preparing something that has some sophistication in how are the different persons related, why money gets transferred, the distribution of money amounts, etc.
I was thinking of either a) code everything in SQL scripts by hand, b) use a declarative language (either prolog or clojure core.logic) to encode all constraints that I have on the data, or c) play a little bit with a probabilistic programming language (most probably Anglican), and sample the resulting model.
My guess is that the former options are more down-to-earth and can just work, but will be limited; whereas the latter might not be very performant to create data the size of a medium bank, and I might get drowned in new concepts that I don’t fully grasp.
Has someone approached a similar problem? Do you know if I am somehow planning to reinvent the wheel, and there are known ways to approach this? I’d be happy to read any comments / pointers / suggestions! Thanks in advance!