Another prediction: "fraudsters work on weekends" would yield the same graph if displayed as a percentage of total transactions.
The "outside of business hours" explanation is still a valid explanation but definitely not the whole story. Overall, I'd caution that when building fraud detection models understanding the stories behind the data is extremely important, or you risk having an algorithm that works for the wrong reasons.
For example, if suddenly your user base because more international (for example if you start allowing non-US users to use your website), you'll suddenly have a lot of false positives if you're not cautious because to your system it'll look like they're "operating outside of business hours".
As soon as it becomes advantageous enough to fit in with the normal user patterns, the attackers will modify their behavior accordingly.
The time-of-day, internationality, and many other signals mentioned in this post are easily evaded when it becomes profitable to do so.
Another way to put it: fraudsters are more likely to be night-owls than the rest of us.
As I've said earlier, you can see in the users count that fraudulent users are indeed the lowest at night (which makes sense, since fraudsters sleep as well). However, we are looking at the "fraud rate" (#fraud / all users in a given hour). I'm not sure what you even mean by "misrepresent" the data. The data speaks for itself.
The argument for misinterpreting the data is easy to use when you account for time zones. If (a) I'm in the US and (b) I notice that most of my fraudulent transactions occur around 3AM, then it follows that (c) most of my fraud occurs at night. However, if the corollary of "Most of my fraud occurs from India/Singapore/etc", then it's simply a matter of time zones. Does most of my fraud occur at night? It does if I'm in the US but, if I'm the hacker in India/etc, then it occurs during the day. It's 3:45AM where I am in the US (CDT) but it's 2:15PM somewhere in India...
However, if I was to write a training manual for my employees who worked in the same building as me (US CDT), then I'm not misinterpreting the data. The data is the "truth" here - if it shows that most fraudulent transactions occur around 3AM local time, then that's the truth and I'd be foolish to over-complicate the issue when trying to train new hires on how to spot fraud.
The time difference would explain the prevalence of night owls and other factors often used in fraud detection. Only on my direct office phone number, I receive 5-10 such calls a week.
Statistically speaking, I'm suspicious regarding the conclusion about lunch hours. In order to identify that feature robustly, it's necessary to very accurately know the variability of the background signal. Does it always dip at lunch hours and all at the same time? Does it always have the same shape? Recall what's being done here is separating the background "non-fraud transactions" from "fraud transactions". If the latter are much smaller in number than the former, and the determination of fraud has a finite miss rate or a finite false alarm rate, then all the implications of Bayes Rule apply, and the fraud signals being seen could be just some transfer function of the fraud mechanism applied to the background signal.
Moreover, do fraudulent signals increase in proportion to total background? Or are they a constant amount?
I think more work needs to go into deconvolving the two kinds of signals before assigning meaning to their parts and features of their parts.
No doubt the story assigned to these is plausible and compelling, but it is based upon unproven inferences. And, in my opinion, assuming that identifying and acting on these signals is simpler than it looks is getting way ahead of the art.
honestly... frack the internet, you'r giving it too much weight
a) Having the tools to look at all your data to see where the patterns are
b) Having the tools to track instances of patterns once you've identified them.
But as I said, the patterns themselves are usually pretty simple, as you see in this article.
This one is a bit specific to Justworks since we use bank account numbers instead of credit card numbers, but one we've picked up is the source of the bank account is a great signal. If a company has bank accounts exclusively from certain banks then they are almost certainly fraudulent. We've even seen people try to sign up with consecutive bank account numbers from the same banks!
For example every insurance claim from any insurance company is run through fraud detection software. Turns out there are some characteristics of a fraudulent insurance claims that have been identified over the years. The software can flag a claim as potentially fraudulent for further human review.
Discover Card's fraud detection worked extremely well in my case. My CC number was stolen and after only two fraudulent transactions (total about $500) it tripped the software to freeze the card. I have no idea how they identified those as fraud so quickly, they didn't seem out of place to my eyes considering my transaction history.
For example, a particular merchant is breached, and then the fraudsters try to run most of the cards stolen in that breach at a few locations.
It could be obvious on your bank's end that a few dozen customers made a charge at Store A and then reported a charge at Store B to be fraudulent, so then they decided to freeze the cards of everyone with the same Store A -> Store B pattern.
If they were smart they'd buy a chocolate bar from a grocery store that has self-checkout, then buy something a little more expensive from Target or whatever, then buy the very expensive thing they really want.
I worked for an insurance company for over 5 years, processing claims. This is the first I have ever heard this. Which does not prove it to be untrue but does make me skeptical.
We had some claims automatically paid via computer but most claims were reviewed by human eyes. Humans were trained to look for indicators of fraud. There was no one thing that would get it sent to the fraud department for investigation. We were looking for a pattern or multiple indicators.
Oh, I'm never saying that claims get paid automatically or that humans don't look for fraud too.
I'm saying that there is also software on the back end that flags things to go to the fraud department automatically. Probably without your knowledge. I just remember reading an article about it a while ago - just about how good it has gotten over the years and it has become an important part of the insurance industry.
Just another example of extensive data mining.
When I say "every insurance claim from any insurance company" I mean that I assume any insurance company worth anything would be running such software.
Found some products:
AllState implies they do.
>Through the use of innovative technologies and network analysis tools, we are able to help detect and stop these crimes
From digging around, this seems kind of a new trend. I seriously doubt that "every claim at any insurance company" is being analyzed this way. And, sure, it's possible it was in use at the company where I worked and I simply did not know. But from reading some of what you linked to, I doubt it. (Though perhaps that has changed since I left.)
From one of the sites you linked to:
"Industry research indicates 10 percent of all claims contain an element of fraud," says Wolfe, and as of the end of 2010, CNA was seeing just 3.7 percent of its claims referred as potential fraud. "That’s considerably below the industry average, and we wanted to find out how much we were missing that wasn’t identified by our adjustors."
So, yeah, that's relatively recent and the searches I did sort of imply that this is an emerging market, they are still trying to convince insurance companies to do this, etc. Insurance is a very conservative industry. They tend to be somewhat slow to adopt new technologies. They are regulated by both federal laws that cover financial companies and also federal laws that cover medical companies (like HIPAA). So they have a huge regulatory burden and this makes change especially hard. Any new tech or new processes have to really be put through their paces to see if they still pass muster on multiple fronts. The de facto result is that insurance companies are kind of sticks in the mud.
Where I worked, a lot of the software was homegrown. It had been developed in-house. They didn't like hiring outside vendors/buying outside solutions. From what I gather, they were fairly cutting edge -- though, of course, it is possible that was just company hype. Since I worked there, I sometimes knew that some statements amounted to spin-doctoring. But I did only have an entry level job and was not really expecting it to turn into a career, so I am sure there is plenty that I missed.
Thanks for replying.
I just assumed it had already become "industry standard." Perhaps it was "becoming industry standard" instead. I think the article was specifically about car (not health) insurance too but I can't remember details. I went to go find it again before posting but of course I couldn't.
I also somewhat suspect (but I have no proof of course) that there is at least a little bit of data sharing to third parties that insurance companies can/do use to get more of the "big picture." Probably more car insurance than health insurance. I know such a thing as a "driving record" exists probably supplied by a company such as LexisNexis. My car insurance provided me a copy of my driving record.
So for example Person A has a pattern of behavior through different insurance companies that can be flagged while just looking at one insurance company's data might not show the data.
This is all wild speculation on my part more of a thought experiment on my part. Laws such as the FCRA are starting to address SOME of these issues but I'm not sure they are on top of all practices.
This type of tracking has become more common in the age of "big data."
That's interesting though that they didn't like hiring outside vendors. I would think they would want outside vendors because their products were already vetted for regulation compliance.
That was not my area and I no longer work in insurance. I don't recall the details. But, yes, that is a thing.
Do you tend to go to certain physical locations or use a limited number of Web sites to buy from?
Any idea how this can be tracked? Normal cookies, or something more in-depth?
A site I'm working on tracks page visits independently of logged-in user sessions.
I'm wondering if it's worth considering explicitly looking for multiple logged-in users sharing the same page visit session.
There are various tricks - e.g. Flash cookies, Etags etc - that make me feel a bit uneasy, despite how much I like the idea of tracking multiple accounts per device.
Some people think this practice violates users' privacy, and I'm one of them. This technology can be used to uniquely identify a user across multiple logins on the same site, or even multiple sites. It's quite widespread.
This paper is mostly a survey of prominent DF providers and sites using this technology, and it's also a good primer on device fingerprinting techniques.
Is this definitely how it's achieved though?
I would presume a highly-skilled fraudster could just spin up a new VM, for instance, and evade detection that way.
Do we know if "regular" cookies alone are good enough for 90% of the lazy fraudsters?
Regarding using "device fingerprinting," can I collect some opinions from HN?
Specifically, if every user record created stores a fingerprint alongside it (which is only used to find account registrations from the same device) is that just as offensive as using fingerprinting to track anonymous sessions?
From my experience building fraud detection systems at Eventbrite most fraudsters are not that sophisticated -- fraudsters usually go for the lowest-hanging fruit and as such are looking for systems to defraud that have the highest payout for the lowest effort. Because there is always some level of uncertainty (getting detected, the credit card not working, etc.) fraudsters often favor techniques that allow them to try as many websites/cards as possible. This is especially true for Sift Science's customers who tend to be more small to mid-size companies; big companies for whom fraud detection is critical will tend to have their own in-house solution.
In addition this is usually only one signal -- ideally you want your algorithm to be able to detect first-time fraudsters too, so the other signals should be able to stand on their own.
One caveat though, the reason why multiple accounts is a signal of fraud is because fraudsters tend to be repeat offenders, and will keep defrauding the same website if their previous attempts worked. But now that they're facing a fraud detection algorithms that detects repeat offenders more easily, it's highly possible they will adapt their behavior.
This is a signal that will fade out in strength over time, and one of the dangers of pooling together data from multiple websites as in this blog post (but hopefully this is taken into account in their algorithms) is that the strength of the signal may be skewed by the proportion of new users of their platform (who will have a higher proportion of unsophisticated fraudsters by nature of they not having a fraud detection system previously).
This is why whenever you are building a fraud detection algorithm (or any machine learning algorithm that's consumer facing) understanding the story behind the data is very important, and not just looking at the numbers.
I'm trying to log and look for varied signals, and have a few interesting ones that pick up the lazy and not-so-lazy fraudsters.
I'm going to be extra careful to ensure that we keep "understanding the story behind the data."
(that one has the added benefit of feeling obvious in hindsight, and so once again, incredibly valuable)
You can monitor characteristics the browser reports, including user agent strings.
You can use techniques like differences in rendering of canvas drawings to images to fingerprint browsers. In fact, I'd bet good money this is a great signal: what you're trying to do is not fingerprint, but detect when the reported user-agent has been overridden. Few people override user-agents.
Then you can go on to ways to bury identifiers in browsers. For example, etags on cached objects may be ok if you aren't using it for advertising and clear it in your privacy guidelines.
You can also fingerprint with time deltas, though this may be patented. Briefly: computers synchronize to milliseconds, I think. If you are careful, you can probably detect sub-millisecond clock skew between a client and your server. This should not be constant across devices.
etc etc etc
In the past we took down botnets this way - most low to mid-grade fraudsters had a limited # of IP addresses (probably multiple PCs or such in a cafe or call centre environment) so it was fairly easy to look at all accounts that had been created from that block of IP addresses (or created elsewhere but had repeated logins from these IPs) and then sanity check by looking at quality of the accounts to see whether non-fraud had happened. I suppose in the case of S.S. their data is quite robust across multiple sites.
The number of NATs there are make that sort of correlation... difficult.
For that matter, there are people that use UA "spoofing" for non-nefarious purposes. Me, for one.
Eventually I got rid of wordpress/php and just use nginx to serve static files so I felt secure enough to drop the firewall rules.
>> This IP address [220.127.116.11] is registered to Qtel. It is the IP address for many people in Qatar, if not the entire country.
While I'm sure there are valid reasons for ua spoofing, I'd bet it's a great signal for fraud.
I could be completely misreading this, but it does seem like a lot of the value is not just in each customer's data in isolation, but comparing it against the growing volume of shared data.
 You could notify ISPs of their hosting fraudlent traffic and if they continue to host it ...
Now throw a language barrier on top, and it's even more difficult.
Hell, getting accurate abuse contact information is a project all by itself.
Maybe a third-party should seize hotmail and outlook.com in order to clean it up for them...
1. Put something that doesn't belong there in the cloud.
2. Make undue generalizations about its applicability to third party businesses of which you have limited understanding.
3. Fake growth by dubious means, such as ramping up 'customers' (even if none of them actually use your service on an ongoing basis), hiring extensively, and waylaying all business processes to cater toward visible progress at investment rounds.
4. Spend almost as much on marketing as development.
Are they using local time? Or is there a chance that they are not accounting for the fact that most of the fraudsters are foreign and in a different time zone?
> Some of the most fraudulent email domains are operated by Microsoft. Why could this be? Two possible reasons are that 1) Microsoft has been around for a lot longer and 2) email addresses were easier to create back in the day. Today, websites use challenge responses such as image verification or two-factor authentication to verify your legitimate identity.
But outlook.com is the most recent Microsoft web mail domain. Why is it already much more used than other Microsoft web mail domains (hotmail, live, etc.) ?
Fraud will be highly correlated with freshly created disposable email addresses, it would be rather unlikely that fraudsters would use a thousand accounts that have been active since 1999.
The shown webmail domains, and the numbers in account name simply are correlated with more recent accounts.
He makes a valid point: why did @outlook.com addresses suddenly become used for scamming?
- - - - @outlook.com is a relatively new email domain (< 2 years)
- - - - Most people buying online are over age 18
- - - - Most people do not change their email address
- - - - Most people over the age of 18 have had their email address longer than two years
So, by that logic, if someone has a @outlook.com email address, there are a few possibilities:
- - - - They had an old email address, but switched/forwarded it to @outlook.com sometime in the last 2 years (unlikely - generally people don't suddenly change their email)
- - - - They made an @outlook.com address for ecommerce signups (unlikely - why not use your current provider e.g. Gmail?)
- - - - This is their first email account (unlikely)
- - - - They registered it to commit fraud (hmmmm)
Obviously this is all speculation and there are exceptions to all those assumptions, but it seems logical that the last option is more likely than the others, especially shen weighted by the fact that fraudsters almost always create more than one email account.
I myself use email address with 2 digits and I have so many of my friends using 4 digits or so. I personally don't think having more digits in your email ID is directly proportional to being more fraudulent.
False positives will always happen, no matter how many signals you throw into the mix, there will always be exceptions. Even so the difference between running with a system like this and being wide open is like day and night.
For example, just because group X usually doesn't eat lunch doesn't mean that not eating lunch is a good trait to detect them in the general population.
Also, 6% of outlook.com is used for fraud? This is a huge percentage.
How does this company detect multiple accounts on the device?
Also it seems that Microsoft gave up on verifying whether your message is spam or not. I had government emails (USPS, for example) as well as emails from my gmail and yahoo friends landing straight in junk.
And what is exact meaning of fraudulent user here ?
Then at the end you throw the message away if the sum total of the points passes some threshold "X"
Counting digits in an envelope sender would just be one more metric.
I do some blacklisting, but 99% of the time blacklisting on one fuzzy metric is seen as extremely bad practice in mail.
I really hate oversimplifications in these serious matters.
It happened to me that my bank was using a similar silly algorithm to consistently block my credit card during my world travel every time I arrived to a new country/airport, even if I told them about it in advance. A way to lose customer for life for sure, especially when their emergency line operates only during working days between 9am-6pm in Germany...
VPN traffic would also be an interesting metric.
Feel free to correct me if my memory is wrong because it very well could be.
The solution I considered was paying people in Africa to sign up for gmail for me, and I'd pay them per account. I figured I'd only need 50-100 per month, so the low volume might make it possible. They often have smartphones, and amounts that are too low for you to bother with might be a decent payday for them for 5 minutes work.
Now, I know what you're going to say... Youtube detects copyrighted works, won't let you upload them. That part was easy.
Just invert the video color, and flip it upside down. Then the lua script for VLC would de-invert and unflip it. And I could even bring in the audio from another site (VLC allows muxing), since Youtube uses audio signatures more than they do video signatures for that stuff.
I had a prototype going for awhile. Called it "Space Potato Channel". It just played videos others had uploaded (wrote a little backend to schedule movies). If you tuned in 5 minutes late, it'd show the video 5 minutes in, etc. Then I learned about how the NSA was giving tips to law enforcement and doing the parallel reconstruction thing, and I reconsidered my scheme to become a bitcoin millionaire.
Long story short, gmail accounts were never something I thought would be much of a problem.
Hang out on any blackhat SEO forum (or more illegal carding shops, etc. I would imagine) and you'll see plenty of guys peddling this service.
Incidentally, the youtube method you're describing has been automated many times. My first real PHP project was a script that found popular videos on non-youtube sites, downloaded them, watermarked them with my blog URL, and uploaded them to youtube. That resulted in a fair amount of direct traffic.
If you trawl around youtube these days you'll see plenty of watermarked videos that are clearly not original content. But as long as nobody is claiming copyright -- which nobody is doing for cat videos -- Google doesn't give a shit. Honestly, uploading non-original videos to Youtube only helps their numbers.
I think a common misconception is that companies care about fake/"spam" user accounts on their services. But what incentive do they actually have to ban them? In the world of venture capital, user numbers are an incredibly important metric, so as long as they aren't actively diluting the service for other users, companies have an incentive to allow them to propagate and pad their stats.
Take Snapchat for example. Looking at my friend request page, I have dozens of obviously spam accounts asking to be my friends. Is Snapchat including these accounts in their user numbers? Almost definitely. In fact, they probably even count as "active users" because they are "sharing photos" so often!
One has to wonder how many popular services have been built on VC money given to them on the presumption of accurate user statistics, when in reality 20-30% of accounts could be shills. Snapchat, Twitter, Facebook... There are tons of fake users on all of them, and yet these companies make relatively little effort to exclude them from stats (except, of course, when reporting monetization per user).
Using some privacy settings and VPNs will get you more Captchas on Google services also.
I spent several years working on the Gmail abuse team. Gmail is used less for fraud than other providers because we were better at fighting abuse than our competitors: as simple as that. Yahoo had a rather hollowed out abuse team for a long time, from what I understand, they didn't invest in it at all. And I think at Microsoft the Hotmail and Passport (i.e. login system) teams were much more compartmentalised than we were inside Google. At least this is what I heard on the grapevine, though I have no clue if it's accurate.
Google does many, many things to combat abuse of Gmail accounts. There's no silver bullet, it's not as simple as "Google phone verifies every account" (it does not and never has), or "if you send more than X messages you get Y". The abuse system is a massively complex pile of interlocking systems, analyses and heuristics.
You can get a good readout of how various teams at the different companies do here:
As you can see currently Outlook.com accounts currently sell for $10 per thousand. Gmail accounts are about $100 per thousand, an order of magnitude more expensive. Getting higher than that is very difficult against good opponents (and the guy who runs buyaccs.com is good, although these days he acts more as a reseller than an account creator himself). The reason is, at these prices it's feasible to simply phone verify every single account by hand using cheap SIM cards. Google does terminate accounts that have phone verified - it's just one more signal - but it's one of the best ones and so it becomes significantly more dangerous when spammers are phone verifying in bulk. In practice it's not a big deal because $100 per thousand is high enough that many business models (like simple spamming) become unprofitable.
I wonder if that's intentional, was one of the seven habits users who block trackers?
Now this post feels like its encouraging me too.
You and I have very different definitions of the word "malicious".
You might as well have just said, "I was thinking of committing light larceny, nothing malicious, and this article makes me want to do it again."
It isn't, though.
But illegal? Not unless I'm actually stealing data, making a profit, or accessing areas I shouldn't be.
You surely don't believe folks are granted access to all those computers, right?
You are, however, comparing apples and oranges. Building a bot net vs. sending out a press release is very different things. Please look into Black Hat vs. White Hat SEO.
You do something different with your property is OK.
Manipulating other people's property through directly bypassing the countermeasures they have in place isn't OK.
None of its illegal.