Hacker News new | past | comments | ask | show | jobs | submit login
Seven habits of highly fraudulent users (siftscience.com)
390 points by necubi on July 31, 2014 | hide | past | web | favorite | 139 comments

The night owl thing is misinterpreting the data. I am going to guess that the more likely scenario is that at 3am there are simply less total transactions, while fraudulent transactions stay at more or less the same level. Looking at a total volume of fraudulent transactions vs the hour or the day would be more helpful.

Another prediction: "fraudsters work on weekends" would yield the same graph if displayed as a percentage of total transactions.

Edit: http://en.wikipedia.org/wiki/Lies,_damned_lies,_and_statisti...

It's also partly due to them tending to be more international, meaning they are more likely to live in a different timezone.

The "outside of business hours" explanation is still a valid explanation but definitely not the whole story. Overall, I'd caution that when building fraud detection models understanding the stories behind the data is extremely important, or you risk having an algorithm that works for the wrong reasons.

For example, if suddenly your user base because more international (for example if you start allowing non-US users to use your website), you'll suddenly have a lot of false positives if you're not cautious because to your system it'll look like they're "operating outside of business hours".

It's also only a good signal as long as you aren't actioning on it.

As soon as it becomes advantageous enough to fit in with the normal user patterns, the attackers will modify their behavior accordingly.

The time-of-day, internationality, and many other signals mentioned in this post are easily evaded when it becomes profitable to do so.

The graphs say local time; I interpreted that to mean ip-address local time.

It may just be that they use US proxies. I did when I was overseas.

The "night owl thing" isn't misinterpretation. It is true that there are less total transactions at night, but the point and observation is that the fraud "rate" is higher at night.

Another way to put it: fraudsters are more likely to be night-owls than the rest of us.

Possibly, but you could also decide that there simply isn't enough signal to draw a conclusion. We don't know if time zones are accounted for properly. We don't know if this fraction represents a significant number of users - for example, the number of fraudulent users could be the least at 3AM, but if the decrease in users overall is greater, then the percentage increases. The data is misrepresented.

Hi there, I worked on retrieving and distilling this dataset at Sift Science. To address your concerns: Yes, time zones are accounted properly. And yes, the number of data points is significant.

As I've said earlier, you can see in the users count that fraudulent users are indeed the lowest at night (which makes sense, since fraudsters sleep as well). However, we are looking at the "fraud rate" (#fraud / all users in a given hour). I'm not sure what you even mean by "misrepresent" the data. The data speaks for itself.

Or they simply live on the other side of the world

If you read the labels for the graph, it says "local time" of the user / fraudster. If we didn't take into account the local time of the user, the data would not be very interesting to look at (mostly uniform) since we have users from all over the globe.

It could be both misinterpreting the data and not misinterpreting the data though. Let me explain...

The argument for misinterpreting the data is easy to use when you account for time zones. If (a) I'm in the US and (b) I notice that most of my fraudulent transactions occur around 3AM, then it follows that (c) most of my fraud occurs at night. However, if the corollary of "Most of my fraud occurs from India/Singapore/etc", then it's simply a matter of time zones. Does most of my fraud occur at night? It does if I'm in the US but, if I'm the hacker in India/etc, then it occurs during the day. It's 3:45AM where I am in the US (CDT) but it's 2:15PM somewhere in India...

However, if I was to write a training manual for my employees who worked in the same building as me (US CDT), then I'm not misinterpreting the data. The data is the "truth" here - if it shows that most fraudulent transactions occur around 3AM local time, then that's the truth and I'd be foolish to over-complicate the issue when trying to train new hires on how to spot fraud.

Probably politically incorrect and certainly not a valid sample but in my experience, most fraudulent callers at least have a strong Indian accent and try some sort of Microsoft support scam:


The time difference would explain the prevalence of night owls and other factors often used in fraud detection. Only on my direct office phone number, I receive 5-10 such calls a week.

If I were to try to understand these data, I'd like to know (a) How are "fraudulent transactions" determined? (b) Is the mechanism perfect or just very good? (c) What fraction of fraudulent transactions does it capture? All? Or just most? (d) And I'd like the detailed data series.

Statistically speaking, I'm suspicious regarding the conclusion about lunch hours. In order to identify that feature robustly, it's necessary to very accurately know the variability of the background signal. Does it always dip at lunch hours and all at the same time? Does it always have the same shape? Recall what's being done here is separating the background "non-fraud transactions" from "fraud transactions". If the latter are much smaller in number than the former, and the determination of fraud has a finite miss rate or a finite false alarm rate, then all the implications of Bayes Rule apply, and the fraud signals being seen could be just some transfer function of the fraud mechanism applied to the background signal.

Moreover, do fraudulent signals increase in proportion to total background? Or are they a constant amount?

I think more work needs to go into deconvolving the two kinds of signals before assigning meaning to their parts and features of their parts.

No doubt the story assigned to these is plausible and compelling, but it is based upon unproven inferences. And, in my opinion, assuming that identifying and acting on these signals is simpler than it looks is getting way ahead of the art.

this is crap. this describes almost every geek out there and chinese/rtl users who use numbers where they cannot type in their script. Looks like another self-proclaimed snake-oil-salesman-turned-security-guru bull shit.

honestly... frack the internet, you'r giving it too much weight

Yes - if you look at the first two graphs, you can more or less infer the third. The level of fraudulent orders and legitimate orders both decline at night, but the decline is not as marked in the fraudulent orders.

I love these. Building fraud detection systems is so funny because the signals themselves end up usually being really simple, it's just a matter of:

a) Having the tools to look at all your data to see where the patterns are

b) Having the tools to track instances of patterns once you've identified them.

But as I said, the patterns themselves are usually pretty simple, as you see in this article.

This one is a bit specific to Justworks since we use bank account numbers instead of credit card numbers, but one we've picked up is the source of the bank account is a great signal. If a company has bank accounts exclusively from certain banks then they are almost certainly fraudulent. We've even seen people try to sign up with consecutive bank account numbers from the same banks!

This goes for other industries as well.

For example every insurance claim from any insurance company is run through fraud detection software. Turns out there are some characteristics of a fraudulent insurance claims that have been identified over the years. The software can flag a claim as potentially fraudulent for further human review.

Discover Card's fraud detection worked extremely well in my case. My CC number was stolen and after only two fraudulent transactions (total about $500) it tripped the software to freeze the card. I have no idea how they identified those as fraud so quickly, they didn't seem out of place to my eyes considering my transaction history.

Sometimes the pattern only exists in data writ large.

For example, a particular merchant is breached, and then the fraudsters try to run most of the cards stolen in that breach at a few locations.

It could be obvious on your bank's end that a few dozen customers made a charge at Store A and then reported a charge at Store B to be fraudulent, so then they decided to freeze the cards of everyone with the same Store A -> Store B pattern.

Great point.

I've heard that good signals are things like small gas purchases at a gas station you haven't been to before; they're used to test card validity at a pay-station with no human present.

This has been such a common signature for fraud over the years that I'd be surprised if many fraudsters are still doing this.

If they were smart they'd buy a chocolate bar from a grocery store that has self-checkout, then buy something a little more expensive from Target or whatever, then buy the very expensive thing they really want.

What's the purpose of the target purchase?

I think because it masks the "tiny purchase -> giant purchase" signal that is used to detect fraud?

Ah, thanks.

a guess: probing the available balance. Presumably, you only get a couple shots at the apple store to buy that laptop because a single person isn't going to be able to try 5 cards with different names on them in a row without the police being called.

Quite possibly, they saw the same item purchased 5 times within minutes, all off of other Discover cards. You'd only see the transaction that used your number.

For example every insurance claim from any insurance company is run through fraud detection software. Turns out there are some characteristics of a fraudulent insurance claims that have been identified over the years. The software can flag a claim as potentially fraudulent for further human review.

I worked for an insurance company for over 5 years, processing claims. This is the first I have ever heard this. Which does not prove it to be untrue but does make me skeptical.

We had some claims automatically paid via computer but most claims were reviewed by human eyes. Humans were trained to look for indicators of fraud. There was no one thing that would get it sent to the fraud department for investigation. We were looking for a pattern or multiple indicators.

>We had some claims automatically paid via computer but most claims were reviewed by human eyes. Humans were trained to look for indicators of fraud. There was no one thing that would get it sent to the fraud department for investigation. We were looking for a pattern or multiple indicators.

Oh, I'm never saying that claims get paid automatically or that humans don't look for fraud too.

I'm saying that there is also software on the back end that flags things to go to the fraud department automatically. Probably without your knowledge. I just remember reading an article about it a while ago - just about how good it has gotten over the years and it has become an important part of the insurance industry.

Just another example of extensive data mining.

When I say "every insurance claim from any insurance company" I mean that I assume any insurance company worth anything would be running such software.

Found some products:



AllState implies they do.


>Through the use of innovative technologies and network analysis tools, we are able to help detect and stop these crimes

Ok. But I think you are still misinformed --> For example every insurance claim from any insurance company is run through fraud detection software.

From digging around, this seems kind of a new trend. I seriously doubt that "every claim at any insurance company" is being analyzed this way. And, sure, it's possible it was in use at the company where I worked and I simply did not know. But from reading some of what you linked to, I doubt it. (Though perhaps that has changed since I left.)

From one of the sites you linked to:

"Industry research indicates 10 percent of all claims contain an element of fraud," says Wolfe, and as of the end of 2010, CNA was seeing just 3.7 percent of its claims referred as potential fraud. "That’s considerably below the industry average, and we wanted to find out how much we were missing that wasn’t identified by our adjustors."

So, yeah, that's relatively recent and the searches I did sort of imply that this is an emerging market, they are still trying to convince insurance companies to do this, etc. Insurance is a very conservative industry. They tend to be somewhat slow to adopt new technologies. They are regulated by both federal laws that cover financial companies and also federal laws that cover medical companies (like HIPAA). So they have a huge regulatory burden and this makes change especially hard. Any new tech or new processes have to really be put through their paces to see if they still pass muster on multiple fronts. The de facto result is that insurance companies are kind of sticks in the mud.

Where I worked, a lot of the software was homegrown. It had been developed in-house. They didn't like hiring outside vendors/buying outside solutions. From what I gather, they were fairly cutting edge -- though, of course, it is possible that was just company hype. Since I worked there, I sometimes knew that some statements amounted to spin-doctoring. But I did only have an entry level job and was not really expecting it to turn into a career, so I am sure there is plenty that I missed.

Thanks for replying.

Oh I was probably somewhat misinformed because the article I read a while ago was much more hyped up than I realized at the time. My apologies. I got the wrong impression.

I just assumed it had already become "industry standard." Perhaps it was "becoming industry standard" instead. I think the article was specifically about car (not health) insurance too but I can't remember details. I went to go find it again before posting but of course I couldn't.

I also somewhat suspect (but I have no proof of course) that there is at least a little bit of data sharing to third parties that insurance companies can/do use to get more of the "big picture." Probably more car insurance than health insurance. I know such a thing as a "driving record" exists probably supplied by a company such as LexisNexis. My car insurance provided me a copy of my driving record.

So for example Person A has a pattern of behavior through different insurance companies that can be flagged while just looking at one insurance company's data might not show the data.

This is all wild speculation on my part more of a thought experiment on my part. Laws such as the FCRA are starting to address SOME of these issues but I'm not sure they are on top of all practices.

This type of tracking has become more common in the age of "big data."


That's interesting though that they didn't like hiring outside vendors. I would think they would want outside vendors because their products were already vetted for regulation compliance.

I don't recall who, but there is a (government) agency that gets reports. I don't recall exactly what gets reported but, yes, there is a mechanism in place for trying to identify fraudulent activity by one person/group across multiple insurance companies.

That was not my area and I no longer work in insurance. I don't recall the details. But, yes, that is a thing.

Do you tend to go shopping for a given product at the same kind of times/days? E.g. for me clothes shopping is on Saturday or Sunday.

Do you tend to go to certain physical locations or use a limited number of Web sites to buy from?

Just guessing.

Nope! My spending habits are kinda random and I travel a lot.

Disclaimer: I work here. Thanks @wdewind! We do too. We had a field day playing with the data. We actually have thousands of features and these are just some of the most simple signals. We're really careful not to disclose some of our more complex (and powerful) ones - but I can say that we statistically model patterns such as position of digits in email address, page visit sequence, etc.

Fraudsters tend to make multiple accounts on their laptop or phone to commit fraud.

Any idea how this can be tracked? Normal cookies, or something more in-depth?

A site I'm working on tracks page visits independently of logged-in user sessions.

I'm wondering if it's worth considering explicitly looking for multiple logged-in users sharing the same page visit session.

There are various tricks - e.g. Flash cookies, Etags etc - that make me feel a bit uneasy, despite how much I like the idea of tracking multiple accounts per device.

This is generally known as "device fingerprinting"--there are many ways to do it but they all involve probing for unique properties of a client via JS / Flash (listing installed fonts, drawing invisible characters and measuring via JS, etc.), then hashing them together to generate a unique ID for that user.

Some people think this practice violates users' privacy, and I'm one of them. This technology can be used to uniquely identify a user across multiple logins on the same site, or even multiple sites. It's quite widespread.

This paper[0] is mostly a survey of prominent DF providers and sites using this technology, and it's also a good primer on device fingerprinting techniques.

[0] http://www.cosic.esat.kuleuven.be/publications/article-2334....

This is generally known as "device fingerprinting"

Is this definitely how it's achieved though?

I would presume a highly-skilled fraudster could just spin up a new VM, for instance, and evade detection that way.

Do we know if "regular" cookies alone are good enough for 90% of the lazy fraudsters?

Regarding using "device fingerprinting," can I collect some opinions from HN?

Specifically, if every user record created stores a fingerprint alongside it (which is only used to find account registrations from the same device) is that just as offensive as using fingerprinting to track anonymous sessions?

> I would presume a highly-skilled fraudster could just spin up a new VM, for instance, and evade detection that way.

From my experience building fraud detection systems at Eventbrite most fraudsters are not that sophisticated -- fraudsters usually go for the lowest-hanging fruit and as such are looking for systems to defraud that have the highest payout for the lowest effort. Because there is always some level of uncertainty (getting detected, the credit card not working, etc.) fraudsters often favor techniques that allow them to try as many websites/cards as possible. This is especially true for Sift Science's customers who tend to be more small to mid-size companies; big companies for whom fraud detection is critical will tend to have their own in-house solution.

In addition this is usually only one signal -- ideally you want your algorithm to be able to detect first-time fraudsters too, so the other signals should be able to stand on their own.

One caveat though, the reason why multiple accounts is a signal of fraud is because fraudsters tend to be repeat offenders, and will keep defrauding the same website if their previous attempts worked. But now that they're facing a fraud detection algorithms that detects repeat offenders more easily, it's highly possible they will adapt their behavior.

This is a signal that will fade out in strength over time, and one of the dangers of pooling together data from multiple websites as in this blog post (but hopefully this is taken into account in their algorithms) is that the strength of the signal may be skewed by the proportion of new users of their platform (who will have a higher proportion of unsophisticated fraudsters by nature of they not having a fraud detection system previously).

This is why whenever you are building a fraud detection algorithm (or any machine learning algorithm that's consumer facing) understanding the story behind the data is very important, and not just looking at the numbers.

I can't thank you enough, this is some incredibly valuable advice that you've given me and everyone else on HN.

I'm trying to log and look for varied signals, and have a few interesting ones that pick up the lazy and not-so-lazy fraudsters.

I'm going to be extra careful to ensure that we keep "understanding the story behind the data."

(that one has the added benefit of feeling obvious in hindsight, and so once again, incredibly valuable)

Thanks again!

Consider that if you collect device fingerprints, you can detect users who live together, because they are likely to share devices. If somebody bad then gets access to that data, they could do creepy things to your users.

a relatively privacy friendly way is to compare accounts to ip addresses. You obviously expect there to be some nat-ing going on but it should be relatively constant over time. That is, if in your entire network you've seen, say, 10 accounts from a given ip and all of a sudden 5 accounts interact with a given shop, it's probably fraudulent.

You can monitor characteristics the browser reports, including user agent strings.

You can use techniques like differences in rendering of canvas drawings to images to fingerprint browsers. In fact, I'd bet good money this is a great signal: what you're trying to do is not fingerprint, but detect when the reported user-agent has been overridden. Few people override user-agents.

Then you can go on to ways to bury identifiers in browsers. For example, etags on cached objects may be ok if you aren't using it for advertising and clear it in your privacy guidelines.

You can also fingerprint with time deltas, though this may be patented. Briefly: computers synchronize to milliseconds, I think. If you are careful, you can probably detect sub-millisecond clock skew between a client and your server. This should not be constant across devices.

etc etc etc

One quick and relatively easy win is logging IP addresses of historical logins or attempts. It's quite easy then to look for patterns that are skewed in a strange way. It's hard to automate but extremely valuable for human intervention. In the case of ecommerce it's a bit easier since you can flag "good" accounts potentially and use that to weigh the analysis. In the case of simple account creation it's hard since you can't really ever call an account "good" necessarily except perhaps with a long usage history.

In the past we took down botnets this way - most low to mid-grade fraudsters had a limited # of IP addresses (probably multiple PCs or such in a cafe or call centre environment) so it was fairly easy to look at all accounts that had been created from that block of IP addresses (or created elsewhere but had repeated logins from these IPs) and then sanity check by looking at quality of the accounts to see whether non-fraud had happened. I suppose in the case of S.S. their data is quite robust across multiple sites.

...And then you start flagging businesses, ISPs, even entire countries as fraudulent.

The number of NATs there are make that sort of correlation... difficult.

For that matter, there are people that use UA "spoofing" for non-nefarious purposes. Me, for one.

Speaking as someone who's spent a lot of time in China... flagging entire countries as fraudulent isn't the end of some ridiculous slippery slope -- it's a thriving practice today. I have to hide my IP address to interact with the internet in almost any interesting way.

yeah, for a while I blackholed china and india with ipfw. Hacking attempts against my server fell like a rock. I felt a tiny bit bad, but if these countries / isps can't police their subscribers, what do they expect.

Eventually I got rid of wordpress/php and just use nginx to serve static files so I felt secure enough to drop the firewall rules.


>> This IP address [] is registered to Qtel. It is the IP address for many people in Qatar, if not the entire country.

Vietnam is pretty bad too. I spent a month in Vietnam and I was blocked from most forums. I had to use a USA-based vpn.

um, no you don't, and yes i've done something similar before

While I'm sure there are valid reasons for ua spoofing, I'd bet it's a great signal for fraud.

You might find this paper about fingerprinting useful https://www.cosic.esat.kuleuven.be/fpdetective/

You can track it in-browser with cookies. You can track it in-browser and have it remain when the user clears cookies with offline storage. You can track it in-browser and cross browser on the same PC and have it remain when the user clears cookies with flash cookies. When a user logs out, you can store the account fingerprint used in all 3 of those places and maintain it as another account logs in. Combine them all and it's pretty effective.

Even more techniques here: http://samy.pl/evercookie/

It sounds like users of this service are essentially sharing data with each via siftscience (e.g. bad credit cards, shipping addresses, etc). I considered exactly this business model years ago, but considered it a complete non-starter due to privacy and proprietary information reasons.

I could be completely misreading this, but it does seem like a lot of the value is not just in each customer's data in isolation, but comparing it against the growing volume of shared data.

User-Agent strings are pretty good ways of tracking devices, I believe.

Not really. You can spoof user agents...

You can spoof anything.

You can't really spoof IP address if you want any kind of two-way interaction

Oh yeah you can. It's not simple, but you very much can.

Not in the real world.

This was a great read. You know what fraudsters search for ? Vulnerable PHP web sites and sites with anonymous logins. Its kind of amazing. I would imagine that if you could take all of these signals and geo-track them back to the originating IP you would be able to illuminate fraud tolerant ISPs[1], and fraud schemes and targets. Sure its a big data problem but it seems imminently tractable if local vendors provide fraud detection data to central source.

[1] You could notify ISPs of their hosting fraudlent traffic and if they continue to host it ...

Getting consumer ISPs to respond to obvious abuse cases (DDOS attacks) is very difficult. Getting them to respond to fraud (which they'd have to investigate more then looking at a bandwidth graph) seems impossible.

Now throw a language barrier on top, and it's even more difficult.

Hell, getting accurate abuse contact information is a project all by itself.

So Microsoft domains account for the highest amount of fraud?

Maybe a third-party should seize hotmail and outlook.com in order to clean it up for them...

Those domains are more popular in countries where fraud is higher.

The OP was referring to how Microsoft seized domains from another company under the guise of security.

The Four Habits of Highly Silicon-Valley Startups.

1. Put something that doesn't belong there in the cloud.

2. Make undue generalizations about its applicability to third party businesses of which you have limited understanding.

3. Fake growth by dubious means, such as ramping up 'customers' (even if none of them actually use your service on an ongoing basis), hiring extensively, and waylaying all business processes to cater toward visible progress at investment rounds.

4. Spend almost as much on marketing as development.

>Habit #2: Fraudsters Are Night Owls

Are they using local time? Or is there a chance that they are not accounting for the fact that most of the fraudsters are foreign and in a different time zone?

I believe they mean the SiftSciences customer's time, not the hacker's time. They are proposing that fraudsters tend to hit in the middle of the night based on where the target is hosted, not where they are.

If the targets were US-based, this could fit with the "fraudsters are international" finding.

All times are local to the user (as inferred by IP geolocation).

Or maybe fraudsters route their traffic through international tunnels to make law enforcement activity more difficult?

> outlook

> Some of the most fraudulent email domains are operated by Microsoft. Why could this be? Two possible reasons are that 1) Microsoft has been around for a lot longer and 2) email addresses were easier to create back in the day. Today, websites use challenge responses such as image verification or two-factor authentication to verify your legitimate identity.

But outlook.com is the most recent Microsoft web mail domain. Why is it already much more used than other Microsoft web mail domains (hotmail, live, etc.) ?

The email properties they describe are simply a proxy for the age of account.

Fraud will be highly correlated with freshly created disposable email addresses, it would be rather unlikely that fraudsters would use a thousand accounts that have been active since 1999.

The shown webmail domains, and the numbers in account name simply are correlated with more recent accounts.

Go to hotmail.com and look at the login portal.

So what? The mail server name has practically nothing to do with the email address. The address is still @hotmail.com.

He makes a valid point: why did @outlook.com addresses suddenly become used for scamming?

Here's a possibility:

- - - - @outlook.com is a relatively new email domain (< 2 years)

- - - - Most people buying online are over age 18

- - - - Most people do not change their email address

- - - - Most people over the age of 18 have had their email address longer than two years

So, by that logic, if someone has a @outlook.com email address, there are a few possibilities:

- - - - They had an old email address, but switched/forwarded it to @outlook.com sometime in the last 2 years (unlikely - generally people don't suddenly change their email)

- - - - They made an @outlook.com address for ecommerce signups (unlikely - why not use your current provider e.g. Gmail?)

- - - - This is their first email account (unlikely)

- - - - They registered it to commit fraud (hmmmm)

Obviously this is all speculation and there are exceptions to all those assumptions, but it seems logical that the last option is more likely than the others, especially shen weighted by the fact that fraudsters almost always create more than one email account.

Honestly, it's even simpler than that: it's the default domain in the email address dropdown when you go to sign up for Outlook.com/Hotmail now. The fraudsters aren't bothering to change the value in the dropdown, and they tend to create new email accounts frequently, so you'll see whatever the current domain is overrepresented.

Pretty interesting insights. Though in Habit #6: "Fraudsters Are Really Boring" : the digits in email addresses appear pretty obvious (non-fraudster) to me. We need to remember that there are 600 million + email accounts registered with Gmail for example and it is really difficult to increase your chance of creating new email ID without using any digit(s) while registering your email address.

I myself use email address with 2 digits and I have so many of my friends using 4 digits or so. I personally don't think having more digits in your email ID is directly proportional to being more fraudulent.

I would say that while there can be legitimate email addresses with multiple digits, and fraudulent email addresses with no digits, neither of these facts precludes the possibility of a correlation.

that's exactly my point. Looking at this habit of "email address with multiple digits" alone seems correlation rather than causation. Making an anti-fraud algorithm by including all or most of these habits might point to causation and help solve the problem though.

Also, that distribution doesn't scream "relevant correlation" to me. The bins are different sizes.

Yep. I don't understand how having 1 digit in one's email ID makes him more fraudulent than someone whose email ID consists of 2 digits. :-o

Anecdotal of course, but I and a few friends/family I've seen used to have email addresses (when I was younger) with two digits representing birth year (like, something85), whereas I can't think of a reason I'd use only one...

Some people actually encode birthyear with 4 numbers, and some even like zipcodes...

Is it just me, or did they really just fit a quintic curve to theirs "Fraudsters are Sneaky" plot? There had better be a good reason for their using such a high-dimensional polynomial.

It may be a non parametric estimate. My guess is that they used loess (https://en.wikipedia.org/wiki/Local_regression).

I've built a system like this recently and most of these are supported by the analysis I made. Even so, there are many more signals possible than the ones listed here and only in the aggregate can you use any of this, a single signal is never strong enough to distinguish between fraud and friend.

False positives will always happen, no matter how many signals you throw into the mix, there will always be exceptions. Even so the difference between running with a system like this and being wide open is like day and night.

Fascinating. I can't get over why Fraudsters Go Hungry. Rampant speculation: they're used to eating in front of a computer screen, they're doing what they love so no reason to do anything differently during lunch hour, they're shut-ins and socially inept. International nerds, male, unmarried, above average IQ but didn't get into the top Indian/Russian/Chinese universities so making a living and finding meaning hacking into America.

The "mechanical turk" style fraudster setups probably rely more on getting desperate, unemployed people to work in regions so scarcely regulated that lunch breaks aren't even expected.

Or a large portion of them could just be bots.

These articles always seem to imply that you can use those traits for detection. Can you? I mean, the numbers and wording imply that you can, but some of the stats are not very clear or intuitive.

For example, just because group X usually doesn't eat lunch doesn't mean that not eating lunch is a good trait to detect them in the general population.

Also, 6% of outlook.com is used for fraud? This is a huge percentage.

How does this company detect multiple accounts on the device?

Only only 'you can', it is being done by thousands of companies across the world every second. (not these properties directly, just building a statistical model of fraudulent users and comparing transactions to that model to flag potentially fraudulent transactions - in wire transfers, online purchases, credit card use, ...)

I think they mean 6% of the email addresses used for fraudulent transactions were outlook.com addresses, at least in the data they were analyzing. 6% of fraud perpetrators use outlook.com is a much different statistic than 6% of outlook.com users commit fraud.

Outlook is seriously broken! 10 minutes after creating an account, my spam mailbox had already 20 messages. I don't believe that spammers are mailing every possible combination of a username; there must be some leak or other way.

Also it seems that Microsoft gave up on verifying whether your message is spam or not. I had government emails (USPS, for example) as well as emails from my gmail and yahoo friends landing straight in junk.

I doubt how much the data can used as an overall generalization. They should analyse the pattern and go little more deeper.

And what is exact meaning of fraudulent user here ?

This is interesting and well-informed, but it's important to remember that fraud is an adversarial problem. The bad guys will change their behavior to evade detection. The habits described here may exist when there is no defense in place, but if you use them to detect fraud, you'll likely see shifts in behavior to appear more "normal" and evade detection.

Seems they should be keeping these signals secret.

These traits aren't unknown to anyone who is processing credit card transactions and keeping an eye out for fraud. It's pretty clear that they don't consider them trade secrets either, or they wouldn't be sharing.


Spam handling is all about weights. Attachment? I won't block you for that, but I'll give you a point. Valid domain? I won't accept you for that, but I'll subtract a half point.

Then at the end you throw the message away if the sum total of the points passes some threshold "X"

Counting digits in an envelope sender would just be one more metric.

I do some blacklisting, but 99% of the time blacklisting on one fuzzy metric is seen as extremely bad practice in mail.

That's why they said it's only one of many signals one should look for, and not to block transactions on that symptom alone.

LOL, I am hitting most of these for completely different reasons and never attempted nor plan to attempt a fraud. I guess another overreaching application of statistics (it must be because confidence intervals say so and our prediction model agrees!). It resembles to me the saying that all murderers eat bread, so bread is dangerous!!!

I really hate oversimplifications in these serious matters.

It happened to me that my bank was using a similar silly algorithm to consistently block my credit card during my world travel every time I arrived to a new country/airport, even if I told them about it in advance. A way to lose customer for life for sure, especially when their emergency line operates only during working days between 9am-6pm in Germany...

The article clearly says that these factors are not to be taken individually, but in addition to hundreds or thousands of others.

Not sure what your point is. Every model, be it human or automatic/statistical, is by definition a "simplification" and is going to have some false positives and negatives. Having these flaws doesn't make the entire model useless.

I'm pretty impressed with how the header and side bar operate on this site. Enlarge/zoom the page and the sidebar becomes the header. I'd like to know how this is achieved.

The website has a responsive design [0], which adapts to the available viewport. It looks like this one was implemented with Bootstrap [1].

[0]: http://en.wikipedia.org/wiki/Responsive_web_design

[1]: http://getbootstrap.com/)

I want to point out that this commenting style is very good. You provided helpful answers, looked into the original question, and then provided sources. More people should comment like this.

Thank you for the information :).

It's interesting that gmail is the least likely used for fraud, why is that? Can't anybody create multiple gmail accounts?

VPN traffic would also be an interesting metric.

IIRC you have to do text message validation. If not I believe the amount of messages you can send are under 50. However these things change over time, and I believe at one point (maybe now) you couldn't make a gmail account without text verification.

Feel free to correct me if my memory is wrong because it very well could be.

At one point I was thinking about setting up a tv channel for VLC... you can write a lua script to let VLC extract video urls from a webpage. So I'd use Tor/bitcoin to get hosting somewhere, put up a simple page for that purpose, and use Youtube to host the videos. You need Google accounts though, lots of them (Google would suspend them quickly, after all).

The solution I considered was paying people in Africa to sign up for gmail for me, and I'd pay them per account. I figured I'd only need 50-100 per month, so the low volume might make it possible. They often have smartphones, and amounts that are too low for you to bother with might be a decent payday for them for 5 minutes work.

Now, I know what you're going to say... Youtube detects copyrighted works, won't let you upload them. That part was easy.

Just invert the video color, and flip it upside down. Then the lua script for VLC would de-invert and unflip it. And I could even bring in the audio from another site (VLC allows muxing), since Youtube uses audio signatures more than they do video signatures for that stuff.

I had a prototype going for awhile. Called it "Space Potato Channel". It just played videos others had uploaded (wrote a little backend to schedule movies). If you tuned in 5 minutes late, it'd show the video 5 minutes in, etc. Then I learned about how the NSA was giving tips to law enforcement and doing the parallel reconstruction thing, and I reconsidered my scheme to become a bitcoin millionaire.

Long story short, gmail accounts were never something I thought would be much of a problem.

Or you can go to "account brokers" who sell accounts for something like $20/1000. Reliability of those accounts varies per broker but some I understand to be quite good (never bought any myself).

Hang out on any blackhat SEO forum (or more illegal carding shops, etc. I would imagine) and you'll see plenty of guys peddling this service.

Incidentally, the youtube method you're describing has been automated many times. My first real PHP project was a script that found popular videos on non-youtube sites, downloaded them, watermarked them with my blog URL, and uploaded them to youtube. That resulted in a fair amount of direct traffic.

If you trawl around youtube these days you'll see plenty of watermarked videos that are clearly not original content. But as long as nobody is claiming copyright -- which nobody is doing for cat videos -- Google doesn't give a shit. Honestly, uploading non-original videos to Youtube only helps their numbers.

I think a common misconception is that companies care about fake/"spam" user accounts on their services. But what incentive do they actually have to ban them? In the world of venture capital, user numbers are an incredibly important metric, so as long as they aren't actively diluting the service for other users, companies have an incentive to allow them to propagate and pad their stats.

Take Snapchat for example. Looking at my friend request page, I have dozens of obviously spam accounts asking to be my friends. Is Snapchat including these accounts in their user numbers? Almost definitely. In fact, they probably even count as "active users" because they are "sharing photos" so often!

One has to wonder how many popular services have been built on VC money given to them on the presumption of accurate user statistics, when in reality 20-30% of accounts could be shills. Snapchat, Twitter, Facebook... There are tons of fake users on all of them, and yet these companies make relatively little effort to exclude them from stats (except, of course, when reporting monetization per user).

I was going to upload A list movies and tv shows. It was going to be a Syfy channel alternative. Just saying.

No that can't be true. I don't have a mobile phone currently and surely have sent more than 50 messages from a gmail address which has never been linked. And i'm sure plenty of other users, especially children and teens, use gmail addresses without ever linking to a phone number for two-step verification as well.

It might not be a strict requirement, but if Google suspects something is up it will do extra verification.

Using some privacy settings and VPNs will get you more Captchas on Google services also.

> It's interesting that gmail is the least likely used for fraud, why is that?

I spent several years working on the Gmail abuse team. Gmail is used less for fraud than other providers because we were better at fighting abuse than our competitors: as simple as that. Yahoo had a rather hollowed out abuse team for a long time, from what I understand, they didn't invest in it at all. And I think at Microsoft the Hotmail and Passport (i.e. login system) teams were much more compartmentalised than we were inside Google. At least this is what I heard on the grapevine, though I have no clue if it's accurate.

Google does many, many things to combat abuse of Gmail accounts. There's no silver bullet, it's not as simple as "Google phone verifies every account" (it does not and never has), or "if you send more than X messages you get Y". The abuse system is a massively complex pile of interlocking systems, analyses and heuristics.

You can get a good readout of how various teams at the different companies do here:


As you can see currently Outlook.com accounts currently sell for $10 per thousand. Gmail accounts are about $100 per thousand, an order of magnitude more expensive. Getting higher than that is very difficult against good opponents (and the guy who runs buyaccs.com is good, although these days he acts more as a reseller than an account creator himself). The reason is, at these prices it's feasible to simply phone verify every single account by hand using cheap SIM cards. Google does terminate accounts that have phone verified - it's just one more signal - but it's one of the best ones and so it becomes significantly more dangerous when spammers are phone verifying in bulk. In practice it's not a big deal because $100 per thousand is high enough that many business models (like simple spamming) become unprofitable.

As an example of techniques Google uses: machine learning, manually written logic, real time statistics, randomly generated and obfuscated signal gathering Javascripts, offline clustering pipelines and a team of people with big screens around their office with lots of graphs on them. Those people keep an eye on the system around the clock and if they see e.g. an unexplained spike in account creations then they will manually investigate what was going on. They are very good at quickly identifying mistakes made by account creators and clustering the accounts by hand.

Given gmail's history, I wouldn't be surprised if they're proactively preventing fraudsters from signing up by a variety of means.

Ghostery blocks access to the content due to how they're doing a redirect.

I wonder if that's intentional, was one of the seven habits users who block trackers?

The drawback to services like this is that they are great at hindsight (Aha! Based on the signals we should have known!) but bad at prediction. Take the example of Doral, FL that they offer; it has 8X higher fraud. But, should you avoid Doral customers? No. Should you avoid people who use forwarding services? No. But if you're scammed, you can look back and say "I should have known!"

If a scam works once, it will probably be tried again. Even if you are correct, there is probably value in simply detecting similar scams. Plus, if sift's network is big enough, a given business will be protected against scams that hit other businesses.

Argh, that's not how you use an <abbr> element! (in "Fraudsters are really boring")

This makes the ridiculous assumption that fraudsters don't use proxies, they do.

A while ago I really wanted to build a bot farm, not to do anything particular malicious (farm reddit upvotes and push content to the front page).

Now this post feels like its encouraging me too.

> A while ago I really wanted to build a bot farm, not to do anything particular malicious (farm reddit upvotes and push content to the front page).

You and I have very different definitions of the word "malicious".

Hmmm,I wouldn't say it was malicious as long as there was no legal activity created by it.

The mere act of building a bot farm is malicious.

Also illegal.

You might as well have just said, "I was thinking of committing light larceny, nothing malicious, and this article makes me want to do it again."

Wrong. If he was using computers that didn't belong to him, then it would be illegal your jumping the gun.

What do you think a botnet is?

He said 'bot farm' that could just be a group of EC2 servers. He didn't mention botnet. A bot is just an automated program, a bot farm is a group of automated programs.

Google's spiders are a bot farm. But nobody considers that illegal.

It could be.

It isn't, though.

I don't see how this is illegal. Immoral, yes. Against the ToS of a website, yes. Possible grounds for a lawsuit, yes.

But illegal? Not unless I'm actually stealing data, making a profit, or accessing areas I shouldn't be.

How do you think botnets are formed?

You surely don't believe folks are granted access to all those computers, right?

He said Botfarm not Botnet, two different things.

Bot farm isn't a term.

Malicious != illegal. There are many activities that might be technically legal, but definitely are malicious.

Vote manipulation to acquire attention is malicious.

Tell that to the hundreds (thousands?) of SEO and PR agencies around the world that offer this same service...

What? Tell them they're sleazeballs? Already ahead of you on that. They don't want to hear it and usually have a lame rationalization that lets them sleep at night.

Its tens of thousands. And depending on how they go about it, it is.

You are, however, comparing apples and oranges. Building a bot net vs. sending out a press release is very different things. Please look into Black Hat vs. White Hat SEO.

You do something different with your property is OK.

Manipulating other people's property through directly bypassing the countermeasures they have in place isn't OK.

He didn't say botnet, he said botfarm which would be akin to a Google Crawler. Whos property is it? Does reddit own the links? What do they own?

None of its illegal.

They own reddit.com which is what he'd be manipulating.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact