Hacker News new | past | comments | ask | show | jobs | submit login
Detecting Chrome headless, the game goes on (antoinevastel.com)
222 points by jacdp on July 19, 2019 | hide | past | favorite | 129 comments



It seems he's doing something with header detection. I used Puppeteer to play around with the site and various configurations I use when scraping.

In headless Chrome, the "Accept-Language" header is not sent. In Puppeteer, one can force the header to be sent by doing:

  page.setExtraHTTPHeaders({ 'Accept-Language': 'en-US,en;q=0.9' })

However, Puppeteer sends that header as lowercase:

  accept-language: en-US,en;q=0.9

So it seems the detection is as simply: if there is no 'Accept-Language' header (case-sensitive), then "Headless Chrome"; else, "Not Headless Chrome".

This is a completely server-side check, which is why he can say the fpcollect client-side javascript library isn't involved.

Here are some curl commands that demonstrate:

Detected: not headless

  curl 'https://arh.antoinevastel.com/bots/areyouheadless' \
  -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3803.0 Safari/537.36' \
  -H 'Accept-Language: en-US,en;q=0.9'
Detected: headless

  curl 'https://arh.antoinevastel.com/bots/areyouheadless' \
  -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3803.0 Safari/537.36'
Detected: headless

  curl 'https://arh.antoinevastel.com/bots/areyouheadless' \
  -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3803.0 Safari/537.36' \
  -H 'accept-language: en-US,en;q=0.9'


As a followup, if you have the ability to modify the Chrome/Chromium command line arguments, using the following option completely fools the detection:

  --lang=en-US,en;q=0.9
You can prove this with the following Puppeteer script:

  (async () => {
      const puppeteer = require('puppeteer');
      const browserOpts = {
          headless: true,
          args: [
              '--no-sandbox',
              '--disable-setuid-sandbox',
              '--user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3803.0 Safari/537.36',
              // THIS IS THE KEY BIT!
              '--lang=en-US,en;q=0.9',
          ],
      };
  
      const browser = await puppeteer.launch(browserOpts);
      const page = await browser.newPage();
      await page.goto('https://arh.antoinevastel.com/bots/areyouheadless');
      await page.screenshot({ path: 'areyouheadless.png' });
      await browser.close();
  })();


I run the division at my company that builds crawlers for websites with public records. We scrape this information on-demand when a case is requested, and we handle an enormous volume of different sites (or sources as we call them). We recently passed 700 total custom scrapers.

Recently, we have seen a spike in sites that detect, and block our crawlers with some sort of Javascript we cannot identify. We use headless Chrome and selenium to build out most of our integrations, I'm starting to wonder if the science of blocking scraping is getting more popular...

I don't think what I'm doing is subversive at all, we're running background checks on people, and we can reduce business costs by eliminating error-prone researchers with smart scrapers that run all day.

I don't want to seem like the bad guy here, but what if I wanted to do the opposite of this research? Where do I start? Study the chromium source? Can anyone recommend a few papers?


> Where do I start? Study the chromium source?

I'm curious why you'd jump straight to browser detection as the most likely culprit. When I was doing scraping, the far more common case was bot detection by origin and access patterns. It's just very difficult to make an automated scraper look like a residential or business user.

Where do you run your scraping operation? Is it in AWS or some other hosting provider, because that will get you blocked quickly by a lot of sites? Do you rate limit, including adding random jitter to mimic the way a human might use a browser?

There's scraping services available that essentially use a network of browsers on residential connections with their extension installed to get around scraping detection. It's much slower, but it's much more reliable. We also had some success by signing up with a bunch of the VPN providers (PIA, NordVPN, ExpressVPN, etc) and cycling through their servers frequently. Anything to avoid creating patterns that look automated or being tied to an IP that can be blacklisted. I'd start there before I'd worry about hacky javascript detection like in this story being what's tripping you up.


According to the NDA with my company I can't reveal anything about the architecture beyond the fact that it is hosted locally on a homebuilt distributed system that randomly chooses from a pool of 120 residential IPs.

We do have human emulation routines that helped avoid most detection, and that library is decoupled in such a way that we can edit behavior down to the individual site.

Some sites are just so damn good and detecting us and I just don't get it.


They can characterise the (browsing) behaviour of all their visitors, and then further characterise those who fall outside their "normal" thresholds. The outsiders that exhibit some sort of correlation (ie their characteristics are not independent of each other) are banned. Any quirks or patterns your systems have would be identifiable as "artificial", and even those that are randomised or seek to emulate humans will have features that are identifiable. An NDA is ineffective against machine learning.

The countermeasure would be to have a bunch of humans use the websites in any way they want, totally undirected, then use the totality of that browsing to facilitate your scraping probabilistically. It would be less efficient, but very difficult to catch.


That's the general direction I'd like to take. When we capture the inputs for the scrapers, I'd like to persist everything. Mouse jiggles, delays, idle time. I think it would definitely help advance the software.


In the grand scheme of things all of this is a wasteful process. Maybe you could direct your worklife towards other challenges that are more rewarding for society and equally profitable?


I think that's unjustified and a little rude. OP is providing an automated service for publicly accessible data that isn't accessible for automation. If the sources are notified and they are operating within the confines of the law, this is no different than writing a search engine crawler.


That crosses into personal attack. Please don't do that on Hacker News. We've had to ask you this before.

https://news.ycombinator.com/newsguidelines.html


OP is being reasonably compensated for something that is perfectly legal.


A pool of 120 residential ips is way too small - patterns are more emergent. Go for thousands, even better, hundreds of thousands. Outsource the residential proxy system to luminati or oxylabs.


This sounds, at best, ethically dubious and at worst illegal. Aaron Swartz was arrested and charged under hacking laws for doing exactly what you're describing.

Given that your run this division there is a good chance you are personally liable.


We have an enormous legal team that communicates constantly with end points to ensure they are aware of our scraping. And as I said in another comment, we store no results other than what is already available to anyone else using the web.

We've had this division for many many years, and before my time we paid another company to do this. There's no legal issues.


Your legal teamn is in contact with them, but their security is actively trying to block you? That doesn't make sense.

Computer security laws are very broad. It doesn't matter if it's just a website that the public can access. If you're accessing it in a matter that they don't want AND you're aware of that, then I struggle to see how your lawyers can justify it.

> Computer hacking is broadly defined as intentionally accesses a computer without authorization or exceeds authorized access.

https://definitions.uslegal.com/c/computer-hacking/

Hiding your user agent because you know they don't want automated retrieval of information is "without authorisation".


>Aaron Swartz was arrested and charged under hacking laws for doing exactly what you're describing.

Don't think connecting a computer to a private network to suck up subscriber data is comparable to scraping publicly accessible internet content.


These fear mongering comments always ignore the notice provision in the CFAA. Web scraping publicly accessible information is not "illegal" under the CFAA. That law, at most, only makes someone who continues scraping after being asked to stop potentially culpable.

First, the accuser needs to, at least, send a cease and desist letter to the accused asking them to stop accessing the protected computer. Second, the accused needs to ignore that request and keep accessing the protected computer.

Is it possible to build a solid CFAA case when those two things do not happen? I cannot find any examples.

https://iapp.org/news/a/can-a-cease-and-desist-notice-create...


My understanding of the case is that he was charged with evading JSTOR security, not for accessing the MIT network.


Although his charges were ridiculous, they involved physically connecting to a secure network without permission, not just scraping the public part of pages from his own networks.


> rate limit, including adding random jitter to mimic the way a human might use a browser

Even if you aren't trying to disguise anything, adding some randomness helps avoid one particular bad pattern with operations on a network. I recall the pattern being called "network synchronization" but I can't get good search results for that.


Reducing your business costs by scraping a public access website is often considered an alternative to paying the business costs of the website operator.

Are you saving money at the expense of the site operator by scraping their site for public records, or are you saving money as well as the site operator?

If you're costing them money to reduce your own bottom line without their express written consent, that makes you "the bad guy". Offsetting costs onto an unwitting, non-consenting third party is an unethical approach to doing business.

I interpret your request as a similar problem to "help me with my homework problem". I could dig up papers and studies, but at the end of the day, you need to go do your homework. Reach out to each municipality and figure out a business arrangement with them that satisfies your needs. It's possible they do not wish you to perform this activity, in which case you will either need to violate their intent for your own profit using scraping or accede to their wishes and stop scraping their municipality. That's your homework as a for-profit business.


I don't empathize with your viewpoint because, whether it's a web scraper, or a person, the work is exactly the same. There's no additional volume, or extra steps. We just emulate a worker.

We measure the value in FTEs, and when a researcher quits, we do not replace them if the appropriate FTEs have been reached with projects.

It's a major benefit to the business not only because we don't have to pay another employee, but we can reduce training costs, and costs incurred by mistakes. We can also adjust execution of one of these agents, which normally would require rearrangement of work instructions, and retraining.

These are public records, 90% of them do not have integrations for automated systems, and those that do, we utilize. They are typically search boxes with results. We are not circumventing any type of cost that would otherwise be incurred.

We do not log any of the results, store them locally, or maintain any of the PII with each search. If a case was searched 20 minutes ago, and comes up again, we rerun the entire thing just as a human would.

Finally, to your point about 'help me with my homework', I consider posting on the HN forums homework for this type of research. There are a diverse set of talented developers on here with esoteric experience. The fact that an article related to the work I do came up on here, I thought, was an excellent opportunity to seek advice and perspective.


Don't be discouraged by the spiteful kneejerk reactions in this thread. HN is a diverse place and some commenters get triggered by an association with one of their pet peeves and launch into a rant without taking time to assess the nuance of your position. I've been the butt of this behavior a few times and it can be pretty toxic.


Sadly, you are correct to have realized that many posters on HN are so naive that they will offer you $0/hour consulting for your for-profit business. Posting on the HN forums means you "don't have to pay another employee" that's an expert in the field. I can't do much to prevent this, but I don't much respect it, either.


What you call being naïve, I'd call being a good human being. Skilled professionals willing to freely share knowledge are a great thing. BTW., it's literally the foundation of our industry and the whole point of Open Source movement.

If it reduces market for some consultants, well, sucks to be them, they'd better find a different way of providing value. Not every value needs to be captured and priced. A world in which all value was captured and priced would really suck.


I'm glad that sites like Wikipedia, StackOverflow, and HN exist. I don't think the world is a worse place because they exist, and I respect the people who post there.

This is the same attitude that says, "why would someone just give away Open Source software when they could build a SaaS business instead?"


I don’t think Stackoverflow for “how can I avoid paying a municipality a reasonable public records fee” should exist, but I do endorse Stackoverflow in general. You’ll have to do what you will with that; generalizing my point to “all Stackoverflow” is certainly wrong, though.


>Posting on the HN forums means you "don't have to pay another employee" that's an expert in the field. I can't do much to prevent this

Sometimes the answer tells you much more about what skills you need to be hiring. Sometimes they give you a lead.


Public records are public.

The fact that some government organizations make it hard to retrieve public records is a flaw in the system. I'd be in favor of a national law requiring all public records to be published in machine-readable form.

In the mean time, it is our civic responsibility to conspire to circumvent these misbehaving public services.


If such a national law were passed with funding guaranteed for open publication of records, I would endorse your point of view.

No such funding exists, and municipalities are regularly denied tax increases by their voters for any reason — much less public records publication that would often embarrass and humiliate those same voters.

So in essence you're asking them to cut public services and staffing in order to give hundreds of dollars of IT costs a month to for-profit businesses who can't be bothered to pay some small fraction of their revenue for the costs of delivering those records.

It is our civic responsibility to republish those records for free as citizens. Doing so for profit at the expense of citizens is unethical.

If OP republishes all records received in a freely-downloadable, unrestricted form, then I would happily help them fix their scrapers. They, of course, do not.


Often what the municipalities are doing for public records is harder and more expensive than just publishing an API. So The funding excuse doesn't really cut muster with me.


Can you name a single for-profit public records scraper who republishes the parsed data scraped without charging for data access?

The public records are public. Charging for them is, by the above arguments, immoral. Therefore, not only the municipalities but also the businesses profiting from those public records owe us their scraped data, for free, without regard for profit concerns.

Not one for-profit business does so. Why is their immoral action acceptable, when the same action by a municipality is not?


There's nothing immoral about charging for content that you've aggregated. People sell dictionaries.

The problem here is that instead of building APIs (or just posting to FTP sites), governments are building offices and funding staff to answer snail mail requests. Or building sophisticated web forms and search engines.

It's obvious how we got to this point (before the internet, you obtained public records by walking into an office) but it's long past time to change. We don't need fancy web forms to search and find data; cut all that out and just provide data in machine readable form to anyone who wants it.

Someone will build a pretty commercial interface to public records data. Chances are, they can do it for less than the 8-figure sum required for UI development in the public sector. Win-win.


It is not obvious to me that reducing the cost to consult public data is necessarily a good thing. Just because this data is accessible, it should not amways also be accessible inexpensively. Example given: trial records should be public but it would probably not be nice to have all your judicial record displayed in people's glasses.


Disagree. It's inherently in the public interest to have access to this data as easily as possible. If it's too embarrassing then that's a cultural problem.


Some "public" records are in the gray area as in; should or should they not (black and white) be published. For example salaries, the employer might forbid disclosing salaries, but anyone can just request anyone's salary from the government because its public. But if they could be downloaded from an FTP ...


In the US it is illegal for employers to forbid disclosing salaries.

Discussing salaries is a taboo created by industries to stifle wages.

https://www.monster.com/career-advice/article/truth-about-di...


What government agency allows you to see arbitrary other peoples salaries?


Tax records are published in Sweden, Finland and Norway


Can you name a single for-profit public records scraper who republishes the parsed data scraped without charging for data access?

Currently? Not off the top of my head. But there was one that scraped municipal records in a large midwest city and made them public for free because they were confusing to get to otherwise.

Unfortunately, the company was bought by a larger company and that portion of what they did was shut down.


Loveland (now apparently called Landgrid). https://landgrid.com/


Loveland is such a cool organization


Public records are published based on certain demand assumptions.

If a real-world demand for, say, some GIS data is hundreds of requests per day, then a crawler that comes in with hundreds requests PER MINUTE will obviously stress the infrastructure. Adjusting infrastructure to cope is not an instant process, nor is it a sure thing to begin with given all the budgeting formalities. So your "civic duty" will ultimately result in destruction of these services, because they simply don't have the means to deal with such thoughtless activism.


You've made an unfounded assumption -- that is, that the person you're responding to is scraping irresponsibly. If they are, as they say, simply replacing human researchers with the equivalent bots, then the net load change from automation is zero, or possibly even negative.


Imagine if search engines had to "reach out to each [site owner] and figure out a business arrangement with them." The world decided that opt out via robots.txt was a better approach.

If the municipality wants to get the information out, this could be a win-win, just like search engines were. Do check for robots.text, though!


We found at one job that approximate one quarter of well-known search engines blatantly use robots.txt noindex declarations as a list of URLs to index, and one openly mocked us for asking them to stop.

Voluntary honor systems don’t work, because there’s no way to compel non-compliers to stop other than standard “anti-attacker arms race” approaches, such as the obstacle described at the head of this thread.


It sounds like scraping is a big problem for you guys. What kind of outfit is it, if you don't mind me asking?


Drop me an email and I’m happy to describe further.


Well, the search engines decided that robots.txt was the better approach for them. Which makes sense, since they want control over as much data as possible, that's their profit motive. The jury is still out on whether that's a long-term win-win social contract between search engine companies and the world.


> Well, the search engines decided that robots.txt was the better approach for them. Which makes sense, since they want control over as much data as possible, that's their profit motive. The jury is still out on whether that's a long-term win-win social contract between search engine companies and the world.

Are you really arguing that the internet would be _more_ accessible if search engines had to reach out to every site they wanted to crawl?

How many companies out there complain about being scraped by Google? How many companies benefit from search-driven traffic?


The alternative would have been opt-in instead of opt-out. Everything excluded by default, except what robots.txt allows you to index.

Naturally, Google didn't want that.


I would assume that any site that was implementing JS-level blocks also has the appropriate robots.txt file in place.


That's not true in the actual web, however.

The best example is a large number of unimportant sites that send 429 errors for /robots.txt if they think it's a scraper. A 4xx result for robots.txt is considered to mean no robots.txt for most crawlers. So the website is getting the reverse of what it thought it was getting.


Why privilege traffic based on its source (whether it's from a human or Selenium)? If some resources are expensive to serve, you can rate limit them.


Because some information is more valuable than the sum of its parts.


To the siblings wondering about reaching out to the sites and offering to pay for the data: I'm not parent poster, but where I work, we absolutely have reached out. We've even offered to build and maintain the systems/APIs/etc we'd need at our own expense in addition to paying for the data. None of the companies we've reached out to seem interested in providing easy access to their data.


I'm very curious to know how you are able to get precise and accurate enough identification information from public websites to be able to credibly run a "background check" on someone. I used to work in the criminal justice system, and had unlimited access to every single criminal case initiated in my state going back for almost 40 years. It's difficult enough for a trained person to do it by hand, let alone automating it. How do you provide any guarantee of accuracy?


My wife's SSN/credit history/online identities have in the past been mistakenly tied up with her sibling's. This has since been corrected with all the appropriate agencies and organizations.

However, from what I've noticed of search results over time, these background check (AND identity verification) sites crawl each other and create a kind of feedback loop, as I've been noticing that some of these pages will falsely report parts of her sibling's background among her own, and falsely flag her as having certain ugly events in her past that don't actually belong to her. This is concerning, as her career area cares a lot about employees having a clean background, and employers using these cheap automated options see cheap, inaccurate results. She has a squeaky clean background with a high credit score and impressive educational credentials, while her sibling has had run-ins with the law and bad debts. I'm concerned about how this will affect her future career prospects.

Beyond background checks, identity verification is a big concern as well. You may have noticed some services ask you to confirm certain facts about your past (street names of where you've lived, schools you attended, jobs and cars you've held). When pulling her credit bureau reports, some of these verifications required confirming facts about her sibling rather than her own in order to gain access.

Like I said, these issues have been fixed with all the "official" record-keeping organizations; however, since the fix, I've been noticing increasing issues with the original mistakes propagating to 3rd-party background-check organizations.

These services cause more problems than they solve, and should require consent, oversight, and civil or criminal penalties associated with a failure to meet high quality standards.


> These services cause more problems than they solve, and should require consent, oversight, and civil or criminal penalties associated with a failure to meet high quality standards.

Existing law does not proscribe recklessly sharing damaging false information about people?


It does not.


According to Wikipedia, it does in the US: https://en.wikipedia.org/wiki/Defamation#United_States

Also of note, "Malice would also exist if the acts were done with reckless indifference or deliberate blindness" - https://en.wikipedia.org/wiki/Malice_(law)


I have tried to pursue legal action against companies maintaining inaccurate information using defamation as grounds, to no avail. Maybe you’ll have better luck.


There are a number of ways we do this.

First, the process of automating a source is not as simple as 'grab data, send to person that creates the case'.

We have many many layers of precaution and validation both by humans and other automated systems, that helps guarantee accuracy.

On top of this, even public records has reporting rules in the industry. There are dates, specific charges, charge types (Misdemeanor/Felony), and a battery of other rules that the information is processed through in order to ensure we do not report information that we are not allowed.

We always lean to the side of throwing a case to a human. In the circumstance that anything new, unrecognized, or even slightly off happens, we toss the case to a team that processes the information by hand. At that point, we are simply a scraper for information and we cut out the step of having a human order and retrieve results.

We do not go back 40 years. Industry standards dictate that most records older than 7 years are expunged from Employment background checks. And most of our clients don't care about more than 3 years worth, with exceptions like Murder, Federal Crimes, and some obviously heinous things.

We also run a number of other tests, outside of public records to provide full background data. We have integrations with major labs to schedule drug screens, we allow those who are having a background check run on them to fill out an application to provide reasoning and information from their point of view to allow customers to empathize with an employee.

We also have a robust dispute system. The person having a background check run on them receives the report before the client requesting it in order to review the results and dispute anything they find wrong. These cases are always handled by a human, and often involve intensive research, no cost spared, to ensure the accuracy of the report.

There's a plethora of other things I'm missing, but if you have any specific questions, I'm happy to answer.

*EDIT

To clarify, there is a lot of information in public records. It isn't unclear or ambiguous at all. Motor Vehicle and Court records are extremely in-depth and spare no detail.


I would imagine a live person audits the information collected by the scrapers, thereby eliminating the hassle of collecting it from multiple different sources.

As a private person, we only have access to court documents on a state or county base. Any central database we have access to would be made my scrapers.


I personally think that it maybe be ethically questionable to be making background checks easier. There is a reason why the right to be forgotten is becoming a thing in various jurisdictions and lack of easy access for sensitive data is one countermeasure to try and counterbalance the need for public access to data with the right to privacy for individuals.


I don't have a perspective on the ethics of easier background checks. We run employment checks, the ultimate decision of whether to hire falls to the customer ALWAYS. I've seen plenty of former criminals get hired. It's a workplace culture 'thing'.

The right to be forgotten is alive and well most of the time, 90% of our clients don't observe information further back than a few years. I feel like that is a fair assessment of someone's behavior.


"Just" providing the data doesn't absolve you of responsibility for the decisions others make using it.

There is a point where data collection becomes unethical, and making everything fine as long as it isn't legal makes for a shitty society. (i.e. legislating behavior should be a last resort not a first judgement on right and wrong)

I don't know precisely where that point is, but automated scraping of social media probably is past (automated scraping of judicial records? probably ok)


I still don't agree. The whole reason this business exists is to remove the cost from all the industries that need to run background checks.

I think the extent and reason for the checks aren't apparent. So I'll give a few examples where we have high volume and I hope that will enlighten you as to the reason why there are so many players in the industry.

The highest volume checks are around the medical and teaching fields. We often run 6-month, to one year recurring checks on teachers and doctors to ensure licenses and certifications are still active. As well as necessary immunizations to work in their environments.

Do you expect a low margin industry like teaching to staff a full time employee to do nothing but run background checks? They want them done and the schools have access to the information, it's just much easier for them to pay us a few dollars an employee and get a nice report than do the legwork themselves.

Additionally, incurring the cost of access for the relevant data is a barrier for companies without a bunch of cash laying around.

We don't solicit companies with incriminating information about their employees, it's a necessary part to a safe environment.


shrug not trying to imply it is all bad. The nature of the information is important. Professional certification or licensing checks are obviously harmless.

What isn't harmless is gathering information about the private lives of people (even when done in the public eye) in ways that are difficult, labor intensive, or impossible without automation.


The right thing to do would be to reach out to those sites and see if they is they have paid options for getting the data you need.


And what happens when they ignore you? I've reached out to tons of website operators to ask for machine readable access to their data on academic, personal and professional projects, I have never gotten a reply and had to resort to scraping.


I can second this for public records websites.

A previous company I worked for aggregated publicly recorded mortgage data. The mortgage data was scraped from municipal sites on a nightly basis because it was not available as a bulk download or purchasable option.

We had requested on several occasions for a service we could pay for in order to get a bulk download of this data, but the municipalities did not have the know how to provide this as were using systems from a private vendor that were prohibitively expensive for them to request modifications. As a result, we worked hand in glove with the municipalities to ensure we were not stressing their infrastructure when we did this scraping, and I think that's the best we were able to do in this case.


Well, when that option is available, as in the case of something like SAMBA WEB MVR, we absolutely opt for that instead, and pay our dues.


If it suits your needs, please consider using Common Crawl instead.

http://commoncrawl.org/


If you are not doing anything subversive then can you share with us some examples of the sites that are selectively blocking you? And give us an example of the public information that is sought. Perhaps disclosing any of the sites puts you at disadvantage versus potential competitors? It does not make much sense to block access to public information, assuming you are not interfering with others' access.


My understanding is that Selenium injects Javascript in the page, whether you're using a headless browser or not. The best bet would be to switch away from Selenium and write the code using something like Puppeteer.

If you do want to stick with Selenium, you're better studying the chromedriver source than Chromium itself.


Have you tried running your browsers in virtual frame buffers? Do they still get detected?


Have you tried writing a chrome extension and running it in a desktop browser instance? It's super easy to set up and shouldn't appear any different than a regular user if you rate limit and add some randomness to the input events.


I run an ad delivery platform (hey, we're both popular) and I detect and block bots because they tend to inadvertly drive up engagement counts on ad campaigns, creating a situation where publishers can't be confident in their numbers. Some clients have their own tech to do the the same.


If you don't inspect and respect robots.txt, you shouldn't be surprised by sites actively blocking your crawlers. Ditto for when you try and work around crawling restrictions by hiding behind real browser UAs.


Have you tried loading a full browser session? Not just headless.


Not the OP, but I did that about 12 years ago, with Firefox. My boss at the time had asked me to parse some public institution website that was quite difficult to write a parser for directly in Python, so in the end we just decided to write a quick extension for Firefox and let an instance of it run on a spare computer. That public institution website had some JS bug that would cause FF to gobble up memory pretty fast, but we also solved that by automatically restarting FF at certain intervals (or when we noticed something was off).

Not sure if people do this sort of things nowadays.


When I'm doing personal scraping, I just write a chrome extension. You can find boilerplates that are super easy to set up, and they persist in a background thread between page loads. It's really easy to collect the data and log it in the console or send it to a local API or database. It's the lowest effort method of scraping I know, and you can monitor it while it runs to make sure it doesn't get hung up on some edge case.


Sure we do. Through Selenium. You can either load a full browser session, or a headless one. But headless sessions are identifiable.


Selenium injects predictable Javascript in both situations.


i run a scraper on craigs list style marketplace for my country, they have now one of those commercial scraping protection, that i trivialy escape with basicaly adding random string to a url. try how they work and then create a workaround, i think most of them use some of those comercial solutions.

i do my scraping just for myself. maybe if i would scale it up they would detect me.


putting my 'bad guy' hat on, I would think about automating via sikuli script if you had to (but only if you had to).


There are additional tests included in https://arh.antoinevastel.com/javascripts/fpCollect.min.js that do not exist in the GitHub repository over at https://github.com/antoinevastel/fp-collect.

  redPill: function() {
      for (var e = performance.now(), n = 0, t = 0, r = [], o = performance.now(); o - e < 50; o = performance.now()) r.push(Math.floor(1e6 * Math.random())), r.pop(), n++;
      e = performance.now();
      for (var a = performance.now(); a - e < 50; a = performance.now()) localStorage.setItem("0", "constant string"), localStorage.removeItem("0"), t++;
      return 1e3 * Math.round(t / n)
    },
    redPill2: function() {
      function e(n, t) {
        return n < 1e-8 ? t : n < t ? e(t - Math.floor(t / n) * n, n) : n == t ? n : e(t, n)
      }
      for (var n = performance.now() / 1e3, t = performance.now() / 1e3 - n, r = 0; r < 10; r++) t = e(t, performance.now() / 1e3 - n);
      return Math.round(1 / t)
    },
    redPill3: function() {
      var e = void 0;
      try {
        for (var n = "", t = [Math.abs, Math.acos, Math.asin, Math.atanh, Math.cbrt, Math.exp, Math.random, Math.round, Math.sqrt, isFinite, isNaN, parseFloat, parseInt, JSON.parse], r = 0; r < t.length; r++) {
          var o = [],
            a = 0,
            i = performance.now(),
            c = 0,
            u = 0;
          if (void 0 !== t[r]) {
            for (c = 0; c < 1e3 && a < .6; c++) {
              for (var d = performance.now(), s = 0; s < 4e3; s++) t[r](3.14);
              var m = performance.now();
              o.push(Math.round(1e3 * (m - d))), a = m - i
            }
            var l = o.sort();
            u = l[Math.floor(l.length / 2)] / 5
          }
          n = n + u + ","
        }
        e = n
      } catch (t) {
        e = "error"
      }
      return e
    }
  };


It doesn't seem to be using the Javascript. Looking at the page source, it has already made the determination before the Javascript runs.

If I load the page source in Chrome, it already includes the "You are not Chrome headless" message, but when I run it in a scraper I maintain, the page source loads with the "You are Chrome headless" message, even without running any Javascript.


So it's just measuring computation speed of math calculations? That doesn't sound very reliable.


https://arh.antoinevastel.com/javascripts/fpCollect.min.js contains some functions called redPill that aren't in the normal fpCollect library. redPill3 measures the time of some JS functions and sends that data to the backend. Here's a chart of redPill3's timing data on my computer: https://i.imgur.com/c8iuV6I.png

Those are averages of multiple runs on a Core i7-8550U running Chromium 75.0.3770.90 on Ubuntu 19.04.

isNan and isFinite are much slower in headless mode, but other functions like parseFloat and parseInt aren't. My guess is that the backend is comparing the relative times that certain functions take. If isNan and isFinite take the same time as parseFloat, then you're not in headless mode. If those functions take 6x longer than parseFloat, you're in headless mode.

I don't know if this holds true for non x86 architectures or other platforms.


Huh. Your chart reminded me of an experiment I did 10 years ago to test if you could distinguish whether an image request was triggered by <img> tag, vs. user clicking on a link (or entering its URL in the address bar). I created a test page and asked people on the Internet to visit it, and then analyzed PHP & server logs.

Unexpectedly, it turned out that Accept header was perfect for this. The final chart was this:

https://i.imgur.com/ZA8qD8t.png

("link" means clicking on an URL or entering it manually; "embedded" means <img> tag)

Makes me wonder whether Accept header is still useful for fingerprinting in general, and distinguishing between headless and headful(?) browsers in particular.


This would be more interesting if the author explained this technique.

People that are knowledgeable enough will deep dive into the webpage, but for everyone else, expect disappointment.


Agreed that the article is poorly written for the Hacker News crowd, would be nice to have a description of his technique so the merits/faults could be analyzed without everyone conducting a reverse engineering effort.

It’s sad to have a smart guy like this dedicating his academic career to something this inconsequential. Anyone with enough incentive is going to be able to defeat any technique this guy dreams up. Anyone who might benefit on paper from detecting a headless browser isn’t going to want this because any possibility of a false positive is a missed impression, or a missed sales opportunity, or an ADA lawsuit (US), or an angry customer.


I'm not sure if diving deep into the page will yield results of how it's done. The page's javascript does a POST to a backend with the browser's fingerprint, and the server does all the "magic" where we can't see it. Unless there is new fingerprint info that is being sent to the server that wasn't around before, I'm skeptical about the javascript in the page revealing the full technique.


He claims the fingerprint library's techniques aren't used for the check though, so surely there must be an observable difference between the POST request from headless and non-headless

Edit: According to other commenters there are checks in the included version of the library which are not in the release version.


The "You are/are not" message seems to be included in the page source before any Javascript runs. Is it possible there are detectable differences in the original HTTP request itself?


My guess is he's looking at XSS mitigations or similar that aren't in headless?

If it were doing something like using CSS being non-blocking (? I don't know that it is) that's a server side detection .. but that would seem to work even against spoofing.

But he says if you spoof another Chrome-based browser (Safari) he can't tell. So he's looking first at UA?? That's weird.


Yep, you got it, checkout the top commment on this thread


only way to do it these days... although the payload is not hashed or obfuscated in any way so it would be extremely easy to fake if it's even being stored in a db or memory somewhere, else you can just copy the request exactly as is


I don't think there's actually anything concrete to discuss. After reading the post, my feeling is that the author is probably fishing for HN geeks that would be intrigued to test his headless detection mechanism. Which would be borderline antisocial behaviour towards the HN community. This would better fit under the "Ask HN" label.

Moreover, I'm not even sure how useful would this community testing be academic-wise. Black-box testing is great stuff for CTF competitions. But any decent academic venue would dismiss systems that can't withstand white-box testing as security-through-obscurity.


Well, if I was him I guess I would prefer to have an accepted paper about this new technique before releasing everything to the public.


Out of all the zero-sum tech arms races (increasingly complex DRM, SPAM senders/blockers, software crackers vs. copy protection, code obfuscation) this one seems to me to be the stupidest. Here we have people putting data out in public for free, for anyone to access, and then agonizing over how someone accesses it. If some data is your company's secret sauce, your competitive advantage, don't put it out on the Internet. If your data is not your competitive advantage, then why bother wasting all this development effort stopping browsers from browsing it? So much waste on both sides.


I agonize about this every day, since a large part of my job is aggregating data from many sites that seem hell-bent on not letting anyone access it without going one-form-at-a-time through their crap UI.

The thing is, we would gladly pay these companies for an API or even just a periodic data-dump of what we need. We've even offered to some of them to write and maintain the API for them. They're not interested, for various industry-specific reasons.

I often wonder how much developer time and money are wasted in total between them blocking and devs working around their blocks.


Sometimes when I'm thinking about it and what 95% of developers are working on, it feels like a planet-wide charity project against unemployment.


I think it's a fairly well known thing that 'junk jobs' tend to spring up in response to supply. I think it's a bizzare cultural thing.


The travel industry is highly protective of its data. It's my understanding that they consider it proprietary and only sell it to those who they deem worthy.


who says this is zero sum? in the limit, it seems like a loose/loose situation to me. or at "best" modest rewards for [spammers|scrapers|...] at tremendous (spread out over the population) cost in loss of usability and compute cycles and development effort


This is like saying, if you're going to give people free samples, why not give away the whole grocery store?

Wanting to give out limited free samples inevitably leads to making sure you are giving out samples to people and not bots and not too much to each person, and that leads to user tracking.

Compare with the arms race between newspapers and incognito mode:

https://www.blog.google/outreach-initiatives/google-news-ini...


I'm the other half of the cat and mouse game that Antoine is referring to, and I just wrote another rebuttal that people here might find interesting [1]. It goes into a little more detail about what his test site is actually doing, and also walks through the process of writing a Puppeteer script to bypass the tests.

- [1] -https://www.tenantbase.com/tech/blog/cat-and-mouse/


All this can be avoided (from a scraper's perspective) by using the Selenium IDE++ project. It adds a command line to Chrome and Firefox to run scripts. See https://ui.vision/docs#cmd and https://ui.vision/docs/selenium-ide/web-scraping

=> Using Chrome directly is slower, but undetectable.


I am using the UI Vision extension for a few months now. It is not very fast, but it always works. It can extract text and data from images and canvas elements, too.


I work in telecom and we interface with large carriers like AT&T, Verizon, etc. We use headless browsers to automate processes using their 15-year-old admin portals, since the carriers simply refuse to provide an API, or one that works acceptably.

Thankfully they're also so technologically slow that they never change the websites or do any kind of headless detection. Its works, and allows us to offer automated [process] to our customers, but it seems so fragile. Just give us a damn API.


Well the user agent of chrome headless contains 'HeadlessChrome' according to this site [1]. Sure enough when I spoof my user agent to the first in the list it magically determines I'm using headless Chrome.

He basically says he's inspecting user agents:

> Under the hood, I only verify if browsers pretending to be Chromium-based are who they pretend to be. Thus, if your Chrome headless pretends to be Safari, I won’t catch it with my technique.

Maybe I should apply for a PhD too.

[1] https://user-agents.net/browsers/headless-chrome


If a browser claims to be Headless Chrome, you believe it. Nobody has a reason to lie about that. The interesting question is the opposite case: is somebody claiming to be a normal Chrome, but is actually Headless Chrome (or an automated member of some other browser family, or not a browser at all but e.g. a Python script).

So if you take a Headless Chrome instance but change the User-Agent to match that of a normal Chrome, does the detector think it's not headless?


The detector still thinks it's headless even if you spoof the user-agent.


The author seems to be making use of: fpscanner[1] and fp-collect[2] libraries for achieving his task though he doesn't seem to explain how exactly the detection is done.

[1]: https://github.com/antoinevastel/fpscanner

[2]: https://github.com/antoinevastel/fp-collect


On his actual "test" page, he claims he is not using fpscanner: https://arh.antoinevastel.com/bots/areyouheadless

> It does not use detection any of techniques presented in these blog posts (post 1,post 2) or in the Fp-Scanner library


> It does not use detection any of techniques presented in these blog posts (post 1,post 2) or in the Fp-Scanner library

That's from the linked test page https://arh.antoinevastel.com/bots/areyouheadless


I think there might be a market for "human crawlers". Just like people use Mechanical Turk to get humans to beat CAPTCHAs, you could use it to get humans to visit a web page for you, and return its HTML source. There are of course residential proxy services (ie HolaVPN), but they're still technically can be detected.


Why would you do that when you can automate it?


Because of the issues the article described: detection of headless crawlers/bots/etc


You can automate a regular browser. It doesn't have to be headless.


Unfortunately, there are some sites that can even detect regular automated browser sessions.


You can still simulate mouse and keyboard.


This straight up crashes my scraper's browser, using puppeteer, extra-stealth ect.


Wouldn't these render the Brave browser unusable on some sites?


Why would it do that? Brave is just a Chromium fork afaik.


I’ve participated in a number of Stanford research studies, and what the author is doing here is similar to part of it.

The studies in which I’ve participated always start with a statement of what they are generally looking for in a participant. You then take a survey that confirms if you are qualified. You then given a release to sign (and keep a copy of), which states what you’ll be doing, and providing an IRB contact. You then go through the study.

At the end of your participation, you are asked “What do you think the study is about?”, and then you were told the real purpose of the study. Eventually the paper(s) is/are published, with hypothesis, methodology, and results.

This seems similar: You decide if you want to participate, and are participating; the only thing that’s missing is the final paper.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: