I don't think that he did it because it's his job to stop random people on the Internet from running slow queries. I think she was just surprised how creative people are and found it funny.
The reason stop words take such a long time is that millions of sites have words like "the" on them, so doing a join on all those simply takes a long time.
My method to find a long string consisting entirely of stop words, was to just download a project gutenberg of the complete works of shakespeare, and find the longest string consisting of just stop words in there, then search for it as a literal quote.
The longest one I found was: "From what it is to a".
Let me see how long Google takes to do it now :)
2.04 seconds! Nice :) - http://i.imgur.com/IhPTpr6.png
that took 30+ seconds 'back in the day'.
Today it resolves instantly and finds at least 10 publications from before 2000.
By the way I believe I wanted to know whether it would return the Shakespeare quote at all.
If you mean that it might have cached the results of the query, I doubt anyone else queried that exact phrase, other than me.
I'm guessing you're most familiar with btree indexes as present and default in many SQL solutions, which are good for quickly answering exact, greater/less matches. There are dozens of data structures useful for indexing, some of which are built to index full text documents. For an example, check out the gin and gist indexes in Postgres .
It's my understanding that database indexing and index compression was a primary differentiator Google excelled at from the beginning. They could beat others at fractions of the typical cost because they didn't need data centers to store and query huge quantities of documents.
Seriously, there's no way even Google could intersect the sets of all crawled web documents containing those individual words in 30 seconds, much less two seconds.
I believe you're mistaken. What I've heard is that for every word, Google has a list of every web site that contains that word - they've flipped the database. So, I believe, if you search for (without quotes) neanderthal violet narwhal obsequious tandem then -- and I just did this query, which took 0.56 seconds, but decided to remove some of the words, so it can get it me results. When I did plus signs, making my query +neanderthal +violet +narwhal +obsequious +tandem it said it worked 0.7 seconds to determine that in all of the entirety of the Internet, there is not a single document that has those 5 words on it.
How do you think it determines in 700 ms that all of the sites it has indexed on all of the Internet does not contain those 5 words anywhere on it?
The answer is that it has a rather short list of sites that contain the word narwhal, which it then intersects with the somewhat larger list of sites that contain obsequious and so on. 700 seconds is plenty fast when you take that approach.
so, this explains why joining stop words (which consist of billions of pages, each) takes so very long.
using stop words it is easy to make queries that take one or two seconds each.
Huh? What do you mean? Google indexes HTML web page content from the entire public internet using web crawlers...
The reason my search took 30 seconds is because it started by getting a list of every site with "from" on it, every site with "what" on it, and so on, intereseecting them all. That's how it ended up finding my quote. how else do you think it did it?
to find the string "from what it is to a" which occurs only hidden in the middle of shaespeare's texts -- what do you think they do?
In my opinion they combine the list of sites that have every word - starting with the least common ones. It's easier if you search for something that has a few uncommon words. Then you start with a small list, and have to combine it with other small lists.
When every word in the phrase has billions of sites (there are billions of pages that have the word "to" on them, same for "from", "what", "it", "is", "a"), you have to combine them all. Then you have to do a string search within the resulting set, since I put it in quotation marks. There is no easy strategy. Hence the long search time.
what else could they be doing?
the sense you mean is a different sense of the word index - meaning, to crawl. Yes, of course it does that too.
Google, no doubt, has a very sophisticated way of querying against their cache of the WWW and it has probably evolved over time. However, it is inappropriate to say Google does a join over the entire internet for one query. It is much more reasonable to say that Google checked your query string against their gigantic index of terms, and it took a while to dig that deep into the pile. The performance hit such a complex query takes is more like unzipping a large archive to get a specific megabyte's worth of info, rather than saying it smashed all the files together and then searched for the exact term like notepad.
Anyway, think about it for a while, it's clearly a cool issue in search, and programs and algorithms do not have to visually search things as humans must.
I recommend reading the Stanford paper (page 12), which spells out in a lot of detail exactly what they were doing.
In short, your pathological query would have searched for every document which contained one of your words, discarded those which didn't match all, and then sorted by word proximity. I expect for a literal phase search, there would be a final pass to look for the exact phrase in order.
Have you accidentally searched Google for a long URL and seen it come up? It is actively caching stuff all the time, and that cache just grows and grows, and it must be a pretty beautiful megastructure that you can run queries against.
Normally it ignored those words. I am fairly certain of this detail. I must have found a list of those words - how else would I have found the string "from what it is to a"? I had a list of its stop words.
Edit: for proof, here's someone's screenshot of the same - http://farm3.static.flickr.com/2270/2201828252_45a32da7f4.jp...
As you can see, it is Google saying it is ignoring a word because it is too common. It has a list of every site that has that, but that list is huge and it doesn't usually use it.
My same query as you took 0.3s, but if I stripped out one word ("From what it is to") it took 2.2 seconds.
Shouldn't someone with a job in statistics know how to account for outliers?
This is why companies like Google have bounties for such things. "Please submit bugs and performance issues privately so we can patch them before you disclose the details publicly and hurt our services - we'll even pay you for your discretion!"
When you operate at the scale of Google, everything is expected to be airtight; outliers should not be possible. It wouldn't surprise me if their monitoring systems are built without the ability to "massage" (ie: manipulate) statistics, as it is a terrible practice. I don't think a statistician who relies on ignoring outliers would last long working for Google. They're not doing their job if the only thing they care about is silencing warnings to make pretty graphs that falsely show everything is running smoothly. Their job is to work with the truth - not manufacture little white lies to appease management.
Nobody ever asks about that 0.1%...
One of my young-adulthood colleagues went on to be an early googler who is influential in relevant policies. We built our relationship sharing bugs and analysis techniques. Quite a few years ago some scoundrels whose trust I gained proudly showed me how they were using youtube links to drop malware. Since my old mate worked there, I mentioned it and they were quite interested.
We hadn't shared anything in a while, both of us demonstrating loyalty to our employers and not talking about work details. I said that it would be really cool to have a one dollar check from Google for a bug report. I probably offered to send something cool from my workplace too.
They said, "We don't pay for bugs" Fifty cents? "We don't pay for bugs!"
I felt like I was simply after a piece of paper and the evildoers were a mildly useful source, but I could easily do without them and the souvenir would have been treasured.
I was unreasonably miffed that I couldn't get that piece of paper, though. So I reviewed the links I'd collected and passed some general information but withheld details that would be obviously unique to these attackers. They expressed disappointment with me the next time we spoke. It turns out that what I gave wasn't specific enough to easily identify the lame cross site exploit, despite my actual intent to lead them to the bug.
Interesting they have a bounty program now.
I got all my wickr contacts to switch to signal, which is much less buggy...
Managing soft-deletes on a database table requires an attention to detail, with every single query ever touching that table, that many developers lack the discipline to handle. Discipline aside, it's difficult for every developer on a team to remember which tables use soft-delete, and when checking that flag is or is not necessary. Finally, ORM abstractions often automate soft-delete in such a way that makes it exhausting for developers to validate every query. I've seen this bug over and over again at every company I've worked for. Happens so often it's impossible to keep count.
> Discipline aside, it's difficult for every developer on a team to remember which tables use soft-delete, and when checking that flag is or is not necessary.
That's the case where instead of "try harder not to make mistakes", you design a system so it is not possible to make them. One way would be to rename original table `raw_messages` and `create view messages as select * from raw_messages where not deleted`.
As for writing: it's not magic, but for me it's not consciously applied processes either. If I had to guess how my earlier comment came about, I'd suggest something like this as a generative process:
1. Find two effects with a common cause (provided upthread).
2. State each effect, sharing words and rhythm to bring out contrast.
3. Omit needless words. (Thanks, Strunk/White!)
I've seen abstraction layers where it's impossible to add a default clause to every SELECT query for a model. I've seen other abstractions where "AND deleted = false" can be automatically added to every SELECT query. I've also seen abstractions where that clause is added to all SELECT, UPDATE, and DELETE queries.
Here's a list of problems:
a) Developers bypassing the model, executing a complex JOIN that includes the table in question, and forgetting they need the "deleted = false". Most complex queries wind up being written as raw SQL or a parsed variant, that never executes the model behavior to append the "AND deleted = false" clause. Is it "wrong" to bypass the model? Most of the time, yes! But it happens every day. We're talking about what happens in reality, not what should happen in an ideal fantasy world.
b) Developers missing the case where they should be including soft-deleted rows. When the abstraction layer enforces a "deleted = false" on every query, it can be difficult or impossible to force backtracking to include soft-deleted rows. Back in the MyISAM days (before foreign keys), I found an "account deletion" mechanism that executed a "DELETE FROM messages WHERE userid = :userid AND deleted = false" - soft-deleted rows were not deleted when required, because an abstraction layer excluded soft-deleted rows in a DELETE query and the original developer never noticed.
c) What happens with UPDATE and DELETE queries? I've seen abstractions that only append the soft-delete mechanism to SELECTs, and others that also affect UPDATEs and DELETEs. Again, should every developer on a codebase understand in which situations the abstraction kicks in? Yes. The fact is they don't, because abstractions inherently make developers not inspect the behavior of their code as deeply as they should.
I don't remember soft-deletes being an issue at all - literally non-existent - 10 years ago, when all SQL queries were typed out by hand. When you're forced to write the query yourself, you have time to think about what you are doing. When you delegate the majority of the task to an abstraction layer that magically modifies your queries on the fly, bad things happen. The most stable and maintainable code base I ever worked on had every single query in XML files. It sounds tedious and bloated, as if it's a joke about the "old days", but every query was located somewhere where it could be analyzed, and you actually had to use your brain when writing a new query. I've seen nothing but misery since the introduction of abstracted ORMs and DBALs, where the only way you ever see the queries being executed is in debug dumps and logs.
>> competence I expect from a junior web developer
Sadly, more than half of the senior developers I've met can't handle soft-deletes properly. So no, in the real world, this cannot be expected of junior developers.
> 19/02/2017 – Got a response, they implemented a short-term fix and forgot to sent my report to the VRP panel …
I hope Google forgetting to follow up on bug bounties and needing to be reminded isn't a common occurrence.
There are systems and processes to help with all of this of course, but at the end of the day it's still a pretty tricky job to get perfect all the time.
This comes from a variety of experiences: I used to manage a bug bounty for a mid-size company on Bugcrowd; in 2014 I surveyed people managing a bunch of programs across different sizes; I've participated in bug bounty programs for companies of different sizes.
The more you offer for rewards and the more recognizable your company name, the more you will be spammed by people submitting reports like (I kid you not): "You have the OPTIONS method allowed on your site this is really serious." The last time I looked at the numbers, Google had over 80,000 bug bounty reports per year, with about 10% of them being valid and maybe another order of magnitude being high severity (I'm fuzzy on the last bit). It's probably over 100,000 per year at this point. It's not uncommon for recognizable but smaller companies to receive one or more per day.
I'm aware of full-time security engineers at Facebook and Google who do almost nothing but respond to bug bounty reports. It's a lot like resumes - people who have essentially no qualifications, experience or (most importantly) a real vulnerability finding will nevertheless spam boilerplate bug reports to as many companies as they can. Take a look at the list of exclusions on a given program - you'll see that many of them explicitly call out common invalid findings that are so ridiculous it's kafkaesque.
HackerOne and Bugcrowd provide a lot of technical sophistication to prime companies for success, but there is an organizational component that is very difficult. If your program is very active, it requires dedication to tune it so you're not flushing engineer-hours away responding to nonsense. This is not to say they're bad - quite the opposite, I think they're fantastic. But I generally recommend smaller companies set up a vulnerability disclosure program through a solid third party, and do so without a monetary reward until they can commit to dealing with a reasonable deluge of reports.
One other thing that never really gets any press is the fact that a good chunk of the folks sending in reports are young people in impoverished nations. Some of them can be pretty tricky to deal with, but if you hold a hard line on professional expectations you can see them flourish in pretty short order to be some of the best reporters out there.
I only spent a short amount of time on the program I was with, but it was very rewarding. A+++, highly recommended.
Even at this low rate, it is not too bad. Let's say you receive 10 reports. You can relatively quickly identify the 8-9 noisy reports to find the 1-2 valid ones. Of course, a higher SNR is always better. It saves you time and effort.
On HackerOne, the average SNR across all programs is over 30%. The platform can automatically filter out certain reports that are duplicates or out of scope.
The platform maintains an average signal rating for each hacker (aka security researcher). Companies can limit access to their programs to hackers with a certain signal or higher. This will significantly increase SNR for the program.
Companies can also opt for a HackerOne program with triage included, in which case the SNR rises close to 100%.
So if a new user of the platform, finds a valid or high impact bug, will be unable to report... less noise but a high value bug unreported in that case...
I'd perhaps accept 'I was so deep in the code!' once from a very junior developer, as a learning experience.
To the person that reported a bug, that one report makes or breaks their entire opinion of your organisation. We lost customers because of poor communication, and on the other hand made some very happy repeat-customers even when we had to say 'we can't fix that yet' - but they were in the loop for the whole process and understood why.
Similar things are being applied on the "defensive" side of things already anyway (i.e. Iranian, Turkish, Chinese firewall systems using machine learning to identify and block new patterns), so why not apply this on the offensive side.
*: Not to demean the author in any way; I understand that putting the time in to explore these things is easier said than done in hindsight.
 - https://cloud.google.com/security-scanner/
Augmenting fuzzying with AI is an interesting approach.
I also did a subdomain search on google a few weeks ago. I stumbled upon a lot of login sites.
A subdomain search leaded to 95 subdomains under corp.google.com.
I don't want to get sucked into it, I'm also closing the tab and going back to my terminal :).
I guess it's as much mindset as it's skill.
If instead of just complaining that commenter had taken the time to fill out a bug report they could have easily gotten the bounty instead.
Sometimes it just takes a tiny bit of extra effort to go from noticing something's amiss to actually doing something to get it fixed.
Basically, Chrome allowed users to use the "Always open files of this type" option with executable files. So if anyone was ever foolish enough to set that option after downloading a `.exe` on Windows, any future site they visited could take over their machine just by initiating a download for a malicious executable.
>>> len("which is nothing more than a simple login page (seems to be for Google")
The bigger issue is, I think, font size. I could imagine that on certain displays this font might look rather small.
> Raneri questioned my motivation and I said that I want to give the vendor ample time to resolve the issue and then I want to publish academically. He was very threatened by this and made thinly veiled threats that the FBI or other institutions would "protect him". Then he continued with statements including "we want to hire you but you must sign this NDA first." He also recommended that I only make disclosure through FINRA, SDI, NCTFA and other private fraud threat sharing organizations for financial institutions.
I applied for a bug bounty, but alas was turned down as Go isn't a Google service and it wasn't in scope for the Patch Reward Program.
I did get into the hall of fame though!
edit: some support for that at https://bugs.chromium.org/p/chromium/issues/detail?id=548688...
It's a similar convention as some of their other names such as GRFE (Groups), bsfe (Blog Search Front-End).
There's invariably one comment like this on bug bounty stories here. One comment that isn't happy with the bug bounty result even when the researcher is and goes off the rails with a weird anti-large company bias and some conspiracy.
This has absolutely nothing to do with your Nexus RMA story, or your cloud SLA story from downthread, or whatever other agenda you have against these large companies.
Do you have any concept of it's like to run a bug bounty at Google's size? Have you ever been involved in managing one? Have you ever participated in one? Can you qualify any of your opinion with something aside from these irrelevant grievances you're throwing out?
You're not contributing to the discussion at all, you're just hijacking the thread so you can perpetuate your soapbox. Human beings make mistakes and bug bounties are an easy place to drop the ball. No one is trying to cheat security researchers out of their rewards.
Not sure it's entirely the same as this case, but hijacking threads with "remember when this happened?" is not that unique.
Also sounds like you were dealing with humans, who (before escalating) followed company protocols.
Also seems not too far-fetched that those protocols are in place to weed out the whatever% that make bogus claims.
All in all, nothing I'd call "truly sad", "evil" and whatnot. Just big businesses being big businesses.
The fact is that the more people are involved in a process, the slower the cogs turn.
Having patience is more of a defense mechanism than it is anything else. It has served me well and made life easier!
But presumably different people look after the security side of things.
A startup’s Firebase bill suddenly increased from $25 to $1750 per month
That's not the same at all as their bug bounty program, which is generally one of the best out there.
When your report is out of scope, Google will not ignore your report. When there is a non-serious bug, you get acknowleged in the bug report they file internally. Finally, when they can not replicate your finding, they will communicate that with you and stay patient until they can either replicate or close your report.
Edit: forgot to add that they raised the bounty with another 2k ("we updated our payouts") and they invited me to their Blackhat booth 1 year later.
Very "monkey on a typewriter". I was not even looking for security bugs, but studying usage of maia.css.