
Applying BERT models to Search - moultano
https://blog.google/products/search/search-language-understanding-bert/
======
colechristensen
Maybe they're getting better at natural language, but for me Google searches
have been getting gradually worse year after year. I want Google, not
AskJeeves.

The "frustration" is "increasing" "when" I "have" to "quote" nearly every
"word" to get Google to actually return results with what I searched for
instead of what it thinks I meant to search for.

And there's the frustration, computing today tries as hard as it can to figure
out what it thinks I actually meant. I don't know if it is worse that a person
who knows what they want can't get it when the computer disagrees or if the
computer is actually mostly right and its algorithms start to push your
desires in it's own direction and whatever motive.

Facebook already does this really, radicalizing people by engineering the most
dopamine-driving content to the top either towards self-obsession or an us v.
them bubble.

In other words, I just want a fucking regular expression instead of our new
data-science overloads ruining our minds with artificial non-intelligence (for
profit).

~~~
londons_explore
When you quote the words, do you find what you're looking for?

In my experience, the times google seems to have totally missed the point of
what I'm looking for, it's usually the times that the answer I'm looking for
isn't anywhere on the web. Things like "datasheet JK45690DFS" or "Types of
asphalt available for local delivery today".

I wish Google had some way to understand your query and the results well
enough to just be able to say "The answer isn't available on the internet".

~~~
pm215
Yes, usually it's because there's maybe 4 documents which match the search
query. But if there are only 4 documents which match what I asked for then I
want to see a page with those 4 documents! I don't want to see a page with 10
results which I have to manually scan through to see that actually more than
half of them are irrelevant rubbish because they aren't hits for what I was
searching for.

------
bratao
BERT is truly amazing. Almost all inovation in NLP uses BERT and transformers
somehow. ALBERT will be the next HUGE thing for the next months, as it show
results better than BERT with a small fraction of parameters.

We did a "Semantic Similarity search" for some documents, where we represent a
document as a vector using BERT, and had to look for documents close to a
reference document.

The results where breathtaking. It really returned semantically similar
documents. You can do it now using ElasticSearch(But you really should do it
using Vespa.ai, it is much faster [https://github.com/jobergum/dense-vector-
ranking-performance](https://github.com/jobergum/dense-vector-ranking-
performance) )

~~~
woadwarrior01
That's very interesting! If you have the time for it, you should consider
experimenting with swapping in SpanBERT[1] instead of BERT in your usecase.
They train on full length length segments instead of masked half segments (as
in BERT). I suspect that this, besides the improvements that SpanBERT brings
over BERT should enable you to feed in bigger chunks (more sentences) to the
model before the averaging step, leading to fewer vectors to average and as a
result, perhaps better clustering.

[1]: [https://arxiv.org/abs/1907.10529](https://arxiv.org/abs/1907.10529)

~~~
bratao
Thank you, I will read and try it. Looks very interesting!

------
binarymax
I've been working on using BERT for search, for research and training
development, with not so great results.

Note the quote "when it comes to ranking results, BERT will help Search better
understand one in 10 searches". This is because of the "keywordese" point they
noted earlier in the article. Most searches are 1 or 2 words - there isn't
enough to grab onto for meaningful ranking with short queries and a similarity
function for longer text documents.

Also, try keeping the systems afloat to handle search like this. BERT is not
practical to use for search results by anyone without the scale of a company
like Google. You need to have a server farm of GPUs to translate all your
documents into tensors - and then keep them around somehow! A document of 10k
text will balloon to ~1MB when converted to a multitoken vector
representation. BERT uncased has 768 features - thats 768 floats per token you
need to keep around. If you compress it using PCA or averaging across tokens,
you lose all the juicy context that you need for the matching and ranking.
Also, there currently isnt a good way to keep this stuff around yet (though
there are active projects ongoing to get this into Lucene [1],[2])

I think this is definitely a great achievement in NLP - but it needs
breakthroughs in other areas to be useable by product teams implementing
search, with any reasonably large content size.

[1] [https://arxiv.org/abs/1910.10208](https://arxiv.org/abs/1910.10208) &
[https://github.com/castorini/anserini/blob/master/docs/appro...](https://github.com/castorini/anserini/blob/master/docs/approximate-
nearestneighbor.md) [2]
[https://github.com/o19s/hangry](https://github.com/o19s/hangry)

~~~
pheug
Distillation is usually used today to tame its resource problems at scale -
you run BERT to squeeze out maximum signal from your training data and then
distill the model e.g. into cheap CNN for inference.

~~~
binarymax
Distillation reduces accuracy and removes the contextual precision. For
example reducing a whole document to some N (1k or so) dimensions have worked
very poorly in my experiments for short queries - typically making the
relevance worse than basic keyword search.

~~~
sdenton4
You might try vector quantization (instead of PCA) if you just need your 768
features to be smaller. ML features tend to be robust to some perturbation.

~~~
binarymax
Well it’s one problem or another. If you compress too much you lose the value,
and if you leave it too large you have the size problem.

Inverted indices are very efficient. How much of that can you give up at what
trade off? If I’m only going to be better for 10% of queries, is that a cost
effective solution? What if I spend the same amount of time tuning a
traditional engine a bit more and get better accuracy for 5% of queries?
Tradoffs rule the world of practical search implementations.

------
sargram01
Search and a lot of the AI based systems these days feels like talking to a
hard of hearing grandparent; as long as you’re saying about what they expect
then it’s fine, but if there’s any nuances or homonyms it turns into a comedy
routine.

------
codingslave
I wish google would release google2008.com

On that website, they keep using their technology from 2008 and let me use it
to search for what I want. I've had enough

~~~
windsurfer
One issue with that is that there has been an arms race since 2008 with
"search engine optimization". If you used a crawler from 2008 on today's web,
you'd probably get a ton of spam.

~~~
codingslave
The results for half of my searches are already all spam. It would be better
to just accept that and have google let me engineer my queries so that I avoid
spam. More token based searches, less word vector/machine learning based
search. Let me query their index like its an SQL database

~~~
windsurfer
Are you sure they are spam? Google's engineers consider annoying things like
recipe pages with 20 paragraphs of stories before the actual recipe _not_ to
be spam.

~~~
buboard
i am being tricked into pinterest link-pages multiple times a day. That is
spam ,and has very little to do with their AI work

------
Pick-A-Hill2019
An interesting snippet from TFA - "..with this release, anyone in the world
can train their own state-of-the-art question answering system (or a variety
of other models) in about 30 minutes on a single Cloud TPU, or in a few hours
using a single GPU." [https://ai.googleblog.com/2018/11/open-sourcing-bert-
state-o...](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-
pre.html)

------
braindead_in
How long till NLG becomes good enough that it can answer questions factually?
I think integrating it with a knowledge graph might just make search obsolete.

~~~
ur-whale
I might be mistaken, but I believe that's already what Google is doing for a
lot of factual knowledge (as a matter of fact they do call it knowledge
graph).

Try for example to type "when was jfk born" in Google, you should see a
factual answer fished from a kg.

------
aasasd
> _In fact, that’s one of the reasons why people often use “keyword-ese,”
> typing strings of words that they think we’ll understand, but aren’t
> actually how they’d naturally ask a question._

Funny thing, I've seen people mention right here on HN that for DuckDuckGo you
need to adjust the style of your queries. This notion puzzles me, probably
because I kept the habit of ‘keywordese’ from the olden days. Most of the
time, results are about the same for me in Google as they are in DDG and even
in Yandex—with the exception that Google is better at grouping related or
similar results, and also if there's one or two sources having the search
phrase almost-verbatim then they're at the top in Google. Apparently, I
already need to learn talking to the site like it's self-aware, to regret
ditching it.

Now, if some wondertech helps me to home in on the answer to my exact software
or programming troubles instead of hundreds of vaguely related SO posts—I
could really dig that.

~~~
luckylion
Funny that you mention Google's grouping and related SO posts ... I still
regularly run into that annoyance that SO hasn't figured out how to do
canonical URLs and Google then proceeds to put the identical content from SO
on two consecutive spots in the results page. SO does not want to fix it (it
has been an issue for years and has been pointed out on meta many times), but
I'm confused by Google's fail to consider them duplicates when their content
is nearly identical (the difference would be just "hot network questions" in
the sidebar etc which depend on the time of the page request).

------
6gvONxR4sf7o
I wonder how this affects their costs of serving a query. BERT isn't exactly
computationally cheap to evaluate.

~~~
tanilama
It is google, it is probably their secret sauce. I won't be surprised that
someday Google makes custom ASIC chip to just run transformer models.

~~~
6gvONxR4sf7o
That's actually old news, and mentioned in OP. They build their own chips
(TPUs) that are super cool. You can even use them as part of google cloud!
Still, moving to BERT ain't cheap.

~~~
tanilama
TPU is old news. I think what would the actual news will be a Chip that is
customized/optimized to just run BERT.

The Transformer architecture itself stays mostly unchanged in the 2 years
after it has been proposed, and with BERT/variants, most (competitive) NLP
models are now Transformer based, it makes sense to make custom chips to just
run Transformers, the same as CNNs.

~~~
sdenton4
Meh, not sure how much more there is to do to specialize for transformer
specifically. TPU and GPU are mainly just fantastic matrix multipliers. And
transformer is partly designed with this hardware in mind: the operations are
basically the same operations you see in a CNN. And in fact, one of the nice
parts of the transformer is that you can run it without RNN, making it even
better optimized for the matrix multipliers.

Furthermore, tpus are a moving target themselves: as ML needs change, the team
build new operations and optimizations into the next generation of chips.

------
lexpar
Does anyone have a nice resource they recommend on what BERT does? I've
gathered it was trained by trying to predict missing words in a sentence, but
I don't have an intuition on how this is useful for downstream prediction
(like, say, learning a word embedding is).

~~~
metasyn
I appreciated this blog post: [http://jalammar.github.io/illustrated-
bert/](http://jalammar.github.io/illustrated-bert/)

~~~
lexpar
Thanks!

------
breadandcrumbel
s there any public information on actually how BERT is being applied to IR?

For each of the scenarios they described they are just like "here's potential
hard search query, and BERT adds magic language understanding which makes it
all better ". It's non-obvious how BERT is actually being used though,
especially at the scale and latency they need.

(I get that that this is Google's "secret sauce" and they might not saying
anything in this particular use of BERT. But I'm curious if anyone had seen
anything related.)

------
ma2rten
This blogpost is very light on details. It doesn’t at all say at which stage
in the search process BERT is used.

------
vagab0nd
Totally off-topic. Is it just me, or is this "People also search for" the
worst feature ever?

For those who haven't noticed, this is the box that shows up under a search
result when you return to google from the page you've gone to. 50% of the time
I click the back button, this freaking box shows up, the whole page shifts,
and I click the wrong link.

Oh and did I mention I never ever, not even once, actually use it?

------
amiga-workbench
>By using new neural networking techniques to better understand the intentions
behind queries

Great, so the search results are going to get even worse?

~~~
skyyler
I struggle to find substance in your comment.

Could you perhaps expand with a few supporting details for your thesis?

~~~
amiga-workbench
Google works well when you are searching for things you don't know much about,
your input is imprecise and it generally points you in the right direction.

However for the opposite case, when you are trying to find something highly
specific, even down to an exact substring match I find the results to be very
poor.

~~~
skyyler
Aww :-(

I suppose you'll just have to make your own...

------
deepnotderp
BERT is such a great case of clever hans. If you scrub the shallow statistical
similarities from question answer sets accuracy drops significantly .

------
vuln
Now google is trying to read my mind, my thoughts and my intentions. I've been
running away from EVERYTHING Google.

~~~
packetslave
And yet...

    
    
        $ host -t mx vuln.ninja
        vuln.ninja mail is handled by 10 alt3.aspmx.l.google.com.
        vuln.ninja mail is handled by 1 aspmx.l.google.com.
        vuln.ninja mail is handled by 5 alt1.aspmx.l.google.com.
        vuln.ninja mail is handled by 5 alt2.aspmx.l.google.com.
        vuln.ninja mail is handled by 10 alt4.aspmx.l.google.com.
    

Maybe you should try running a bit faster.

~~~
Pick-A-Hill2019
Ouch. The burn.

