
Reducing search indexing latency to one second - i0exception
https://blog.twitter.com/engineering/en_us/topics/infrastructure/2020/reducing-search-indexing-latency-to-one-second.html
======
fareesh
Indexing latency aside, when I search for something on Twitter it shows me
results as I type

When I tap on the result, there is invariably something else at the tap
location

What's the terminology for this? Flash of ephemeral search result?

Are there any good ways of avoiding this problem?

~~~
capableweb
This is something that seemingly none of the big platforms or companies get
right. Not Twitter, not Microsoft, not Firefox, not Apple, not a single one of
them.

Seemingly they missed the lesson web developers learned ten years ago that you
can't change content that the user is currently looking at without any
interaction done, unless that's to be expected from the user.

But auto-loading items in a list is somehow still difficult for these
companies to get right. Once you're done typing your query in the window's
start menu, apple's "quick launcher thing" or twitter's search widget, it
takes X seconds for the query to actually finish, so while you're hovering the
menu, it changes content from underneath and the thing you wanted to click on,
has been switched to something else.

I'm starting to wonder if this is on purpose, as seemingly no one is getting
this right, but I don't understand what metric they are optimizing for
(incorrectly) to believing that this behavior is correct.

~~~
Viliam1234
I am so angry at this behavior on Firefox.

I type the first letter. Firefox displays 10 options. I think they are
bookmarks and recent google searches containing the letter. The ordering kinda
corresponds to how recently or how often I used them, but it is not exactly
that, for example the link I choose almost always somehow remains at the 2nd
place. Whatever.

I type the second letter. Again, Firefox displays 10 options, bookmarks and
recent google searches that contain the substring. At that moment, the link I
want to click is quite often displayed somewhere. So I move my mouse...

...but a fraction of second later, Firefox loads "things it thinks I might
want to search" (probably popular search results by other people that start
with those two letters I typed) and inserts them into positions 2 - 5, moving
the other search results below them...

...and just as I am pressing my mouse button, the link below my mouse cursor
is replaced by google search for something completely irrelevant.

Aaaaargh!

(I wonder whether it is an accident, or on purpose, that the most frequently
used link goes to the 2nd place, not the 1st one, so it gets replaced by an
irrelevant google search. I suspect it probably increases some metric
somewhere, and is probably interpreted as a good thing.)

~~~
kzrdude
Why not disable search suggestions? Having every keypress sent to google
doesn't seem great.

~~~
Viliam1234
Er... yeah, that makes sense. Thanks, I turned it off. I didn't realize this
was optional.

------
stereosteve
This is excellent.

I was recently reviewing Lucene concepts and found this video really good:
[https://www.youtube.com/watch?v=T5RmMNDR5XI](https://www.youtube.com/watch?v=T5RmMNDR5XI)

Also this site has a series of Lucene articles that are pretty nice. The one
on Term Vectors in particular: [http://makble.com/what-is-term-vector-in-
lucene](http://makble.com/what-is-term-vector-in-lucene)

Based on some quick research it seems like Lucene is already using a sorted
skip data structure for the posting list, so I wonder why they had to do a
custom implementation? Perhaps it has to do with their custom Document ID
scheme and how they want to preserve order in the Posting List being different
from the default behavior. It also sounds like searchers are searching on
indexes as they're being written, and there is some custom coordination around
visibility, which might require diverging from Lucene default behavior.

Either way, pretty impressive!

~~~
inertiatic
Interestingly, Elasticsearch exposes something similar to this, based on a
Lucene level feature that enables index level sorting.

You can specify such an index level default sort (similar to what they use
custom IDs to achieve) and it will use skip lists to make searching with that
sort faster. It will impose an indexing overhead but I would guess for
usecases like this it could make sense.

[https://www.elastic.co/blog/index-sorting-
elasticsearch-6-0](https://www.elastic.co/blog/index-sorting-
elasticsearch-6-0)

------
tpmx
Google: "Yah, we did that 20 years ago on mechanical hard drives."

Seriously though: Google built realtime indexing a very long time ago.

I co-implemented a small-scale (like 100k pages) full text search engine about
20 years ago with _a lot_ of inspiration from the 1998 paper "The Anatomy of a
Large-Scale Hypertextual Web Search Engine".

I had always assumed Google used 2-3 layers sort of like in Hierarchical
storage management (HSM); fresh data stored in RAM and older data stored on
HDDs, then combining them during the query step. I was itching to have a go at
implementing that, but it wasn't really required for our use case.

~~~
H8crilA
I believe that these days many big enough latency optimized systems use a
hierarchy of storage solutions (magnetic, flash, RAM).

Here's a useful rule of thumb: the costs per byte of RAM:flash:magnetic are
approximately 100:10:1.

------
simonw
This is a really good piece of technical writing. I particularly enjoyed the
explanation of skip lists.

~~~
diehunde
One of my favorite data structures. I was learning about the internals of
LevelDB/RocksDB and noticed they use skiplists as a default for the in-memory
datastore. It's an alternative to balanced trees (AVL/Red-black) but it's way
easier to implement and understand. Invented in the '80s and still widely used
in modern tools.

------
FlashBlaze
Is there an engineering blog where they describe how they manage to show the
exact timeline from where I left off each time I open the app?

~~~
Nican
If I remember correctly, every time a tweet is made, a fan-out operation is
made, adding the tweet to each of the follower's timeline. I would think the
app just needs to remember the position on their own timeline?

~~~
jaysh
That's part of it, but not the entire solution.

[http://highscalability.com/blog/2013/7/8/the-architecture-
tw...](http://highscalability.com/blog/2013/7/8/the-architecture-twitter-uses-
to-deal-with-150m-active-users.html)

> Outliers, those with huge follower lists, are becoming a common case.
> Sending a tweet from a user with a lot of followers, that is with a large
> fanout, can be slow. Twitter tries to do it under 5 seconds, but it doesn’t
> always work, especially when celebrities tweet and tweet each other, which
> is happening more and more. One of the consequences is replies can arrive
> before the original tweet is received. Twitter is changing from doing all
> the work on writes to doing more work on reads for high value users.

------
Nican
Doing a quick google search, there are claims of "6,000 tweets per second in
2020", or about 6 tweets per millisecond. The blog posts mentions there is an
edge case for getting more than 16 tweets per millisecond.

Rather close margins assuming an exponentiation usage growth of Twitter. I
wonder how long that variant is going to last.

~~~
ntonozzi
We left out some detail here to keep the post readable. In reality, we can
handle 16 Tweets per millisecond per partition. We have tens of partitions, so
that makes the edge case less likely. If we do see a gigantic growth in Tweet
rate, we would need to increase the number of partitions anyways, so we
decided this edge case was acceptable.

