
Open-Sourcing Vespa, Yahoo’s Data Processing and Serving Engine - mkagenius
https://www.oath.com/press/open-sourcing-vespa-yahoo-s-big-data-processing-and-serving-eng/
======
madmax108
Yahoo time and again releases open source software which is super super
helpful to the community at large. But it always makes me wonder why an org
with such an amazing engg culture (multiple anecdotes from friends who were at
Yahoo, plus the amazing experiences at Yahoo OpenHack each year as a testament
to this) could be run into the ground.

Really goes to show that engg != business and unless you have a firm business
model and growth, an amazing engg team can only get you so far.

I know I'm just stating the obvious, but just putting it out there!

Vespa looks super interesting (more so since I'm in a company that provides
ecommerce search APIs as a product) and I'm sure I'll play with it more.
Thanks Yahoo! :)

~~~
gyehuda
The Register did a good job describing the dichotomy between the product
business and the tech side of the business here, related to a previous open
source project Yahoo published (disclaimer, I run the open source process at
Yahoo)
[https://www.theregister.co.uk/2017/03/23/yahoo_tensorflow_on...](https://www.theregister.co.uk/2017/03/23/yahoo_tensorflow_on_spark/)

"Over the decades Yahoo! has contributed substantially to the greater good,
publishing its own code as open source.

Arguably Yahoo!’s greatest legacy once it is a division of Verizon will be big
data, after one of its engineers – Doug Cutting – wrote an open-source
implementation of Google’s MapReduce that became Hadoop. What followed was an
entire ecosystem of startups and projects crunching data at scale – Cloudera,
Hortonworks, MapR to name three in a market some calculate will be worth $50bn
by 2020."

~~~
runT1ME
What do you mean you run the Open Source process at yahoo?

Hopefully Verizon's open source legacy will become even bigger and make it
easier for these kinds of innovations to happen.

We've had some success a few cities over in Verizon Labs...
[https://verizon.github.io/](https://verizon.github.io/)

~~~
gyehuda
My job is to manage the open source process for Oath (which is essentially
Yahoo + AOL). That includes helping ensure we can publish code like this and
the hundreds of other projects we publish too. I'm the one who cares about
open source licenses, patent clauses, github permissions, etc. Many large tech
companies have someone in a comparable role and some of us work together in
the todogroup to help manage the way we do opensource. I'm beginning to meet
the people in Verizon who do the same. I hope their open source legacy grows
too. Heck I celebrate when Google, Amazon, and Comcast publish great code too.
It's good for us all. But Vespa is a real treat. It's really really special to
Yahoo and we are very hopeful that the Big Data community sees how many things
they can do with this, at scale, the way we have.

~~~
caniszczyk
Here's some info if you're interested in starting your own open source program
c/o the TODO Group:
[https://github.com/todogroup/guides](https://github.com/todogroup/guides)

------
tedd4u
At Flickr, we worked closely with the Vespa team from 2011 through 2016 on a
wide range of advancements:

    
    
       * partial document refeeding (i.e. expedite indexing a new field to 20+ billion documents without refeeding everything and staying online handling 100M+ free text queries a day)
       * visual similarity search - check out the tensor ranking features [1] [2]
       * online elasticity - add/remove replicas / shards online. A must when it could take weeks+ to re-feed from scratch. This is non-trivial to make work smoothly at scale. 
       * latency / tail-latency on complex queries. p90 reduction from 3,000 to 30 ms.
    

This is a major gift to the open-source community of a battle-tested search
engine that works reliably without babysitting with very large datasets, and
simultaneous high query / high feed volumes. Huge debt of gratitude to the
team in Trondheim and Verizon/Oath/Yahoo legal & management teams for making
this happen. :+1:

[1] [http://docs.vespa.ai/documentation/tensor-
intro.html](http://docs.vespa.ai/documentation/tensor-intro.html) [2]
[http://docs.vespa.ai/documentation/tensor-user-
guide.html](http://docs.vespa.ai/documentation/tensor-user-guide.html)

------
aargh_aargh

      $ cloc-git https://github.com/vespa-engine/vespa.git
      http://cloc.sourceforge.net v 1.60  T=64.10 s (224.5 files/s, 28276.3 lines/s)
      --------------------------------------------------------------------------------
      Language                      files          blank        comment           code
      --------------------------------------------------------------------------------
      Java                           6573         106215         102720         537097
      C++                            3209          76542          19178         504855
      C/C++ Header                   2985          42731          57087         158388
      XML                             389            705            550         139626
      Maven                           141            133            244          14096
      CMake                           450            254            560           8452
      Perl                             57           1124            762           7649
      Bourne Shell                    196           1257            734           6918
      Scala                            95           1685            617           6378
      Teamcenter def                  234           1474           3490           2468
      Lisp                              4            231            403           2118
      HTML                             16            211             29           1950
      C                                 7            288            198           1432
      Python                            6            132             66            556
      Ruby                              9             39              9            294
      Bourne Again Shell                3             35             12            182
      Pig Latin                         9             39             52             54
      make                              2             22              8             39
      Ant                               1              9             17             36
      YAML                              1              9              1             22
      DTD                               2              6              6             10
      --------------------------------------------------------------------------------
      SUM:                          14389         233141         186743        1392620
      --------------------------------------------------------------------------------

~~~
zimpenfish
As an ex-employee, there could not be a better description of Yahoo!
development than this.

~~~
praneshp
Hey, there is no yinst/buildyblocks stuff here. That's kind of a must-have.

~~~
zimpenfish
> yinst

[FLASHBACKS]

To be fair, yinst was the least worst part of the systems I was working on
(Yahoo!Europe backend feeds stuff.)

------
elvinyung
This is really cool. Vespa was probably first described in this 2007 paper:
[https://brage.bibsys.no/xmlui/bitstream/handle/11250/251199/...](https://brage.bibsys.no/xmlui/bitstream/handle/11250/251199/348506_FULLTEXT01.pdf)

Next up, I would really like to see Sherpa/PNUTS (their NoSQL operational
database) and Everest (their petabyte-scale Postgres data warehouse) open
sourced :)

~~~
FractalNerve
May I ask some stupid questions? :/

I don't quite get the diagram of the Vespa Architecture. Is Vespa a middleware
between database engine and query parser? This is what puzzles me.

If so, are there other such middlewares available for ie. PostgresSQL that
allow hooking "Query Templating Models" (that is it?) generated via Machine-
Learning Models? Is it way more complicated than that, or did they
overengineer the problem into a monolith? EDIT: Looking at
[https://github.com/vespa-engine/vespa](https://github.com/vespa-engine/vespa)
it seems that it is overengineered, or maybe it consists of individual micro-
components like node.js, hmm more questions :(

Is GraphQL such middleware or lower-level?

Does Vespa replace custom Glue-Code between Backend and Frontend that
generates such query-sets for content ranking/positioning?

Or what exactly does Vespa solve? I'm sorry, I've read the article, but can't
say, yep that's what it is!

EDIT: How else could you solve what Vespa does using Rust, Go, or C/C++
libraries? A very simple or general direction would be immensely useful to
understand Vespa =) The project makes the simultanous impression of an immense
engineering feat and at the same time a huge code debt.

~~~
FractalNerve
> How else could you solve what Vespa does using Rust, Go, or C/C++ libraries?

Let me try myself answering my own question, I hope someone hops in and tells
me where I'm wrong or how else to improve :)

    
    
         1) Get PostgresSQL exntensions via "package manager" pgxnclient
         1.1) pg_bouncer - For connetion pooling
         1.2) yoke - As a high-availability cluster manager with auto-failover and automated cluster recovery
         1.3) prestodb.io - Distributed SQL query engine for pgsql
         1.4) pglogical - Logical streaming replication for using a publish/subscribe model
         1.5) pg_lambda - To create your own AWS (meta) Lambda
         1.6) pg_strom - To offload tasks to the GPU
         1.7) zombodb - To utilize full-text searching via indexes backed by Elasticsearch
         2) Put all together with pglogical and presto to seperate GPU/CPU intensive tasks.
         2.1) "Build Missing Middleware" - To design/fuse a query visually that combines multiple backends
         2.1.1) Create a binary data-stream by integrating pg_lambda, pg_strom, presto and zombodb
         2.1.2) "Build Missing Middleware" - A tensor processing extension to use ML Model evaluations
         2.1.3) "Use Missing Middleware" - For data-processing via Machine-Learning models
         2.1.4) "Use Missing Middleware"- To output ML processed results into the database
         2.2) Partition these queries using "pg_lambda + middleware" to create accelerated and fused query results
    

So what's missing to create a Vespa alternative using existing technologies is
everything in Point 2) if I'm not mistaken. Torrent based replication isn't
exactly neccessary, except at Twitter/Facebook scale, but if you reach that
stage you can hire a libtorrent author.

~~~
FractalNerve
I thik basing this on PostgresSQL was wrong now and believe that a meaningful
approach at creating a Vespa alternative yourself is basing this on a Content-
Adressable-Storage[1] and adding a DB-Layer ontop (ie. using AUFS).

It would have following properties: decentralized, distributed, resilient,
highly-available, software-defined storage & retrieval system.

According to [http://vespa.ai/#featurematrix](http://vespa.ai/#featurematrix):

    
    
            FEATURE	                    VESPA	ELASTIC SEARCH	RELATIONAL DATABASES
            ACID transactions			                •••
            Optimized for analytics		        •••	        ••
            Optimized for serving	    •••	        •	        ••
            Scalable	            •••	        ••	        •
            Easy to operate at scale    ••	                        •
            Text search	            •••	        ••	        •
            Machine learned ranking	    •••	        •               2.1.2) - 2.1.4)	
            Middleware logic container  •••		                1.4)
            Live reconfiguration	    •••	                        1.2)
    

And yet I've to admit that even if the Github repository looks quite chaotic,
making an alternative, even using existing technologies would be big feat.

Initially I would've chosen PostgresSQL as a base, but the "HA-Layer" is
something that shouldn't be decoupled and not a later thought. That's why CAS
is a much better form of integration. Also integrating the PostgresSQL Engine
into a zfs kernel extension ie. would be a mess. And integrating the database
engine into a a distributed p2p algorithm would only add compatability issues
an no real advantages.

[1] [https://en.wikipedia.org/wiki/Content-
addressable_storage#Op...](https://en.wikipedia.org/wiki/Content-
addressable_storage#Open-source_implementations)

PS: Clever aquisition by Docker! "Infinit.sh is a content-addressable and
decentralized (peer-to-peer) storage platform that was acquired by Docker
Inc." And in my eyes one of the best implementations and easiest targets that
allow adding a database-layer ontop.

------
gyehuda
"Vespa is the single greatest piece of software Yahoo ever built. It's like
ElasticSearch but a hundred times better. I am so happy." Laurie Voss Co-
founder/COO of @npmjs
[https://twitter.com/seldo/status/912876700542787585](https://twitter.com/seldo/status/912876700542787585)

------
076ae80a-3c97-4
"When machines are lost or new ones added, data is automatically redistributed
over the machines, while continuing serving and accepting writes to the data.
Changes to configuration and Java components can be made while serving by
deploying a changed application package - no down time or restarts required."
That sounds pretty impressive.

~~~
iamalchemist
Looks impressive!

------
cies
Wow this project is humongous!

[https://github.com/vespa-engine/vespa](https://github.com/vespa-engine/vespa)

I'm really curious how it compares to Lucene/ElasticSearch/ELK, which is
currently my tool of choice for (faceted) search and recommendation.

~~~
aidos
I think that's actually the broadest root folder I've ever seen on a project!
It's _really_ hard to know where to start looking. Is anyone familiar with the
internals?

~~~
RealJon
Sorry about that - we haven't really optimized the module structure for
newcomer comprehension. If you tell me what you want to are looking for I can
point you to the right place.

------
tallanvor
Wow...

I'm guessing (hoping) their lawyers made sure to go over their old agreements
with a very fine comb to make sure their license for the software allowed them
to open source this.

Don't get me wrong, they've added a lot to it, but there's a lot of code in
there that could only have come from their purchase of Overture, who had
purchased AllTheWeb from FAST (which was itself purchased by Microsoft).

~~~
reednj
If they purchased it, then they would own the copyright, so they can relicense
it any way they want can't they?

------
sanxiyn
Vespa, using its own malloc: [https://github.com/vespa-
engine/vespa/tree/master/vespamallo...](https://github.com/vespa-
engine/vespa/tree/master/vespamalloc)

~~~
andreer
The reason is simply, for performance. To avoid having to go to the kernel
every time we need to allocate memory for a query, and avoid having to clear
memory on free/reuse. It is made for Vespa, but also used for other programs.

Similar in purpose to Google's TCMalloc: [http://goog-
perftools.sourceforge.net/doc/tcmalloc.html](http://goog-
perftools.sourceforge.net/doc/tcmalloc.html)

~~~
sanxiyn
How is it better than TCMalloc? (If it isn't, it probably should be replaced
by TCMalloc.)

~~~
andreer
In our tests, vespamalloc has simply been faster. I don't know how in-depth
the analysis has been as to why, but obviously vespamalloc is written and
tuned for Vespa so that is a likely factor.

------
softwaredoug
If anyone wants the kind of machine learning based rerabking in Elasticsearch,
we've been working with the wikimedia foundation on an Elasticsearch learning
to Rank plugin:

[http://github.com/o19s/elasticsearch-learning-to-
rank](http://github.com/o19s/elasticsearch-learning-to-rank)

------
sandGorgon
I see it has tensor processing built in -
[http://docs.vespa.ai/documentation/tensor-
intro.html](http://docs.vespa.ai/documentation/tensor-intro.html)

Can this be used as a spark+tensorflow replacement ?

~~~
gyehuda
Note: TFoS is also a Yahoo open source project. The teams work together.
[https://github.com/yahoo/TensorFlowOnSpark](https://github.com/yahoo/TensorFlowOnSpark)

~~~
sandGorgon
is tfos production-ready right now ? because i thought it was still
experimental.

is it used inside Yahoo - because Vespa comes with its own tensor processing
engine. I wondered who would use one over the other.

~~~
RealJon
TensorFlow on Spark is for learning, Vespa is for serving. Where Vespa excels
is in evaluating a learned model very quickly over lots of documents. We're
working on providing support for running models learned with TensorFlow
directly. For now people make the translation on their own.

~~~
KGIII
I don't want to trigger any bot detection by voting all your comments up in a
short amount of time. So, I will say thank you and mention that is is
contributions from people like you that keep me coming back.

------
stereosteve
Here is a snippet to decodeURIComponent for all the yql examples in the
documentation. Makes it a bit easier to see the yql syntax.

    
    
      $('pre:contains("yql=")').each((i, el) => { el.innerText = el.innerText.replace(/\+/g, ' ').replace(/yql=(.+%3B)/, (m, p1) => 'yql=' + decodeURIComponent(p1)) })

------
wiradikusuma
The quickstart doesn't seem to work, at lease on macOS. I raised a ticket:

[https://github.com/vespa-engine/vespa/issues/3560](https://github.com/vespa-
engine/vespa/issues/3560)

~~~
qw
I just checked the issue, and it seems it was caused by human error. It is
closed now.

------
bucketman
How well does vespa handle time series data, compared to, say, elasticsearch?

------
benth
Say I'm using ELK for log aggregation. Would Vespa be a good replacement? One
pain point is ingest rate. How many "average" log lines per second can Vespa
do per node?

~~~
andreer
It could be a replacement for the 'E', but the APIs are different enough that
there's no drop-in replacement for the 'L' and 'K' and creating or making
those compatible would be a significant effort. Would be great if someone did
though :-)

~~~
benth
Gotcha. On the ingest front, do you have any numbers around that? I see some
benchmarks that focus on other (important) aspects like QPS but didn't catch
anything on ingest.

~~~
RealJon
Write speed (add or update) is typically between a few thousand to a few tens
of thousands operations per second per node sustained, depending on sizeof
data etc.

Sustaining throughput over long time is important and often overlooked
mentioned in benchmarks.

------
crudbug
Big open source news in recent times.

How is the storage layer designed ? disk format ? Can you extend the layer to
support different models such as Property graph ?

------
elfchief
Now, if Yahoo would just open source the yinst/opsdb/rolesdb/igor/etc
ecosystem. I miss that elegant tooling so fricking much.

------
whage
I don't understand how such a move is profitable for a business. Can someone
please point me to some articles that discuss this?

~~~
C4stor
[https://www.joelonsoftware.com/2002/06/12/strategy-
letter-v/](https://www.joelonsoftware.com/2002/06/12/strategy-letter-v/)

The point of Joel's article is that "Smart companies try to commoditize their
products’ complements."

For what I understand, Yahoo is a media company, and as so it may try to
commoditize a natural complement of today's media companies, which is data
search.

~~~
jogjayr
That article is a classic but I don't really understand how it applies in this
case.

Smart companies commoditize their products' complements (something that needs
to be bought with the product) so that whoever wants buy their product has a
large variety of offerings to select from. For instance, MS-DOS's ability to
run on any standard PC architecture machine commoditized the PC.

It's not clear how commoditizing data search makes selling media to consumers
easier, because consumers of media don't buy data search. In fact they expect
it to come for free from the media company.

Might it not have the opposite effect instead? That is, it makes starting up a
media company cheaper and allows competitors to spend more money on acquiring
media?

~~~
KGIII
Yahoo! gets code improvements back and, by being the originals and having the
most experts familiar with the entire code base - and it is huge, they retain
their competitive advantage. They also do this while fostering good will and,
potentially, reducing the number of bugs, improving security, and increasing
efficiency.

It's brilliant, in a way. There are risks, but they are small and mitigated.
They may even end up selling support and customizations, or enabling that
market. The upside potentials are many and the downsides are few and only risk
small impacts.

Hell, you can get RedHat for the low cost of nothing, just by signing up for
it. On top of that, they will give you every single last line of code you
want. They'll give you all of the code, and do it for free.

Yet, they are a successful for-profit company. They don't even accept
financial donations, as far as I know. They aren't the wealthiest company, but
they are doing quite well and not suffering financially.

Open source doesn't mean no profit. It just means additional rights for the
source code and/or user. (Different licenses prioritize different liberties
and have different goals.)

------
abiox
i wonder if this is able to squeeze something useful out of Yahoo Answers.

~~~
KGIII
I am not trying to be snarky when I say this, so I'll make it longer than a
single word.

No.

That is, Y! Answers was/is a great idea with terrible results. I'm not sure if
it should have been moderated better, or if it should have been marketed
better. Hell, maybe it should have had a basic literacy test prior to being
allowed to post and answer?

I could actually come up with a few hundred ways to have made it better. We
know the question and answer format works. It does on many, many sites. It
failed there and, largely, that's because of the users. I'm not sure which was
worse, the questions or the answers.

I'd love to be given that project and tasked with improving it. It's a great
idea, but horribly implemented. I'd like to fix it because I'm fond of trying
the impossible and like fixing broken things.

Improving search for it is absolutely not going to help it. No, that's not
going to help in the slightest.

~~~
gyehuda
FWIW, there was a research team at Yahoo who would actually get insights about
the way people used language (relevant for contextual search) from Yahoo
Answers. They were more active during the earlier years when the content
quality was much higher. I don't know if they used Vespa in their mining
process, but it would not surprise me if they did. Vespa is used in many
projects because it's just that good. So if you are thinking of using it --
for Yahoo Answers, for Quora, whatever, go for it. I know a cancer
researcher/engineer who wants to use Vespa for serving clinical reports and
trial outcomes.

As for the current Answers site, well, we'll see what happens. I know the PM,
a delightful person. I don't know the plan. But if there is a plan to make
something useful from it, apply. It apparently makes money (otherwise it would
have been killed long ago), and that means there's something to work with.
When I started at Yahoo I was hoping to get onto the Groups team for the same
reason -- a huge challenge to fix something that could be made cool again.

------
martinp
Maybe this can be merged with the discussion from yesterday:
[https://news.ycombinator.com/item?id=15339851](https://news.ycombinator.com/item?id=15339851)

------
gfredtech
I observe that it was written in Java: [https://github.com/vespa-
engine/vespa](https://github.com/vespa-engine/vespa)

~~~
martinp
The serving and admin/config layers are Java, while the content cluster is
C++. Please see the architecture diagram on
[http://vespa.ai/](http://vespa.ai/).

~~~
gfredtech
yeah, seen :)

