Hacker News new | past | comments | ask | show | jobs | submit login
Open-Sourcing Vespa, Yahoo’s Data Processing and Serving Engine (oath.com)
512 points by mkagenius on Sept 27, 2017 | hide | past | favorite | 121 comments

Yahoo time and again releases open source software which is super super helpful to the community at large. But it always makes me wonder why an org with such an amazing engg culture (multiple anecdotes from friends who were at Yahoo, plus the amazing experiences at Yahoo OpenHack each year as a testament to this) could be run into the ground.

Really goes to show that engg != business and unless you have a firm business model and growth, an amazing engg team can only get you so far.

I know I'm just stating the obvious, but just putting it out there!

Vespa looks super interesting (more so since I'm in a company that provides ecommerce search APIs as a product) and I'm sure I'll play with it more. Thanks Yahoo! :)

The Register did a good job describing the dichotomy between the product business and the tech side of the business here, related to a previous open source project Yahoo published (disclaimer, I run the open source process at Yahoo) https://www.theregister.co.uk/2017/03/23/yahoo_tensorflow_on...

"Over the decades Yahoo! has contributed substantially to the greater good, publishing its own code as open source.

Arguably Yahoo!’s greatest legacy once it is a division of Verizon will be big data, after one of its engineers – Doug Cutting – wrote an open-source implementation of Google’s MapReduce that became Hadoop. What followed was an entire ecosystem of startups and projects crunching data at scale – Cloudera, Hortonworks, MapR to name three in a market some calculate will be worth $50bn by 2020."

Congrats on the Vespa release, it looks phenomenal!

Generally when you disclose something (like the fact that you run the open source process at Yahoo) that's a disclosure, rather a disclaimer.

A disclaimer might be considered the opposite "I don't run the open source process at Yahoo, but...", or, more commonly, "IANAL".

oops, thanks. I meant to disclose my affiliation, not disclaim it.

> Cloudera, Hortonworks, MapR to name three in a market some calculate will be worth $50bn by 2020

Very optimistic calculation...

There are a lot of players in this space.

What do you mean you run the Open Source process at yahoo?

Hopefully Verizon's open source legacy will become even bigger and make it easier for these kinds of innovations to happen.

We've had some success a few cities over in Verizon Labs... https://verizon.github.io/

My job is to manage the open source process for Oath (which is essentially Yahoo + AOL). That includes helping ensure we can publish code like this and the hundreds of other projects we publish too. I'm the one who cares about open source licenses, patent clauses, github permissions, etc. Many large tech companies have someone in a comparable role and some of us work together in the todogroup to help manage the way we do opensource. I'm beginning to meet the people in Verizon who do the same. I hope their open source legacy grows too. Heck I celebrate when Google, Amazon, and Comcast publish great code too. It's good for us all. But Vespa is a real treat. It's really really special to Yahoo and we are very hopeful that the Big Data community sees how many things they can do with this, at scale, the way we have.

Here's some info if you're interested in starting your own open source program c/o the TODO Group: https://github.com/todogroup/guides

Here's some more information if you're interested in running your own open source program: https://www.linuxfoundation.org/resources/open-source-guides...

When I was there (2003-2005), the big problem already seemed to be that a lot of engineering saw Yahoo as a tech company, while a lot of the business side saw Yahoo as a media company.

Especially so after Yahoo conceded Search.

So one side wants to pour money into getting technical leverage. The other side largely just wanted more effective ways of publishing and monetizing content.

It doesn't really matter which side was right, only that they were at times pulling in wildly different directions in terms of what they believed it was important to invest in.

Yahoo's largest problem while I was there (2004-2011) was that nobody could articulate what y! is and what it does. Actually, at employee orientation in 2004, they had a clear mission: to be a part of everything you do online: either by doing it, or cobranding it, with a goal of being in the top 3 of every internet vertical.

Somehow, over time, this message was lost, and it was no longer ok to be #2 or #3 in some verticals (like search), and having very healthy profit margins wasn't enough either. The Microsoft search deal was a bad deal, poorly executed, as well: Microsoft couldn't meet the monetization and performance requirements from the start (Microsoft paid out of pocket for a while for this), y! lost control of an important product, and the staffing reductions to search that were supposed to be enabled never came.

Interesting "to be a part of everything you do online: either by doing it, or cobranding it, with a goal of being in the top 3 of every internet vertical." Could you say that part of the reason they lost was a mission/purpose that from the above seems to be purely focused on internal value creation? No word on how they would benefit their customers? It does not say anything about the value it creates.

Put another way should your purpose be something customer focused?

The customer focused version of this missionis 'to help users do everything they want to do online'. Part of doing everything is providing some form of discoverability, which takes the form of linking to other products -- big traffic drivers are links on the front page to major products, and integration with search, but also relevant cross linking within the more specific verticals -- at Y! Travel, we would use flickr photos, events from upcoming, share restaurant data with local, we would have done things with yahoo calendar if it was more usable, etc.

That sounds like an incredibly unclear mission to me. It also sounds like the mission of a media company, not a tech company, but without making that distinction clear...

What's unclear about it?

And/or what do you propose that fits what Yahoo! did (or should have been doing) at any point in its life? Alternatively, provide a clear mission statement than encompasses what any large company with lots of products does -- GE, IBM, Google, etc.

Well for starters it is clearly hyperbole. There is no way Yahoo particularly wanted to be part of everyones porn surfing for example - Tumblr notwithstanding. And you'd find plenty of other niches where Yahoo made no attempt whatsoever to enter.

And outside of investment in Alibaba, their involvement in online shopping was marginal. In fact, Yahoo's premium services was always marginal across the board - I spent years trying to push product teams to find services they could potentially charge for in Europe, and from what I can tell my replacements had no luck in that respect either.

Instead Yahoo actually divested a number of services in that respect. Yahoo! Personals was sold to Match.com, and while it may have been co-branded for a while, that's long since lapsed in most markets.

So either Yahoo dramatically failed to follow this up in any reasonable way, or the actual intent was a lot narrower. Or both.

But what does "be a part of" even mean? Have an ad on the page? Be recognised by the user as providing the service? Providing content? Provide the tech? It really says nothing. It doesn't say anything about the purpose either. Is it to sell more ads? To grow the brand? To drive people to premium services?

It's a non-statement.

To me it's a statement that's basically carte blanche for whatever management happens to want right now, but it's certainly not providing focus.

And while I would interpret it as directed towards media, I suspect it did nothing to e.g. clarify to the company whether Yahoo was a media company or a tech company. Maybe intentionally, because there were a lot of people in engineering when I was there who did not want to accept that what mattered to Yahoo was connecting eyeballs to content, and that.

> Well for starters it is clearly hyperbole. There is no way Yahoo particularly wanted to be part of everyones porn surfing for example - Tumblr notwithstanding. And you'd find plenty of other niches where Yahoo made no attempt whatsoever to enter.

It's a little hyperbolic. A realistic refinement would add family friendly and having a decent amount of marketing spend and/or user time spent.

> And outside of investment in Alibaba, their involvement in online shopping was marginal.

Yahoo! Shopping was a decent price comparison tool (and Kelkoo was an acquisition of the same in Europe), and Yahoo! Stores was enabling a lot of small businesses to sell online. Auctions was closed in 2007 (except in Japan), because it's hard to be the number 2 for auctions. Yahoo! Wallet was supposed to make it easier to buy things all over the web by storing your payment credentials in one (trusted) place. There was a person to person payment thing that I can't recall the name of.

> Yahoo! Personals was sold to Match.com

In 2010 -- way into the period where nobody knew what Yahoo! was trying to do.

> To me it's a statement that's basically carte blanche for whatever management happens to want right now, but it's certainly not providing focus.

Yahoo may have been focused on the directory initially, but at least by 1997 it was not a goal to be focused -- there was never one thing Yahoo did to focus around, other than the brand. Even the first archive.org capture from October 1996 has links to random things.

> And while I would interpret it as directed towards media, I suspect it did nothing to e.g. clarify to the company whether Yahoo was a media company or a tech company.

Does it matter if it is a media company or a tech company? The real answer is Yahoo is an eyeballs company. Premium services I guess were technology revenue, but services revenue was never expected to be a large portion of revenue for the company (although certainly it was for some verticals). Yahoo's tech stack enabled small teams to build vertical sites that competed with much larger teams. Yahoo's business stack enabled those small teams to sign large advertisers and content deals. Yahoo's network of sites enabled the small teams to start with lots of eyeballs, without huge marketing spends. Building compelling verticals gets the eyeballs to come back. Having a great internet search product that the eyeballs saw as a great internet search product would have really helped with keeping eyeballs.


Steve Jobs responding to a question about OpenDoc in 1997:

> You’ve got to start with the customer experience and work backwards to the technology.


Because often there's no correlation between engineering culture and business.

I haven't had the longest career, but the little I have see empirically confirmed to me that there's actually a strong connection between engineering culture and business success. Of course, a great engineering culture alone is not enough, but it can act as multiplicative factor.

In an engineering centric product. Why would a mom and pop email / news / casual gaming portal succeed directly because of good engineering? Time and time again we massively over-play our role in many business stories.

This is so true. For a website so heavily focussed on content, tech is only a small part of the game. Case point is youtube, though the tech challange with youtube is more complex. Nevertheless, youtube now has to survive on business deals and cannot depend on tech alone, unlike google search.

Yahoo mail was garbage. News was not useful and mostly ads. Search was not great.

If you build crappy tools don't be surprised when people don't find a lot of value in them.

I would think they aren't all that correlated. Look at apple, Steve job treated some of his engineers like crap and they were successful. Its nice to have a good place to work, but I don't think it ensures success.

This doesn't necessarily mean a 'bad culture' -

a) the concept of 'tough love' is applicable

b) were engineers allowed and challenged to innovate? or did their work effort primarily consist of navigating red tape and beurocracy to the point that all creativity was crushed?


see also: dilbert.

To me a good engineering culture isn't a "nice" engineering culture, it is one where the company consistently delivers outstanding work that is on point with business requirements.

See: DEC, Sun Microsystems. Engineering != Ability to make money. While none of us may like it, you almost always need a solid marketing and sales force to have a successful business.

Sun was making money, their problem was cash flow. Their primary customers were on Wall Street. They borrowed a bunch of money prior to the GFC, and right when they needed their Wall Street customers to be placing orders to cover payments on the debt, their customers were tightening their belts.

Business is (mostly) about people and relationships. That's probably where Yahoo was lacking - not the engineering part.

And they used to be a BSD shop as well, then Marrisa changed it to Linux I think.

Juicero had an amazing engineering culture. Look where that went. You need a proper management team to wrangle it in. Clearly Yahoo has made mistakes in that aspect.

Not sure that Juicero is a good example good engineering, "Do You Need a $400 Juicer?"


I think you and the parent comment are both partially right - from having watched a number of teardowns, the Juicero was (if anything) massively over-engineered in terms of number of parts, cost of those parts, how they were machined, etc.

It was also kind of clear that it went too far, that a more experienced hardware design and engineering team would have found (not a cheaper), but a more effective way to handle those same challenges.

>from having watched a number of teardowns, the Juicero was (if anything) massively over-engineered in terms of number of parts, cost of those parts, how they were machined, etc.

What you mean to say is that the Juicero was overbuilt and underengineered. As the saying goes, "anyone can build a bridge, it takes an engineer to barely build it".

This reminds me all too much of Coffee Equipment Company which back around 2005 built the world's greatest coffee machine, Clover. Unfortunately, at $11k per machine even Starbucks, after buying the company, wasn't able to find a profitable niche for it. Clover machines were, are, beautiful machines that are able to repeatedly and dependably turn out genuinely great cup after great caffeinated cup; the problem is that it was a solution in search of a problem, much like Jucero.

What does "engg" stand for? A particular field of engineering?

ok if I ask what company, I have some interest in the area.

IMHO, yahoo has bad investors. One of the proof is the firing of Marissa Meyer. As soon as she was hired, the reputation of yahoo has started to improve. All the fights of investors against her until they make her leave had the opposite effect. It takes years to build a reputation and it takes more years for this reputation to bring dividends. I have a similar feeling for Microsoft. The reputation has (slowly) improved since the arrival of Satya Nadella. I which he succeeds to make it back a big player.

Meyer basically destroyed a healthy, but not growing, company in the span of 4 years through short-term number-massaging actions at the cost of their competent engineers and core product development.

Nadella isn't making a particularly strong case for MS either with his seeming distaste for any product of theirs that isn't either mobile or cloud-based (like the ones where they actually have a monopoly to build off and no competition worth speaking of), which seems to arise from that 'chasing growth numbers' mindset.

Microsoft products? "Windows is a service" is the word now, as firmly stated by a recent message demanding a PC reboot.

(The first sentence was "Windows is a service and updates are a normal part of keeping it running smoothly.")

to their "credit", this doesn't actually mean anything given the msft ability to misuse language consistently to make it sound more magical like

>Meyer basically destroyed a healthy, but not growing, company in the span of 4 years

So she did exactly what Google paid her to do. Eliminate a competitor.

How dumb do you have to be to hire a CEO from high in the ranks of your biggest rival?

Yeah she was terrible. Their homepage still sucks, freezes up my browser all the time.

I am not in America, but when she arrived, I was very curious to see how she could revived such a dying zombie. Clients and engineers were leaving as hell. IMHO, she has done a lot to reduce the hemorrhage and make it attractive again.

> IMHO, she has done a lot to reduce the hemorrhage and make it attractive again.

What are you talking about? She's no longer employed and Yahoo has been sold off to Verizon...

I replied to the rewrite of history of yahoo, in particular that yahoo was healthy before Marissa Meyer arrived. IMHO, she had a positive effect on yahoo during her presence.

Boardroom drama and IR conference calls turning into a 4chan style discussions are characteristic of Big Co. culture as such in US. The few successful big companies that evaded that predicament, were the ones with either poison pill provisions, or other tricks that made them "taste bad" for activist investors and other nasty guys from financial industry

Loss of direction is what stalls big tech companies 9 times out of 10. If you have 10 decision making actors on the board of directors, you have to effectively do 10 different things that each of that guys want, and all of them with compromises made to accommodate the 9 remaining other things that had to be done simultaneously.

My father frequently says that "a public company is a like a woman with 10 husbands, all trying to make love to her at the same time"

At Flickr, we worked closely with the Vespa team from 2011 through 2016 on a wide range of advancements:

   * partial document refeeding (i.e. expedite indexing a new field to 20+ billion documents without refeeding everything and staying online handling 100M+ free text queries a day)
   * visual similarity search - check out the tensor ranking features [1] [2]
   * online elasticity - add/remove replicas / shards online. A must when it could take weeks+ to re-feed from scratch. This is non-trivial to make work smoothly at scale. 
   * latency / tail-latency on complex queries. p90 reduction from 3,000 to 30 ms.
This is a major gift to the open-source community of a battle-tested search engine that works reliably without babysitting with very large datasets, and simultaneous high query / high feed volumes. Huge debt of gratitude to the team in Trondheim and Verizon/Oath/Yahoo legal & management teams for making this happen. :+1:

[1] http://docs.vespa.ai/documentation/tensor-intro.html [2] http://docs.vespa.ai/documentation/tensor-user-guide.html

  $ cloc-git https://github.com/vespa-engine/vespa.git
  http://cloc.sourceforge.net v 1.60  T=64.10 s (224.5 files/s, 28276.3 lines/s)
  Language                      files          blank        comment           code
  Java                           6573         106215         102720         537097
  C++                            3209          76542          19178         504855
  C/C++ Header                   2985          42731          57087         158388
  XML                             389            705            550         139626
  Maven                           141            133            244          14096
  CMake                           450            254            560           8452
  Perl                             57           1124            762           7649
  Bourne Shell                    196           1257            734           6918
  Scala                            95           1685            617           6378
  Teamcenter def                  234           1474           3490           2468
  Lisp                              4            231            403           2118
  HTML                             16            211             29           1950
  C                                 7            288            198           1432
  Python                            6            132             66            556
  Ruby                              9             39              9            294
  Bourne Again Shell                3             35             12            182
  Pig Latin                         9             39             52             54
  make                              2             22              8             39
  Ant                               1              9             17             36
  YAML                              1              9              1             22
  DTD                               2              6              6             10
  SUM:                          14389         233141         186743        1392620

As an ex-employee, there could not be a better description of Yahoo! development than this.

Hey, there is no yinst/buildyblocks stuff here. That's kind of a must-have.

> yinst


To be fair, yinst was the least worst part of the systems I was working on (Yahoo!Europe backend feeds stuff.)

Can you elaborate?

Even split between Java / C++, with at least 4 different build systems: 2 of which are effectively the next generation of the other 2. 2 different UNIX shells and 5 other dynamic programming languages. Not the end of the world, and there are probably good reasons for a lot of it (maybe there are bindings for various languages, maybe some of it is misindeitified, maybe they're harnessing 2 large bodies of existing work, 1 in Java and 1 in C++), but it may suggest a lot of people just doing things their own way trying to carve out their own niche, without a cohesive philosophy across the system, which mirrors my experience with Yahoo's engineering org.

No, this is run by a single, very cohesive, very remote team.

There is some misidentification in your list (our .def files are nothing to do with Teamcenter). We use 2 languages because we Java and C++ have different strengths which make each suitable at different layers of the architecture. The rest is a combination of "for good reasons" and "leftover scraps" :-)

Yeah I ran cloc on a few projects I'm very familiar with, and shell scripts that are interpreted with bash but not with a shebang because they're libraries sourced by others are called Bourne too. Just explaining what OP was getting at :)

Pig Latin?

That would be Apache Pig, a scripting language for Hadoop. Some scripts are included to feed data to and query Vespa from Hadoop.


This is really cool. Vespa was probably first described in this 2007 paper: https://brage.bibsys.no/xmlui/bitstream/handle/11250/251199/...

Next up, I would really like to see Sherpa/PNUTS (their NoSQL operational database) and Everest (their petabyte-scale Postgres data warehouse) open sourced :)

May I ask some stupid questions? :/

I don't quite get the diagram of the Vespa Architecture. Is Vespa a middleware between database engine and query parser? This is what puzzles me.

If so, are there other such middlewares available for ie. PostgresSQL that allow hooking "Query Templating Models" (that is it?) generated via Machine-Learning Models? Is it way more complicated than that, or did they overengineer the problem into a monolith? EDIT: Looking at https://github.com/vespa-engine/vespa it seems that it is overengineered, or maybe it consists of individual micro-components like node.js, hmm more questions :(

Is GraphQL such middleware or lower-level?

Does Vespa replace custom Glue-Code between Backend and Frontend that generates such query-sets for content ranking/positioning?

Or what exactly does Vespa solve? I'm sorry, I've read the article, but can't say, yep that's what it is!

EDIT: How else could you solve what Vespa does using Rust, Go, or C/C++ libraries? A very simple or general direction would be immensely useful to understand Vespa =) The project makes the simultanous impression of an immense engineering feat and at the same time a huge code debt.

> How else could you solve what Vespa does using Rust, Go, or C/C++ libraries?

Let me try myself answering my own question, I hope someone hops in and tells me where I'm wrong or how else to improve :)

     1) Get PostgresSQL exntensions via "package manager" pgxnclient
     1.1) pg_bouncer - For connetion pooling
     1.2) yoke - As a high-availability cluster manager with auto-failover and automated cluster recovery
     1.3) prestodb.io - Distributed SQL query engine for pgsql
     1.4) pglogical - Logical streaming replication for using a publish/subscribe model
     1.5) pg_lambda - To create your own AWS (meta) Lambda
     1.6) pg_strom - To offload tasks to the GPU
     1.7) zombodb - To utilize full-text searching via indexes backed by Elasticsearch
     2) Put all together with pglogical and presto to seperate GPU/CPU intensive tasks.
     2.1) "Build Missing Middleware" - To design/fuse a query visually that combines multiple backends
     2.1.1) Create a binary data-stream by integrating pg_lambda, pg_strom, presto and zombodb
     2.1.2) "Build Missing Middleware" - A tensor processing extension to use ML Model evaluations
     2.1.3) "Use Missing Middleware" - For data-processing via Machine-Learning models
     2.1.4) "Use Missing Middleware"- To output ML processed results into the database
     2.2) Partition these queries using "pg_lambda + middleware" to create accelerated and fused query results
So what's missing to create a Vespa alternative using existing technologies is everything in Point 2) if I'm not mistaken. Torrent based replication isn't exactly neccessary, except at Twitter/Facebook scale, but if you reach that stage you can hire a libtorrent author.

I thik basing this on PostgresSQL was wrong now and believe that a meaningful approach at creating a Vespa alternative yourself is basing this on a Content-Adressable-Storage[1] and adding a DB-Layer ontop (ie. using AUFS).

It would have following properties: decentralized, distributed, resilient, highly-available, software-defined storage & retrieval system.

According to http://vespa.ai/#featurematrix:

        ACID transactions			                •••
        Optimized for analytics		        •••	        ••
        Optimized for serving	    •••	        •	        ••
        Scalable	            •••	        ••	        •
        Easy to operate at scale    ••	                        •
        Text search	            •••	        ••	        •
        Machine learned ranking	    •••	        •               2.1.2) - 2.1.4)	
        Middleware logic container  •••		                1.4)
        Live reconfiguration	    •••	                        1.2)
And yet I've to admit that even if the Github repository looks quite chaotic, making an alternative, even using existing technologies would be big feat.

Initially I would've chosen PostgresSQL as a base, but the "HA-Layer" is something that shouldn't be decoupled and not a later thought. That's why CAS is a much better form of integration. Also integrating the PostgresSQL Engine into a zfs kernel extension ie. would be a mess. And integrating the database engine into a a distributed p2p algorithm would only add compatability issues an no real advantages.

[1] https://en.wikipedia.org/wiki/Content-addressable_storage#Op...

PS: Clever aquisition by Docker! "Infinit.sh is a content-addressable and decentralized (peer-to-peer) storage platform that was acquired by Docker Inc." And in my eyes one of the best implementations and easiest targets that allow adding a database-layer ontop.

I think at a glance, it's basically a much more scalable version of something like Elasticsearch, optimized for very quick wide fanout to a large number of leaf nodes.

It's a datastore in its own right (just like ES), but I imagine that e.g. you wouldn't use it to handle transactions.

So the upsides of Vespa over Elasticsearch are speeding up the rate at which it scales? Ah, that seems reasonable for a company this size, but is there something in there that's of use for Startups?

This blog post shows how Elasticsearch was used to reindex a 136TB dataset with 36B documents[1], so I'm unsure exactly where except for Google/Yahoo Scale companies Vespa is of use. I would like to understand howto utilize it though without adding an umnanagable complexity.

EDIT: Maybe a Vespa Cloud startup, that abstracts the management and makes "Scalability as a Service" by utilizing other Cloud providers.


[1] https://thoughts.t37.net/how-we-reindexed-36-billions-docume...

[2] http://docs.vespa.ai/documentation/vespa-quick-start.html

In my experience running machines with Vespa (ended in 2011) and elastic search (which ended earlier this year), Vespa was a lot more stable, even though my elastic search had many times more hardware and fewer documents. At least once a month, elastic search would take a several minute break to do who knows what, even though there was not even any indexing or anything other than searching going on. In case it matters, I was running elastic as a single node cluster (actually several single node cluster), my production Vespa was multinode, but I think we had a single node (or fewer node anyway) cluster for dev/testing.

Anyway, I'm happy that we have more options in this space now.

"Vespa is the single greatest piece of software Yahoo ever built. It's like ElasticSearch but a hundred times better. I am so happy." Laurie Voss Co-founder/COO of @npmjs https://twitter.com/seldo/status/912876700542787585

"When machines are lost or new ones added, data is automatically redistributed over the machines, while continuing serving and accepting writes to the data. Changes to configuration and Java components can be made while serving by deploying a changed application package - no down time or restarts required." That sounds pretty impressive.

Looks impressive!

Wow this project is humongous!


I'm really curious how it compares to Lucene/ElasticSearch/ELK, which is currently my tool of choice for (faceted) search and recommendation.

I think that's actually the broadest root folder I've ever seen on a project! It's really hard to know where to start looking. Is anyone familiar with the internals?

Sorry about that - we haven't really optimized the module structure for newcomer comprehension. If you tell me what you want to are looking for I can point you to the right place.

Found some comparison (on the root page of their project website, my bad).


Next thing I'd really like to know is what existing software it builds on top of.

Outside of general purpose libraries that are found in most software projects it's not based on any existing software. It's built from the ground up by Oath, and the companies that preceded it: Yahoo, FAST, Overture since early-mid 2000s.

This article has some more details about the history: https://www.cnbc.com/2017/09/26/yahoo-open-sources-vespa-for...

Disclamer: I work on the Vespa team in Trondheim, Norway.

Is Vespa relevant if you're not into writing Java? I.e., can it be used as a black box similar to Elasticsearch?

From the repo, it looks like an absolutely huge, monolithic codebase. (It even bundles its own memory allocator!) Do you know if there are plans to break it up into smaller, more manageable pieces?

While I haven't looked at what's required to deploy this beast, operationally speaking, it sounds it might be daunting to run, and for non-"big data" applications might very well be overkill as an alternative to Elasticsearch.

You don't need to plug in any Java code, you can use it with HTTP calls to read and write.

No plans to break it up into pieces (apart from already consisting of modules). It does one thing, it just happens to be a big thing :-)

If you have a mac of Linux box you can have it up and running in 10 minutes. Multi-node production deployments are no different because Vespa manages the nodes, not you directly.

From what I can tell from the documentation, schemas ("search definitions" in Vespa parlance) are stored as files that are part of the "application package" (which I don't know what is yet); it doesn't seem like you can make dynamic schema changes through a REST API, and some schema changes actually require search nodes to be restarted?(!) Doesn't Vespa have anything similar to the Elasticsearch mappings API for dynamically updating a schema?

This may be a difference in mindset, but we generally let apps define the schema, so it's in the developer's realm of responsibility, not an operational concern. We also have apps (currently using Elasticsearch) that manage the schema automatically, derived from a high-level application definition. With ES, the app creates a new index with new mappings, shovels data into the new index, and then activates the new index. But it uses the ES APIs to do this, and restarting anything is not on the table.

Doug Cutting, the creator of Lucene, was an employee of Yahoo. So does Vespa share any technologies with Lucene?



I'm guessing (hoping) their lawyers made sure to go over their old agreements with a very fine comb to make sure their license for the software allowed them to open source this.

Don't get me wrong, they've added a lot to it, but there's a lot of code in there that could only have come from their purchase of Overture, who had purchased AllTheWeb from FAST (which was itself purchased by Microsoft).

If they purchased it, then they would own the copyright, so they can relicense it any way they want can't they?

I was with FAST. Were you there, by any chance?

The reason is simply, for performance. To avoid having to go to the kernel every time we need to allocate memory for a query, and avoid having to clear memory on free/reuse. It is made for Vespa, but also used for other programs.

Similar in purpose to Google's TCMalloc: http://goog-perftools.sourceforge.net/doc/tcmalloc.html

How is it better than TCMalloc? (If it isn't, it probably should be replaced by TCMalloc.)

In our tests, vespamalloc has simply been faster. I don't know how in-depth the analysis has been as to why, but obviously vespamalloc is written and tuned for Vespa so that is a likely factor.

If anyone wants the kind of machine learning based rerabking in Elasticsearch, we've been working with the wikimedia foundation on an Elasticsearch learning to Rank plugin:


I see it has tensor processing built in - http://docs.vespa.ai/documentation/tensor-intro.html

Can this be used as a spark+tensorflow replacement ?

Note: TFoS is also a Yahoo open source project. The teams work together. https://github.com/yahoo/TensorFlowOnSpark

is tfos production-ready right now ? because i thought it was still experimental.

is it used inside Yahoo - because Vespa comes with its own tensor processing engine. I wondered who would use one over the other.

TensorFlow on Spark is for learning, Vespa is for serving. Where Vespa excels is in evaluating a learned model very quickly over lots of documents. We're working on providing support for running models learned with TensorFlow directly. For now people make the translation on their own.

I don't want to trigger any bot detection by voting all your comments up in a short amount of time. So, I will say thank you and mention that is is contributions from people like you that keep me coming back.

It is more like an alternative to tensorflow serving - it does not train models but is good at evaluating them and use the evaluation for ranking.

Here is a snippet to decodeURIComponent for all the yql examples in the documentation. Makes it a bit easier to see the yql syntax.

  $('pre:contains("yql=")').each((i, el) => { el.innerText = el.innerText.replace(/\+/g, ' ').replace(/yql=(.+%3B)/, (m, p1) => 'yql=' + decodeURIComponent(p1)) })

The quickstart doesn't seem to work, at lease on macOS. I raised a ticket:


I just checked the issue, and it seems it was caused by human error. It is closed now.

How well does vespa handle time series data, compared to, say, elasticsearch?

Say I'm using ELK for log aggregation. Would Vespa be a good replacement? One pain point is ingest rate. How many "average" log lines per second can Vespa do per node?

It could be a replacement for the 'E', but the APIs are different enough that there's no drop-in replacement for the 'L' and 'K' and creating or making those compatible would be a significant effort. Would be great if someone did though :-)

Gotcha. On the ingest front, do you have any numbers around that? I see some benchmarks that focus on other (important) aspects like QPS but didn't catch anything on ingest.

Write speed (add or update) is typically between a few thousand to a few tens of thousands operations per second per node sustained, depending on sizeof data etc.

Sustaining throughput over long time is important and often overlooked mentioned in benchmarks.

Big open source news in recent times.

How is the storage layer designed ? disk format ? Can you extend the layer to support different models such as Property graph ?

Now, if Yahoo would just open source the yinst/opsdb/rolesdb/igor/etc ecosystem. I miss that elegant tooling so fricking much.

I don't understand how such a move is profitable for a business. Can someone please point me to some articles that discuss this?


The point of Joel's article is that "Smart companies try to commoditize their products’ complements."

For what I understand, Yahoo is a media company, and as so it may try to commoditize a natural complement of today's media companies, which is data search.

That article is a classic but I don't really understand how it applies in this case.

Smart companies commoditize their products' complements (something that needs to be bought with the product) so that whoever wants buy their product has a large variety of offerings to select from. For instance, MS-DOS's ability to run on any standard PC architecture machine commoditized the PC.

It's not clear how commoditizing data search makes selling media to consumers easier, because consumers of media don't buy data search. In fact they expect it to come for free from the media company.

Might it not have the opposite effect instead? That is, it makes starting up a media company cheaper and allows competitors to spend more money on acquiring media?

Yahoo! gets code improvements back and, by being the originals and having the most experts familiar with the entire code base - and it is huge, they retain their competitive advantage. They also do this while fostering good will and, potentially, reducing the number of bugs, improving security, and increasing efficiency.

It's brilliant, in a way. There are risks, but they are small and mitigated. They may even end up selling support and customizations, or enabling that market. The upside potentials are many and the downsides are few and only risk small impacts.

Hell, you can get RedHat for the low cost of nothing, just by signing up for it. On top of that, they will give you every single last line of code you want. They'll give you all of the code, and do it for free.

Yet, they are a successful for-profit company. They don't even accept financial donations, as far as I know. They aren't the wealthiest company, but they are doing quite well and not suffering financially.

Open source doesn't mean no profit. It just means additional rights for the source code and/or user. (Different licenses prioritize different liberties and have different goals.)

Good point, thanks for the excellent article!

My idea is that:

1) It attracts developers who will provide fixes, bug reports, documentation, etc. for free (or even draw potential full-time developers for the project).

2) It makes the project look more "trendy" and appeals to developers who will try not to use proprietary software; this is the same move that .NET did and it seems to have worked very well.

2nd point seems valid. 1st one however suggests that people will try to use it to their own benefit - some of them to compete with Yahoo. Why else would they contribute to the project?

Proprietary code is expensive to maintain. Even though there's a dedicated team of a few dozen people that has been working on Vespa over the years there are thousands of developers who have been contributing to ElasticSearch, Lucene, and other projects in the open source world that are in a similar space. Getting contributions to Vespa will help it grow and evolve to make it better -- much like Yahoo did when it evolved Hadoop and scaled it out, much like Yahoo did when it help caffe, druid, hive, oozie, openstack, pig, storm, shark, spark, and tensorflow and tons of other projects (that it either created, co-created, or took from others and contributed improvements back to help make better for all).

Sharing code is fine since we use lots of shared code too. We don't sell code. So if someone wants to use Vespa to make an amazing product and make tons of money, we hope they do. We know that sharing Hadoop helped our competitors, but we also know that the revenue stream comes from ads. So we're glad to share code that makes tech better for everyone. As it turns out, many tech companies feel the same way and openly share code with the industry to help us all get to better tech platforms. In the tech space, it's not about grabbing more of the pie, it's about making a bigger pie. The internet revolution is young and the more we built it in the open, the better it will be for all of us.

> We know that sharing Hadoop helped our competitors, but we also know that the revenue stream comes from ads.

This is not clear to me, can you please explain? I'm still stuck at thinking "if you help your competitors then you give up some of your market share".

EDIT: your pie analogy is really nice, I guess it means you grow the whole market by sharing tools like Vespa so you get a smaller slice of the bigger pie. I still don't get how the "revenue stream comes from ads" part relates to everything else.

Some companies make money by licensing software. They are less willing to publish code since, to them, code is revenue and they don't want to give it away. Internet-media companies view code more like a required expense. Giving code away 1. helps reduce carrying costs 2. attracts developers 3. forces out of dependency debt, 4. encourages developers to make their code better 5. builds skills that transfer to other companies 6. make it easier for us to find people who already know the tech we use. etc. etc. There are TONS of upsides. Case in point: we can hire a Hadoop developer. Had we never open sourced the code, we'd have to keep an army of developers in house for a decade. Instead, they can leave to form startup companies (if that's what they want to do) and we still get the benefit of their creative effort. We also invested in one of those companies, so when they make money, we do too. Developers want to work at Yahoo (really) since it will help them build skills they can take elsewhere, or they can stick around and use those skills internally. Either way, why would a developer work on proprietary code when they can work on open source code which will give them more options.

There's a good give and take in the tech world. Sure, our code has helped Facebook and Google eat our lunch. But we don't blame the tech sharing for that -- since they've contributed quite a bit too. We work rather closely with them on a bunch of projects -- which help us all.

Sure, there are some projects that we'd consider the "secret sauce" that really differentiates us from others. We won't open source those. But a lot of code is there 'cuz we need to move bits around quickly. Sharing that code is not going to make or break a multibillion$ enterprise. It's actually going to help make it better in the long run.

To so make money: our sales people to do that, not the tech people. The sales people are given amazing products, and huge audiences to sell to the advertisers. Whereas a podcast might have a million subscribers, a popular radio program have 10 millions listeners, a TV show getting 50 million viewers, or a wireless company have 150 million subscribers, we have over a billion users -- and the advertisers LOVE that. So we all sell ads, and who ever does that better wins. But as tech folks, we're collaborative. It's not really a new thing, it's very much part of the fabric that has helped the internet evolve.

i wonder if this is able to squeeze something useful out of Yahoo Answers.

I am not trying to be snarky when I say this, so I'll make it longer than a single word.


That is, Y! Answers was/is a great idea with terrible results. I'm not sure if it should have been moderated better, or if it should have been marketed better. Hell, maybe it should have had a basic literacy test prior to being allowed to post and answer?

I could actually come up with a few hundred ways to have made it better. We know the question and answer format works. It does on many, many sites. It failed there and, largely, that's because of the users. I'm not sure which was worse, the questions or the answers.

I'd love to be given that project and tasked with improving it. It's a great idea, but horribly implemented. I'd like to fix it because I'm fond of trying the impossible and like fixing broken things.

Improving search for it is absolutely not going to help it. No, that's not going to help in the slightest.

FWIW, there was a research team at Yahoo who would actually get insights about the way people used language (relevant for contextual search) from Yahoo Answers. They were more active during the earlier years when the content quality was much higher. I don't know if they used Vespa in their mining process, but it would not surprise me if they did. Vespa is used in many projects because it's just that good. So if you are thinking of using it -- for Yahoo Answers, for Quora, whatever, go for it. I know a cancer researcher/engineer who wants to use Vespa for serving clinical reports and trial outcomes.

As for the current Answers site, well, we'll see what happens. I know the PM, a delightful person. I don't know the plan. But if there is a plan to make something useful from it, apply. It apparently makes money (otherwise it would have been killed long ago), and that means there's something to work with. When I started at Yahoo I was hoping to get onto the Groups team for the same reason -- a huge challenge to fix something that could be made cool again.

Yep, the old maxim "garbage in, garbage out". Yahoo! was the biggest internet company out there (maybe AOL was bigger). You had the whole population in all its warty glory using Yahoo as the default search engine, landing page, email, and who knows what else.

Plus answers was done at a time when we were extremely naive about the wisdom of crowds and open source knowledge. Later efforts learned from those early 2000s failures. I'm pretty sure Joel Spolsky specifically called out Yahoo Answers when he was talking about what Stack Overflow was going to do right in his early announcements.

Maybe this can be merged with the discussion from yesterday: https://news.ycombinator.com/item?id=15339851

I observe that it was written in Java: https://github.com/vespa-engine/vespa

The serving and admin/config layers are Java, while the content cluster is C++. Please see the architecture diagram on http://vespa.ai/.

yeah, seen :)

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact