Hacker News new | past | comments | ask | show | jobs | submit login

This is really cool. Vespa was probably first described in this 2007 paper: https://brage.bibsys.no/xmlui/bitstream/handle/11250/251199/...

Next up, I would really like to see Sherpa/PNUTS (their NoSQL operational database) and Everest (their petabyte-scale Postgres data warehouse) open sourced :)




May I ask some stupid questions? :/

I don't quite get the diagram of the Vespa Architecture. Is Vespa a middleware between database engine and query parser? This is what puzzles me.

If so, are there other such middlewares available for ie. PostgresSQL that allow hooking "Query Templating Models" (that is it?) generated via Machine-Learning Models? Is it way more complicated than that, or did they overengineer the problem into a monolith? EDIT: Looking at https://github.com/vespa-engine/vespa it seems that it is overengineered, or maybe it consists of individual micro-components like node.js, hmm more questions :(

Is GraphQL such middleware or lower-level?

Does Vespa replace custom Glue-Code between Backend and Frontend that generates such query-sets for content ranking/positioning?

Or what exactly does Vespa solve? I'm sorry, I've read the article, but can't say, yep that's what it is!

EDIT: How else could you solve what Vespa does using Rust, Go, or C/C++ libraries? A very simple or general direction would be immensely useful to understand Vespa =) The project makes the simultanous impression of an immense engineering feat and at the same time a huge code debt.


> How else could you solve what Vespa does using Rust, Go, or C/C++ libraries?

Let me try myself answering my own question, I hope someone hops in and tells me where I'm wrong or how else to improve :)

     1) Get PostgresSQL exntensions via "package manager" pgxnclient
     1.1) pg_bouncer - For connetion pooling
     1.2) yoke - As a high-availability cluster manager with auto-failover and automated cluster recovery
     1.3) prestodb.io - Distributed SQL query engine for pgsql
     1.4) pglogical - Logical streaming replication for using a publish/subscribe model
     1.5) pg_lambda - To create your own AWS (meta) Lambda
     1.6) pg_strom - To offload tasks to the GPU
     1.7) zombodb - To utilize full-text searching via indexes backed by Elasticsearch
     2) Put all together with pglogical and presto to seperate GPU/CPU intensive tasks.
     2.1) "Build Missing Middleware" - To design/fuse a query visually that combines multiple backends
     2.1.1) Create a binary data-stream by integrating pg_lambda, pg_strom, presto and zombodb
     2.1.2) "Build Missing Middleware" - A tensor processing extension to use ML Model evaluations
     2.1.3) "Use Missing Middleware" - For data-processing via Machine-Learning models
     2.1.4) "Use Missing Middleware"- To output ML processed results into the database
     2.2) Partition these queries using "pg_lambda + middleware" to create accelerated and fused query results
So what's missing to create a Vespa alternative using existing technologies is everything in Point 2) if I'm not mistaken. Torrent based replication isn't exactly neccessary, except at Twitter/Facebook scale, but if you reach that stage you can hire a libtorrent author.


I thik basing this on PostgresSQL was wrong now and believe that a meaningful approach at creating a Vespa alternative yourself is basing this on a Content-Adressable-Storage[1] and adding a DB-Layer ontop (ie. using AUFS).

It would have following properties: decentralized, distributed, resilient, highly-available, software-defined storage & retrieval system.

According to http://vespa.ai/#featurematrix:

        FEATURE	                    VESPA	ELASTIC SEARCH	RELATIONAL DATABASES
        ACID transactions			                •••
        Optimized for analytics		        •••	        ••
        Optimized for serving	    •••	        •	        ••
        Scalable	            •••	        ••	        •
        Easy to operate at scale    ••	                        •
        Text search	            •••	        ••	        •
        Machine learned ranking	    •••	        •               2.1.2) - 2.1.4)	
        Middleware logic container  •••		                1.4)
        Live reconfiguration	    •••	                        1.2)
And yet I've to admit that even if the Github repository looks quite chaotic, making an alternative, even using existing technologies would be big feat.

Initially I would've chosen PostgresSQL as a base, but the "HA-Layer" is something that shouldn't be decoupled and not a later thought. That's why CAS is a much better form of integration. Also integrating the PostgresSQL Engine into a zfs kernel extension ie. would be a mess. And integrating the database engine into a a distributed p2p algorithm would only add compatability issues an no real advantages.

[1] https://en.wikipedia.org/wiki/Content-addressable_storage#Op...

PS: Clever aquisition by Docker! "Infinit.sh is a content-addressable and decentralized (peer-to-peer) storage platform that was acquired by Docker Inc." And in my eyes one of the best implementations and easiest targets that allow adding a database-layer ontop.


I think at a glance, it's basically a much more scalable version of something like Elasticsearch, optimized for very quick wide fanout to a large number of leaf nodes.

It's a datastore in its own right (just like ES), but I imagine that e.g. you wouldn't use it to handle transactions.


So the upsides of Vespa over Elasticsearch are speeding up the rate at which it scales? Ah, that seems reasonable for a company this size, but is there something in there that's of use for Startups?

This blog post shows how Elasticsearch was used to reindex a 136TB dataset with 36B documents[1], so I'm unsure exactly where except for Google/Yahoo Scale companies Vespa is of use. I would like to understand howto utilize it though without adding an umnanagable complexity.

EDIT: Maybe a Vespa Cloud startup, that abstracts the management and makes "Scalability as a Service" by utilizing other Cloud providers.

--

[1] https://thoughts.t37.net/how-we-reindexed-36-billions-docume...

[2] http://docs.vespa.ai/documentation/vespa-quick-start.html


In my experience running machines with Vespa (ended in 2011) and elastic search (which ended earlier this year), Vespa was a lot more stable, even though my elastic search had many times more hardware and fewer documents. At least once a month, elastic search would take a several minute break to do who knows what, even though there was not even any indexing or anything other than searching going on. In case it matters, I was running elastic as a single node cluster (actually several single node cluster), my production Vespa was multinode, but I think we had a single node (or fewer node anyway) cluster for dev/testing.

Anyway, I'm happy that we have more options in this space now.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: