Next up, I would really like to see Sherpa/PNUTS (their NoSQL operational database) and Everest (their petabyte-scale Postgres data warehouse) open sourced :)
I don't quite get the diagram of the Vespa Architecture.
Is Vespa a middleware between database engine and query parser? This is what puzzles me.
If so, are there other such middlewares available for ie. PostgresSQL that allow hooking "Query Templating Models" (that is it?) generated via Machine-Learning Models? Is it way more complicated than that, or did they overengineer the problem into a monolith? EDIT: Looking at https://github.com/vespa-engine/vespa it seems that it is overengineered, or maybe it consists of individual micro-components like node.js, hmm more questions :(
Is GraphQL such middleware or lower-level?
Does Vespa replace custom Glue-Code between Backend and Frontend that generates such query-sets for content ranking/positioning?
Or what exactly does Vespa solve? I'm sorry, I've read the article, but can't say, yep that's what it is!
EDIT: How else could you solve what Vespa does using Rust, Go, or C/C++ libraries? A very simple or general direction would be immensely useful to understand Vespa =) The project makes the simultanous impression of an immense engineering feat and at the same time a huge code debt.
Let me try myself answering my own question, I hope someone hops in and tells me where I'm wrong or how else to improve :)
1) Get PostgresSQL exntensions via "package manager" pgxnclient
1.1) pg_bouncer - For connetion pooling
1.2) yoke - As a high-availability cluster manager with auto-failover and automated cluster recovery
1.3) prestodb.io - Distributed SQL query engine for pgsql
1.4) pglogical - Logical streaming replication for using a publish/subscribe model
1.5) pg_lambda - To create your own AWS (meta) Lambda
1.6) pg_strom - To offload tasks to the GPU
1.7) zombodb - To utilize full-text searching via indexes backed by Elasticsearch
2) Put all together with pglogical and presto to seperate GPU/CPU intensive tasks.
2.1) "Build Missing Middleware" - To design/fuse a query visually that combines multiple backends
2.1.1) Create a binary data-stream by integrating pg_lambda, pg_strom, presto and zombodb
2.1.2) "Build Missing Middleware" - A tensor processing extension to use ML Model evaluations
2.1.3) "Use Missing Middleware" - For data-processing via Machine-Learning models
2.1.4) "Use Missing Middleware"- To output ML processed results into the database
2.2) Partition these queries using "pg_lambda + middleware" to create accelerated and fused query results
It would have following properties: decentralized, distributed, resilient, highly-available, software-defined storage & retrieval system.
According to http://vespa.ai/#featurematrix:
FEATURE VESPA ELASTIC SEARCH RELATIONAL DATABASES
ACID transactions •••
Optimized for analytics ••• ••
Optimized for serving ••• • ••
Scalable ••• •• •
Easy to operate at scale •• •
Text search ••• •• •
Machine learned ranking ••• • 2.1.2) - 2.1.4)
Middleware logic container ••• 1.4)
Live reconfiguration ••• 1.2)
Initially I would've chosen PostgresSQL as a base, but the "HA-Layer" is something that shouldn't be decoupled and not a later thought. That's why CAS is a much better form of integration. Also integrating the PostgresSQL Engine into a zfs kernel extension ie. would be a mess. And integrating the database engine into a a distributed p2p algorithm would only add compatability issues an no real advantages.
PS: Clever aquisition by Docker! "Infinit.sh is a content-addressable and decentralized (peer-to-peer) storage platform that was acquired by Docker Inc." And in my eyes one of the best implementations and easiest targets that allow adding a database-layer ontop.
It's a datastore in its own right (just like ES), but I imagine that e.g. you wouldn't use it to handle transactions.
This blog post shows how Elasticsearch was used to reindex a 136TB dataset with 36B documents, so I'm unsure exactly where except for Google/Yahoo Scale companies Vespa is of use. I would like to understand howto utilize it though without adding an umnanagable complexity.
EDIT: Maybe a Vespa Cloud startup, that abstracts the management and makes "Scalability as a Service" by utilizing other Cloud providers.
Anyway, I'm happy that we have more options in this space now.