TiSpark sits Spark SQL on top of a storage engine to answer complex OLAP queries

elvinyung · on Aug 4, 2017

Yes! I've been thinking of something like this for a while.

For the sake of simple data integration, I think this sort of architecture is optimal. As it stands, Spark is basically already a distributed database without its own storage engine; tighter integration with a transactional storage engine means that you could get the full power of OLTP and OLAP (HTAP) under the same interface.

Imagine that you could process transactions in Spark (pushing them down to the distributed storage engine), and then Spark could automatically use the changes to update a materialized view, and you could serve the updated materialized view directly from Spark for real-time decision support, using SQL plus richer analytics like machine learning, graph processing, etc. It's not quite a one-size-fits-all [1] database, but it's close.

Put a PostgreSQL or MySQL wire protocol server in front of it, and application developers won't even have to know that they're using Spark.

(I'm glossing over the fact that Spark currently isn't very good at transaction processing in the sense that it literally doesn't have much of a write API right now -- i.e. support for equivalents of SQL `begin`, `commit`, and rollback`, and updates/upserts in general -- but I think that's reasonably easy to add by pushing down this functionality to capable storage engines.)

[1] https://cs.brown.edu/~ugur/fits_all.pdf

ilovesoup · on Aug 4, 2017

I'm the main dev of TiSpark. I totally agree. For now we allow trx only in TiDB. I believe one day those big-data stuff will be unified onto one platform. With a full-featured distributed db storage layer underneath, there might be tons of tricks to play comparing to data on hdfs. Ultimately, we plan to put a mysql layer on top of Spark SQL (maybe or something else as mpp engine), as you said, to make user not aware of existence of Spark SQL underneath.

elvinyung · on Aug 4, 2017

Just curious -- what other MPP engines are you currently considering, and what are the criteria?

jamesblonde · on Aug 4, 2017

We have made a step towards unifying HDFS with a distributed SQL layer by doing something different- we put HDFS' metadata in MySQL Cluster (NDB) - an in-memory distributed DB that scales to 10s of nodes. NDB is not your father's MySQL. The result is HopsFS that scales to at least 16X HDFS on throughput (Usenix FAST 2017). See here: www.hops.io

We have since adapted Hive SQL to move its metadata to the same DB as HopsFS (MySQL Cluster). We added foreign keys from Hive tables to the backing directories in HDFS. The result is strongly consistent metadata for Hive. That is, you can remove a backing directorty for a DB or table in HopsFS, and Hive's DB or table is automatically deleted. If you have ideas for next steps, we're all ears. It's all open-source: https://github.com/hopshadoop

See 22'40 in here: https://www.youtube.com/watch?v=mTRsrjH5WLI

elvinyung · on Aug 4, 2017

Very nice! I actually read the FAST '17 paper a while back.

I'm curious if HopsFS has any support for partition-pruned secondary index scans? (If not, is it on the roadmap?) I feel like the lack of it in many current big-data systems is a major bottleneck to adoption for many kinds of workloads.

jamesblonde · on Aug 4, 2017

Do you mean Hive on Hops, I guess? HopsFS is a filesystem (drop-in replacement for HDFS with distributed metadata).

Partition-pruned index scans are key for high performance in distributed databases. Index scans and full-table scans hit all shards and kill performance. Hive supports partition-pruned index scans and push-down filtering for projections.

ngaut · on Aug 4, 2017

Nice reference, thanks for sharing.