Hacker News new | comments | ask | show | jobs | submit login
Show HN: EventQL – Open-source distributed SQL analytics database in C++11 (github.com)
90 points by paulasmuth on July 26, 2016 | hide | past | web | favorite | 18 comments

Is this similar to Apache Drill[0]/Google Dremel[1] (BigQuery[2])? One differences is that it seems to be able to do mutations to data, not simple appends.

[0] https://drill.apache.org/ [1] http://research.google.com/pubs/pub36632.html [2] https://cloud.google.com/bigquery/

Yes, it is similar to BigQuery with a couple of differences as you pointed out. The big ones being that it's fully open source and self-hostable.

It's less similar to Apache Drill - Drill is "only" a query engine and doesn't handle the actual data storage. EventQL combines a bigtable-like storage engine (optimized for the analytics use case) with a dremel/bigquery/dremel-like query engine.

How does it compare to Clickhouse, recently released from Yandex [0]?

[0] https://clickhouse.yandex/

Awesome! Thanks! I'll have to try it out.

Looks great! I've just gotten a single instance up and running. Really simple to set up. It seems almost exactly what We've been looking for! Some background: We have been evaluating different time series databases for use with sensor readings (We probably need to ingest something on the order of 100,000-200,000 samples/s per cluster). The data is from physical sensors like temperature, level, pressure etc. Where all the data is in the form of tag(string)|datetime(ms)|value(decimal).

Have you done any comparisons to other similar software like influxdb, Cassandra, etc? Especially ingestion rate and disk usage.

What kind of pricing can we expect on Managed Hosting?

We are currently leaning towards Influxdb but the cluster licensing stuff they are doing really made us think twice.

If you're looking for recommendations for time series data, this thread still holds up - https://news.ycombinator.com/item?id=8368509

I'd consider taking a look at KDB+

We threw kdb+ out of consideration pretty early because it's extremely expensive and we prefer open software.

At my work we use OSI PI which is an Enterprise Historian for storing this kind of physical sensor data - has very good support for time series logging and integrates well with control system(Citect, ABB etc) as well as our LIMS system.

It's not free software so that might be a deal breaker for you.

InfluxDB doesn't currently handle high-cardinality data sets well -- it needs a lot of RAM.

Yeah that is one of the biggest issues besides the licensing stuff. But ram is pretty cheap and it seems 200,000 different series/tags fits easily in 32gb.

If you're willing to try something more bleeding edge: http://btrdb.io/

Why EventQL over other distributed columnar-storage based databases like Redshift, Vertica, Citus, or Hadoop-based ones like Impala and Presto. What does EventQL do better?

Good question!

May be useful to add to Show HN guidelines that it is recommended to add a "Why X" section. Or "How X compares to others in this space".

Without that, it is confusing. Are authors unaware of other solutions in the space? If so, they surely did not build competitive product.

Or maybe authors are aware of other solutions, but don't want to get compared?

Either way, not a good sign.

looks good [0], thank you

can anyone comment on data disk usage ?

[0] docs - http://eventql.io/documentation/

Why not build a bigtable-cassandra fusion row-store too ? Since updates are async and all nodes are the same (cassandra) while the data model is sorted-by-primary-key (bigtable) and the schema fixed (low cost in storing tuples) and sql available (easier for devs).

How can we import CSV dataset? HTTP API for insert returns this error message: expected JSON_OBJECT_BEGIN, got: JSON_OBJECT_BEGIN. The GROUP BY query in homepage scans 1.8B rows and only takes 1.5 seconds which is great but how many nodes used in that setup?

We do have a csv import util (the API expects JSON) but it's not in the current distribution/release build. I'll add it and update this comment once it's live.

Queries are mostly limited by IO if running on regular hard disks. The number of rows/seconds mainly depends on the number and types of columns that are accessed. For example, if we scan 1.8B rows and only load a single integer column from disk (and the integers are small), we'll only have to load about 1 byte per row from disk (using an idealized model excluding some overheads for illustration purposes). If we want to complete the query in 1.5 seconds that would be a total IO load of 1144MB/s. So (depending on disk speed) around ~15 machines would suffice.

As a maintainer of the projec,t it's cool to see the Fluentd plugin in the repo: https://github.com/eventql/fluent-plugin-eventql

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact