Hacker News new | past | comments | ask | show | jobs | submit login
InfluxDB is betting on Rust and Apache Arrow for next-gen data store (influxdata.com)
205 points by mhall119 on Nov 10, 2020 | hide | past | favorite | 113 comments

InfluxDB creator here. I've actually been working on this one myself and am excited to answer any questions. The project is InfluxDB IOx (short for iron oxide, pronounced eye-ox). Among other things, it's an in-memory columnar database with object storage as the persistence layer.

You are the CTO of a significantly-sized tech company. How do you find the time to work on this, and how do you justify your use of your own time, versus hiring and grooming a team, like more traditional CTOs might?

I hope this doesn’t come across as confrontational; I find this inspiring.

Not confrontational at all. In short, it's taken me years to be able to do this. Last year we finally got our engineering organization to a point where I no longer have any direct reports. Instead, everyone rolls up to our VP of engineering. In most mid-sized startups (like ours), the VP of engineering is the person that manages most of the process and generally makes sure the trains run on time.

My view of the CTO role is that it's focused on more long term efforts or the product generally (in more technical companies). So that's what I've been doing for much of this year. With the caveat that I'm also the founder and on the board so there is still about 10 hours a week of executive meetings/management, reviewing of other people's writing & product efforts and working with our biggest customers and prospects.

All that being said, I'm not just working on this because it's what I want to do and it's what excites me (which it does). I also think it's my highest point of leverage within the company. Very few people have the same view and in depth experience in this problem domain. Most of the people I know of that do are either founding and running competitive offerings or they're working on similar projects at Google, AWS, or Azure and getting paid significantly more than we can afford to pay a single engineer.

I think this is one of our most important efforts right now and the best way I know to make it successful is to be working on it in depth. I'm not the best programmer on the team or the smartest, but I have some depth of experience that gives me a clear vision on what it should be able to do and what tradeoffs we should be making.

My role on this effort is basically as the product person and the tech lead. In this case the tech lead isn't the manager of people on the team. We try to have a multiple paths for advancement in engineering and one of them tech focused individual contributor.

Of course, I view this project as necessary, but not sufficient for our overall success. That's why most of our engineering team is working on our legacy products and the continued forward development of the overall platform.

The architecture you've proposed here has many overlaps with q/KDB (or Jd, Shakti), which does not (AFAIK) have a decent open source alternative. Looking forward to see what you build!

Re: no open source alternative to kdb

I built this for Morgan Stanley (the same place where kdb originally came from) and put it to use in some major high-volume/low-latency trading systems where a significant fraction of US equity trades execute every day:


How is Hobbes support for relational workloads? My favourite feature of KDB is mixing SQL-like queries with list manipulations. I see support for tables in the Hobbes docs [0], but an array of tuples is row-oriented whereas KDB is column-oriented (flipped dict of lists). Am I missing something?

Nice job btw.

[0] https://hobbes.readthedocs.io/en/latest/language/types.html#...

Yes similar things are done in hobbes, although with array comprehension syntax instead of SQL. Grouping and joins are done with regular functions. I think that this approach scales better than fitting it all into one query syntax (though I don’t feel strongly about the syntax).

Yes you could say it’s row oriented and not column oriented. It’s possible to arrange data in column vectors, and that does get better performance in some situations. I have some ideas to unify the two approaches (so either layout can be decided, ad-hoc), but need a better approach to dependent types.

Thanks, appreciate the response.

Depending on your need for KDB, old school InfluxDB, OpenTSBD, SciDB and Promethius are open source alternatives.

There is also VictoriaMetrics open source alternative with similar architecture and comparable performance - see https://github.com/VictoriaMetrics/VictoriaMetrics/

What's the secret of the performance? Especially interested to see that it is written in Go.

I am bullish on better time-series systems, but why would I use InfluxDB IOx over Kafka?

I can choose whatever memory I have available for each topic/partition to do a query - or KSQL to transform topics.

edit: I should say that I can recognize a few benefits from columnar memory and lower level management with Rust, but it would be a huge infrastructure/tooling shift for a regular CRUD API shop or one that has already invested in any eventing system IMO.

We are talking 10x+ less storage/servers when working on columnar storage.

Time series databases and Kafka are intended for very different workloads.

Kafka is used as distributed message queue. I.e. a set of apps produce opaque messages and send them to Kafka queues, while another set of apps consume these messages from Kafka queues.

Time series databases are used for storing and querying time series data. There are no Kafka-like queues in time series databases. Every time series has a key, which uniquely identifies this time series, and a set of (timestamp, value) tuples usually ordered by timestamp. Time series key is usually composed of a name plus a set of ("key"=>"value") labels. These labels are used for filtering and grouping of time series data during queries. There are various time series databases optimized for various workloads. For example, VictoriaMetrics [1] is optimized for monitoring and observability of IT infrastructure, app-level metrics, industrial telemetry and other IoT cases.

As you can see, there is little sense in comparing Kafka with time series databases :)

[1] https://victoriametrics.com/

Last time I checked out VictoriaMetrics it had something akin to a memory leak when writing evenly to large numbers of distinct keys -- attempting to cache all the data in RAM and not freeing it even under high memory pressure (it's been awhile, but iirc it had a hard-coded 1hr expiry). Does that sound like a behavior that would still exist in the current design?

You seem to be making an assumption that I'm not going to (or shouldn't) use Kafka as a database for my time series data - when in fact that is how we use it.

I store and query messages in certain Kafka Topics - it's not all our data, but it is a non-trivial amount of it.

In that sense, InfluxDB will start to have a decent amount of overlap with something like Snowflake or Redshift.

Yes, this will likely be the case, but it's not a specific goal. We're focusing our efforts right now on the InfluxDB time series use cases we see most. More general data warehousing work isn't our focus, but we expect to pick up quite a bit of that along the way as we develop and as the underlying Arrow tools develop.

Snowflake was actually specifically mentioned as an example of who else is doing this in the blog post

It looks like InfluxDB is going to target general-purpose analytical workloads. It would be interesting to look at how InfluxDB will compete with ClickHouse in this space.

What are the advantages of Influx-iOX compared to Apache Druid? There seems to be much overlap in feature set and target use cases: seperate storage from compute, using parquet for columnar storage, good at analytical workloads on timeseries.

It looks like Influx-IOx and Apache Druid have many common base building blocks. The main difference is that Influx-IOx is optimized for time series workloads, while Apache Druid is optimized for analytical workloads. While time series workloads can be treated as analytical workloads in most cases, time series databases usually provide specialized query languages such as Flux, InfluxQL, PromQL [1] or MetricsQL [2] that simplify typical queries over time series data.

P.S. Apache Druid should be compared to ClickHouse [3] or similar analytical databases.

[1] https://valyala.medium.com/promql-tutorial-for-beginners-9ab...

[2] https://victoriametrics.github.io/MetricsQL.html

[3] https://clickhouse.tech/

Would this imply that you will be targeting in-memory analytics only?

Offtopic: I have been following your company and team closely during many Gophercons. Good luck to your team ... although a bit sad to see you go the Rust route :P

For now the execution is in-memory only. Over time we want to be able to execute against Parquet files on disk.

However, for large scale analytics on huge data sets where you're scanning all of it, we'll likely push you to EMR or something like that. The nice part is that those big data systems can execute directly against the Parquet files in object storage that form the basis of our durability.

This is part of the bet we're making compatibility with a larger data processing ecosystem.

Almost all of our use cases fall within what you can compute in memory on a single system (which is pretty much up to 1TB).

Do you have any plans for supporting versioned time series? As in "show me the time series as it was on 2020-11-11-08T:00:00:00Z"? Most of our time series are forecasts that change over time and that would be super helpful.

Anyway, thanks for your work!

Hi, one of the creators of IOx here. Unlike InfluxDB where you can only do efficient range based queries on one time value, in IOx you will be able to apply predicates efficiently to any column. That means you can have several columns for different times (e.g., time created, forecast time, expire time..).

You will be able to do efficient range queries on any of these. Most use-cases will work best when you partition your data into time chunks, which commonly would be the time you inserted the samples into the database (created time in your case I guess).

I don't think the time that the forecast is for should be the main index. Rather the time that the forecast is created should be the main index. In general I think the time index is about what happened at a particular point in time.

Interesting idea. How would you model the forecast time in this case? I'm not an expert but to my knowledge, you can only have one time index in InfluxDB, right?

I'm not sure about InfluxDB at all, sorry if I gave that impression. I have however designed and built data architecture for various forecasting software. "all data you perform analytics on is time series data" (from the article) rings so true with me.

However you need to consider that in this sense, it's about what is happening at any given moment in time. In your case, the forecasts are created at a point in time, and that "should" be their index speaking purely from a time-is-the-ultimate-index perspective.

Just some random questions that might help you answer your own questions: 1. Are there multiple forecasts made at the same time, for the same period in the future, but by different systems / algorithms / hyperparameters? Would you not want to keep multiple, and in that regard, what would be the "latest" forecast? 2. If your latest forecast is necessarily the most accurate, why would you need to keep previous versions in the same database?

None of this might be relevant to you, and maybe you'll only get the benefits of InfluxDB by using the forecast time as the index. But I thought I'd give you my thoughts just in case it helps you.

I think you're on the edge of describing bitemporal databases. You have a range which represents "time over which this fact is asserted to be true" and a second range for "time during which the assertion was recorded as valid". These typically get called "valid time" and "transaction time".

You can express forecasts this way. The valid time range is the range for which you assert a given forecast is going to be true. The transaction time range is the time during which you held that forecast to be the current forecast.

Using transaction time you can then reconstruct your state of belief from any point in time. Using valid time you can make assertions about any fact in ranges over the past, present or future.

I think Snodgrass's Developing Time-Oriented Database Applications in SQL[0] is still the best overall introduction, though the SQL used is fairly dated (ha). The relevant Wikipedia entry is OK too[1].

[0] https://www2.cs.arizona.edu/~rts/tdbbook.pdf

[1] https://en.wikipedia.org/wiki/Temporal_database

Can't you just have another time key?

Can I? That would be great. I thought InfluxDB did not support that but I might be wrong, I'm not really an expert.

What's it best suited for?

Right now it's just a project in development, so nothing yet. But it's supposed to be for time series data. This could be metrics, events, or any kind of semi-structured data that fits into tables where you want to ask questions about it over time.

It also defines rules for replication (push and pull), subscriptions to subsets of the data, and processing data as it arrives via a scripting engine.

Some of these features will arrive before others. Right now we want to make it work well for the data InfluxDB is currently good at (metrics, sensor data) and also work well for high cardinality data like events and tracing data.

do you expect any of your work on this to end up as rust libraries that can be used on their own?

Yes, absolutely. There are the Arrow libraries in Rust that we're already contributing to. DataFusion, for example is an SQL execution engine.

We'll be publishing crates for parsing InfluxDB Line Protocol, Reading InfluxDB TSM files (for conversion into Parquet and other formats), and client libraries as well.

The entire project itself will also be published as a crate. So you can use any part of the server in your own project.

cool. interested to see the TSM library.

if it's any use to you, here's an unpublished rust influx client crate I've been using for a few years. the main point of interest is an ergonomic macro to construct measurements, `measure!`. https://github.com/jonathanstrong/influx-writer

InfluxData is arguably playing catch-up with Thanos, Cortex, and other scale-out Prometheus backends for the metrics use case. Given that, I wonder why they decided to write a new storage backend from scratch instead of building on the work Thano and Cortex have done. Those two competing projects are successfully sharing a lot of code that allows all data to be stored in object storage like S3.


Those two systems are designed to work with Prometheus style metrics, which are very specific. You have metrics, labels, float values, and millisecond epochs.

I'm not totally sure how they index things, but I would guess that it's by time series with an inverted index style mapping for the metric and label data to underlying time series. This means they'll have the same problems with working with high cardinality data that I outlined in the blog post.

InfluxDB aims to hit a broader audience over just metrics. We think the table model is great, particularly for event time series, which we want to be best in class for. A columnar database is better suited for analytics queries, and given the right structure is every bit as good for metrics queries.

I agree, they appear to be playing catch up on many fronts. Notably, with Cortex, Tempo, and Loki, Grafana Labs seem to have pulled way ahead in advancing a successful open-source cloud observability strategy.

InfluxData have a long history of writing (and rewriting) their own storage engines, so choosing to do it again is unsurprising. I guess this sort of hints that the current TSM/TSI have probably reached their performance and scalability limits and will be EOL before too long.

What I find interesting is that this project is already almost a year old and only has six contributors (two of whom look like external contractors). It seems more like a fun side project than the future core of the database that is supposed to be deployed into production next year.

I think the best new projects are created by a small focused team. Adding too many people too early actually slows things down. But, of course, I'm biased.

The thing about this getting to production next year is that we're doing it in our cloud, which is a services based system where we can bring all sorts of operational tooling to bear. Out of band backups, usage of cloud native services, shadow serving, red/green deploys, and all sorts of things. Basically, it's easier to deploy a service to production once you've built a suite of operational tools to make it possible to do reliably while testing under production workloads that don't actually face the customer.

As for us rewriting the core of the database, that's true. But I think you're unrealistic about what the data systems look like in closed source SaaS providers as they advance through orders of magnitude of scale. Hint: they rewrite and redo their systems.

As for Grafna, MetricsTank was their first, Cortex wasn't developed there, and Loki and Tempo look like interesting projects.

None of those things has the exact same goal as InfluxDB. And InfluxDB isn't meant to be open source DataDog. That's not our thing. We want to be a platform for building time series applications across many use cases, some of which might be modern observability. It also doesn't preclude you from pairing InfluxDB with those other tools.

To be clear, InfluxDB IOx will be using Apache Arrow, DataFusion and Parquet, and contributing to those upstream projects, not creating our own thing.

>What I find interesting is that this project is already almost a year old and only has six contributors

I don't think this is a fair criticism. Postgres only has 7 core contributors - and influxDB is far less complex targeting a much simpler use case.

PostgreSQL might have only seven people on the "core team", but the active contributors list is much longer: https://wiki.postgresql.org/wiki/Committers

Furthermore, hundreds of people have been responsible for its development over the years: https://www.postgresql.org/community/contributors/

As an additional, possibly more relevant example - the Cortex project has 158 contributors: https://github.com/cortexproject/cortex

My larger point is that, in contrast with the new project, InfluxDB currently has ~400 contributors. I'm certain that many dozens of those were involved in getting the current storage engine to a stable place. And now that hard work is on a path to being deprecated by moving to a completely new language and set of underlying technologies.

Taking the project from a handful of contributors to a production-ready technology within an existing ecosystem is a non-trivial task. I'm sure it will come together eventually, but the commitment to ship it "early next year" seems unlikely to me.

We'll be producing builds early next year. Those won't be anything we're recommending for production. Our goal is to have an early alpha in our own cloud environment by the end of Q1. I stress alpha. But we'll also have a bunch of tooling around it (which we've already built for the other parts of our cloud product) to backup data out of band, monitor it, shadow test it against production workloads, etc.

We're also building on top of the work of a bunch of others that have built Arrow and libraries within the Rust ecosystem.

When the open source GAs, I don't really know. But we're doing this out in the open so people can see, comment, and maybe even contribute. Who knows, maybe after a few years you'll be a convert ;)

Also: https://github.com/VictoriaMetrics/VictoriaMetrics from the guy that did FastHTTP ( the fastest HTTP lib for Go ).

What a ride. You're close to releasing Influx 2.0 without a clear migration strategy for your customers, and then you think it's a good idea to announce yet another storage rewrite? Why should customers stick with you guys when you have a track record for shipping half baked software, rewriting it, and leaving people out in the cold?

Users of InfluxDB 1.x can upgrade to 2.0 today. It's an in-place upgrade and just requires a quick command to make it work. Further, InfluxDB 2.0 also has the API for InfluxDB 1.x. We've been putting out InfluxDB 1.x releases while we've been developing 2.0. One of the reasons we waited so long to finalize GA for 2.0 was because we had to make sure there was a clean migration path and there is.

For our cloud 1 customers, they'll be able to upgrade to our cloud 2 offering, but in the meantime, their existing installations get the same 24x7 coverage and service we've been providing for years.

As for how this will be deployed, it will be a seamless transition for our cloud customers when we do so. Data, monitoring and analytics companies replace their backend data planes multiple times over the course of their lifetime.

For our Enterprise customers, we'll provide an upgrade path, but not until this product is mature enough to ship an on-premise binary that won't get a chance to get upgraded but for a few times a year.

The only difference here is that we're doing it as open source. They always do theirs behind closed doors. I'm sure most of our users and many of our customers prefer our open source approach.

That is a rude comment. Why have you sticked with windows or mac os when they released 20 versions of their software? NT and 98, windows server, etc.? Because it does it's job, who cares that there are other versions that also works! I run influx stack in production for two years and had (almost) zero problems with it. Everything works, is being maintained and on the side the new cool stuff is being rolled out, what's the problem with that? Paul, as a happy user of influx, let me say - thank you guys for what you are creating!

Thanks :)

InfluxDB 1.x user can upgrade to InfluxDB 2.0 pretty easily, there is an `upgrade` command that will convert your metadata from 1.x to 2.0.

Your time series data doesn't even need to be touched, it'll "just work" after the upgrade.

This is different than the previous `influxd migrate` command that never seemed to work, right?

For a while the 2.0 development branch was using a different storage file format than 1.x, which required migrating your time series data.

But by the 2.0 Release Candidate that was reverted so that it will use the same file format as 1.x, and the 2.0 functionality was backported onto that, so the upgrade path for 1.x to 2.0 is much simpler now than it was going to be.

Migration is something we feel is very important, which is why we've delivered a single-CLI-command upgrade from 1.x to 2.0. Here's the documentation: https://docs.influxdata.com/influxdb/v2.0/reference/upgradin...

(InfluxDB employee here. Sorry I should have mentioned that in my original post.)

If you're an InfluxDB user looking to migrate to a more reliable and solid database, here's a tool [1] that allows you to do that in a single command:

[1]: https://www.outfluxdata.com

It would've been better if you explicitly disclosed your employment at TimescaleDB.

Apologies, I did so elsewhere in the thread. I work at Timescale!

I'm using Apache Arrow to store CSV-like data and it works amazingly well.

The datasets I work with contains a few billion records with 5-10 fields each, almost all 64-bit longs; I originally started with CSV and soon switched to a binary format with fixed-sized records (mmap'd) which gave great performance improvements, but the flexibility, size gains due to columnar compression and the much greater performance of Arrow for queries that span a single column or a small number of them won me over.

For anyone who has to process even a few million records locally, I would highly recommend it.

Arrow + Parquet is brilliant!

Right now I'm writing tools in Python (Python!) to analyse several 100TB datasets in S3. Each dataset is made up of 1000+ 6GB parquet files (tables UNLOADed from AWS Redshift db). Parquet's columnar compression gives a 15x reduction in on-disk size. Parquet also stores chunk metadata at the end of each file, allowing reads to skip over most data that isn't relevant.

And once in memory, the Arrow format gives zero-copy compatibility with Numpy and Pandas.

If you try this with Python, make sure you use the latest 2.0.0 version of PyArrow [1]. Two other interesting libraries for manipulating PyArrow Tables and ChunkedArrays are fletcher [2] and graphique[3].

[1] I use: conda install -c conda-forge pyarrow python-snappy

[2] https://github.com/xhochy/fletcher

[3] https://github.com/coady/graphique

Yeah, Parquet is awesome. One of the things we really want to do here is to push DataFusion (the Rust based SQL execution engine) to work on Parquet files, but to push down predicates and other things and operate on the data while it's compressed.

You pay such a high overhead marshalling that data into an Arrow RecordBatch. Best thing ever is to work with the Parquet file and not even decompress the chunks that you don't need. Of course, this assumes that you're writing summary statistics as part of the metadata, which we plan to do.

There's rudimentary statistics support, but I've found that it accounts for a lot of the write time (I wrote some tests last weekend, I can ping your team when I put them on GH).

Improving our stats writing could yield a lot of benefits. I'll open JIRAs for this in the next few days.

Parquet + Arrow reminds me of a fast SQL engine on data lake called Dremio.


They also have an OSS version in GitHub.

They're heavily into Arrow. A few years ago they contributed Gandiva, an LLVM expression compiler for super fast processing. https://arrow.apache.org/blog/2018/12/05/gandiva-donation/

It's one of the reasons I like being all in on Arrow. Why do everything ourselves when a ton of other smart people are working on this too?

Talking about Gandiva, something that's open for taking: https://issues.apache.org/jira/browse/ARROW-5315 (creating Gandiva bindings for Rust).

I think DataFusion is mature enough that we could plug in Gandiva into it.

Disclaimer: I work on the Arrow Rust implementation

Ah yes, of course, hi Nevi :). Thank you again for all your work on the Rust implementation. We're obviously big fans.

Gandiva bindings is definitely something we should look into, but I'm guessing there's much lower hanging fruit within DataFusion in terms of optimizing, particularly for our use case.

Thanks Paul :)

I think with compute functions/kernels, we're sitting under a grapevine, so we'll be able to add a lot to Arrow without yet needing Gandiva bindings. The Rust stdsimd [0] work will also enable us to better use SIMD in ~ a year from now (I hope)

[0] https://github.com/rust-lang/stdsimd

I am confused about Parquet wrt to the rest of your stack. Is it just that Parquet happens to be the Redshift export format? Or are you actually using Arrow and Parquet at the same time in some manner?

Parquet is the persistent storage format for when the data is written to disk or object storage. Arrow is the in-memory format and a set of tools built around working with it.

So InfluxDB IOx will use both, Arrow in-memory for fast access, and Parquet on storage for persistence.

Ok, I didn't realize that Arrow was in-memory and not also on-disk. As serializing between the two didn't make sense to me. I would have thought that Arrow would also be an on-disk format (mmap'd) so that there would be little to no conversion losses.

Why convert to an on-disk format (Parquet) and not save the in-memory representation to storage directly?

Parquet is awesome, but it's sad their Rust parquet crate only works on nightly rustc, which is a no-no in many cases. Not sure why they chose to go that way (well, because of trait specialization, technically speaking - which is not landing and not going to land anytime soon).

There are contributors and committers in the Arrow project working to resolve this. We recently removed specialization from the core Arrow crate and we plan on doing the same for the Parquet crate.

Yes, thanks a million for doing that! I've read the 'despecialization' of arrow-rust PR which actually seemed to end up being quite simple (but it's a breaking change, so a 3.0 candidate?)

Hope that something similar can be done in parquet so that arrow-parquet-df can all compile on stable in 3.0, that would unlock it for many potential users.

Thanks for sharing graphique, that is very neat.

Did you have to implement your-format to Arrow conversion /abstraction layer manually or was it already available? Could you give out some pointers on how to roll out own binary-format-to-arrow query engine? Why didn't you use parquet for storage?

Sorry, I didn't check back on this comment after posting it, I hope you'll see this.

It's manual. What you get with Arrow is an efficient way to store structured data in a way that values for the same column (same dimension) are together on disk rather than having each record with all its fields together. So if you're storing say a dataset of users with a 64-bit user ID, an IP address, a timestamp, and a country code you'd define a Schema object as having these 4 columns with the size of each one (here 64/32/32/16 bits for example) and then you'd start writing your records block by block. A block is just a set of records and Arrow will mark the start and end of each block. Up to you to decide when to start and end a block, I use 100k entries per block but haven't played much with different values.

In pseudo-code it'd be something like this when reading just the user IDs:

    VectorSchemaRoot root = arrowReader.getVectorSchemaRoot();
    BigIntVector userIdVector = (BigIntVector) root.getVector("user_id"); // gets this 64-bit dimension from the schema
    // ... more vectors defined, one for each dimension
    List<ArrowBlock> blocks = arrowReader.getRecordBlocks();
    for (ArrowBlock block : blocks) { // go over all blocks
        arrowReader.loadRecordBatch(block); // *actually* reads the block
        for (int i = 0; i < block.getRowCount(); i++) {
            long userId = userIdVector.get(i);  // offset is within the current block
The code in this example will only go over the user IDs, and will read them very quickly. So yes, you have to implement any sort of querying capabilities yourself. In my case it was simple set of queries like "get distribution of dimension X" where X can be a parameter, or "filter records where X < minX || X > maxX", also with parameters, etc. Just a handful in all.

For a limited set of queries and not something like full SQL, this was perfect. I found this article very useful to get started: https://github.com/animeshtrivedi/blog/blob/master/post/2017...

InfluxDB IOx will use Parquet files for persistent storage

"Columnar databases aren’t new, so why are we building yet another one? We weren’t able to find one in open source that was optimized for time series."

This is the direction that QuestDB (www.questdb.io) has taken: columnar database, partitions by time and open source (apache 2.0). It is written is zero-GC java and c++, leveraging SIMD instructions. The live demo has been shown to HN recently, with sub second queries for 1.6 billion rows: https://news.ycombinator.com/item?id=23616878

NB: I am a co-founder of questdb.

There is more mature ClickHouse database [1]. This is column-based OLAP database, which provides outstanding query performance (it can scan billions of rows per second per CPU core) and outstanding on-disk data compression (up to 100x if proper table configs are used). ClickHouse also scales horizontally to multiple nodes. Comparing to QuestDB, ClickHouse consistently shows high performance on a wide range of query types from production. It also provides many SQL extensions optimized for analytical workloads.

BTW, VictoriaMetrics [2] is a time series database built on top of ClickHouse architecture ideas [3], so it inherits high performance and scalability from ClickHouse, while providing simpler configuration and operation for typical production time series workloads.

[1] https://clickhouse.tech/

[2] https://victoriametrics.com/

[3] https://valyala.medium.com/how-victoriametrics-makes-instant...

Please stop shilling "victoria metrics" in every one of your posts.

Another SQL engine on data lake that heavily uses arrow is Dremio.



If you have parquet on S3, using an engine like Dremio (or any engine based on arrow) can give you some impressive performance. Key innovations in OSS on data analytics/data lake:

Arrow - Columnar in memory format; Gandiva - LLVM based execution kernel; Arrow flight - Wire protocol based on arrow; Project Nessie - A git like workflow for data lakes

https://arrow.apache.org/. https://arrow.apache.org/docs/format/Flight.html. https://arrow.apache.org/blog/2018/12/05/gandiva-donation/ https://github.com/projectnessie/nessie

What service could I replace Athena / PrestoDB that uses Apache Arrow?

Looking at Dremio’s website they seem to be a good competitor to presto/Athena for some use cases.

Alternative solutions depends on your use case. If it’s about querying S3 data then Dremio/Athena/Presto/spark are good.

I didn't see it in the post, but a huge amount of this is due to Andy Grove's work on Rust implementations of Apache Arrow and DataFusion.

I imagine he's happy where he is, but I hope there's some opportunity for InfluxDB to give credit and support for his great work.

It absolutely is and we've been contributing back. A even bigger amount of this is based on Wes McKinney's work on Arrow. Andy is great and he's been helpful as we've been working with DataFusion.

How does InfluxDB compare to TimescaleDB? My understanding is that the use case is pretty similar (time series/metrics), are they good at different things?

For perhaps the most comprehensive comparison between InfluxDB and TimescaleDB see this blog post [1] which compares the two DBs on data model, query language, reliability, performance, ecosystem, operational management, and company/community support.

[1]: https://blog.timescale.com/blog/timescaledb-vs-influxdb-for-...

This comparison is vs. InfluxDB. This thread is about a new project called InfluxDB IOx. It's under development and we're not producing builds, so any kind of operational comparison would be very premature.

Its architecture is dramatically different than InfluxDB. You can do a comparison of the design goals. Read the post this thread refers to. I think you'll find it has very different goals than Postgres, an OLTP database, and Timescale, which is built on top of it.

Thank you for clarifying, OP's original comment refers to TimescaleDB vs InfluxDB hence the comparison blog between the two.

As you note, a comparison between InfluxDB IOx and TimescaleDB is not possible at this time, due InfluxDB IOx being still under development and unavailable for comparison, so that blog is the next best thing for developers looking for answers today.

It would be great to see similar comparison for TimescaleDB vs VictoriaMetrics :) There are some benchmarks ([1], [2], [3]) that compare performance and resource usage between TimescaleDB, InfluxDB and VictoriaMetrics, but these benchmarks may be outdated.

[1] https://medium.com/@valyala/when-size-matters-benchmarking-v...

[2] https://medium.com/@valyala/high-cardinality-tsdb-benchmarks...

[3] https://medium.com/@valyala/insert-benchmarks-with-inch-infl...

Hi valyala (cofounder of VictoriaMetrics) — I know we’ve talked about this several times before already, so you should know — those benchmarks you linked are from 2018. They pre-date many key features in TimescaleDB, including features like native compression, which invalidate many of those findings. So they aren’t really relevant or valid anymore. Just don’t want people to draw wrong conclusions.

(Disclaimer: I work at Timescale)

Timescale is built on top of Postgres, which is a row oriented database. They've built a kind of columnar layer on top of it, which is quite interesting. Because it's Postgres you get their full SQL support.

Meanwhile, InfluxDB IOx has a very different set of goals than Postgres. It's not an OLTP (transactional) DB and never will be. It's firmly targeted at OLAP and real-time OLAP workloads.

That means we can do things like optimize for running on ephemeral storage with object storage as the persistence layer. It'll have fine grained control over replication, how data is partitioned in a cluster, and where data is indexed, queried, queued for writes and more. Push and pull replication, bulk transfer, and persistence with Parquet. This last bit means you get integration with other data processing and data warehousing tools with minimal effort.

It'll also support Arrow Flight which will give it great integration into the data science ecosystems in Python and R.

Right now, InfluxDB IOx is really too early to do any real comparison on actual operation. We're putting this out now so that people can see what we're doing, comment on it, and maybe even contribute. We think it's an interesting approach where no single item is completely novel, but the composition of everything together makes it an entirely unique offering in open source.

Edit: one other thing I forgot to mention. InfluxDB IOx is open source, Timescale isn't. For some that matters, for many it doesn't. Depends on your use case.

Can you elaborate on what you mean by TimescaleDB not being open source?


It's under a community license, which has restrictions. The limitations on derivative works and value added products or services are the ones that will create the most problem for people trying to build a business on it: https://www.timescale.com/legal/licenses

For users within large organizations, they're likely not able to use the software without approval from their legal department because it doesn't fall under any open source license.

Like I said, whether you care is really case dependent.

This is misinformation. Most of TimescaleDB is open source under Apache 2. The difference is that the advanced features of TimescaleDB - Eg clustering - are under a source available license and are free, while advanced InfluxDB features like clustering are under a paid enterprise license. In fact TimescaleDB recently made all of our enterprise features available for free. So one could argue that TimescaleDB is more open than Influx.

For more information on TimescaleDB licensing, check out this blog post: https://blog.timescale.com/blog/building-open-source-busines...

(Disclaimer/ fyi: I work at Timescale)

My post is about InfluxDB IOx, which is the project this thread is about. You're correct about InfluxDB having HA and clustering under a closed source enterprise license. If you read the post, I even mention this as a shortcoming of the project. One which we're hoping to rectify with InfluxDB IOx.

So some parts of Timescale are under actual Apache 2 and some parts are under a proprietary source available license. I'm not sure what the LOC of which is which, or how it's actually organized in your repo. I'll leave it up to your potential users to try to figure out which and disentangle what parts are actually open.

As I recall, AWS very publicly forked Elastic because of this very same type of confusion. The difference is that if AWS were going to fork your project, they'd just fork Postgres, which is the real open source software that you're benefitting from.

If I were building an developer focused analytics, monitoring, or data analysis product, I wouldn't do it on top of Timescale because some parts of your codebase most definitely prevent that through your license. But that's me.

> If I were building an developer focused analytics, monitoring, or data analysis product, I wouldn't do it on top of Timescale because some parts of your codebase most definitely prevent that through your license. But that's me.

That's also FUD, two ways.

First, what the Timescale License prevents is somebody offering our Community Edition as a standalone "TimescaleDB-as-a-Service", a la AWS bundling it as part of RDS, or Microsoft as part of Azure Postgres. There is a clean technical test for "DDL access to the database" by users in the license. It's not tricky. You can absolutely develop/sell/distribute/provide analytics, monitoring, or data analysis products on top of TimescaleDB Community Edition. Many companies do.

As to "hopelessly-entangled source", if you know what a directory is, you can tell the difference. There's a "/tsl" subdirectory with Timescale Licensed code. Everything else is Apache2. You can compile pure Apache-2 versions with a single compile flag, and we distribute Apache2 binaries. In fact, the Postgres community itself distributes Apache-2 binaries, and Microsoft, Digital Ocean, Rackspace, and other clouds make the Apache2 version available as part of the managed database offerings.

So if my users can't have DDL access, that means that they can't define the schemas for the analytics that they want to do? It only works if as the developer of an application I have a fixed schema that my users interact with?

TimescaleDB behaves more like a regular relational database, while Influx is fairly different and has some interesting nuances that you'll need to understand if you want to have a table with a bunch of columns. For example having a lot of metadata columns has different performance implications than having a high cardinality of actual measurements (although it's been long enough since I've used Influx I don't remember what those differences actually are)

If you're used to writing SQL, TimescaleDB is much easier to write queries with although if you get over the learning curve both of the query languages in Influx seem very powerful

One notable advantage of Influx is its integrations with other tools for ingest and visualization, and it seems like 2.0 is doubling down on that

There is already a general purpose working-in-progress OLAP project written in Rust.


1. TensorBase is highly hackable. If you know the Rust and C, then you can control all of the world. This is obvious not for Apache Arrow and DataFusion (on the top of Arrow).

2. TensorBase uses the whole-stage JIT optimization which is (in complex cases possibly hugely) faster than that done in Gandiva. Expression based computing kernel is far from provoding the top performance for OLAP like bigdata system.

3. TensorBase keeps some kinds of OLTP in mind (although in the early stage its still in OLAP). There is no truely OLTP or OLAP viewpoints in users. Users just want all their queries being fastest.

4. TensorBase is now APL v2 based. Enjoy to hack it yourself!

ps: One recent writting about TensorBase (and those compared with some query engine and project in Rust works included) could be seen in this presentation: https://tensorbase.io/2020/11/08/rustfest2020.html

Disclaimer: I am the author of TensorBase.

It looks like TensorBase has many common things with ClickHouse. Do you have benchmarks that compare performance of TensorBase with ClickHouse similar to benchmarks published by ClickHouse [1]?

[1] https://clickhouse.tech/benchmark/dbms/

The new direction is really promising. However, we at ScyllaDB (/me am a co-founder) already meet most of the requirements. We use C++20 with an advanced shard per core, we have an open format for the files, you can easily import/export them. One can use Scylla as a general DB and also run KairosDB for timeseries specific if needed.

Recently we have a MSc project to add Parquet which is a very good direction, couldn't agree more.

How does datafusion based engine like this compare to timely/differential dataflow(naiad)?

PS: Im a rookie in this whole domain.. so any pointers would be really helpful.

I'm not too familiar with their stuff, but I think in terms of approach, they're very different. This project doesn't aim to product materialized views, which I think is more of what naiad is for? I'm not sure about differences in terms of what problems they're trying to sovle.

Very cool! - I'm curious how far Influx will then move into being a general purpose columnar database system outside of typical timeseries workloads - moving more into being a general purpose OLAP DB for analytical "data science" workload?

Will there be any type of transactional guarantees (ACID) using MVCC or similar?

Is the execution engine vectorised?

Execution is vectorized, but that's Arrow really. We'd like this to be useful for general OLAP workloads, but the focus for the next year is definitely going to be on our bread and butter time series stuff.

That being said, Arrow Flight will be a first class RPC mechanism, which makes it quite nice for data science stuff as you can get data in to a dataframe in Pandas or R in a few lines of code and with almost zero serialization/deserialization overhead.

This isn't meant to be a transactional system. More like a data processing system for data in object storage. I'm curious what your need is there for OLAP workloads, can you tell me more?

key takeaway:

> As an added bonus, within the Rust set of Apache Arrow tools is DataFusion, a Rust native SQL query engine for Apache Arrow. Given that we’re building with DataFusion as the core, this means that InfluxDB IOx will support a subset of SQL out of the box

What is the difference between Apache spark and Apache arrow?

Apache Arrow is a specification for in-memory columnar data, IPC format + Flight protocol, with implementations in a number of languages. Some of the implementations contain code to perform computations on the in-memory data. Some of the implementations contain some form of query engine. All of these are single process / libraries, rather than distributed systems.

Apache Spark is a distributed compute platform, which does have some support for Arrow for interop purposes.

One of the things that led me to get involved in Arrow originally was to explore the idea of building something like Apache Spark based on Arrow (and Rust) and my latest prototype of that concept is in the Ballista project [1].

[1] https://github.com/ballista-compute/ballista

Layoffs coming, then?

(If you don't know, don't guess.)

Just built InfluxDB IOx from sources [1] and compared data ingestion performance to VictoriaMetrics by using Billy tool [2] on a laptop with Intel i5-8265U CPU (it contains 4 CPU cores) and 32GB RAM. Results for 1M measurements are the following:

- InfluxDB IOx: 600K rows/sec

- VictoriaMetrics: 4M rows/sec

I.e. VictoriaMetrics outperforms InfluxDB IOx by more than 6x in this benchmark. I hope InfluxDB IOx performance will be improved over time, since it is written in Rust.

[1] https://github.com/influxdata/influxdb_iox

[2] https://github.com/VictoriaMetrics/billy

There is absolutely no point benchmarking it at this stage. One of the many reasons we're not bothering to produce builds right now. We'll let you know when it's time to even have a look as I'm sure you'll want to continue to compare it with Victoria.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact