Often this means building systems to analyze terabytes of logs [semi]-realtime. All I have to say is - thank god! This is going to make my job a lot easier, and likely empower us to remove our current infrastructure setup.
I know at one point we actually considered building our own time series database. Instead, we ended up utilizing a Kafka queue with an SQL based backend after we parsed and paired down the data, because it was the only one quick enough to do the queries.
Should make a lot of the modeling I've worked on a bit easier.
They make insane promises but, the promises don't like up to expectations.
Kinesis analytics for example, can aggregate data across a time window (sliding window) from a stream (Kinesis). A huge issue that isn't document or stated is that when kinesis analytics restarts due to the process dies (being migrated, binpacked, etc.) the ENTIRE time window has to get re-aggregated. So your count drops to 0.
Really unacceptable if you're using it to generate KPIs which you alert on. We ended up switching to a system which pushes the stream data into influx and running the aggregations there via queries.
Dealing w/ AWS during this entire process was a huge pain.
I've loaded a 100 Billion Rows in into a 5 shard database and can do full queries across the whole dataset in under 10 seconds. It also natively consumes multiple kafka topics.
ClickHouse may be used as a timeseries backend, but currently it has a few drawbacks comparing to specialized solutions:
- It has no efficient inverted index for fast metrics lookup by a set of label matchers.
- It doesn't support delta coding yet - https://github.com/yandex/ClickHouse/issues/838 .
Learn how we created a startup - VictoriaMetrics - that builds on performance ideas from ClickHouse and solves the issues mentioned above - https://medium.com/devopslinks/victoriametrics-creating-the-... . Currently it has the highest performance/cost ratio comparing to competitors.
Have you done any load tests that would more closely mirror a production environment such as performing queries while clickhouse is handling a heavy insert load?
But I've spent a long time looking at the various solutions out there, and while ClickHouse is not perfect, I think it's the best multi-purpose database out there for large volumes of data. TimescaleDB is another one, but until they get sharding it's dead on arrival.
The "basically" is what intrigues me :D
CREATE TABLE points(Timestamp DateTime,Client String,Path String,Value Float32,Tags Nested(Key String,Value String)) ENGINE = MergeTree() ORDER BY (Client, Timestamp, Path) PARTITION BY toStartOfDay(Timestamp)
SELECT (intDiv(toUInt32(Timestamp), 15) * 15) * 1000 as t, Path, Value as c FROM points_dist WHERE Path LIKE 'tst_val1' and Tags.Value[indexOf(Tags.Key, 'server')] = 'node' and Timestamp >= toDateTime(1543421708) GROUP BY t, Path, Value ORDER BY t, Path
Imagine you have 1000 servers submitting data to 100 timeseries each minute. That's 100,000 writes a minute (unless they support batch writes across series) At $0.50 per million writes that's $72 a day or $26k a year.
Now imagine you want to alert on that data. Say you have 100 monitors that each evaluate 1GB of data once a minute. At $10 per TB of data scanned, that's $1,440 a day or $525k a year!
The only way I can have trust in Amazon's proprietary products is if RDS continues to get less expensive every year, since that is effectively the BATNA to these new products. It's been a while now since the last RDS cost reductions, and unless we continue to see more of those it's hard to have confidence that Amazon will continue to treat their customers of these new proprietary services fairly over the long term.
Let's say the read is 1MB instead of 1GB, that's now $1.44 a day and $525 a year. Query pricing becomes not so bad.
From my own experience, the 100 metrics per server estimate I was giving above is pretty low, though. Once you factor in different combinations of tags closer to 1000 is more realistic. That potentially brings up the write pricing quite a bit.
Self-hosting isn't the only option though. For example, that hypothetical 1000 server scenario would cost $180k a year at list pricing on Datadog or SignalFX.
That's a weird comparison. 20x is only true if you write 8KB with every entry, and you haven't included the storage and instance savings.
It's not hard to come up with suboptimal scenarios where this is more expensive, but that's missing the point. It's optimized for a specific kind of usage pattern.
Curious to see what the query language is for this, wonder if they're just exposing the backing store for CloudWatch as a service now.
"With Timestream, you can easily store and analyze log data for DevOps, sensor data for IoT applications, and industrial telemetry data for equipment maintenance."
I've seen a lot of people complain about pricing, so I thought I'd share a little why we are excited about this:
We have approximately 280 devices out, monitoring production lines, sending aggregated data every 5 seconds, via MQTT to AWS IoT. The average messages published that we see is around ~2 million a day (equipment is often turned off, when not producing). The packet size is very small, and highly compressable, each below 1KB, but let's just make it 1KB.
We then currently funnel this data into Lambda, which processes it, and puts it into DynamoDB and handles rollups. The costs of that whole thing is approximately $20 a day (IoT, DynamoDB, Lambda and X-Ray), with Lambda+DynamoDB making up $17 of that cost.
Finally, our users look at this data, live, on dashboards, usually looking at the last 8 hours of data for a specific device. Let's throw around that there will be 10,000 queries each day, looking at the data of the day (2GB/day / 280devices = 0.007142857 GB/device/day).
Now, running the same numbers on the AWS Timestream pricing (daily cost):
- Writes: 2million * $0.5/million = $1
- Memory store: 2 GB * $0.036 = $0.072
- SSD store: (2GB * 7days) * $0.01 (GB/day) * 7days = $0.98
- Magnetic store: (2 GB * 30 days) * $0.03 (GB/month) = $1.8
- Query: 10,0000 queries * 0.007142857GB/device/day --> 71GB = free until day 14, where it'll cost $10, so $20 a month.
Giving us: $1 + $0.072 + $0.98 + $1.8 + ($20/30) = $4.5/day.
From these (very) quick calculations, this means we could lower our cost from ~$20/day to ~$4.5/day. And that's not even taking into account that it removes our need to create/maintain our own custom solution.
I am probably missing some details, but it does look bright!
Our goal remains the same: build the best possible product that optimizes for developer productivity and happiness. And open source as much as we possibly can while maintaining a healthy business.
Redis and MongoDB at least seem to have woken up
There's a lot to unravel in there.
I prefer 'free software' to 'open source' as it has a clearer meaning, especially in this context. Even so, no one can steal free / open source software (or as you say, product -- though that turn strongly implies a commercial offering).
By definition you can't really stop anyone from using your free software, unless perhaps you start naming companies explicitly, but I can't imagine it'd be an easy process, or have a happy outcome, if you started targeting 'major cloud providers' for special conditions.
Note that I am not an apologist for AWS, Google, Microsoft, etc - but it feels like the fundamental problem here is not massive corporations charging other people to access free software.
Open source is software that through the license enforces the source code to remain open.
I'm not a fan of RMS or his attitudes on most things, but am a strong OSS fan as it is the best way to develop and maintain software.
Entirely agree, hence I drew the distinction. I eschew 'open source' as it's highly ambiguous, and mostly misses the point.
> Free Software is software that through the license enforces a philosophy.
I would disagree. Free software ensures the user has certain freedoms.
> Open source is software that through the license enforces the source code to remain open.
This is a very circular definition -- open source is open.
> I'm not a fan of RMS or his attitudes on most things, but am a strong OSS fan as it is the best way to develop and maintain software.
As it happens, rms is no fan of OSS.
I eschew free software because I'm not about forcing my views on others (which is literally the mission of GNU). I'm about developing software to be the best it can be, and maybe meeting some friends along the way. Open source, being the best software development model overall, allows me to meet that goal. You could almost say some of RMS's more extreme quirks border on authoritarian (see the example with the abortion joke in the libc manual he FORBADE removal of and demanded be re-added when a dev simply overruled him). He's not acting in a manner that encourages "freedom", but as a simple and obvious dictator of all things GNU or claiming to be GNU. He's frequently tried to shape the path of GNOME (which I am a former foundation member and was on the sysadmin team) in areas he literally has no business weighing in on. Then there are some more gross personality problems, like his sexism, or tendency to actually eat his toenail gunk, or to ever refuse to be wrong on anything, even when an entire community disagrees with him.
Dr Stallman has done a great deal of good for the world with Free Software, however like the VAX and PDP-11, his time has passed. Open source won just like Linux won over GNU/Hurd. It is ok that he won, by losing.
In the context of GP's (beginningguava) comment about 'open source projects' needing to change their licensing to prevent corporations making money by SaaSing various tools, my point was twofold - first, by definition you can't have free software with restrictions like that, and second you'd be merely fighting the symptoms (with little chance of success).
Aside - I'm curious what you mean by the 'open source software development model', as I don't think that's actually a thing.
It goes back to ESR's The Cathedral and the Bazaar and is what he deems "the bazaar model" or "bazaar style" before him, Larry Augustin, and Bruce Perens (if memory serves) went on to coin the phrase "open source". Even if you don't necessarily agree with ESR (I see him in a similar vein as RMS fwiw), his thoughts on software development models have generally speaking, been proven true.
This is incorrect, licenses like MIT or BSD are also free software license because they afford the 4 freedoms to the user (even though they don't enforce them on derivative works).
Licenses like the Redis one are open-source but not free software because they place limitations on the user (can't sell hosted service, IIRC)
Having said that, I can definitely see this be an interesting product for people doing less than 10k inserts per second.
Competing with AWS on just cost sounds worrying to me.
From the little that was said, going to guess this uses something like Beringei (https://code.fb.com/core-data/beringei-a-high-performance-ti...) under the hood
Ill have to go look into this, because if aws historic pricing for any large volume stream, quickly becomes untennable.
Its very easy to have gobs and gobs of time series points... aws might make using this way too expensive for anything at relative scale for a small startup?
https://kx.com/discover/in-memory-computing/ seems to indicate that it takes up ~600 Kb (I'm not sure if this is bits or bytes, but even if it's bits, that turns into 75KB)
L1 cache is per core. Skylake Xeons have a 64KB cache per core, 32KB for data and 32KB for instructions. Even with an even split there, you're not fitting 75KB (or 600KB) into the L1 cache.
Bits would be a weird measurement to use when talking about memory utilization, so I'm pretty sure that it's 600 kilobytes. You're not anywhere close to fitting that into the L1 cache. L2 cache, sure. But you get the relatively spacious 1 megabyte for L2.
I'm also not sure that the "core" fitting into the CPU cache is particularly meaningful for performance anyway. it doesn't say anything about how much outside of the core gets used, how big the working set size is for your workload, how much meaningful work is done on that working set of data, etc. If you're frequently using parts of the software that don't fit in the cache, or getting evicted from it for other code, or your working set of data doesn't fit in the cache and you're constantly going to main memory for the data you're working on, the "core" fitting in L1 cache (or L2 cache, which looks more realistic) is going to be basically meaningless.
Kusto is still the name of the query language and the desktop application (Kusto.Explorer), the service was just renamed to Azure Data Explorer.
Back then (internally) we actually had a lot of issues with ingesting and querying time series data at scale.
Neither tech works on its own, but together (substituting Cassandra columns for hdfs) was magic for the specific data configuration & use case.
IMO, it makes way more sense to decide the aggregations you want ahead of time (e.g. "SELECT customer, sum(value) FROM purchases GROUP BY customer"). That way, you deal with substantially less data and everything becomes a whole lot simpler.
What some would do is record in blocks where every point after the earliest is stored as a delta. Then each block is more compressible as it contains a lot of 0s.
Some tasks actually require absolute granularity, up to 6 decimal places of precision and thereafter reliance on atomic order of arrival, for deterministic results on data from high frequency trading.
Without absolute knowledge of the order or if there's aggregation the best you can do is approximate, which often is considered suboptimal when the real solution is available.
What I am doing with one program could almost be called a data lake because it is just a bunch of JSONL files that have really varied data in them. But it's organized by date and hour per day as well as predefined keys, since I know I will need to query it that way.
CloudWatch metrics are also very expensive for what you get, so that's another similarity to Timestream ;)
1. Amazon Timestream (amazon.com)
3. Amazon Quantum Ledger Database (amazon.com)
8. Amazon FSx for Lustre (amazon.com)
13. AWS DynamoDB On-Demand (amazon.com)
14. Amazon's homegrown Graviton processor was very nearly an AMD Arm CPU (theregister.co.uk)
21. Building an Alexa-Powered Electric Blanket (shkspr.mobi)
30. Amazon FSx for Windows File Server (amazon.com)
It happens every year during Google, Apple, Amazon, and Facebook events.
(Written from keynote floor)
EDIT: Is mobot a dead project?
What do you think hacker news is supposed to be? Anything that isn't mundane to a lot of people isn't interesting enough for those in the target audience. Amazon is having its annual AWS conference and thus has a lot of announcements, of course they have a lot of new niche products.