
Amazon Timestream – Fast, scalable, fully managed time series database - irs
https://aws.amazon.com/timestream/
======
citilife
At my day job, I build a lot of machine learning systems that require data to
be fed in a time series manner[1].

Often this means building systems to analyze terabytes of logs
[semi]-realtime. All I have to say is - thank god! This is going to make my
job a lot easier, and likely empower us to remove our current infrastructure
setup.

I know at one point we actually considered building our own time series
database. Instead, we ended up utilizing a Kafka queue with an SQL based
backend after we parsed and paired down the data, because it was the only one
quick enough to do the queries.

Should make a lot of the modeling I've worked on a bit easier[1].

[1] [https://medium.com/capital-one-tech/batch-and-streaming-
in-t...](https://medium.com/capital-one-tech/batch-and-streaming-in-the-world-
of-data-science-and-data-engineering-2cc029cdf554)

~~~
lykr0n
Have you looked at Clickhouse for timeseries data? It's the one database I've
found that can scale and can query in near-realtime.

I've loaded a 100 Billion Rows in into a 5 shard database and can do full
queries across the whole dataset in under 10 seconds. It also natively
consumes multiple kafka topics.

~~~
haggy
> I've loaded a 100 Billion Rows

Have you done any load tests that would more closely mirror a production
environment such as performing queries while clickhouse is handling a heavy
insert load?

~~~
lykr0n
I'm working on developing benchmarking tools for internal testing, but both
Yandex and CloudFlare use Clickhouse for realtime querying. I'm still in
development phase for my product, but I'll make sure to post information &
results when we launch here.

[https://blog.cloudflare.com/http-analytics-
for-6m-requests-p...](https://blog.cloudflare.com/http-analytics-
for-6m-requests-per-second-using-clickhouse/)

But I've spent a long time looking at the various solutions out there, and
while ClickHouse is not perfect, I think it's the best multi-purpose database
out there for large volumes of data. TimescaleDB is another one, but until
they get sharding it's dead on arrival.

~~~
haggy
Very cool, I'll check this out!

~~~
lykr0n
It's a quirky piece of software and has limitations that need to be considered
when standing up a production cluster- such as you cannot reshard it
currently. If you have a 3 node cluster, it's messy and requires downtime to
add another node.

~~~
bsg75
Still a bit messy, but the clickhouse-copier utility helps a bit:
[https://github.com/yandex/ClickHouse/issues/2579](https://github.com/yandex/ClickHouse/issues/2579)

~~~
lykr0n
I have open issues about it on GitHub. It does not work correctly at this
time. If you dig through the issues, there are statements by the devs saying
the tool has been neglected.

------
sciurus
This is not cheap for the "DevOps" use case.

Imagine you have 1000 servers submitting data to 100 timeseries each minute.
That's 100,000 writes a minute (unless they support batch writes across
series) At $0.50 per million writes that's $72 a day or $26k a year.

Now imagine you want to alert on that data. Say you have 100 monitors that
each evaluate 1GB of data once a minute. At $10 per TB of data scanned, that's
$1,440 a day or $525k a year!

~~~
jaxxstorm
Well, that depends on what you consider cheap. Hiring someone to manage a time
series system like graphite or prometheus is going to cost you a whole lot
more than $26k a year

~~~
sciurus
You're ignoring half of my example scenario. $26k writing the data, $525k for
querying it just for alerting, plus whatever it costs to store and to query
ad-hoc. That's over half a million dollars. Even if you hire someone for
$250k, you can self-host your time series system for cheaper than that.

Self-hosting isn't the only option though. For example, that hypothetical 1000
server scenario would cost $180k a year at list pricing on Datadog or
SignalFX.

------
willlll
I'm actually impressed at how incredibly expensive they made this. $0.50 per
million 1KB writes, which is _20x_ what aurora charges, since aurora allows
8KB writes. And Aurora is already expensive if you actually read/write to it.

~~~
ravedave5
I get the feeling this is for important data (banking etc) so I have a feeling
this is 200x cheaper than whatever else is available.

~~~
tonyedgecombe
That's not what is says in the release:

"With Timestream, you can easily store and analyze log data for DevOps, sensor
data for IoT applications, and industrial telemetry data for equipment
maintenance."

------
Tehnix
Quite excited for this! We have currently been experimenting with using
DynamoDB, and managing our own rollups of our incoming data (previously on an
RDS, which is not a good choice for this kind of data).

\---

I've seen a lot of people complain about pricing, so I thought I'd share a
little why we are excited about this:

We have approximately 280 devices out, monitoring production lines, sending
aggregated data every 5 seconds, via MQTT to AWS IoT. The average messages
published that we see is around ~2 million a day (equipment is often turned
off, when not producing). The packet size is very small, and highly
compressable, each below 1KB, but let's just make it 1KB.

We then currently funnel this data into Lambda, which processes it, and puts
it into DynamoDB and handles rollups. The costs of that whole thing is
approximately $20 a day (IoT, DynamoDB, Lambda and X-Ray), with
Lambda+DynamoDB making up $17 of that cost.

Finally, our users look at this data, live, on dashboards, usually looking at
the last 8 hours of data for a specific device. Let's throw around that there
will be 10,000 queries each day, looking at the data of the day (2GB/day /
280devices = 0.007142857 GB/device/day).

\---

Now, running the same numbers on the AWS Timestream pricing[0] (daily cost):

\- Writes: 2million * $0.5/million = $1

\- Memory store: 2 GB * $0.036 = $0.072

\- SSD store: (2GB * 7days) * $0.01 (GB/day) * 7days = $0.98

\- Magnetic store: (2 GB * 30 days) * $0.03 (GB/month) = $1.8

\- Query: 10,0000 queries * 0.007142857GB/device/day --> 71GB = free until day
14, where it'll cost $10, so $20 a month.

Giving us: $1 + $0.072 + $0.98 + $1.8 + ($20/30) = $4.5/day.

From these (very) quick calculations, this means we could lower our cost from
~$20/day to ~$4.5/day. And that's not even taking into account that it removes
our need to create/maintain our own custom solution.

I am probably missing some details, but it does look bright!

[0]
[https://aws.amazon.com/timestream/pricing/](https://aws.amazon.com/timestream/pricing/)

------
sciurus
It's got to be a rough day for the team at
[https://www.influxdata.com/](https://www.influxdata.com/) . This could become
serious competition for their InfluxCloud hosted offering.

~~~
beginningguava
At what point are open source projects going to change their licensing to
prevent the major cloud providers from just stealing their products? I highly
doubt AWS built this from scratch. Amazon, Google, and Microsoft are going to
choke the life out of these projects

Redis and MongoDB at least seem to have woken up

[https://www.geekwire.com/2018/open-source-companies-
consider...](https://www.geekwire.com/2018/open-source-companies-considering-
closed-approach/)

~~~
Jedd
> At what point are open source projects going to change their licensing to
> prevent the major cloud providers from just stealing their products?

There's a lot to unravel in there.

I prefer 'free software' to 'open source' as it has a clearer meaning,
especially in this context. Even so, no one can _steal_ free / open source
software (or as you say, product -- though that turn strongly implies a
commercial offering).

By definition you can't really stop _anyone_ from using your free software,
unless perhaps you start naming companies explicitly, but I can't imagine it'd
be an easy process, or have a happy outcome, if you started targeting 'major
cloud providers' for special conditions.

Note that I am _not_ an apologist for AWS, Google, Microsoft, etc - but it
feels like the fundamental problem here is not massive corporations charging
other people to access free software.

~~~
SEJeff
Free software and open source aren't always the same thing though. Free
Software is software that through the license enforces a philosophy.

Open source is software that through the license enforces the source code to
remain open.

I'm not a fan of RMS or his attitudes on most things, but am a strong OSS fan
as it is the best way to develop and maintain software.

~~~
Jedd
> Free software and open source aren't always the same thing though.

Entirely agree, hence I drew the distinction. I eschew 'open source' as it's
highly ambiguous, and mostly misses the point.

> Free Software is software that through the license enforces a philosophy.

I would disagree. Free software ensures the user has certain freedoms.

> Open source is software that through the license enforces the source code to
> remain open.

This is a very circular definition -- open source is open.

> I'm not a fan of RMS or his attitudes on most things, but am a strong OSS
> fan as it is the best way to develop and maintain software.

As it happens, rms is no fan of OSS.

~~~
SEJeff
RMS has said publically that proprietary software should be illegal, and that
free software guarantees user freedoms. This is a philosophy, and one that
opines every user wants to become a developer. For things like emacs, and much
of GNU, this is true, for most of the rest of the world, this is not.

I eschew free software because I'm not about forcing my views on others (which
is literally the mission of GNU). I'm about developing software to be the best
it can be, and maybe meeting some friends along the way. Open source, being
the best software development model overall, allows me to meet that goal. You
could almost say some of RMS's more extreme quirks border on authoritarian
(see the example with the abortion joke in the libc manual he FORBADE removal
of and demanded be re-added when a dev simply overruled him). He's not acting
in a manner that encourages "freedom", but as a simple and obvious dictator of
all things GNU or claiming to be GNU. He's frequently tried to shape the path
of GNOME (which I am a former foundation member and was on the sysadmin team)
in areas he literally has no business weighing in on. Then there are some more
gross personality problems, like his sexism, or tendency to actually eat his
toenail gunk[1], or to ever refuse to be wrong on anything, even when an
entire community disagrees with him.

Dr Stallman has done a great deal of good for the world with Free Software,
however like the VAX and PDP-11, his time has passed. Open source won just
like Linux won over GNU/Hurd. It is ok that he won, by losing.

[1]
[https://www.youtube.com/watch?v=I25UeVXrEHQ](https://www.youtube.com/watch?v=I25UeVXrEHQ)

~~~
Jedd
rms's toenails aside, I'd suggest that _every_ licence is about enforcing a
set of views on others. There are plenty of licences to choose from, so it's
fairly easy to find one compatible with your own views.

In the context of GP's (beginningguava) comment about 'open source projects'
needing to change their licensing to prevent corporations making money by
SaaSing various tools, my point was twofold - first, by definition you can't
have free software with restrictions like that, and second you'd be merely
fighting the symptoms (with little chance of success).

Aside - I'm curious what you mean by the 'open source software development
model', as I don't think that's actually a thing.

~~~
SEJeff
Fair enough, the open source software development model is no different in
reality from the free software development model generally speaking.

It goes back to ESR's The Cathedral and the Bazaar and is what he deems "the
bazaar model" or "bazaar style" before him, Larry Augustin, and Bruce Perens
(if memory serves) went on to coin the phrase "open source". Even if you don't
necessarily agree with ESR (I see him in a similar vein as RMS fwiw), his
thoughts on software development models have generally speaking, been proven
true.

------
addisonj
Nice to see, this has felt like a gap in cloud offerings for a while... and
the open source options have difficulties.

From the little that was said, going to guess this uses something like
Beringei ([https://code.fb.com/core-data/beringei-a-high-performance-
ti...](https://code.fb.com/core-data/beringei-a-high-performance-time-series-
storage-engine/)) under the hood

------
plasma
The financial read cost of this database makes it practically unusable for
customer facing dashboards, disappointing.

------
axus
A place to put the timestamped data they download from yesterday's Amazon
Ground Station.

------
brootstrap
Been searching for years to a good alternative to postgres for storing gobs of
weather timeseries data. So far we have been running postgres system for many
years in production and have hired multiple contractors to implement a 'real
timeseries solution'. All of which have been utter shit and complete failures.
The AWS services are expensive as all hell. With a little bit of imagination
we created a unique schema for timeseries data that doesnt require terabytes
of space, and processes billions of data points a day, and has blazing fast
queries into said data.

~~~
jws
I moved a decade's worth of weather time series data from well indexed Sqlite
to InfluxDB and was nothing but pleased. It ended up taking an order of
magnitude less storage and so much faster to query that I didn't even bother
to benchmark it. You can probably write a simple query to your Postgres
database to cough out the text file to load InfluxDB to see how it works for
you. Then it comes down to how easy it is to replace your query and insert
functions… So it's all easy except for the hard part.

------
samstave
So what will this compare wrt boundary, signalfx, stackdriver, etc types of
previous services...

Ill have to go look into this, because if aws historic pricing for any large
volume stream, quickly becomes untennable.

Its very easy to have gobs and gobs of time series points... aws might make
using this way too expensive for anything at relative scale for a small
startup?

~~~
spullara
It appears that ingestion alone is more expensive than the commercial metric
services. Might not matter for small scale.

------
brian_herman__
I wonder how this compares to KDB

~~~
SEJeff
The core of Amazon's timeseries db likely doesn't fit into a CPU's L1 cache.
It does with KDB :)

~~~
cthalupa
>The core of Amazon's timeseries db likely doesn't fit into a CPU's L1 cache.
It does with KDB :)

Does it?

[https://kx.com/discover/in-memory-computing/](https://kx.com/discover/in-
memory-computing/) seems to indicate that it takes up ~600 Kb (I'm not sure if
this is bits or bytes, but even if it's bits, that turns into 75KB)

L1 cache is per core. Skylake Xeons have a 64KB cache per core, 32KB for data
and 32KB for instructions. Even with an even split there, you're not fitting
75KB (or 600KB) into the L1 cache.

Bits would be a weird measurement to use when talking about memory
utilization, so I'm pretty sure that it's 600 kilobytes. You're not anywhere
close to fitting that into the L1 cache. L2 cache, sure. But you get the
relatively spacious 1 megabyte for L2.

I'm also not sure that the "core" fitting into the CPU cache is particularly
meaningful for performance anyway. it doesn't say anything about how much
outside of the core gets used, how big the working set size is for your
workload, how much meaningful work is done on that working set of data, etc.
If you're frequently using parts of the software that don't fit in the cache,
or getting evicted from it for other code, or your working set of data doesn't
fit in the cache and you're constantly going to main memory for the data
you're working on, the "core" fitting in L1 cache (or L2 cache, which looks
more realistic) is going to be basically meaningless.

~~~
SEJeff
Gah, I meant L2 cache, but was being entirely too smug. I remember a
presentation a KX rep gave at our office a few jobs ago where this was one of
their bullet points. I found it amusing, and a bit odd.

~~~
cthalupa
Gotcha! Definitely an interesting marketing point. I probably would have had
the same reaction :)

------
probdist
Seems positioned to compete with Azure Data Explorer (MSFT's log/time series
optimized service). I know Azure runs a lot of services on top of Data
Explorer (previously called Kusto) I wonder if this is a true internal battle
tested product or a me-too offering.

~~~
nkassis
I might be mistaken but wouldn't Data Explorer be more similar to AWS
CloudWatch which has been around for a long time.

~~~
probdist
Azure Data Explorer/ Kusto is more of a database that is optimized for the log
use case than a service. There is a front end tool and a lot of the use-cases
are around log management, but it is database you can do general SQL or KQL
things with. Time series is one of the core use-cases for it also but it has
less marketing around it.

~~~
spullara
Sounds more like AWS CloudWatch Log Insights launched yesterday.

------
erikcw
Seems like this could be a great remote storage backend for Prometheus.

~~~
nabadu87
Oh yeah! Would that need a custom storage peovider in Prometheus?

~~~
erikcw
Yes [0]. I haven't had time to fully look into the details. But from looking
at the existing integration -- it would be fantastically convenient to have
something cloud native [1].

[0]
[https://prometheus.io/docs/prometheus/latest/storage/#remote...](https://prometheus.io/docs/prometheus/latest/storage/#remote-
storage-integrations)

[1] [https://prometheus.io/docs/operating/integrations/#remote-
en...](https://prometheus.io/docs/operating/integrations/#remote-endpoints-
and-storage)

------
temuze
Honest question: when dealing with time-series data, do you actually need
every data point? Is that level of granularity really necessary?

IMO, it makes way more sense to decide the aggregations you want ahead of time
(e.g. "SELECT customer, sum(value) FROM purchases GROUP BY customer"). That
way, you deal with substantially less data and everything becomes a whole lot
simpler.

~~~
bruth
Really depends on the use case. Working in healthcare, vital signs can be
modeled as time series points, but are lower frequency than, say, metrics from
servers. However we want to store every point so a spike is not missed. One
could argue an unsustained spike is noise, but in the healthcare domain there
may be a correlation with some external event (the purpose is surprised and
their heart rate spikes).

~~~
have_faith
The clever thing to do in this scenario would be to keep every spike but
delete all the data between similar data points after storing. So you get low
granularity for identical/nearly-the-same data points and high granularity
when something interesting happens. I don't have any experience with time-
series data so maybe this is commonplace.

~~~
lozenge
That would be impossible to run any new analyses on.

What some would do is record in blocks where every point after the earliest is
stored as a delta. Then each block is more compressible as it contains a lot
of 0s.

------
MagicPropmaker
We had applications where we were tracking guests in a venue through various
means. We tried a number of queuing systems to manage the flood of events, but
they'd all fall over. I'll love to run my old "venue simulator" through this
and see if it can stand up to actual guest load as they walk around, ride,
purchase things.

------
coredog64
I'm wondering if this shares any technology with the CloudWatch metrics
backend. They've been making improvements there all year, and most of them
generally align with what's announced here.

CloudWatch metrics are also very expensive for what you get, so that's another
similarity to Timestream ;)

------
taf2
I couldn’t tell from the page is this SQL based similar to timescale or a more
similar to influxdb?

------
mharroun
This is looking like a managed druid... that would be very nice to have.

------
tjholowaychuk
Anyone know if this is what CloudWatch Insights uses? If so, it doesn't even
come close to competing with Elasticsearch performance (with a tiny cluster),
it seemed quite slow.

------
inoiox
There have been a lot of amazon links this week

~~~
philwelch
AWS Reinvent is happening. Kind of like how's there's a lot of Google links
during IO or a lot of Apple links during WWDC.

------
jopsen
where is the docs?

------
booleandilemma
There are at least 7 Amazon-related stories on the HN front page right now,
what’s going on?

~~~
dang
[https://hn.algolia.com/?query=by:dang%20big%20tech%20confere...](https://hn.algolia.com/?query=by:dang%20big%20tech%20conference&sort=byDate&dateRange=all&type=comment&storyText=false&prefix=false&page=0)

------
superkuh
A quick look at the Hacker News frontpage shows a bit of a problem,

    
    
        1.  Amazon Timestream (amazon.com)
        3.  Amazon Quantum Ledger Database (amazon.com)
        8.  Amazon FSx for Lustre (amazon.com)
        13. AWS DynamoDB On-Demand (amazon.com)
        14. Amazon's homegrown Graviton processor was very nearly an AMD Arm CPU (theregister.co.uk)
        21. Building an Alexa-Powered Electric Blanket (shkspr.mobi)
        30. Amazon FSx for Windows File Server (amazon.com)

~~~
dang
[https://hn.algolia.com/?query=by:dang%20big%20tech%20confere...](https://hn.algolia.com/?query=by:dang%20big%20tech%20conference&sort=byDate&dateRange=all&type=comment&storyText=false&prefix=false&page=0)

~~~
superkuh
Ah, sorry. I should've known you'd already be on it. Thanks.

------
nimbius
jesus christ six amazon articles in a day? AWS is undeniably the body of
christ for HN but am i missing something? FSX, blockchain, timestream,
Graviton, ground station, and cloudwatch... all of these articles are
advertisements for mundane shit.

~~~
danso
It's AWS re:Invent day/week. I've never been, but I get the impression that
it's like Apple's keynote, or Google I/O, in which big product announcements
are made. On those days, you'll see multiple submissions about the respective
conferences too.

~~~
cheeze
This is _exactly_ what re:Invent is. Most teams dream of launching a new AWS
product at re:Invent (and not missing their date and launching at a later
time)

~~~
danso
It'd be an interesting blog post topic to look at how AWS:Invent (or I/O, or
F8, etc) product threads on HN compare to actual product impact. I remember
Rekognition getting decent discussion 2 years ago [0], but not along the
angles or magnitude of how Rekognition is usually discussed in recent months.
OTOH, other things I've been interested in as a data geek, I've barely heard
of since reading about them on HN -- e.g. Athena [1]

[0]
[https://news.ycombinator.com/item?id=13072956](https://news.ycombinator.com/item?id=13072956)

[1]
[https://news.ycombinator.com/item?id=13072245](https://news.ycombinator.com/item?id=13072245)

------
mLuby
I count 7 separate Amazon posts on the front of HN. Is this some conspiracy?
#NotAmused #ShouldBeBundled

~~~
twistedpair
The conspiracy is called re:Invent

