
Rockset – Serverless search and analytics engine - headalgorithm
https://www.rockset.com
======
bastawhiz
I have a medium-data database storing events (timestamps + some metadata,
~200gb). It currently lives on an Aurora postgres cluster. If this were rock-
solid, my application would be the perfect fit, I think. I have two major
concerns:

1\. I'd originally started on a managed solution
([https://getconnect.io/](https://getconnect.io/)). They promised scalability
and reliability, but we were forced to move away when the queries would take
upwards of thirty seconds, inserts would start failing, and we'd receive 503s.
I estimate that my database (which, at the time, was almost two orders of
magnitude smaller) was among their largest. Why should I trust Rockset with my
data?

2\. I can't find anything about performance, benchmarks, or any other
information about how the service will behave in production. InfluxDB is an
example of something that does a great job of this: their docs outline what
Influx is good at, what it's not good at, what will make your queries slow,
etc. Instead, Rockset has a how-to guide for building a FB messenger chat bot
with CSVs. From their docs:

> Rockset has a cloud-scale architecture. It can scale both compute and
> storage independent of one another. One one hand, you can have a small data
> set served by zillions of compute in parallel to make queries faster. On the
> other hand, you can have petabyte size data sets served by a small number of
> compute nodes. And, of course, you can have the entire spectrum in-between
> these two scenarios.

What does that even _mean_?

Sorry, Rockset, I'm going to need more than zillions of compute to convince me
to move my business to you.

~~~
foxish
Hi bastawhiz,

Performance numbers are coming, please look out for them in a future blog
post. Concrete details about the architecture included in
[https://rockset.com/Rockset_Concepts_Design_Architecture.pdf](https://rockset.com/Rockset_Concepts_Design_Architecture.pdf).
If you think we could add value to you beyond what you get with Aurora PG,
then I'd welcome a chance for an evaluation on a 100% free trial (no credit
card required) where you could test out the performance for yourself with your
own data. Totally respect the skepticism. please reach me directly at anirudh
at rockset.com if you'd like to chat further.

------
manigandham
I see that the pricing model has changed from flat-rate to per-GB. Much more
interesting to work now, although still on the high-end (but nice that there's
no further charge for queries). The previous comparison to a Postgres instance
using JSONB still stands.

How do query times scale with data size? Also is there any full-text search
(other than regex)? That would make it more compelling.

~~~
agnokapathetic
and 100KB/s is the max ingest rate?

~~~
foxish
That's just for streaming input by default. We do work with users and can
increase it if the use-case demands it. For bulk ingest from sources like S3,
that limit does not apply and it typically does many MB/s.

(I work on the product team at Rockset.)

------
yingw787
I think this is really cool! It's really nice to use a standards-compliant
persistent file format; I think a lot of companies have their own persist
implementations that render the data only visible at the SQL or REST layer.

I'm wondering:

\- Would it be possible to add certain guarantees to performance
characteristics for different file formats? Parquet and column-oriented stores
operate a good deal different from CSV and row-oriented stores. Would you have
to scan the binary?

\- Can you combine different persist types together? How do the performance
characteristics change?

\- What do you do about unclean data and disjoint data sets? Does somebody
else have to clean them? What happens if somebody "corrupts" data (say,
replaces a CSV delimiter type in-place while Rockset is running)?

\- Is there an extensions API available (e.g. SQL through Google Spreadsheets
and CSV on AWS S3, both through Zapier)? That could deliver a big value-add,
since if your data can be colocated more efficient means and alternatives can
be applied.

This is neat!

~~~
foxish
Hi yingw787, I work on the product team at Rockset. Thanks for your thoughts!
I'll try and answer your questions below.

\- The different file formats get indexed and turn into a Rockset specific
format which ensures that irrespective of the file type you get excellent
performance for your SQL queries. This also means you can JOIN data from
different sources (containing files in different formats) using SQL
irrespective of the source formats.

\- Depending on the complexity of the SQL queries, the latency can range from
low tens of milliseconds to a few seconds. Since we index ALL the fields in
several ways, if we're able to use our indices to accelerate the query (which
is almost always the case), it will likely be in the 10-200 milliseconds range
for a wide range of analytical queries. Look out for some numbers in the
future.

\- Data cleaning is something we facilitate through the use of our
delete/update records API that lets you mutate the index and remove/update the
records that you consider to be containing bad data. Since Rockset supports
schemaless ingest ([https://rockset.com/blog/from-schemaless-ingest-to-smart-
sch...](https://rockset.com/blog/from-schemaless-ingest-to-smart-schema-
enabling-sql-on-raw-data/)), error documents don't really break anything and
you can work around them by writing a query that ignores them. We are
interested in providing visibility into the data so that you can quickly
detect issues with the data and fix them.

\- Rockset has a REST API, clients in different programming languages
([https://docs.rockset.com/rest-api/](https://docs.rockset.com/rest-api/)) and
some visualization tools like Tableau
([https://docs.rockset.com/tableau/](https://docs.rockset.com/tableau/)). Can
you elaborate on what you mean by colocating data and the extension API?

~~~
yingw787
My impression of most databases is that locating the data physically close
together (i.e. an internal network connection ties together database nodes)
provides assumptions for performance optimization (e.g. based on internal
testing we think there is the tail latency at this percentile is X
milliseconds between requests on database nodes, or the network will only fail
requests X% of the time, therefore we can optimize this factor in source). If
you have disparate data located elsewhere, it may be more difficult to bake in
such assumptions (e.g. requests across public Internet may fail more often),
and more difficult to achieve performance, and therefore the value-add from a
product like Rockset would be to tie together disparate data sources. But I
just read your comment that the data is transformed to a Rockset specific
format, so it might matter less in that case because you do have a persist
filesystem.

For the extensions API, I was imagining something like postgresql-contrib:
[https://www.postgresql.org/docs/current/contrib.html](https://www.postgresql.org/docs/current/contrib.html)

In Rockset's case, I thought it would make sense if the data came from
multiple locations, extensions requests might take that as a top-level
assumption; hence the idea of a Rockset extension for something like Zapier,
where multiple Internet services are tied together into automation pipelines
(or in Rockset's case, read/write query pipelines).

I just thought of this now, but the client interface for a database like
PostgreSQL is useful enough where other databases like CockroachDB can
implement it too: [https://www.cockroachlabs.com/blog/why-
postgres/](https://www.cockroachlabs.com/blog/why-postgres/)

Hope this helps :)

------
wearhere
Calling a hosted database "serverless" is the most brazen branding I have seen
in a long time. For extra hilarity, their pricing page says "pricing is
inclusive of cloud hardware".

~~~
imveeve
hi, this is Venkat from Rockset.

Good feedback. We thought about the different ways to frame the value prop and
"serverless" is what resonated the most with us because: 1/ you can load data,
process queries and build apps/dashboards without ever thinking about servers
-- so, no provisioning or capacity planning required. 2/ you only pay for
amount of data actually loaded and indexed -- so, no idle servers costing you
$$$s.

If you have better suggestions that feels more accurate, please share and we
will definitely consider it.

Touche on the "cloud hardware" bit. We will fix that soon.

~~~
wearhere
Hey Venkat! Thanks for replying in good humor.

Now that you explain your reasoning a bit, and upon re-reading
[https://en.wikipedia.org/wiki/Serverless_computing](https://en.wikipedia.org/wiki/Serverless_computing),
I think using "serverless" in this context makes sense. I see "serverless"
used so much more often to describe compute runtimes like AWS Lambda than
databases that, I confess, I thought you might be trying to ride that wave's
popularity; and/or that you might be using "serverless" _just_ because the
servers were managed by you not the users, whereas you allocate capacity on a
more granular level than the server.

I do still recommend you take out the "cloud hardware" bit ;D

Thanks for the explanation, and best of luck! Cool model.

~~~
imveeve
thanks.

'cloud hardware' was definitely LOL worthy. ... brb after i go fix it :)

------
Aeolun
Honestly, this pricing scheme confuses the hell out of me.

Questions that initially pop up:

\- What do I do if I need more QPS?

\- What do I do if I need to ingest more data.

Why are these two values coupled to the cost of the data stored somehow?

~~~
imveeve
[this is Venkat from Rockset]

Our goal is to make the default experience simple so that you get enough
compute to build most real-world apps and dashboards. We still give you
flexibility in case you want to purchase additional compute for ingest or
queries. We will make this clear in our pricing page -- thanks for the
feedback.

> \- What do I do if I need more QPS? Barring extreme workloads (say 1 million
> QPS on 1 GB of data) for which we are not a good fit anyway, we auto-scale
> enough compute to handle the QPS needs of most real-world applications. As I
> mentioned earlier, if you want to break out of the standard compute
> allocation, then we do offer ability to purchase additional compute, but
> from our experience, this is seldom required.

> \- What do I do if I need to ingest more data. Yes, you can purchase
> additional ingest bandwidth if you need a higher steady state ingest
> capacity. Please note that the bandwidth limit only applies to real-time
> streaming ingest — for bulk ingest (for example: the first time a collection
> is created in Rockset sourced from Amazon S3) we try to build the indexes at
> much higher speeds and will keep working on making that really really fast
> without any additional fees.

~~~
Aeolun
Thank you for the explanation.

It would probably be good if you made all that clear on the pricing page
though (along with the extra cost I would actually incur).

------
netvarun
IIRC this is by one of the creators of RocksDB.

------
thinkingkong
Cool! Pricing seems a tad high but I really like these types of products. When
you take into account the act of running a database, finding the data
transform or ingestion tools, and the reporting layer (I like metabase) plus
the maintenance it starts to level out.

~~~
imveeve
hi, this is Venkat from Rockset.

Yeah, we are fans of Metabase too and we will soon add support for connecting
Metabase with Rockset. We do have Redash [1] and Superset [2], which are also
pretty good and open source.

[1] [https://docs.rockset.com/redash/](https://docs.rockset.com/redash/) [2]
[https://docs.rockset.com/apache-superset/](https://docs.rockset.com/apache-
superset/)

------
scribu
I'm curious how scalable this is. If it can't go beyond what you can load into
memory, it doesn't seem that useful.

If you already know SQL, you could just as well use Pandas (the Python
library) to load data from various sources and query it.

Also: AWS Athena

~~~
foxish
Hi scribu, I'm Anirudh from the product team at Rockset.

The data is indexed onto SSDs in the cloud. The sweet spot is 10s of terabytes
of data that you want to build a live application on top of.

From an architectural standpoint, we can scale even further.
[https://rockset.com/Rockset_Concepts_Design_Architecture.pdf](https://rockset.com/Rockset_Concepts_Design_Architecture.pdf)

~~~
manigandham
1TB costs $6000/month at your pricing. 10s of TB is rather expensive. It's
great if you can get those rates but I'm having trouble seeing how these
numbers can work on large datasets.

~~~
Aeolun
If you are live querying 1TB of data in memory, $6000/month does not seem that
crazy at all.

~~~
manigandham
It's in memory now? The prior comment says SSDs.

------
Scarbutt
Was curious about using pdfs as source but the docs have no info on it:
[https://docs.rockset.com/](https://docs.rockset.com/)

~~~
dhruba_b
[https://rockset.com/blog/how-to-run-sql-on-pdf-
files/](https://rockset.com/blog/how-to-run-sql-on-pdf-files/)

