
Terark (YC W17) is a profitable database compression company based in Beijing - rockeetterark
https://www.techinasia.com/real-pied-piper-silicon-valley
======
rockeetterark
Sorry, I know the article is kind of sensational but it also has some good
information and we're here to discuss the real substance in the thread.

Terark built a new storage engine for Database and Data Systems based on the
Succinct Nested Trie data structure. Our technology enables direct search on
highly compressed data without decompressing it. Thanks to that we obtain
>200X faster performance and more than 15X storage savings (better than
Google's LevelDB or Facebook's RocksDB). We are a Y Combinator company (W17).

~~~
CaveTech
That's 200x performance in relation to what?

~~~
rockeetterark
In 200x improvement in random read performance, compared to RocksDB or
WiredTiger (MongoDB's storage engine). You can find benchmarks on our website:
[https://terark.com/en/index](https://terark.com/en/index) And here:
[https://github.com/Terark/terarkdb/wiki/Benchmark](https://github.com/Terark/terarkdb/wiki/Benchmark)

We also provide a free license of TerarkDB and you can download the exact
scripts we used and run your own benchmarks with the configuration you want.
We know the claim may sound outlandish, so we try to be as transparent as
possible.

~~~
desdiv
Can you please do some benchmarks against MySQL and PostgreSQL? The vast
majority of your prospective customers will be using these two instead of
RocksDB or MongoDB.

~~~
rockeetterark
Sure! Our benchmark against MySQL is here: [https://github.com/Terark/mysql-
on-terarkdb/wiki/YCSB-on-9.1...](https://github.com/Terark/mysql-on-
terarkdb/wiki/YCSB-on-9.1G-Movie-Data) We used YCSB on 9.1Gb of movie data.
This benchmark is comparing MySQL with our product "MySQL on Terark". "MySQL
on Terark" is basically MySQL configured with TerarkDB instead of InnoDB --
that way you can migrate your MySQL applications to Terark with virtually no
modification in your code.

We do not have any benchmark against PostgreSQL though. It is not in our plans
to adapt our storage engine to PostgreSQL, so we're not comparing it against
it, but the gains are just as significative.

I hope that answer your questions, and feel free to reach us at
business@terark.com

------
continuations
So there's TerarkDB:
[https://github.com/Terark/terarkdb](https://github.com/Terark/terarkdb)

And there's TerichDB:
[https://github.com/Terark/terichdb](https://github.com/Terark/terichdb)

How are they related to each other?

Also TerichDB calls itself open source but then includes this: "TerichDB is
open source but our core data structures and algorithms(dfadb) are not yet."

If the core algorithms of TerichDB is not open source then is TerichDB even
usable? Are you going to open source the core algorithms?

All this is rather confusing.

~~~
rockeetterark
TerichDB is an experimental repo. We'll take it private to avoid confusion.
Thanks for pointing this out.

Regarding the license of our products: the core of TerarkDB is a plug-in for
RocksDB. It is loaded as a dynamic library for librocksdb.so and compliant
with RocksDB’s license. All the code related to MongoDB/MySQL is open source
(We use MongoRocks and MyRocks).

Making the core algorithms open source is a dilemma for us. At this stage, as
a young startup, keeping the core algorithms proprietary gives us leverage on
the valuation and insulates us from potential competitors. But this is
something we may reconsider in the future in order to facilitate a wider
adoption of our products.

It's a big debate, even for us internally. If anyone else here has been facing
the same dilemma, we'd love to hear about your opinion, what you chose in the
end and how things turned out.

~~~
couchand
The old quote from Howard Aiken may be relevant here: "Don't worry about
people stealing an idea. If it's original, you will have to ram it down their
throats."

~~~
rockeetterark
In our situation, making it fully open source for wide adoption would oblige
us to become a consulting company with revenue based on support. This would
require a large headcount that we cannot support as a mostly bootstrapped
startup. So, for now, we decided to focus on the tech and deliver the best
tech possible to the (largest) clients who need it the most. Think of it as
Quality vs/ Quantity. For instance, as reported in the article, our largest
client at the moment is Alibaba Cloud (the Chinese Amazon AWS). We're able to
cater to their custom needs and even send our CTO and some engineers to their
office to accompany them when needed. We solve a pain point for them on a huge
scale, and we're able to make a decent revenue that makes us profitable and
allows us to grow our business independently.

But we're open on the question. Our storage engine is also compatible with
MongoDB and MySQL, so if we could partner with a large company providing
support for MongoDB/MySQL on Terark (think something like Percona, for
instance), and that open sourcing all our code was a must, we would consider
it.

~~~
ableton
You guys could keep it proprietary then try to get bought out by one of the
big guys who would open source it.

------
polskibus
I really hope someone with vast knowledge of database internals will come here
and comment on Terark claims. The blog entry mentioned in the comments is a
better source of information than that article.

~~~
jandrewrogers
Some of the basic assertions, such as the relative inefficiency of block
compression in database engines, are true. I've seen material gains from using
context/content-aware compression and some commercial OLAP databases exploit
this extensively. They appear to be using many of the same kinds of
techniques.

However, the assertions made around caching behavior, such as wasting memory
due to double caching, are not generally true. While you will see this in
simple/naive database engines, a sophisticated high-performance database
implementation won't be designed this way.

~~~
rockeetterark
Thanks. These assertions are here to give a basic background and overview of
databases performance in general. The real game changer with Terark is our
novel compression algorithm. It's more space efficient, that's one thing, but
above all else we can search directly into the compressed data without
decompressing it. That's the real breakthrough.

We do that by using a data structure called Succinct Nested Trie, and we've
introduced concepts such as CO-Index (Compressed Ordered Index) and PA-Zip
(Point Accessible Zip).

We were at first a compression company, and turned to storage engines and
database as a domain of application for our algos, hence the analogy with Pied
Piper :)

~~~
polskibus
How does your technique compare to a typical column store?

------
scott00
Is the compression geared towards any particular type of data? Seems like
compression that would work well on, say, blog posts, may not work as well on,
say, tick-level data from a stock exchange.

~~~
rockeetterark
It works, but it's not the best scenario for us. Scenarios with financial data
are most likely sequential read (gimme all data for the last 50 trading days)
and write heavy (tick-level write). We're blowing away the rest of the pack
when you have a huge haystack and you're looking for the needle in it, that's
where you're gonna get a 200x boost in performance with using Terark.

------
based2
[https://terark.com/en/blog/detail/14](https://terark.com/en/blog/detail/14)

~~~
rockeetterark
Yep, this is an article we published on our blog to give a bit of background
information on databases performance. It's very general.

For benchmarks on TerarkDB's performance in particular, you can have a look
here:
[https://github.com/Terark/terarkdb/wiki/Benchmark](https://github.com/Terark/terarkdb/wiki/Benchmark)

------
est
since Terark is a chinese startup

its founders answered some more questions here

[https://www.zhihu.com/question/46787984](https://www.zhihu.com/question/46787984)

~~~
rockeetterark
Thanks for mentioning it! Yes, for Chinese speakers, our CTO Lei Peng has
answered a lot of questions on Zhihu (the Chinese Quora).

------
nerdwaller
Maybe I'm alone in privacy concerns, but something behind the "great firewall"
scares me a bit to trust.

~~~
richardw
I don't think you send them your data. They send you their technology.

~~~
rockeetterark
That is correct. We simply built a storage engine technology. You run it
yourself on your own servers. Everything is open source except for the core
compression algorithms that are loaded as a proprietary dynamic library. Being
a young startup, we made the choice not to open source this part for now (see
my other comment higher in this thread).

