

How Netflix loaded 1 billion rows into SimpleDB - petewarden
http://practicalcloudcomputing.com/post/313922691/5-steps-simpledb-performance

======
pierrefar
I was running a website that was doing millions of writes a day to SDB for
real time analytics. The biggest PITA feature of SDB is that it throttles
writes in a very horrible stringent way. You can barely tickle a domain and it
will throttle you. I never had any consistently good batch puts - they all
eventually fail.

After talking with SDB folks, they recommended that I shard my data because
each domain maps to a different network computer cluster. I'm glad it's the
first recommendation in the OP list because it seriously is the best thing you
can do.

Another trick that I experimented with: use multiple EC2 instance to write to
the same domain. I managed to convince myself that the throttling is per EC2
instance per domain, not a global per domain. However, cost ruled this
solution out.

Reading was much more consistent but was also throttled, especially at high
write-loads. The solution was two-fold:

1\. Cache everything "indefinitely" and break the cache when you know its
contents will change. For the real time stuff, you can't cache. I used
memcached, and looked at other solutions like Tokyo Tyrant, memcachedb and
redis. Use what you feel comfortable using really.

2\. Read as little as possible. Doing a "select * from domain where..." is
horrible compared to doing "select attribute1, attribute2 from domain
where...". Once you read, cache.

------
jordyhoyt
The linked blog post gives more details on how he did it and the throughput he
got.

[http://practicalcloudcomputing.com/post/284222088/forklift-1...](http://practicalcloudcomputing.com/post/284222088/forklift-1b-records)

Very interesting that Oracle became the bottleneck.

~~~
nettdata
As an admittedly biased Oracle architect/DBA, I'm inclined to believe that the
"Oracle" bottleneck was due to the design/implementation of the Oracle DB,
rather than the technology itself.

If, for instance, you were pulling this data out of an Oracle instance
designed to be a cheap, historical archive, on minimal hardware and cheap
disk, then yeah, it can be very limiting in performance.

If you're pulling it out of a production database, and have the priority
levels set such that the data suck won't affect production users/usage, then
again, that bottleneck is by design.

I'm taking that statement as something that's a comment on the system, rather
than a slam on Oracle's technology and capabilities.

~~~
r39132
Yep, my post was not meant to slam Oracle in any way. It only meant to outline
that our Oracle instance was being taxed such that the data pull could not be
as aggressive as I would have liked.

~~~
jordyhoyt
Right, I would have thought it impressive if any major DB was the source.
Getting it to the point where the source DB is taxed as much as possible seems
to be the point of all your optimization. Nice that you could get there.

------
pvg
The post is notable in its absence of any hint as to why and with what kind of
data this was done. Was the driver cost? Performance? A pleasant cloudy
feeling? It seems a given you can, if you try, get a billion rows into
SimpleDB. You can probably get a billion rows mechanical-turked onto clay
tablets. The interesting thing to learn would be why doing so is advantageous.

~~~
r39132
Good point. I did mention it in some other posts. We are moving to SimpleDB
for greater scalability and availability.

~~~
pvg
It would be really interesting to hear more about it, within the scope of what
you can reasonably disclose.

------
lsb
You can buy a machine with 128GB of memory and 2 TB of disk space for under
$10k at Dell. A billion rows could be an in-memory dataset now.

~~~
r39132
But you need to plan for data growth.

~~~
nwatson
If your goal is write-once archival and subsequent query of large volumes of
data, e.g., log/event data, you can't do much better than some of the column-
oriented databases. Two examples are SenSage and Vertica.

See customer profiles for SenSage at <http://sensage.com/customers/customer-
profiles.php>. Various customers collect and store up to 200GB of raw text log
files (400 million records at generous 500bytes/line estimate) broken into
records ... PER DAY ... with 2 to 5 or even more years of data online
available for analysis ... fast analysis ... one example is an 8-billion-
record query in 2 minutes.

------
elq
bah. that's nothing! my team put over 3 billion rows into amazons cloud in a
matter of hours without having to deal with the vagaries of sdb :)

to the best of my knowledge, oracle was the bottleneck because the oracle
instance is an actual high volume production database and the IR process was
restrained to minimize the impact to production users.

~~~
r39132
Correct.

~~~
temporalMan
SenSage has customer whose database size is up to petabyte. Some of SenSage
customer load up to multiple Terabytes in a day.

~~~
nwatson
What is the customer's industry sector and what kind of data do they store
there? Is SenSage have much success pushing their solution to other similar
companies?

And also how is SenSage doing overall? I don't hear much in the press these
days and assume either (a) not going so well; or more hopefully (b) they and
their customers find each other in more private circles and there's no need to
grab all the headlines.

I care because I worked on the product for almost 4 years starting from when
we were in a living room/den/kitchen in Berkeley. Good times.

