
BlazingDB Origins, oh and we just raised $2.9M from Nvidia and Samsung - meremagee
https://blog.blazingdb.com/blazingdb-origins-oh-and-we-just-raised-2-9m-from-nvidia-and-samsung-99cd581e66c7
======
polskibus
Great to see more competition in GPU+DW space! Some questions:

1\. How does BlazingDB compare to MapD ?

2\. How do you skip ingest - are you using Apache Arrow like Dremio as an
efficient data representation format?

3\. Do you have any benchmarks, or maybe where would you see BlazingDB on this
list?
[http://tech.marksblogg.com/benchmarks.html](http://tech.marksblogg.com/benchmarks.html)
using same hardware as MapD ?

4\. Can you run your solution in a cluster? :)

~~~
felipe_aramburu
1) We are focused on the data lake. We love MapD, they are doing kick ass
stuff, we are focused more on operating on information from disk and from
cloud storage services like s3 or hdfs implementations.

2) We read parquet files into our own caching system. We often use arrow apis
though do not rely on arrow for our data representation.

3) We have made client side benchmarks but have not performed a standardized
replicable benchmark for people to validate yet. We have been a VERY small
team to date and are going to make that available as soon as we can. You CAN
launch AWS marketplace blazing instances to see how it performs.

4) You sure can. A large part of BlazingDb's focus is on distribution. You can
add nodes during runtime.

~~~
polskibus
Could you elaborate on the GPU + data lake part? Memory transfer lag to and
from GPU is significant in comparison to GPU computing power. Data lake may
mean mutliple heterogeneous data sources with or without schema. How is coping
with it helped by a GPU?

~~~
felipe_aramburu
So depending on your sources, e.g. if they are compressed or not, the GPU can
greatly speed upi/o by compressing and decompressing directly on the GPU. This
is particularly meaningful when you can store it in compressed state when you
transfer to the GPU and decompress for use after its on. It doesn't solve most
of the problems of working with hetergenous sources. That being said gpus
definitely allowed us to speed up how we get data out of Apache parquet and we
anticipate we will be able to benefit from their speed when adding more
compressed file formats.

~~~
kwillets
What types of compression do you use? I did some work on streamvbyte, and it
seems relevant.

~~~
felipe_aramburu
Right now we support RLE, RLE Delta RLE, Delta RLE, Dictionary, Bitpacking.
Many of these are combined together. It does look interesting I am checking it
out.

------
agibsonccc
This actually looks pretty useful. I'm adjacent to you (DL on hadoop/spark
competing more with the just launched amazon sagemaker among other things) and
was curious about your go to market strategy.

Like us you're kind of strattling the line between the hadoop ecosystem and
GPUs. Mesos does this as well.

It seems like most folks in the GPU space still don't get Hadoop or S3 as a
data source yet though (despite it being the dominant source of data
warehousing S3 or otherwise).

How are you guys coping with this? Have you found a different experience than
me?

It seems like a lot of the big data companies are at least adding some sort of
gpu management as a checkbox now so the spaces will likely converge with or
without the MPI crowd.

The other trend I'm seeing is you have folks like MapD and Kinetica trying to
run the whole stack themselves. I'm not sure how well that will go overall
(especially given how you can't really have _everything_ in 1 warehouse
typically). Could you comment on this? Are you going to try this as well after
integrating?

I hope to see more companies in this space actually exploring this
intersection.

Many make the mistake of boiling the ocean and doing nothing well though. Due
to that it ends up being consulting. How will you guys over come that?
Reworded, what is the initial focus vs the long term strategy?

~~~
felipe_aramburu
People in the GPU space and slow storage systems like hadoop and S3) To
caveat, I saw slow compared to something like storing your queryable data in
system memory. The answer is that your clients are not normally GPU people.
They are aware of GPUs and know they can be used for certain things but real
clients normally have lots of data and they don't fire up a million instances
to let it sit in a $10/GB storage solution. So we are able to help people that
are already leveraging these kinds of storage technologies out of necessity.

Other playrers taking the whole stack) When you say other players are trying
to run the whole stack themselves it depends on what you mean. From my
understanding Kinetica is an in memory solution for example. How do you query
a 100TB data set? Or a 1PB dataset? They are instead going after a specific
kind of problem so I wouldn't consider any of us to be trying to run the whole
stack. This is usually left up to behemoths and none of us are that in the
GPGPU database space yet.

As far as where our focus is. We want you to be able to use blazing to
accelerate query workloads more and more on data where it lies how it lies,
You can ingest into blazing for maximum performance but the big difference
between us and some of the other gpu players is that our focus is on working
on datasets that leverage multiple understores (e.g. hdfs, s3, azure file
share) and use multiple file formats. Right now the only two file formats we
operate on natively are Simpatico (our own file format) and Parquet. Others
have to be ingester but we are working on adding more file formats tha we can
interact with efficiently in place without the need for ingestion.

As far as trying to take over the whole stack. Blazing wants to be a lean but
powerful development shop. We want to focus on our core competencies and
leverage other brillliant technologies when possible to avoid having to manage
everyrhing. We don't want to build new file systems. And we think that there
is a really big value proposition in helping people analyze data where it
already lies.

As per your last question.

This is not our first pajama party when it comes to startups and something we
have seen is that early on it feels very similar to consulting when you are
establishing product market fit. You adjust your product to the immediate
needs of the people you are interacting with. That road can be a troublesome
one if you end up spending all of your time configuring other toosl and
products and not developing your own. Keeping a tight handle on the funnel
early on will help ensure that you have enough resources to work on your main
value proposition and also increase the likelihood that the few engagements
you do take lead to long term revenue and licenses (if that is how you make
your money).

In the long run my main strategy is try to do as little as possible. Try to do
it really well.

~~~
agibsonccc
Great response thank you! I won't lie: We definitely have similar issues (I
think any infra company runs in to this)

Re: Kinetica. They are doing visualization as well as things like machine
learning.

I like your guys' approach a lot better. I see a clear bridge from the mass
market commodity storage to something useful actually leveraging GPUs. That's
why I commented. Best of luck to you!

~~~
felipe_aramburu
Ok I see your point there. It does feel like that is quite a bit to manage.
Our approach would be to perhaps try and show you how it could be used in
these different kinds of workloads but would never try to undertake them
ourselves. I'd probably sooner tear out my eyes than make data visualiation
software.

------
georgewfraser
I would love to see performance numbers for a GPU data warehouse on a
benchmark like TPCDS. My concern is that GPUs are very fast for a few
operations, but that these operations aren't the bottleneck for realistic
queries, so the overall performance will not be good. I did a TPCDS-based
benchmark on Redshift-vs-Snowflake-vs-BigQuery a few weeks ago [1], it could
probably be run by BlazingDB without too much trouble since the data is
already in GCS in Parquet format.

[1]
[https://news.ycombinator.com/item?id=15434272](https://news.ycombinator.com/item?id=15434272)

~~~
arnon
TPC-DS has some incredibly complex queries that might be too complex for a
database that doesn't have very extensive syntax coverage

~~~
felipe_aramburu
I just glanced at about 15 of them and did not notice anything we did not
support. There seems to be alot of nesting and whatnot but nothing here really
jumps out at me. I have not looked at the full set yet.

------
tmostak
Congrats Rodrigo and team! We at MapD look forward to continued collaboration
with you guys on the GOAI project and elsewhere!

~~~
roaramburu
Thanks Todd! We always appreciate the support and and also look forward to
collaborating on GOAI and anything else that might come up!

------
malloryerik
How fantastic to hear about a cool project that starts in Peru and is held
together by remote contributors. It's exciting.

------
pat_space
Congratulations on your raise! I can see many applications for this in
healthcare and government. Would love to hear more about how others are
implementing BlazingDB and the problems they are solving.

------
manigandham
Congrats on the raise. Nice to see all the progress on fast distributed SQL
systems. We use MemSQL and BigQuery and the detachment of compute from storage
is compelling when used correctly, especially with the cheap and fast cloud
storage available now.

------
dpflan
I don't know much about how this would work or make sense, but is there a
benefit to using a GPU based database with something computationally demanding
like machine training / learning that uses GPUs too for computation?

~~~
felipe_aramburu
So if there are already gpu nodes being used for a machine learning workload
then having blazing running on the same cluster woud allow you to share those
resources. So you could run blazingdb on the same cluster as a machine
learning workload and have it either feed that machine learning workload or
accelerate some other sql analytics workload. This would allow you to keep
your hardware in a state of greater utilization.

------
bogomipz
What a neat and inspiring origin story. Congrats on the financing. I look
forward to hearing more news from you folks.

------
kwillets
How is overall throughput on GPU systems? It looks a lot faster than the last
time I looked at GPU bus interfaces; how well does this pull in data from
external sources?

Viva Peru!

~~~
felipe_aramburu
Depends on what you mean by throughput. In terms of ram bandwidth on GPU this
is very high. PCIE is a different matter. That is still the big limiter on
many boxes. Ensuring you have enough cpu pcie lanes to drive all of the
graphics cards you want to connect is important. If you are on open power you
can use nvlink which is much faster. As far as pulling from external sources
we do so using the parquet file format directly or you can ingest csv, json,
and xml into our own native file format.

~~~
ben-schaaf
AMD's EPYC offering (or Ryzen) could help fix some of the PCIE bottlenecking.
If I remember correctly the EPYC can have 128 lanes, while XEON is limited to
32.

~~~
felipe_aramburu
Good Point. I tried to convince someone very recently to go this route so I
could piggy back and test on it as well :).

------
ruw1090
What is the benefit of shipping data to the GPU for execution if the data is
on S3 or HDFS? Won't most of the cost of the query be I/O?

~~~
felipe_aramburu
Sure the very first time you run a query. But with multi tiered caching the
data you frequently access sits closer and closer to the gpus so that
alleviates that bottlekneck over time to an extent. Also what is a fantastic
way of improving i/o? Compression and decompression. Our own file format
compresses and decompresses using the GPU. We are working on doing the same
for some of the Parquet decompression steps. I/O is almost always your main
concern here, but you can improve upon it greatly by leverage processes that
might not have been computationally feasible before.

------
DevKoala
Congrats guys. How do you compare in performance against a time series
database like Druid?

------
sidi
Congrats Rodrigo and team! Great to see the progress!!

------
hmm_really
You do know there is a BlazegraphDB also?

------
RA_Fisher
Congrats Rodrigo! Go BlazingDB!

------
focusandship
Congrats Rodrigo and the BlazingDB team, You guys has come so far since the
TechCrunch, Great Job! I really felt the chemistry the culture and the talent
in your team when I met you guys in TechCrunch Battle Field!

------
ris
Secret source. Sorry, you don't get my data. The world has moved on.

