BlazingDB Origins, oh and we just raised $2.9M from Nvidia and Samsung

polskibus · on Nov 30, 2017

Great to see more competition in GPU+DW space! Some questions:

1. How does BlazingDB compare to MapD ?

2. How do you skip ingest - are you using Apache Arrow like Dremio as an efficient data representation format?

3. Do you have any benchmarks, or maybe where would you see BlazingDB on this list? http://tech.marksblogg.com/benchmarks.html using same hardware as MapD ?

4. Can you run your solution in a cluster? :)

felipe_aramburu · on Nov 30, 2017

1) We are focused on the data lake. We love MapD, they are doing kick ass stuff, we are focused more on operating on information from disk and from cloud storage services like s3 or hdfs implementations.

2) We read parquet files into our own caching system. We often use arrow apis though do not rely on arrow for our data representation.

3) We have made client side benchmarks but have not performed a standardized replicable benchmark for people to validate yet. We have been a VERY small team to date and are going to make that available as soon as we can. You CAN launch AWS marketplace blazing instances to see how it performs.

4) You sure can. A large part of BlazingDb's focus is on distribution. You can add nodes during runtime.

polskibus · on Dec 1, 2017

Could you elaborate on the GPU + data lake part? Memory transfer lag to and from GPU is significant in comparison to GPU computing power. Data lake may mean mutliple heterogeneous data sources with or without schema. How is coping with it helped by a GPU?

felipe_aramburu · on Dec 1, 2017

So depending on your sources, e.g. if they are compressed or not, the GPU can greatly speed upi/o by compressing and decompressing directly on the GPU. This is particularly meaningful when you can store it in compressed state when you transfer to the GPU and decompress for use after its on. It doesn't solve most of the problems of working with hetergenous sources. That being said gpus definitely allowed us to speed up how we get data out of Apache parquet and we anticipate we will be able to benefit from their speed when adding more compressed file formats.

kwillets · on Dec 1, 2017

What types of compression do you use? I did some work on streamvbyte, and it seems relevant.

felipe_aramburu · on Dec 4, 2017

Right now we support RLE, RLE Delta RLE, Delta RLE, Dictionary, Bitpacking. Many of these are combined together. It does look interesting I am checking it out.

agibsonccc · on Dec 1, 2017

This actually looks pretty useful. I'm adjacent to you (DL on hadoop/spark competing more with the just launched amazon sagemaker among other things) and was curious about your go to market strategy.

Like us you're kind of strattling the line between the hadoop ecosystem and GPUs. Mesos does this as well.

It seems like most folks in the GPU space still don't get Hadoop or S3 as a data source yet though (despite it being the dominant source of data warehousing S3 or otherwise).

How are you guys coping with this? Have you found a different experience than me?

It seems like a lot of the big data companies are at least adding some sort of gpu management as a checkbox now so the spaces will likely converge with or without the MPI crowd.

The other trend I'm seeing is you have folks like MapD and Kinetica trying to run the whole stack themselves. I'm not sure how well that will go overall (especially given how you can't really have everything in 1 warehouse typically). Could you comment on this? Are you going to try this as well after integrating?

I hope to see more companies in this space actually exploring this intersection.

Many make the mistake of boiling the ocean and doing nothing well though. Due to that it ends up being consulting. How will you guys over come that? Reworded, what is the initial focus vs the long term strategy?

felipe_aramburu · on Dec 1, 2017

People in the GPU space and slow storage systems like hadoop and S3) To caveat, I saw slow compared to something like storing your queryable data in system memory. The answer is that your clients are not normally GPU people. They are aware of GPUs and know they can be used for certain things but real clients normally have lots of data and they don't fire up a million instances to let it sit in a $10/GB storage solution. So we are able to help people that are already leveraging these kinds of storage technologies out of necessity.

Other playrers taking the whole stack) When you say other players are trying to run the whole stack themselves it depends on what you mean. From my understanding Kinetica is an in memory solution for example. How do you query a 100TB data set? Or a 1PB dataset? They are instead going after a specific kind of problem so I wouldn't consider any of us to be trying to run the whole stack. This is usually left up to behemoths and none of us are that in the GPGPU database space yet.

As far as where our focus is. We want you to be able to use blazing to accelerate query workloads more and more on data where it lies how it lies, You can ingest into blazing for maximum performance but the big difference between us and some of the other gpu players is that our focus is on working on datasets that leverage multiple understores (e.g. hdfs, s3, azure file share) and use multiple file formats. Right now the only two file formats we operate on natively are Simpatico (our own file format) and Parquet. Others have to be ingester but we are working on adding more file formats tha we can interact with efficiently in place without the need for ingestion.

As far as trying to take over the whole stack. Blazing wants to be a lean but powerful development shop. We want to focus on our core competencies and leverage other brillliant technologies when possible to avoid having to manage everyrhing. We don't want to build new file systems. And we think that there is a really big value proposition in helping people analyze data where it already lies.

As per your last question.

This is not our first pajama party when it comes to startups and something we have seen is that early on it feels very similar to consulting when you are establishing product market fit. You adjust your product to the immediate needs of the people you are interacting with. That road can be a troublesome one if you end up spending all of your time configuring other toosl and products and not developing your own. Keeping a tight handle on the funnel early on will help ensure that you have enough resources to work on your main value proposition and also increase the likelihood that the few engagements you do take lead to long term revenue and licenses (if that is how you make your money).

In the long run my main strategy is try to do as little as possible. Try to do it really well.

agibsonccc · on Dec 1, 2017

Great response thank you! I won't lie: We definitely have similar issues (I think any infra company runs in to this)

Re: Kinetica. They are doing visualization as well as things like machine learning.

I like your guys' approach a lot better. I see a clear bridge from the mass market commodity storage to something useful actually leveraging GPUs. That's why I commented. Best of luck to you!

felipe_aramburu · on Dec 1, 2017

Ok I see your point there. It does feel like that is quite a bit to manage. Our approach would be to perhaps try and show you how it could be used in these different kinds of workloads but would never try to undertake them ourselves. I'd probably sooner tear out my eyes than make data visualiation software.

RHSman2 · on Dec 1, 2017

Well said

hoodoof · on Dec 1, 2017

straddling

agibsonccc · on Dec 1, 2017

Indeed typo thanks!

georgewfraser · on Dec 1, 2017

I would love to see performance numbers for a GPU data warehouse on a benchmark like TPCDS. My concern is that GPUs are very fast for a few operations, but that these operations aren't the bottleneck for realistic queries, so the overall performance will not be good. I did a TPCDS-based benchmark on Redshift-vs-Snowflake-vs-BigQuery a few weeks ago [1], it could probably be run by BlazingDB without too much trouble since the data is already in GCS in Parquet format.

[1] https://news.ycombinator.com/item?id=15434272

arnon · on Dec 1, 2017

TPC-DS has some incredibly complex queries that might be too complex for a database that doesn't have very extensive syntax coverage

felipe_aramburu · on Dec 1, 2017

I just glanced at about 15 of them and did not notice anything we did not support. There seems to be alot of nesting and whatnot but nothing here really jumps out at me. I have not looked at the full set yet.

tmostak · on Nov 30, 2017

Congrats Rodrigo and team! We at MapD look forward to continued collaboration with you guys on the GOAI project and elsewhere!

roaramburu · on Nov 30, 2017

Thanks Todd! We always appreciate the support and and also look forward to collaborating on GOAI and anything else that might come up!

felipe_aramburu · on Dec 1, 2017

Thanks bud. Looks like the GOAI people are making some moves! We are also happy to be sharing in that experience with you all.

malloryerik · on Dec 1, 2017

How fantastic to hear about a cool project that starts in Peru and is held together by remote contributors. It's exciting.

pat_space · on Nov 30, 2017

Congratulations on your raise! I can see many applications for this in healthcare and government. Would love to hear more about how others are implementing BlazingDB and the problems they are solving.

manigandham · on Dec 1, 2017

Congrats on the raise. Nice to see all the progress on fast distributed SQL systems. We use MemSQL and BigQuery and the detachment of compute from storage is compelling when used correctly, especially with the cheap and fast cloud storage available now.

dpflan · on Nov 30, 2017

I don't know much about how this would work or make sense, but is there a benefit to using a GPU based database with something computationally demanding like machine training / learning that uses GPUs too for computation?

felipe_aramburu · on Nov 30, 2017

So if there are already gpu nodes being used for a machine learning workload then having blazing running on the same cluster woud allow you to share those resources. So you could run blazingdb on the same cluster as a machine learning workload and have it either feed that machine learning workload or accelerate some other sql analytics workload. This would allow you to keep your hardware in a state of greater utilization.

bogomipz · on Dec 1, 2017

What a neat and inspiring origin story. Congrats on the financing. I look forward to hearing more news from you folks.

kwillets · on Dec 1, 2017

How is overall throughput on GPU systems? It looks a lot faster than the last time I looked at GPU bus interfaces; how well does this pull in data from external sources?

Viva Peru!

felipe_aramburu · on Dec 1, 2017

Depends on what you mean by throughput. In terms of ram bandwidth on GPU this is very high. PCIE is a different matter. That is still the big limiter on many boxes. Ensuring you have enough cpu pcie lanes to drive all of the graphics cards you want to connect is important. If you are on open power you can use nvlink which is much faster. As far as pulling from external sources we do so using the parquet file format directly or you can ingest csv, json, and xml into our own native file format.

ben-schaaf · on Dec 1, 2017

AMD's EPYC offering (or Ryzen) could help fix some of the PCIE bottlenecking. If I remember correctly the EPYC can have 128 lanes, while XEON is limited to 32.

felipe_aramburu · on Dec 1, 2017

Good Point. I tried to convince someone very recently to go this route so I could piggy back and test on it as well :).

ruw1090 · on Dec 1, 2017

What is the benefit of shipping data to the GPU for execution if the data is on S3 or HDFS? Won't most of the cost of the query be I/O?

felipe_aramburu · on Dec 1, 2017

Sure the very first time you run a query. But with multi tiered caching the data you frequently access sits closer and closer to the gpus so that alleviates that bottlekneck over time to an extent. Also what is a fantastic way of improving i/o? Compression and decompression. Our own file format compresses and decompresses using the GPU. We are working on doing the same for some of the Parquet decompression steps. I/O is almost always your main concern here, but you can improve upon it greatly by leverage processes that might not have been computationally feasible before.

DevKoala · on Dec 1, 2017

Congrats guys. How do you compare in performance against a time series database like Druid?

sidi · on Dec 1, 2017

Congrats Rodrigo and team! Great to see the progress!!

hmm_really · on Dec 1, 2017

You do know there is a BlazegraphDB also?

RA_Fisher · on Nov 30, 2017

Congrats Rodrigo! Go BlazingDB!

focusandship · on Nov 30, 2017

Congrats Rodrigo and the BlazingDB team, You guys has come so far since the TechCrunch, Great Job! I really felt the chemistry the culture and the talent in your team when I met you guys in TechCrunch Battle Field!

ris · on Nov 30, 2017

Secret source. Sorry, you don't get my data. The world has moved on.