
Can You Solve This? 1B Records per Second Data Streaming Challenge - bamborde_zaiku
https://nanosai.substack.com/p/the-one-billion-records-per-second
======
varrakesh
This isn’t well-specified enough to be a real challenge. They say it can be a
server - but not too beefy of a server. What does that even mean? If I put
eight NVMe SSDs in a 128-core server, is that too beefy? What about 64, 16
cores?

Can I know (or bound) the number of orders or products in advance to
preallocate? Can I design the dataset myself with certain assumptions (e.g.
sorted with respect to time)? Can I bound certain aspects of the dataset (e.g.
orders must not contain more than 255 products, orders always contain the
prices of everything, etc.)?

Latency isn’t apparently a factor - so if I’m processing 1B records, do we
care how quickly it gets done? If not, I’ll just stream the data off to a GPU
and get the results later?

~~~
jjenkov
Hi Varrakesh, the reason it is not "well specified" is, that all of your
suggestions are interesting to try out and benchmark. Rather than saying "it
has to be exactly like this" we have left it more open ended by saying "what
would it take to get to 1 billion records per second?".

The answer might be different on different types of hardware, and with
different types of data sets, and with different types of data set sculpting.
Yes, it is okay to have one benchmark where there are no more than e.g. 255
products, or 255 customers, but then we should probably also benchmark with
e.g. up to 65.536 products and 65.536 customers, and up. Part of achieving
high performance data streaming is the ability to make your data small.

It would also be okay to use a GPU - although we have not (yet) plans about
doing that. Still, it would be very interesting to see what kind of results
you could get with that design.

We just have the requirement, that the data streaming engine must not be
exclusively designed for this challenge. It must be a reasonably functional
general purpose data streaming engine.

By the way, we hope to reach the 1 BRS milestone on a single server, i7-6700
Quad-Core Skylake CPU, with 2 NVME SSDs mounted in RAID 1. 1 GB of memory to
run the benchmark app should be enough, but the server will probably have 64
GB by default.

------
anonymoushn
It doesn't matter much whether I can solve it since the first rule of the
contest explicitly bans me from writing code to solve it.

~~~
jjenkov
What we mean by that rule (general purpose data streaming engine) is just,
that the product must be usable for other use cases than just this challenge.
You can write your own data streaming engine, but it should be able to handle
a wide variety of use cases, not just the challenge use case.

For instance, simply writing a program that loads 1 billion bytes into memory,
iterates them and sums them, would not count as a "general purpose data
streaming engine". But you don't have to use Spark, Kafka or something like
that. You can write your own.

------
dmitrygr
Ok, so I solve this very monetizable problem for you, and what do i get?
Bragging rights?

~~~
jjenkov
I think you misunderstood. WE will implement this data streaming engine, and
give it away as open source ! ... so WE get the bragging rights, you get the
data streaming engine.

------
CoolGuySteve
Yes I can, but why would I spend the time adapting my trading engine to
process someone else's data? Seems like I'd be providing free R&D.

Also, kind of funny that C/C++ are not listed in the application form.

~~~
ear7h
Their current engine is written in Java so you could just wait until their
engine gets to 100m and port it to C++ to win

~~~
jjenkov
Exactly ;-)

------
hammerton
Kind of seems like they want someone to do a tonne of work for free...

 _Contest closes_ \---- _Nanosai advertises that their platforms now supports
1B requests per sec_

~~~
VStack
Hi - we think you misunderstood because you probably haven`t red the post
properly. WE will implement this data streaming engine, and give it away as
open source ! The reason we are asking people if they can solve this
challenge, is for us to know if there are other easy to use data streaming
engine that can help solve this as per our relaxed/open specifications.

~~~
hammerton
Regardless, still seems like your parent company Kahler AI will leverage and
monetize the engine all under the guise that its open source.

I also love that you include NO INCENTIVE with this 'challenge'. Why would
anyone submit code to you guys for free?

All smells fishy to me.

~~~
bamborde_zaiku
Kahler is not parent company of Nanosai - though both share Zaiku Group as
parent with Nanosai being an open source JV project with Jenkov Aps.

Also, yes the streaming engine will be useful to some use cases for Kahler.
However, it is open source and so anyone will be able to access the same
underlying engine.

By the way, we already have several people signed to the challenge including
people working for notable tech companies!:)

------
bamborde_zaiku
Hi everyone, many thanks for all your comments - though there may have been
some misunderstanding that we address below:

WE (at Nanosai.com) will attempt to build a data streaming engine that can
process 1 billion records per second, and release it as open source. If YOU
want to try the challenge too, that's fine (e.g. someone already working on
data streaming engine tech). We did not mean for YOU to solve this problem for
US.

Our initial measurements and calculations show that it should be possible to
reach 1BRS, although the records would have to be small. Still, a data
streaming engine will always have some record iteration overhead, so it would
take some tuning to get that overhead small enough to reach 1BRS even with 1
byte records.

------
Tempest1981
A quick check to see if there are any hardware bottlenecks.

Ignoring use of a GPU, how many IPS is a quad core i7 (mentioned by jjenkov)?

And how many instructions might it take to do something useful to a record?
Say read 16 bytes and do some compares.

Or would the SSDs be the bottleneck? Also, RAID1, not RAID0, so effectively
just 1 SSD.

Or NVMe? Wow, a google search says vendors are pushing for 32 GBps. (My SATA
setup is obsolete.)

Maybe it's purely a software problem.

------
jandrewrogers
The description of this challenge is confusing and poorly specified. It is
unclear why it is supposed to be technically difficult. Giving it my best
interpretation, this is essentially a solved problem and people that know how
to solve it are unlikely to find it interesting.

Also, it would probably help if it was written in a programming language
appropriate for the purpose, such as C or C++.

------
selfup
Is there some kind of test harness for this? What does generic streaming
engine mean here?

