
Apache Arrow Flight: A Framework for Fast Data Transport - stablemap
http://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/
======
fulafel
It's interesting how much faster your laptop SSD is compared to these high end
performance oriented systems. Keeping in mind that the localhost/tls-disabled
number is a high bound. (Not singling out Arrow by any means, most others are
slower. )

I wonder which came first, the petering off of wired network hardware perf
improvement, or the software bottlenecks that become obvious if we try to use
today's faster networks. 100 Mb ethernet came in 1995, gigE in 1999, 10 gigE
in 2002 and gained adoption in a few years.. on that track we should have had
100gigE in 2006 and seen it in servers in 2008 / workstations in 2010. And
switches / routers should have seen terabit ethernet in 2010. Today's
servers(X) seem to be at about 25 GBe, and with multicore that's just 1-2
gigabits per core.

(X) according to
[https://www.supermicro.com/products/system/1U/](https://www.supermicro.com/products/system/1U/)

~~~
frumiousirc
Network is now outpacing single core performance.

The same 25 Gbps claimed by the article can be achieved with a single-threaded
ZeroMQ socket. That thread will be CPU bound. To break 25 Gbps, multiple I/O
threads need to be engaged.

There are already greater than 100 Gbps network links while single-core speed
has stagnated for many years. Multi-threaded or multi-streamed (like in the
article) solutions are needed to saturate modern network links.

~~~
fulafel
Yep, you can't do fast anything in software without parallelism in the modern
world, as single thread performance improvements have stalled even harder than
networking speed improvements. I guess my observation might be a symptom of
parts of the software world staying behind at the single core performance
level and only specialized applications bothering to do the major sw surgery
that 10x or 100x parallellism requires.

------
jumpingmice
More people should try high performance services with non-traditional protobuf
implementations. The fact that every language has a generated parser in no way
preclude you from parsing them yourself. Hand-rolled serialization of your
outbound messages can also be really fast, and the C++ gRPC stack will just
accept preformatted messages and put them on the wire. Finally the existence
of gRPC itself should not make you feel constrained against implementing the
entire protocol yourself. It’s just HTTP/2 with conventional headers.

~~~
wesm
To be clear for anyone reading, we're parsing and generating the data-related
protobufs ourselves, and retaining ownership of the memory returned by gRPC to
obtain zero copy.

The C++ details are found in

[https://github.com/apache/arrow/blob/master/cpp/src/arrow/fl...](https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/serialization_internal.cc)

~~~
jumpingmice
Have you considered/tried using the new ParseContext instead of
CodedInputStream? It is performance-oriented.

Edit: Apparently it's also the default in protobuf 3.10

~~~
wesm
I wasn't aware of it but will take a look. Thanks!

------
wodenokoto
A bit off topic, but since this is implemented using gRPC, I’d like to ask,
what is RPC and how does one make an (g)RPC call?

My understanding is it’s a binary alternative to JSON/REST API and all google
cloud platform services uses it, however, since I have not managed to figure
out how to do a single interaction with RPC against gcp (or any other
service), I am wondering if my understanding is completely wrong here.

~~~
Matthias247
RPC is a general term and stands for remote procedure call. You do a function
call which might kind of look like a normal function call, but the actual
function is executed on another host.

gRPC is one implementation of RPC, where HTTP/2 is used as a transport layer,
and protocol buffers are used for data serialization. You typically use it be
using the grpc framework: Generate code for a specific API, and then use the
generated code and the client library to perform the call. There might however
also be different ways, e.g. proxies to HTTP systems and server introspection
mechanism that allow to perform calls without requiring the API specification.

------
riboflavin
Also, for more on Arrow and Flight: [https://www.dremio.com/understanding-
apache-arrow-flight/](https://www.dremio.com/understanding-apache-arrow-
flight/)

------
algorithmsRcool
Are there any thoughts about where compression fits into this model?

I know networks are getting very fast but with this size of data I wonder if
there are realizable gains left with modern algorithms like Snappy.

~~~
dikei
Columnar formats generally compress very well due to the similarity between
values in the same column.

------
RocketSyntax
We are struggling with reliability when using mounting solutions for big data
in S3. Would this help?

~~~
khc
Define big data? Have you tried
[https://github.com/kahing/goofys/](https://github.com/kahing/goofys/) ?

Disclaimer: I am the author

~~~
RocketSyntax
Yes. 1PB. Although I don't remember the specifics about reliability; something
about having to remount the entire fs if it wasn't 100% there.

~~~
khc
What do you mean by not 100% there?

------
maximente
meta: the https version works fine:

[https://arrow.apache.org/blog/2019/10/13/introducing-
arrow-f...](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-
flight/)

