
Building a Real-Time Bike-Share Data Pipeline with StreamSets, Kafka and MapD - josep2
https://www.jowanza.com/blog/2018/9/8/real-time-station-tracking-ford-gobike-and-mapd
======
bengotow
This is awesome! Would love to hear more about the dataviz the author is
putting together. I have to admit the stack seems like it might be a /bit/
overkill for this use case, but it seems like it was a great learning
experience!

~~~
randyzwitch
Thanks for reading! I worked with OP (I'm a MapD employee), and yes, for this
smallish example this is overkill. However, we're planning on adding a lot
more feeds[1] to this pipeline which will make the tools we used a lot more
necessary

[1] At the time of writing, there were 213 different bike share feeds
[https://github.com/NABSA/gbfs/blob/master/systems.csv](https://github.com/NABSA/gbfs/blob/master/systems.csv)

~~~
amrrs
Hey @randyzwitch big fan of your Adobe analytics R package and blog.

Regarding this, Is it on AWS or On-Prem? Any basic information you could share
about h/w

~~~
randyzwitch
For this example, I set up a hosted Kafka cluster on Azure using HDInsight, as
I didn't want to mess around with setting up Kafka. It's probably an expensive
way to solve this problem long-term, as we're not really using anything exotic
that a stock Kafka install wouldn't do out of the box.

[https://docs.microsoft.com/en-
us/azure/hdinsight/kafka/apach...](https://docs.microsoft.com/en-
us/azure/hdinsight/kafka/apache-kafka-introduction)

------
pugworthy
It would be a bit of a nightmare of device maintenance (and loss), but real
time bike tracking at Burning Man would be pretty nice.

People lose bikes, people get bikes stolen, and people also want to know where
the Yellow Bikes (free to use bikes) are located.

------
joeblau
This looks really cool. I would love to see the output of the what is
visualized my MapD.

------
CyberDildonics
How is this anything other than a normal database app? How many bikes are
being shared that you need a 'data pipeline' and 'stream sets' and Kafka and
MapD to make it 'real time'

~~~
free652
Well don't you need to get the data into a database? How would you design a
system that collects data from a million bikes?

Kafka acts like a buffer, because downstream systems may not be fast enough to
transform and persist the data during spikes.

~~~
randyzwitch
Good point. While this demonstration is for a single bike-sharing program,
there are over 200 feeds that conform to the GBFS specification, most updating
roughly every 10 seconds.

------
gcbw2
Do they make that data public for the same loophole that paypal tries to use
with venmo? i.e. make sharing the data "a feature" so that they can sell it as
there is no expectation of privacy.

~~~
randyzwitch
The data for this example are in aggregate, number of bikes currently at a
station along with other selected information about the bike share location.
The lowest level information is at the bike_id level, no information is
transmitted about which customer is on which bike/bike_id.

[https://github.com/NABSA/gbfs/blob/master/gbfs.md](https://github.com/NABSA/gbfs/blob/master/gbfs.md)

------
neuro
Do you work for the company that developed MapD?

~~~
randyzwitch
I do, Jowanza does not

