Building a Real-Time Bike-Share Data Pipeline with StreamSets, Kafka and MapD

bengotow · on Sept 10, 2018

This is awesome! Would love to hear more about the dataviz the author is putting together. I have to admit the stack seems like it might be a /bit/ overkill for this use case, but it seems like it was a great learning experience!

randyzwitch · on Sept 10, 2018

Thanks for reading! I worked with OP (I'm a MapD employee), and yes, for this smallish example this is overkill. However, we're planning on adding a lot more feeds[1] to this pipeline which will make the tools we used a lot more necessary

[1] At the time of writing, there were 213 different bike share feeds https://github.com/NABSA/gbfs/blob/master/systems.csv

amrrs · on Sept 10, 2018

Hey @randyzwitch big fan of your Adobe analytics R package and blog.

Regarding this, Is it on AWS or On-Prem? Any basic information you could share about h/w

randyzwitch · on Sept 10, 2018

For this example, I set up a hosted Kafka cluster on Azure using HDInsight, as I didn't want to mess around with setting up Kafka. It's probably an expensive way to solve this problem long-term, as we're not really using anything exotic that a stock Kafka install wouldn't do out of the box.

https://docs.microsoft.com/en-us/azure/hdinsight/kafka/apach...

tuukkah · on Sept 10, 2018

Could you support the citybik.es API which supports more than 400 cities (no GBFS requirement)? https://citybik.es/

randyzwitch · on Sept 10, 2018

The example will obviously work with any API, so we could. But not sure using citybik.es adds anything here for my purposes.

pugworthy · on Sept 10, 2018

It would be a bit of a nightmare of device maintenance (and loss), but real time bike tracking at Burning Man would be pretty nice.

People lose bikes, people get bikes stolen, and people also want to know where the Yellow Bikes (free to use bikes) are located.

joeblau · on Sept 10, 2018

This looks really cool. I would love to see the output of the what is visualized my MapD.

CyberDildonics · on Sept 10, 2018

How is this anything other than a normal database app? How many bikes are being shared that you need a 'data pipeline' and 'stream sets' and Kafka and MapD to make it 'real time'

free652 · on Sept 10, 2018

Well don't you need to get the data into a database? How would you design a system that collects data from a million bikes?

Kafka acts like a buffer, because downstream systems may not be fast enough to transform and persist the data during spikes.

randyzwitch · on Sept 10, 2018

Good point. While this demonstration is for a single bike-sharing program, there are over 200 feeds that conform to the GBFS specification, most updating roughly every 10 seconds.

randyzwitch · on Sept 10, 2018

It is a simple demonstration of how different tools work together. Feel free to do something else.

gcbw2 · on Sept 10, 2018

Do they make that data public for the same loophole that paypal tries to use with venmo? i.e. make sharing the data "a feature" so that they can sell it as there is no expectation of privacy.

randyzwitch · on Sept 10, 2018

The data for this example are in aggregate, number of bikes currently at a station along with other selected information about the bike share location. The lowest level information is at the bike_id level, no information is transmitted about which customer is on which bike/bike_id.

https://github.com/NABSA/gbfs/blob/master/gbfs.md

neuro · on Sept 10, 2018

Do you work for the company that developed MapD?

randyzwitch · on Sept 10, 2018

I do, Jowanza does not