Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Building a Real-Time Bike-Share Data Pipeline with StreamSets, Kafka and MapD (jowanza.com)
62 points by josep2 on Sept 10, 2018 | hide | past | favorite | 16 comments


This is awesome! Would love to hear more about the dataviz the author is putting together. I have to admit the stack seems like it might be a /bit/ overkill for this use case, but it seems like it was a great learning experience!


Thanks for reading! I worked with OP (I'm a MapD employee), and yes, for this smallish example this is overkill. However, we're planning on adding a lot more feeds[1] to this pipeline which will make the tools we used a lot more necessary

[1] At the time of writing, there were 213 different bike share feeds https://github.com/NABSA/gbfs/blob/master/systems.csv


Hey @randyzwitch big fan of your Adobe analytics R package and blog.

Regarding this, Is it on AWS or On-Prem? Any basic information you could share about h/w


For this example, I set up a hosted Kafka cluster on Azure using HDInsight, as I didn't want to mess around with setting up Kafka. It's probably an expensive way to solve this problem long-term, as we're not really using anything exotic that a stock Kafka install wouldn't do out of the box.

https://docs.microsoft.com/en-us/azure/hdinsight/kafka/apach...


Could you support the citybik.es API which supports more than 400 cities (no GBFS requirement)? https://citybik.es/


The example will obviously work with any API, so we could. But not sure using citybik.es adds anything here for my purposes.


It would be a bit of a nightmare of device maintenance (and loss), but real time bike tracking at Burning Man would be pretty nice.

People lose bikes, people get bikes stolen, and people also want to know where the Yellow Bikes (free to use bikes) are located.


This looks really cool. I would love to see the output of the what is visualized my MapD.


How is this anything other than a normal database app? How many bikes are being shared that you need a 'data pipeline' and 'stream sets' and Kafka and MapD to make it 'real time'


Well don't you need to get the data into a database? How would you design a system that collects data from a million bikes?

Kafka acts like a buffer, because downstream systems may not be fast enough to transform and persist the data during spikes.


Good point. While this demonstration is for a single bike-sharing program, there are over 200 feeds that conform to the GBFS specification, most updating roughly every 10 seconds.


It is a simple demonstration of how different tools work together. Feel free to do something else.


Do they make that data public for the same loophole that paypal tries to use with venmo? i.e. make sharing the data "a feature" so that they can sell it as there is no expectation of privacy.


The data for this example are in aggregate, number of bikes currently at a station along with other selected information about the bike share location. The lowest level information is at the bike_id level, no information is transmitted about which customer is on which bike/bike_id.

https://github.com/NABSA/gbfs/blob/master/gbfs.md


Do you work for the company that developed MapD?


I do, Jowanza does not




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: