This is awesome! Would love to hear more about the dataviz the author is putting together. I have to admit the stack seems like it might be a /bit/ overkill for this use case, but it seems like it was a great learning experience!
Thanks for reading! I worked with OP (I'm a MapD employee), and yes, for this smallish example this is overkill. However, we're planning on adding a lot more feeds[1] to this pipeline which will make the tools we used a lot more necessary
For this example, I set up a hosted Kafka cluster on Azure using HDInsight, as I didn't want to mess around with setting up Kafka. It's probably an expensive way to solve this problem long-term, as we're not really using anything exotic that a stock Kafka install wouldn't do out of the box.
How is this anything other than a normal database app? How many bikes are being shared that you need a 'data pipeline' and 'stream sets' and Kafka and MapD to make it 'real time'
Good point. While this demonstration is for a single bike-sharing program, there are over 200 feeds that conform to the GBFS specification, most updating roughly every 10 seconds.
Do they make that data public for the same loophole that paypal tries to use with venmo? i.e. make sharing the data "a feature" so that they can sell it as there is no expectation of privacy.
The data for this example are in aggregate, number of bikes currently at a station along with other selected information about the bike share location. The lowest level information is at the bike_id level, no information is transmitted about which customer is on which bike/bike_id.