Hacker News new | past | comments | ask | show | jobs | submit login

Cut out Kafka by writing directly to S3 and bulk loading from S3 directory (optimal for Redshift). The article never details what "near real-time" means, which is bothersome.



According to the article, Yelp had 7 different data sources and similar number of targets.

If they wrote a loader for each combination, they'd end up with 49 combinations. Not to mention 7 loaders to write every time they add an app.

With Kafka - they just need to connect each thing to Kafka - 14 connectors instead of 49.

This is pretty much the scenario Kafka was invented for, and you get stream processing for free: https://engineering.linkedin.com/distributed-systems/log-wha...


You only need 1 loader for directory loading json. Yelp already has an ETL (and more transforms) for combinatorial normalization of format (from log files to events pretty much covers the spectrum).

Redshift will create columns (within some restrictions about nested arrays) which generally have to be avoided, however you get the data into redshift, from json. Kafka is a process/time wasteful step in almost every redshift loading scenario, given the current state of AWS services. Test for yourself over a few billion messages at various message sizes from 1k to 1M, if you get the chance.

Kafka is great for a message queue if you can't write to S3 directly or as a buffer to deal gracefully with S3 hiccups, for high frequency throughput to redshift.


(disclaimer: I work on the Data Pipeline project @ Yelp - but these opinions are my own and not necessarily representative of Yelp)

@jack9 - If I am understanding your point correctly, it is that you could just use S3 directly as the 'unified buffer' and the Kafka part is unnecessary. I'll try to shed some light on why we made Kafka part of this infrastructure.

- Regarding "near real-time" this is our cheeky way of saying we internally haven't set an SLA or a formal definition of the maximum latency in the system for us to declare it 'real time'. Hard numbers wise from the time a MySQL[0] event happens to a transformed version of it being in redshift is roughly in the realm of 10-30 seconds (unofficial and not a guarantee of course, just what I am anecdotally seeing at this point, and we are certain we can bring that down).

- As ora600 points out, we not only have multiple data sources, but also multiple data targets. So it's important to us to reduce the IN*OUT down to IN+OUT. It's important to note that nearly all of these connection points more naturally 'speak' in streaming than batch processing.

- Kafka has been fantastic tech for this use case for us: aside from this s3/redshift OUT connector we generally are dealing with connectors which want to be streaming. In-fact if it were performant to do so we would love to stream message-by-message into redshift as well (it's not).

- We already have a lot of infrastructure for batch processing - specifically a service for bulk loading from s3 to redshift with Mycroft[1] (it's open source[2]), and many systems (especially older systems which wrote to s3 in order to do Map-Reduce processing with MRJob[3], also open source[4]) have been able to use it for this purpose - In many ways we previously had used s3 as our 'unified buffer', but it's not a great solution for stream processing.

- Finally regarding the ETLs:

- - Yep we have them, and we hate them. The Data Pipeline project, in a sense, almost originated as a way for us to get rid of them. That's certainly why my team has been building the Data Pipeline components (we are 'the data warehouse team' in Yelp). They are costly: Each ETL is 'handwritten' (adding a new table is writing a new ETL, granted these are very small classes as we have a decent framework, but it still takes manual effort), this requires a 'Yelp-Main' push (Yelp has talked openly about our efforts to break apart our monolith[5], but it's still a thing and the push process isn't as painless as with our micro services), and inputs from other data sources was challenging as our ETL framework was designed around the 'Yelp-Main' codebase (it's old, cut us slack ;)).

- - A component of the Data Pipeline is what we call the 'Application Specific Transformers'[6] (we call em ASTs for short, much to the confusion of any compiler writers out there) which allow us to apply the types of transformations we typically did in ETLs (outputting both an 'raw id' and the 'encoded id' which is served to the front end, splitting bit-flag ints into separate booleans, and stringifying int enumerations - as well as extra stuff ETLs didn't do like extracting documentation for Watson[6])

It's worth noting as well that the Data Pipeline lives in PaaSTA[7] which is a big improvement from how the legacy components it is replacing were deployed.

All this being said: we thought carefully about meeting all the needs of our systems, but our needs are of course our own. This certainly would be the wrong tool for many other use-cases.

Sorry for slow response - I've been out on PTO and just got back to work today. :)

[0] - https://engineeringblog.yelp.com/2016/08/streaming-mysql-tab...

[1] - https://engineeringblog.yelp.com/2015/04/mycroft-load-data-i...

[2] - https://github.com/Yelp/mycroft

[3] - https://engineeringblog.yelp.com/2010/10/mrjob-distributed-c...

[4] - https://github.com/Yelp/mrjob

[5] - https://engineeringblog.yelp.com/2015/03/using-services-to-b...

[6] - https://engineeringblog.yelp.com/2016/08/more-than-just-a-sc... (Includes some info about the ASTs and Watson)

[7] - https://engineeringblog.yelp.com/2015/11/introducing-paasta-...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: