Hacker News new | past | comments | ask | show | jobs | submit | jamii's favorites login

Very well written article. I'll add that there are tricks to manage that behavior.

The dataflow is a diamond:

    transactions ----> credits 
      \                   \
       ----> debits -----JOIN----> balance ---> total
In Flink the SQL query planner may instantiate two different consumers groups on transactions; that topic might be read twice. These can progress at different rates, which is partly why some inconsistencies could appear.

Now, the trick: converting transactions to datastream then back to SQL will introduce a materializing barrier, and you will benefit from high temporal locality. There might be some inconsistencies left though, as the JOIN requires some shuffling and that can introduce some processing skew. For example the Netflix account will be the target of many individual accounts; the instance responsible for Netflix might be a hot spot and process things differently (probably by backpressuring a little, making the data arrive in larger micro batches).

Anyway when processing financial data you might want to make the transaction IDs tag along the processing; maybe pair them back together in tumbling event-time windows at the total. Like he said: 'One thing to note is that this query does not window any of its inputs, putting it firmly on the low temporal locality side of the map where consistency is more difficult to maintain.'. Also, windowing would have introduced some end-to-end Latency.

This makes me think: Streaming systems introduce a new letter in the CAP trade-off?

I humbly propose: The CLAP trade-off, with L for Latency.

[1] https://issues.apache.org/jira/browse/FLINK-15775?filter=-2


Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: