Hacker News new | past | comments | ask | show | jobs | submit login
Teeing, a hidden gem in the Java API (frankel.ch)
242 points by sidcool on May 10, 2021 | hide | past | favorite | 110 comments



I prefer imperative code over collector like APIs for two reasons, both visible in this example:

1. The intention of the code is obscured by more layers of abstraction, increasing complexity.

2. Small changes to what the code is conceptually doing tend to lead to larger changes to the actual code than with imperative code.

For the first point, reading through this post and the previous post linked, the code is doing the following:

Taking the entries of a map of product to count. Turning those entries into a different class, a row of each product and count. Then it collects this list on one hand. On the other hand, it sums the total, based on the per product cost and the count of the product in the cart. Then it combines the row list and the total cost into one object, returning it.

Deconstructed, half of this code is just unnecessary complexity. A map of products to count, and a list of unique product/count tuples are theoretically identical. You can iterate over them, find specific products, etc. There /might/ be a reason to specifically desire a list; However why would that code be coupled with code to sum the cost of the cart?

All in all, why is the code not just:

  public BigDecimal sumPrice(Cart cart) {
    BigDecimal sum = BigDecimal.ZERO;
    for (Map.Entry<Product,Integer> entry : cart.getProducts.entrySet()) {
      sum = sum.add(entry.getKey().getPrice().multiply(new BigInteger(entry.getValue())
    }
    return sum;
  }
For the second point, briefly: Consider how code would have to change to calculate a deal, such as Buy One Get One Free. Such code to do this calculation would add another layer to the collector with another function defined somewhere (or hidden in some other existing abstraction, such as OP's CartRow::getRowPrice) instead of just visible in the function that calculates a cart subtotal. If the deal relied upon concepts not limited to one row at a time, eg buy any two flavors of chips for %25 off, the proposed solution would have to be completely rewritten.


What you write could be written as:

   ...stream().
     map(e -> e.getKey().getPrice().multiply(new BigDecimal(e.getValue()))).
     reduce(BigDecimal.ZERO, BigDecimal::add)
The problem here is not the use of streams but that the author goes at the problem in a confusing and round about way.


That's very pretty to look at and easy to read. But it's a huge pain in the ass to debug when getPrice starts failing. You get a 42000 line stack trace and the only information is that it failed somewhere in that long singular line.

The first thing you do is unravel all that crap to a normal loop so you can debug it properly.

And when you're done, you really don't want to go back to the stream way of doing just in case it breaks again and you need to start debugging once more.


   >> ...stream().
       map(e -> e.getKey().getPrice().multiply(new BigDecimal(e.getValue()))).
     reduce(BigDecimal.ZERO, BigDecimal::add)
>That's very pretty to look at and easy to read. But it's a huge pain in the ass to debug when getPrice starts failing.

The trick with java and chaining calls is that you have to do the chain on each line, so the stack track can pinpoint the line it failed at.

so instead of the above, write the code as:

    ...stream().
                map(e -> e.getKey()
                          .getPrice()
                          .multiply(new BigDecimal(e.getValue())))
               .reduce(BigDecimal.ZERO, BigDecimal::add)
And the stack trace will tell you if the getKey() or getPrice() failed or the multiply(...) failed.


If someone would kindly explain this to google-java-format, life would be so much better for everyone involved.


intellij IDEA does this formatting very easily, and you can also customize it pretty easily to fit your aesthetics too.


Plus: you can put your formatting rules into your `.editorconfig` and put it under source control.


One downside of this approach is that it only works when you can require everyone who touches the code to use IntelliJ.


Friends don't let friends use Eclipse.

But less flippantly, what other tool is workable for Java?


IntelliJ is my favorite Java IDE, too, but I'm not in the business of telling people what editor to use.

I can see a company deciding to do that for internal projects. But if it's open source, I'm certainly not interested in creating a situation where it's difficult to comply with a project's coding standards without buying a $500 piece of software.


the community edition is free, and usable for most projects that aren't enterprise level.


Looks like it shouldn't when the fluent chain wouldn't fit on one line.

https://github.com/google/google-java-format/issues/341


> you have to do the chain on each line, so the stack track can pinpoint the line it failed at

...this, to me, speaks to a huge tooling failure. The stack trace should have precise information about the region of the source file that the call was embedded in (line+column start+end), as opposed to merely the line.


I don't think you should be downvoted for this view, but my experience has been pretty good with streams; they took a little getting used to, but now I find they're generally a LOT more robust than the handcrafted stuff. Off-by-ones and NPEs are fairly hard to achieve, and if well done the semantics can be a lot clearer. A lot of enterprise code IS basically plumbing after all!

Where it goes wrong IMO is where a mix of styles and too many inline lambdas splatter a mess of logic into some god-method.

As other comments note, good formatting will help with intelligibility of failures - and IMO sensible logging (at least of error paths) makes it all rather noce to debug.


Why would getPrice start failing? If you have side effects or anything other than a trivial implementation inside getPrice, you have more fundamental problems with what you are building than whether to use a for loop or streams.


Why does any code start failing?

That’s not the point they were making. The point is that when you start chaining together so many calls, it can become difficult to debug. Prettier to write, but more difficult to debug. But that’s okay, you need to find the balance that’s appropriate for the particular project.


> but more difficult to debug.

I've found the stream debugger in IntelliJ to be very useful when looking at debugging streams.

https://www.jetbrains.com/help/idea/analyze-java-stream-oper...

The old plugin version of it has the same functionality (and better pictures) - https://plugins.jetbrains.com/plugin/9696-java-stream-debugg... (click the 'more' link)

Selecting a piece of data within the stream shows you how it moved through the stream.

The other part of streams, for me, is that with the additional "each line does one, and only one thing" and you're not putting too much complexity in a single map, I feel that it forces you to write simpler code that doesn't need much debugging. The question of "how did that data get into the stream is where most of the debugging comes from.


First.

If getPrice fails, the stack trace will start there. If it returns null (which it should not) and trigger a npe, then the line:

   map(e -> e.getKey().getPrice().multiply(new BigDecimal(e.getValue()))).
Is just as dense as the original:

   sum = sum.add(entry.getKey().getPrice().multiply(new BigInteger(entry.getValue())
And the stack trace would be just as confusing.

Second.

The key thing with streams is to borrow from the functional programming paradigm: Split data and functions, avoid or isolate side-effects.

Do this correctly and there is a quite real plus in productivity.


Could I trouble you for an example or a reference on what this would look like refactored with "Split data and functions, avoid or isolate side-effects"? I would like to understand better but I don't know enough about functional programming to grasp your meaning just from the comment.


You might read up on pure functions [1] which are unable to produce side-effects when run.

Imagine you have two methods in Java:

  int add(int a, int b) {
    log.info("Adding two numbers {} {}", a, b);
    return a + b;
  }

  void doStuff() {
    add(1, 2);
    add(2, 3);
    add(3, 4);
  }
The result of add in doStuff is unused. However, add has a log statement which someone might be relying on elsewhere. The log line makes understanding the usefulness of this code much harder. ie: Can you delete this call? It's impossible to know without understanding everything that might consume the log line. The log-line is a side-effect in these methods.

In languages that understand "pure functions" there are optimizations that can be done by the toolchain (think automatic memoization, deferred computation, and much more) when only pure functions are called.

[1] https://en.wikipedia.org/wiki/Pure_function


I don't know any good resource. And not all functional programmers are good programmers; but try to understand what they are trying to do.

There as a bit of zen to it, less is more, in sense that a language gets more powerful if it is more constrained. For example if you know (by the type system or just coding conventions) that p.getPrice() nevers returns null, it is easier to reason about (proof, test, read) the code.

Like wise if you know that if p1 == p2 then p1.getPrice() == p2.getPrice() (that would be no side effects).

If you as some one suggested, need to support some crazy localization, then don't put it into p.getPrice(). If you must, change the name to something telling and make its input explicit: p.calculateLocalizedPrice(locale). Or better make it an explicit function (static method, or maybe something sitting in a service) calculateLocalizedPrice(product, locale) and again have it be without side effects.


I think definitions vary slightly, but the quality you mentioned (a == b => f(a) == f(b)) would be called a deterministic function [1] - very useful, often applicable or even type check enforced in functional programming languages.

Having no side effects is a different very useful quality - function doing exactly as specified and no more (I/O, setting variables ..). I'm not sure if it means not accessing global state, though it is usually better if both inputs and outputs are explicit.

All in, it is usually easier to reason about functions that are explicit, deterministic and side-effect free, yet I find it profoundly more valuable if it can actually be relied on (a known subset of) functions having those qualities.

[1] https://maksimivanov.com/posts/pure-functions-and-side-effec...


Because stuff fails?

Maybe getPrice used to be a static lookup from a map that couldn't fail, but then the Sales Team wanted to go multi-national and now it's a database lookup with multiple dependencies, that can fail.

"But why wasn't it caught in a code review" etc...

Have you actually worked with a big team ever? Why would (or how could) anyone (outside of Google) go through every dependency of the getPrice function and check that every use case is handling errors/exceptions properly?

Stuff breaks, code is read and debugged more often than it's written. Optimising stuff to be easy and fast to write is the wrong way to make maintainable code. Unroll your loops, add toggleable debug logging and add comments why stuff is done the way it is.


The method getPrice is not really correctly named if it did the things you describe. Naming things carefully will give you far more productivity than "unrolled loops, toggleable debug logging and comments on why stuff is done the way it is".


Functions aren't always refactored to their perfect names, for multiple reasons.

getPrice might start up as something that just gets the price, after 5-10 years it might be a complex process accessing some ERP systems.


> Stuff breaks, code is read and debugged more often than it's written. Optimising stuff to be easy and fast to write is the wrong way to make maintainable code.

This is exactly why people like collection functions. For loops can do anything, you have to spend more time reading and understanding the loop to build your mental model of what is happening. Mapping does one thing, transforms a collection into another collection. Same with filter, etc. If you are optimizing for readability, collection functions give way more information to the reader. Your approach is to optimize for debugging, which I'm not saying is wrong, but it's not optimizing for readability.


What map returns greatly depends on lambda inside. So yoir collection of beans is changing into collection of good knows what and you have to keep while chaining in mind - because it is nowhere visible.


Whatever exists inside the map lambda would have to exist inside the for loop as well. So if you're dealing with a confusing transformation, a loop doesnt offer you any extra tools for making that transformation more apparent to the reader. Loops have plenty of advantages (computer execution is more obvious, stack traces can be cleaner, easier concept for beginners to grasp), but I have never seen a loop be more readable than a well written functional composition.


Yes and it tends to be more apparent what is its type. It also tends to have a name that helps understanding a lot . It is not even primary loop vs stream difference. This particular frustration is the fluent api vs procedural difference.

It is chaining that obfuscates in this case. Through, I really don't find functional style more readable in general.


If it were Python, I could probably write something like:

    sum(key.price * val for (key, val) in cart.items())
But the Java example makes me think "Wait, what?"

Personally, the lesson I'm drawing is that your library shouldn't actively fight the native syntax of the language you are using - just use what your language provides, that's the cleanest and easiest way to do it.

But I guess people's tastes are different.


Python's `sum()` works only on numbers and assumes an empty list sums to 0. Java's `reduce()` is quite a bit more flexible.

I use Joda Money and frequently find myself doing:

    final Money bucks = stream.map(Thing::getPrice).reduce(Dollars.ZERO, Money::plus);
If you don't pass in a baseline, you get optional:

    final Optional<Money> bucks = stream.map(Thing::getPrice).reduce(Money::plus);
Also, StreamEx improves the ergonomics of Java streams. The way I would implement the original example:

    EntryStream.of(cart.getProducts())
        .mapKey(Product::getPrice)
        .mapValue(BigDecimal::new)
        .mapKeyValue(BigDecimal::multiply)
        .reduce(BigDecimal.ZERO, BigDecimal::add);
Unfortunately BigDecimal is not especially ergonomic (it could use a `multiply(long)` method, which would eliminate the annoying `mapValue()` above). But unlike the python version of this, it will preserve scale. And streams work as-is on Money types.


What. Python's `sum()` works on anything that defines `__add__`, which can be numbers, strings, lists, or your own custom classes.

  sum([[1, 2], [3, 4]], []) == [1, 2, 3, 4]


Actually it seems sum is special cased to fail for strings, but otherwise you are correct


Half the problem is that the BigDecimal class is kind of horrid in Java, and so is Map.Entry. The language lacks features that would make them more workable, like operator overloading, implicit constructors and destructuring/pattern matching.

Now, the first two are very fair not to want in a language because they allow programmers to bring in an unlimited amount of user-defined complexity.

Some destructuring however, would just make it easier to work with what's already in the language and libraries.


I dont find 'reduce' to be more readable then + inside loop.


I think your point is valid with a trivial example as shown in the article.

In a more complex example, with multiple steps and potentially multiple streams, using the Streams API allows the algorithmic and business logic parts of the whole calculation to be kept separate, more visible and more maintainable/adaptable as a result. Imperative code in such cases usually involves factoring out chunks of the for-loop into separate functions or classes which IME goes in the opposite direction.


Your advice is OK for when you are doing a one-off, simple thing.

In general, though, streaming abstractions leave that paradigm in the dust once things get become just a tad more complicated.

Using streaming abstractions, you can reason about the flow of your data at a much higher level and create abstractions which would require 100s or 1000s of lines of code to duplicate.

For example, if he decides he needs to buffer the stream, then perform processing on it, then send it out into to the world, he can do all of this super easily by using a few streaming combinators while everything is super clear and type-safe.


> create abstractions which would require 100s or 1000s of lines of code to duplicate.

I would like to see an example of this, I just don't think streaming libraries, or abstractions in general, can get rid of that much complexity. In most cases, abstractions can't really get rid of complexity, they just move it around.

The issue I have with streaming abstractions is that the implementation tends to be overly complex. For instance, Rx libraries tend to be thousands of lines of code, and they're often written in ways that make it impossible to trace code or understand a call-stack. It makes some things easier, but when you run into a problem it becomes a nightmare to debug.

I'm just not sure it's warranted a majority of the time. Those libraries tend to be huge because they have to cover every single use-case, and they work at a very high level of abstraction. When you're working with stream-like domains, it's often possible to implement the subset of what you would need of that streaming abstraction with normal code, and in a way which is much easier to understand and debug.


> In most cases, abstractions can't really get rid of complexity, they just move it around.

People say this pithy truism all the time, and it's just not true. You're probably just talking about indirection and encapsulation.

Real abstractions by their very nature reduce complexity. You could say "No true Scotsman" but I deal with actual abstractions in Haskell all the time and it only simplifies code.

So maybe Java is just deficient in its abstraction capability, but parametric polymorphism and pure streams sound like a good start to me. If a steam makes you uncomfy because it's doing O(n) things "under the hood" and you "like to know what the computer is doing" that's probably just a personal comprehension issue.


Who said I was talking about Java?

There are some abstractions which objectively reduce complexity: i.e. the C programming language abstracts over assembly, and essentially formalizes a subset of assembly in a way which reduces cognitive load for the programmer. This is a good abstraction, but this is not what streaming libraries do.

> If a steam makes you uncomfy because it's doing O(n) things "under the hood" and you "like to know what the computer is doing" that's probably just a personal comprehension issue.

If you have to reach for "you can only not like this because you don't understand this", it means you don't have a real argument. I have implemented event-stream-like systems before, and I am perfectly aware what they do "under the hood".

I can give you an example of how these libraries go wrong. I was working a few years ago on a project which made heavy use of event streams, and one of my colleagues opened a PR which passed all unit tests locally, but failed when the CI system ran the tests. After about half a day of trying to figure out the problem, we determined that it was because a different scheduler was being used when the tests were run during the CI pipeline, which resulted in some concurrency related issues. What's worse, it didn't fail reliably, but only intermittently, so this made it extremely difficult to track down. This is what happens when you hand over your flow of control to another system: it makes it very difficult to reason about.

Abstractions are good when they reduce boilerplate and provide a shorthand for monotonous work. When abstractions try to be smart or magical and do work for you, it's almost always a bad thing.


> Abstractions are good when they reduce boilerplate and provide a shorthand for monotonous work. When abstractions try to be smart or magical and do work for you, it's almost always a bad thing.

Depends on your environment ... and the abstractions.

Haskell has a lot of guarantees that C will never have. These guarantees often mean that making abstractions are a lot less error-prone and potentially a lot more useful. In general, stronger language semantics gives the implementation of the language more leeway to optimize. All of the undefined behavior in C is a pretty good reason to step extremely carefully while doing anything, especially creating abstractions.

> the C programming language abstracts over assembly, and essentially formalizes a subset of assembly in a way which reduces cognitive load for the programmer.

Languages like Haskell attempt to abstract over computation in general; It makes C and Java seem a lot closer together in the spectrum of languages. I'm by no means an expert Haskell'er but just dabbling with it has been illuminating and I've written code in various languages since the 90's.

I hope the GP comment "personal comprehension issue" hasn't thrown you too far off. I think there might be a more tasteful way to express what they were thinking but I can't speak for them.


> you run into a problem it becomes a nightmare to debug.

this happens sometimes due to both java being a PITA to do functional programming with, and also partly due to the initial programmer of the streaming logic not breaking down into composable functions.

And with most people being more familiar with imperative style, it makes the functional style harder to maintain _for them_.


I'm not just talking about Java - I've also seen this in mobile projects in Swift for example.

> And with most people being more familiar with imperative style, it makes the functional style harder to maintain for them.

It has nothing to do with functional style. It's perfectly feasible to write functional code with minimal abstraction which is easy to trace and understand. The problem is in handing over your flow of control to an overly complex black box of an abstraction layer.


All of that is good when you need it. None of that is good when you don't actually need higher level reasoning about data.


Why do you need to program at a higher level than assembly?


Too much abstraction harms as much as too little abstraction. And java is exactly environment that made "too much abstraction to the point of hurting maintainability" in the past.

If you need higher level abstraction, you should use it. If you dont need it, you absolutely should not use it.


This x1000, we are not writing code for ourselves, we write code for our employers, and it behooves us all to keep it as simple and maintainable as possible. Will you be there in 10 years to explain it, will you want to if you are? If this is your open source project or whatever do as you please, otherwise follow the principal of least abstraction.


The first point I won't refute. Some people like imperative style and that's fine. I personally don't and find the stream based approaches much simpler.

As for your second point, you could easily add such functionality to the code with a stateful mapping operation that sets the price to 0 for every other item encountered. In your code, you'd have to add another pass of the loop or you'd have to stick the logic for computing the price inside your for loop.

Personally, I've found that decomposing problems into stream based pipelines makes it much easier to decorate additional functionality than imperative code but that's just my personal experience.

  public PriceAndRows getPriceAndRows(Cart cart) {
    DiscountApplier discounts = new DiscountApplier()
    return cart.getProducts()
        .entrySet()
        .stream()
        .map(CartRow::new)  
        .map(cartRow -> discounts.apply(cartRow)) // Stateful discount application logic                                                           
        .collect(Collectors.teeing(                                                     
            Collectors.reducing(BigDecimal.ZERO, CartRow::getRowPrice, BigDecimal::add),
            Collectors.toList(),                                                        
            PriceAndRows::new                                                           
        ));
  }


The "teeing" combinator is giving you separation of concerns. There are two separate calculations, potentially defined in separate functions, that are composed into a single operation that is performed in a single pass of the stream. Sure, you could write a monolithic imperative for-loop that does it all, but such an approach will not scale.


You don't want the "buy one get one free" to apply directly to the cart subtotal. You want to show the customer which entry is free (or discounted), so they can see that the rebate is applied correctly.

You probably want to check all 'rebate rules' attached to all products in the cart whenever one is added/removed. An active rebate will then modify the price of the cart item, and the sum method can be stupidly simple.


> Consider how code would have to change to calculate a deal, such as Buy One Get One Free.

There's nothing wrong with passing in a function other than `CartRow::getRowPrice`. That's one of the beautiful things about functional programming, you can alter behavior by passing different functions as parameters.

However, as someone who writes ecommerce code that deals with exactly this situation all the time, I can tell you that the question doesn't really make sense. Discounts are usually represented by data in the cart - either a special field, or a negative line item that gets summed along with the others. You don't just mysteriously have a second cheaper item in the cart.


The original is almost impossible to debug when something inevitably goes wrong too.

I have encountered dozens of places where streaming API calls have been reworked into imperative by whoever ends up maintaining it just so they can figure out why the hell it is breaking in some unforseen edge case.


Side-effects do not mix with lazy on-demand streams! This is unfortunately a problem with bringing functional programming constructs to a language with idiomatic pervasive mutation.


BTW, this is a non-issue with C# 1.0 and later due to `yield return` syntax.


Sorry, I'm skeptical -- how exactly does a 'yield return' resolve the divergences between functional programming & mutable data structures?

My layman's impression was that 'yield return' was largely syntactic sugar around an iterator, rather similar to Java.


The comment you responded too were talking about debug experience.


>All in all, why is the code not just:

Because it's hard to review, too many things are happening on one line.

What if you have to negate multiplication? You're messing that line even more, with streams, you're adding one line that's easy to read, easier to review, track changes.


> What if you have to negate multiplication?

Add sum *= -1; on the next line if you think the line is already too complex.

Or just do return -sum;


The example is using BigDecimal class where arithmetic operators don't work, you need to call .negate() on the object, or multiply again BigDecimal("-1"). Negating inside a loop will give different result than multiplying at the end in return line. Nevertheless, you're adding ANOTHER operation to the line which is already complex and hard to read.


The exactly same thing can be said about stream version. There is no meaningful difference between the two when it comes to this change - either you do it in the middle of loop or add line for it.

Except that the stream version is harder to debug and read, but more effective on potentially unlimited stream.


You add a new line with .map or .peek before .collect


Which is oh so much different then to add the new line into the imperative code? Adding new line into imperative code tend to be simple.


A new .map line has limited scope of access, and thus highly restricted possibilities.

The same line in a for-loop can access anything, including things from other lines.

The primary benefit of these kinds of functional chains is that each step is highly constrained in their capabilities, mainly by their function name and access scope, so you can better verify that it really does do what it says on the tin.

A for loop’s content has to be ultimately read thoroughly, because anything goes. Including modifying lists unrelated to the list under question.


The price for that limited scope is that it is much harder to figure out what is actual parameter of lambda that goes into map function. If is also not even limited all that much. In lambda, everything from surrounding scope is visible too.

My personal issue with these is that it is all harder to reason about, harder to read, harder to debug. And each time there is real issue, I have to unpack these into procedural, fix and then encrypt again.

I dont see less bugs in code since we started using these. Bugs did not dropped.


> The price for that limited scope is that it is much harder to figure out what is actual parameter of lambda that goes into map function

I’m not sure why that soul would be the case... it’s whatever came out of the previous action? You’re doing list operations...always.

Nested list operations get convoluted fast, so I generally avoid them (break them up into separate iterations) but otherwise. The only thing I can remember complicating things is issues with type conversion, but the IDE usually tells you what’s up.

The loss of print debugging and useful debugger support is a pain but usually resolved by commenting out half the manipulations and printing immediately.

Otherwise IME it’s a lot simpler to see a series of independent operations on the total list, then it is to follow a for loop interacting with one element at a time.

But personally I switch between the two styles freely, depending on which one is “cleaner” so I likely naturally avoid the more convoluted/difficult cases.


Fluent interfaces spread complexity of reading complex operations into multiple lines, you're squeezing everything into one line where tracking few characters change will be difficult to review, difficult to compare changes, bisect bugs (change in a complex line vs addition of a new simple line).

Adding new line is simple change with no increased reading complexity. Modifying already complex line to add new operation will only increase complexity, change log of this line, making it all harder. That's all I'm saying, I'd rather read 11 simple lines, than 4 complex lines.


I can add one new line into procedural code that does exact same thing.

It wont increase reading complexity either. It wont even force me to think about what hidden map parameters. It will all be visible directly.


While I understand the reason of Streams and collectors, the Java implementation is really ugly and obscure. It is a language in itself. I want to like them, but I really can't.

On the other hand, in Clojure they are totally natural.


beauty is in the eye of the beholder


streams are best used against data with an unknown size (potentially infinite) that isn’t necessarily all held in memory or cannot fit the memory limits.


> All in all, why is the code not just:

In the other code, you could substitute parallelStream for stream and have it execute in parallel for "free".


Another point to consider that is strictly superior is how streams offer ~effortless parallelism.


Not Java, but an example of streams enabling parallelism in another language - https://developers.redhat.com/blog/2021/04/30/how-rust-makes.... Change 4 characters and it goes from saturating one core to saturating all.


True, sometimes imperative loops are easier than streams. But your code is not a suitable substitute in this case where the constraints are clear. He does want the total sum, and he does want the list, and he does only want to iterate once.


In Haskell terms, Collectors have an Applicative instance, and teeing corresponds to the liftA2 function:

> liftA2 :: (x -> y -> z) -> Fold a x -> Fold a y -> Fold a z

The "teeing" functionality is actually one of the "selling points" of the Collector-like library, not a hidden gem:

> This module provides efficient and streaming left folds that you can combine using Applicative style.

It's curious how the same abstraction can have different emphases across languages.

Collectors are also Comonads:

- You can always extract a value of the type that parameterizes the Collector, just by "closing" it.

> extract :: Fold a x -> x

- You could, in theory, "duplicate" a Collector<X> and get a Collector<Collector<X>>. This seems like a dumb function, but it would allow you for example to feed different Streams to the "same" collector, by duplicating it before consuming a Stream, the taking the result Collector, duplicating it again, passing it to another Stream...

> duplicate :: Fold a x -> Fold a (Fold a x)

http://hackage.haskell.org/package/foldl-1.4.11/docs/Control...


I imagine part of the reason is because Java has the idiomatic usage of

list.stream().functionalOperations.collect(Collectors.toList())

whereas in Haskell you can just do whatever on your lists and it should fuse... or so I thought. Clearly I don't know too much about which way is the "right way" in Haskell since I haven't used Control.Foldl before and I just sort of assumed fusion would happen at least for most list operations.


You should look into iteratees. It's the same insight: if you have the concept of something that receives values and eventually yields a value of a given type, that's a structure that has some very nice algebraic properties (e.g. they're monads).


I'm not sure you could implement a useful flatMap() for Collector-like types, at least without forcing the collector to hold all the received values in memory, which would defeat the purpose.

It would be like a function that takes

- a Collector that produces an X

- a function that takes an X and returns a Collector that produces an Y

and returns a Collector that produces an Y.

The thing is: while being fed, the result Collector should first feed the initial Collector and, at some time, "switch" to the Collector produced by the function. But when to perform the switch?


Iteratees use a slightly different interface: when you feed one values you get the next iteratee state (which is either "done" or "in-progress", roughly - you can feed an EOF if you want an iteratee to finish, you don't have a "current" value until then) and any unconsumed values (possibly all of them). It's counterintuitive to start with, but it makes for a really nice representation.


Here's a possible implementation of "duplicate": https://stackoverflow.com/a/67475265/1364288


Just in case it's not obvious to everyone, the name is a reference to the tee(1) [1] shell command, part of "coreutils" [2]. The manual page begins:

tee - read from standard input and write to standard output and files

It's a proper vintage tool, listing RMS as an author.

Edit: less repetitive repetition, added coreutils link.

[1]: https://man7.org/linux/man-pages/man1/tee.1.html

[2]: https://www.gnu.org/software/coreutils/


The name "tee" is also a reference to a physical piece of plumbing which looks like the letter T, and allows connecting one "input" pipe to two "output" pipes.


Of course, the command evokes an association with the Unix tee command which fulfills a similar purpose: you can use it to pipe the output of some other command into a file while having it be printed to stdout at the same time. So, for instance:

    grep "error\|warning" log.txt | tee /tmp/issues.txt
would find mentions of the terms "error" and "warning" in a file and both print them to the terminal window as well as write them to the file /tmp/issues.txt. This can be quite handy at times.

According to Wikipedia, the name "tee" is a reference to a T-splitter used in plumbing, which makes sense.


You can write either imperative for-loops or a set of connected stream-processors. An arbitrary set of connected "streamers" can always be converted to imperative code and I assume that is what is happening under the covers.

But the reverse is not true, an arbitrary set of for-loops can not be translated into a set of streamers. Right?

That means that the structure of your program is much more constrained when you compose it out of streams, than if you compose it out of arbitrary for-loops.

And if you know that your program obeys a set of constraints imposed on it by the connected streams, the program becomes easier to understand, because you can RE-use your knowledge of how those streams always work to understand every component of the system, meaning every stream-component of it.


Multiple leaps of faith. "More constraints mean more easier to understand" is a little too reductive.

Replace streams with "operator". All your program code can be written as operators like dot or plus, and since those operators gave constraints, then the whole program will be simpler, right?

If that were true, we'll still be writing code in assembly, managing registers by hand. Because hey, just 256 registers means that you can reuse your knowledge of registers everywhere, which means programs will be easier to understand again, right?

There are examples in threads which show real examples where plain code is better.


All streams work very similarly. Much more similarly than if you mix and match all kinds of operators together.

Think Unix pipes they are easy to understand because they all behave similarly.

Yes sometimes plain code is better sure but in cases where streams fit the job they are better.

I guess the main point is that streams operate on multiple elements and they operate the same on every element. Therefore you don't need to reason what happens to every element that goes through the pipe.

It is a bit like adding and multiplying matrices, you can understand the calculation without having to mentally follow how each matrix element is processed.


There is a talk from Venkat Subramaniam, if you want to explore Collectors to death,

Here is the part about teeing https://youtu.be/pGroX3gmeP8?t=5499


I don't think I've watched this one! Thanks!


At first, I though "this article just made me realized I didn't use the python itertools.tee to its full potential".

But then I tried to think of a code where I would rather use this than a list comprehension or yield and a more manual control flow.

And I couldn't.

Those streams are elegant when the business logic flows perfectly like a river. Unfortunately, reality is messy, and production code will have matching, conditions, casting, extractions and transformations all over the place, leading to:

- very long chains of calls

- hard to change code when feedback pushes for it

- limiting your tooling (specially debuggers) to stuff that have exceptional support of chaining, and it's rare


This article made me realise that I don't know how to perform step-through debugging on chained of operations on Streams in Java, so I looked it up. I found out that there's Stream Trace Dialog in IntelliJ IDEA [0]. I guess it proves your point that working with Stream API requires additional tooling in cases when the code doesn't 'just work'.

[0]: https://www.jetbrains.com/help/idea/analyze-java-stream-oper...


the ideal scenario with functional programming like this is to reason about the code, and may be algebraically model it mentally so that you "know" it works.

But i find a lot of programmers don't do that - but instead write a first version which they don't truly understand (or understand completely), and then use stepped debugging to tweak the program until they get to a verison that works to their desired outcome.


Ideal scenarios rarely exist IRL. You may be in a rush. Inexperienced. Tired. On a problem you don't completely understand yet. With incomplete information. Exploring data or the problem space. Experimenting with an API. Trying to debug the code your colleague wrote, or a bug in the underlying lib.

That's why practicality beats purity in the vast majority of situations.

There is a place for purity, but you need a hell of a setup.


The way out of this is called unit testing. Code like this is just over engineered crap without it. Most of the effort is writing good tests. Show me the test that tells me in a concise way what this is actually supposed to do.

Mostly purity in this context boils down to weird combinations of premature optimization or complete disregard for that. Mostly it doesn't matter of course since code like this runs on trivial amounts of data so giving the garbage collector a little more work with silly stream objects, boxing/unboxing, etc. does not matter. Code like this does not matter, at all. Unless it's wrong. Hence the need for tests. Without tests it's just more likely to end up in tears. With tests, it doesn't really matter what the code looks like as long as the tests pass. If it's convoluted without tests, it's a problem waiting to happen.


Tired won't make you write a proper test, not be a beginner, make you explore data more efficiently, etc.

In fact, it's very hard to write a test first when you are exploring.


Completely agree, streams are good for when you have a ( you guess it ) stream of things where you need to do a limited and well defined amount of things.


I think this is made more complex and confusing than it should be.

Notice that the author uses the word price (or row price) for two different things: the price of a product and the total (price * quantity) in a shopping cart line (cart row).

The set of CartRow can be calculated as a straight forward entrySet().map(...) of the shopping cart in products map form (Map<Product, Integer>).

The PriceAndRows object is really the total for all the cart rows and the union of all the cart rows. Both things can be calculated as a straight forward map / reduce.


In theory, the streaming APIs are meant to make it possible to operate on streaming database queries, where the entire result set is simply too large to fit into memory. A single, unified API over both in-memory datastructures and streaming queries is supposed to make it easy to do both with the same set of APIs without needing reason about the details, nor materialize the entire intermediate result in memory [1].

But I find the opposite to be true; it's hard to reason about where/when the entire result might need to be collected, and the streaming APIs cannot really match the query language underneath (SQL, e.g.). Instead, streaming APIs are frankly confusing, and less efficient than just doing the straightforward for loops, IMHO. Especially when things get complicated, with multiple joins and map/reduce.

[1] Another way to achieve the incremental streaming result effect is to write everything in terms of generators. It is sooo much clearer to see a loop over a data structure and a yield to know how much the computation is actually incrementalized, IMHO.


As an aside, I’ve recently been onboarded into a heavy Java ecosystem (backend, Java 8). What are the best resources to follow for growing in this language - and not just mapping patterns from one language to another?


C# Aggregate(..) extension method (linq)?


That’s just reduce. Teeing is basically creating multiple streams from a single one and after doing something with those “branches”, aggregating them.


Aggregate does the job just fine in this case. Multiple streams are not needed.

    .Aggregate(new PriceAndRows(), 
    //here result is the PriceAndRows instance and next is a cart row.
    (result,next) => 
    {
       result.Price+=next.Price; 
       result.Rows.Add(next);
       return result;
    })
I'm trying to think of a better use case where multiple collectors would be cleaner than LINQ but LINQ has a lot of tools in the toolbox, SelectMany, Aggregate, temporary anonymous types, etc.

However, even in the Java side, the example could be done with reduce alone, I think.


> However, even in the Java side, the example could be done with reduce alone, I think.

That’s my point: teeing itself is not aggregate/reduce. If after the branching the streams differ in size, reduce no longer applies, for example.


But my point is that Aggregate and reduce can handle that in this case because they both can simply add to the PriceAndRows instance incrementally. You can sum the total and build the list in the body of the Aggregate/reduce method. Teeing is pointless here. There's no need to use another collector.


Would the Python3 version be:

map(merge_func, zip(X,Y))?


It looks to be the inverse of that. i.e. split a single stream of stuff so two different things can consume it. In python you'd do something like that using itertools.tee, though I think the java API goes about it quite differently.

https://docs.python.org/3/library/itertools.html#itertools.t...


No. In idiomatic python, you would not use a functional approach for something like this, but if you really wanted to, you could do:

     def teeing(reducer1, reducer2):
         return lambda acc,elem: (
             reducer1(acc[0],elem),
             reducer2(acc[1],elem)
         )


    functools.reduce(
        teeing(
            (lambda l,elem: [elem] + l),
            (lambda l,elem: elem.price + l)
        ),
        cart.getProducts(),
        ([], 0)
    )


It is quaint how there is supposed to be an "idiomatic python" just as mypy, "pattern statements", walrus operators, etc. are being added.

"Idomatic python" is dead, and lives now only as a pretty dumb ideology which states: everything must be phrased as a naively-typed naively-imperative program. Python is being made, retroactively, a bad imperative programming language: largely because its creator always thought that's what he'd made (he is wrong).

It's a great tragedy "pythonic" has become this: a back-reaction against the times (of increasing adoption of functional programming driven by increasing data-transformation needs).


That seems a very broad statement. Are there any such articles on this topic? I don’t see how Pythonic as a concept has to be mutually exclusive with new language features.

To me, the features you cite seem to fit the Pythonic approach intuitively.


Ah I didn't read their description well and didn't realize it was doing reductions.


Can somebody provide a link what this is all about? What's the objective?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: