Personally, I'd like to run purely functional jobs, and get progress information...

sp332 · on Feb 11, 2015

You want a purely functional job with a side effect of updating progress info?

scott_s · on Feb 11, 2015

It's a reasonable request. The job is purely functional in the computation on its incoming data. That means you don't need to worry about persistent state management, which is a problem if you want to easily move jobs around a distributed system. The job (or part of the job) can still send metrics about what its doing to some other service, and still remain functional with respect to the data it's processing.

sp332 · on Feb 11, 2015

Functional doesn't just mean repeatable. If you modify global state that is not represented in the "input" and "output" of the job, then it's not purely functional.

scott_s · on Feb 11, 2015

Yes, I am aware of what "functional" means. But I think you are being too literal. If you view the computation through the lens of the actual data involved in the computation, then it is functional. That it also has some internal state that gets updated and reported as part of the monitoring of the computation - but not part of the computation - then it is still functional with respect to the actual data in the job.

The key distinction is that this internal metrics data that is just used for monitoring is incidental, and can be thrown away without impacting the computation's result. It's just convenient information that humans might want to have so they can reason about the internal state of the overall system.

amelius · on Feb 11, 2015

You are right. It is also possible to view the process as a lazy functional computation, which produces a list of progress-items, with the last element of the list being the result of the computation. Such a view is useful when job A invokes another job B, and job A needs to compute new progress information based on the progress information from job B.

Anyway, I think such a tool would be highly useful. Also, "progress" is an often overlooked feature, but users actually appreciate it a lot, and often even require it. And you simply can't easily add progress-computation to a system as an after-thought. You have to "weave" it through the design as you build it.

scott_s · on Feb 11, 2015

You generally don't want things such as metrics to be on the critical path for the computation itself. What you're proposing is clean by functional programming standards, but it would probably effect performance because now computing the metrics is tied to the computation. You often want such things to be disjoint for both performance and software engineering reasons.

There's also the fact that in such distributed systems, the entity that consumes the result of the computation is generally not the same entity that consumes the metrics.