Can you elaborate more on the data-parallel computing part?

tgma · 2024-12-20T19:08:23 1734721703

Pretty much the entire Data Science/Machine Learning landscape from numpy, etc. to tensorflow and alike are thin wrappers over C code and if you want performance, you better batch structure your operation beforehand and minimize back and forth from Python.

dec0dedab0de · 2024-12-23T14:22:38 1734963758

The GIL is only a problem if you’re trying to access the same memory.

If you take the time to split up your data into chunks you can avoid the GIL entirely with multiprocessing. Or by handing it off to a library that does it for you. just as long as they’re not using the same python objects.

ptx · 2024-12-23T15:48:49 1734968929

Using multiprocessing adds some overhead for because of serialization, so it's slower than it would be to just hand off the Python objects directly to the workers (as you can with threads) because the process doing the parsing also has to spend time on serializing them again. So you can avoid the GIL, but it has a cost.

For example, if I parse an XML document with ElementTree, as a quick experiment, parsing the document takes ~1 second and serializing all the elements names and attributes to JSON takes an additional ~0.5 seconds. Serializing the whole ElementTree object using pickle takes ~4 seconds. Serializing it as XML takes roughly as long as parsing it.

tgma · 2024-12-25T07:47:01 1735112821

Plus, [de]serialization often needs to be done under GIL (although under non-shared GIL per process, it interferes with any other async thing that may be done in each process).

It also screws up C extensions that don't support fork and does not solve the problem of increasing TLS handshake/fanout in outbound requests by the number of cores.

The whole thing sucks in production.

dec0dedab0de · 2024-12-23T17:47:32 1734976052

multiprocessing has more overhead even without serialization. I just brought it up to expand on why the GIL would force someone to thinking about being data-parallel.

ptx · 2024-12-23T18:11:14 1734977474

What I was trying to say, I guess, is that the additional serialization overhead can't be parallelized, which means that parallelization with multiprocessing doesn't help much in some cases where GIL-free threading would.

francocalvo · 2024-12-20T12:16:01 1734696961

He's probably talking about libraries like PySpark or PyFlink which are used a lot

whoiscroberts · 2024-12-22T00:18:12 1734826692

Pyflink seems promising, I love vanilla flink but as soon as you need to debug your pyflink job pyflink becomes a hurdle. That translation layer between Python and Java can be opaque.