Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I always struggle with this. Even though I understand IO-bound and CPU-bound tasks, it's not always one or the other when you build an application. For example, I have to train a machine learning model where I need to query a database or read some files which are considered to be IO bound. But then when I am training a huge model it will become a CPU bound task. How do you go about this?

Would you then separate the logic in such a way that you can use both multi-processing and multi-threading/co-routine. Is this possible when using async? I have limited experience but it feels like the moment you introduce async everything has to be async in the code.



Queues.

If you want to do currency programming and you want to pass data between different tasks, then thread safe queues are quite a simple effective way to orchestrate the demand complexity in my experience.

After all that's how parallel processing is managed in the Post Office.


To be honest, your best cheap shot for making pipelines fast is to get all your data into RAM and then run your models. Ingestion i/o has lots of surprising bottlenecks, from small file i/o to NFS to decoding e.g. of image/video frames.

If you can afford it, create a standardized representation for your data and keep it in memory as much as possible. If that's not feasible, write the parsed representation into uncompressed tar files and load these on batch start.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: