I can't speak for the parent commenter, but there is often code processing the input/output of machine learning models that benefits from high-performance implementations. To give two examples:
1. We recently implemented an edit tree lemmatizer for spaCy. The machine learning model predicts labels that map to edit trees. However, in order to lemmatize tokens, the trees need to be applied. I implemented all the tree wrangling in Cython to speed up processing and save memory (trees are encoded as compact C unions):
2. I am working on a biaffine parser for spaCy. Most implementations of biaffine parsing use a Python implementation of MST decoding, which is unfortunately quite slow. Some people have reported that decoding dominates parsing time (rather than applying an expensive transformer + biaffine layer). I have implemented MST decoding in Cython and it barely shows up in profiles:
I'm curious, since most of the big libraries are already just cuda calls anyway but I'm always interested in anything to speed up the full process.