The serialisation cost of translating data representations between python and C (or whatever compiled language you're using) is notable. Instead of having the compiled code sit in the centre of a hot loop, it's significantly better to have the loop in the compiled code and call it once
You don't have to serialize data or translate data representations between CPython and C. That article is wrong. What's slow in their example is storing data (such as integers) the way CPython likes to store it, not translating that form to a form easily manipulated in C, such as a native integer in a register. That's just a single MOV instruction, once you get past all the type checking and reference counting.
You can avoid that problem to some extent by implementing your own data container as part of your C extension (the article's solution #1); frobbing that from a Python loop can still be significantly faster than allocating and deallocating boxed integers all the time, with dynamic dispatch and reference counting. But, yes, to really get reasonable performance you want to not be running bytecodes in the Python interpreter loop at all (the article's solution #2).
But that's not because of serialization or other kinds of data format translation.
The overhead of copying and moving data around in Python is frustrating. When you are CPU bound on a task, you can't use threads (which do have shared memory) because of the GIL, so you end up using whole processes and then waste a bunch of cycles communicating stuff back and forth. And yes, you can create shared memory buffers between Python processes but that is nowhere near as smooth as say two Java threads working off a shared data structure that's got synchronized sprinkled on it.
https://pythonspeed.com/articles/python-extension-performanc...