
Data-Oriented Programming in Python - jbredeche
https://www.moderndescartes.com/essays/data_oriented_python/
======
mjburgess
I went in thinking this would be data-oriented in the ECS sense, ie., the way
that modern games are designed to optimize class structure for cache lines.

However it's about the limitations of python's computational model (ie., how
it imposes large constant factors on most operations, due to pointer
indirection). And how numpy solves that in one case (homogenous arrays).

I was hoping it would include a discussion on the julia solution: a sort of
partial static-typing, where type information is use to generate optimised
code.

It should be possible for python to follow a similar route: an (optional?)
compilation pass which uses type information to eliminate pointer indirection
(, etc.).

However I'm not sure how much resistance the actual object model of the VM
poses: would you, in effect, just be compiling to a different VM?

The present cpython approach to optimization is hacky and specific, eg.,

    
    
       sum( a * a for a in range(10) )
    

is faster than any equivalent loop/etc. approach, because `sum()` has a
manually coded fast-path within its implementation.

~~~
mlthoughts2018
Cython offers the type of targeted compilation options you suggest. Also numba
offers them in a more opinionated way that requires less overhead than Cython.

These are both long standard tools in scientific Python.

~~~
mjburgess
My hope is that there's some easy, non-trivial, improvements in just basic
type awareness, eg.,

    
    
        xs: List[Int] = list(range(10))
        total : Int = 0
    

should permit an optimised,

    
    
        for x in xs:
            total += 1
    

(almost?) as fast as the equivalent sum.

Python hasn't had typing for long enough for this thought to have percolated
deeply, ie., I believe at the moment it is just unconsidered rather than
infeasible.

I suspect that there's a few cases that can be hard coded into the VM that
will make a big difference.

eg., namedtuple/dataclass, List[String], Dict[String, String], etc.,

~~~
mlthoughts2018
Unfortunately what you suggest just isn’t conceptually possible (by design) in
Python, and would require removing many of the language features that make
Python successful for a wide set of use cases.

Type hinting (eg List[Int]) has no control over the actual byte layout, it’s
just purely metadata. Nothing stops someone from changing xs[4] to be some
complex object in your example, regardless of the type hint, so at a pure
CPython level it’s fundamentally impossible to benefit from static typing &
optimization purely on the basis of type hinting.

This is why Cython and numba play that role. They change the location of type
metadata - it’s not mere runtime metadata, rather it is compile-time static
type annotation in the real sense, for either ahead of time compilation
(Cython) or just in time compilation (numba).

~~~
mjburgess
I'm not convinced. The compiler can emit different byte-code. The proposal
includes adding a compiler-aware phase to processing type "hints" (here now,
actual static types).

If you type code "xs: List[Int]" and try to change xs[4], the _compiler_
(phase) would have to reject your code.

This isn't "removing features" as typing is optional. And as long as a lists
in "successfully compiled python program" express the same API, it doesnt
matter that their implementations differ.

~~~
mlthoughts2018
> “ If you type code "xs: List[Int]" and try to change xs[4], the compiler
> (phase) would have to reject your code.”

This just can’t work - because lists are mutable, so code that mutates them
must be valid and permitted by any such compiler. Similarly, values are
dynamically typed, so the value of the assignment might be an int or might
not, and it cannot be decided until runtime (and this inability to prove at
compile time can’t cause the program to fail to compile, it’s Python).

Additionally, Python only has infinite precision ints, so it would be quite
hard to get anything useful out of this, unless you wholesale switch to using
ctypes or something, but then just use Cython and it will be way easier.

I’m not convinced this could ever be anything more than a super limited toy.

------
ForHackernews
This is a niche where [https://julialang.org/](https://julialang.org/) is
really poised to shine: (nearly) As fast as C; (almost) as approachable as
Python.

------
wodenokoto
I thought that data oriented programming was a paradigm that sits next to
Object Oriented and Functional programming and I had a lot of trouble
understanding the DOP-related articles that have been trending on HN recently.

But apparently it is just what we in R and Python call vectorised functions.

To boil it down:

non-vectorised:

    
    
        square(2) -> 4
        square([2,4]) -> error
    

vectorised:

    
    
        square(2) -> 4
        square([2,4]) -> [4, 16]
    

Which doesn't care whether you are passing messages between objects or
avoiding side effects through pure functions and monads or imperatively muting
your data left and right.

Or am I completely off?

------
throwaway33339
The two-language problem is well-known. People wanted performance, which is
reserved to languages like C, C++ or Java, but they didn't want to use these
languages, since they are objectively ugly and a pain to write. Thus,
languages like Python were born, but we were warned that they were going to be
slow because something something dynamic typing something something the
compiler can't optimize blah blah blah. And so we were told to avoid doing too
many loops, or load too many objects in memory, or indeed even attempt push
the language to match one's actual use cases, because Python wasn't well-built
for it.

But in the meantime, languages like R or Matlab had figured a solution: write
all the heavy-lifting ultra-optimized algorithms in C or Fortran or some
equally ugly language that no one but really smart nerds wants to touch, and
wrap it in a semantic that makes loops and loading many objects unnecessary,
called 'vectorized operations'. In R, for instance, you think you're
manipulating mere strings or logicals, but you're in fact manipulating
_vectors of length 1_ and of type 'string', 'logical', etc. But doing
operations on vectors or arrays became as seamless as doing them with mere
scalars, with hardly any loss in performance. And so the R world thrived,
although we were still cautioned to use weird lapply/sapply/rapply magic
instead of doing proper loops because something something compiler something
something slow blah blah blah.

And so the Python world saw that the R and Matlab world thrived, and wondered
if they could do the same. A bunch of really smart nerds sat down with their
laptops and wrote a bunch of ultra-optimized algorithms in one of those ugly
languages no one else wants to touch, and lo, in the mid-2010s Python had
finally achieved feature parity with R and Matlab twenty years ago. Yet the
trend showed no sign of slowing, as Python was not only useful for scientific
computing, but many other use cases as well (you ever tried to write an
interface or webserver in R?), and sometimes researchers have the audacity to
want to do several things at once with the computer. And so Python achieved
its present ubitquity in data science.

There's trouble in paradise, however. As with R, we were cautioned to avoid
doing too many loops because something something you know what I mean, and
instead use vectorized operations. And little by little, we had to learn every
day a little more of numpy's arcane API, the right magical formulas to invoke
in order to avoid losing performance. We had to learn which operations are in-
place and which ones create a new array (knowing this could change over
multiple versions), which appropriate slicing and indexing to use, which
specific functions to call. And the more our use cases deviated from the
documentation, the more magic we had to learn. At some point we had to learn
obscure methods beginning with an underscore, or even (the horror!) mind
whether arrays were ordered C-style and Fortran-style, or even told to use
Cython (!), nevermind your desire to absolutely avoid touching these languages
in any way. May Allah be with you should you ever want to manipulate sparse
data.

Aware that the community had to learn magic whose complexity on par with the
ugly languages they'd sworn off, really smart nerds took it upon themselves
to... write more magic in order to avoid writing the older magic. And so we
got dask, which is as powerful as it is painful to use. We got numba, which
seems to work automagically in the official demo snippets and zilch in your
own. 'That's because you're using them wrong', the smart people tell you on
stackoverflow. 'Teach me how to use them right', you beg. And so your mental
spellbook thickens with no end in sight...

Enter Julia. Julia doesn't have that any of the above dillemas, because Julia
is _fast_. Julia doesn't care whether you vectorize or write loops, but you
can do either. Julia doesn't force you to declare types, but you can if you
really want to. Julia doesn't require you to write advanced magic to do JIT
compilation. Julia doesn't see itself as an R or Python competitor: why, Julia
_loves_ Python and R, and in fact you can just call one from the other if you
feel like it! Go on, just RCall ggplot on an array created with
PyCall("numpy"), it just works! Julia was _built_ with parallel computing and
HPCs in mind, so no need to fiddle with dask boilerplate when it just works
with @macros. Julia knows programmers are afraid of change, so it syntax is
really, really close to Python's. Julia has a builtin package manager. Julia
lets you use the GPU without having to sacrifice a rooster to Baal every time
you want to install CUDA bindings.

Of course Python isn't going anywhere, just like R is still going strong even
after Python 'displaced' it. And of course, Julia's ecosystem is smaller (but
growing), its documentation is lacking, it doesn't have millions of already
answered questions on Stackoverflow...but if you know where the wind blows,
you know where the future is headed, and its name rhymes with Java.

~~~
kgwgk
> if you know where the wind blows, you know where the future is headed, and
> its name rhymes with Java.

[https://www.rhymes.net/rhyme/java](https://www.rhymes.net/rhyme/java)

~~~
pbowyer
I have no clue where the rhyming future is headed. Anyone solved this riddle?

------
zurn
It's interesting that this execution speed optimization technique has adopted
the name traditionally used in altogether other meanings (like the quote about
a program's data structures being more important than the code, or how FP and
programming with immutable values is actually data-first programing).

------
andreareina
What I'd like (I don't know how possible) is being able to define
ufunc(-likes) without resorting to C extensions. Obviously won't be as fast,
but some benefit should be possible?

~~~
pfheatwole
I'm not sure if Numba is in the spirit of your question, but it does make it
easy to write ufuncs without explicitly dropping into C.

[https://numba.pydata.org/numba-
doc/latest/user/vectorize.htm...](https://numba.pydata.org/numba-
doc/latest/user/vectorize.html)

~~~
jojo2000
numba is a true horror to use, because supported subset of python language is
an ever-moving target and overall small. cython is good but needs time to
meddle with. For loop-intensive, simple function I'd advise FFI.

~~~
qfwfq_
I disagree that it's a "true horror to use." The set of supported built-in
classes grows significantly by the day. It's not as good when used for wholly
unstructured streams of data (e.g. tuples of mixed type, dicts with complex
objects inside of them), but if you can spend the design time to arange things
in a structured manner, it's super easy to use, and can seriously boost
performance on simple algos.

I've had a ton of success using it in statistical and computational geometry
applications.

~~~
jojo2000
Let me explain my line of reasoning here (been there at least three times,
situation gave 3 different outcomes) :

\- case 1 : need to make calculations with a specialized library in C
(precision arithmetics). Build a bare FFI, later replaced with CFFI.

\- case 2 : loop-intensive on very simple calculations, time-constrained
development. Identified as horrible performance in python : tried numba didn't
work, ended up using cython, worked really well.

\- case 3 : optimize numpy-intensive routine for performance. Tried numba,
didn't work. Too expensive to recode in cython. Look at numpy C/C++
interfaces, and numpy-friendly C++ alternatives. Also try to trick numpy to
function better (do not ever try this, it'll give worse results). Ended up
doing nothing as time to develop was the main constraint here.

If you have a lot of time and work on a small program, maybe you can spend the
time to optimize. Team producing lots of complicated algorithms and no way to
re-develop everything, stick to python, identify perf losses and choose wisely
what you'll optimize.

Numba has progressed but is still not a "drop-in decorator" as advertized. Can
even give worse performance in some cases. Nevertheless the idea is good and I
praise the effort, when it's done it'll be massive !

~~~
mlthoughts2018
Numba failing to work in case 2 is extremely bizarre. Can you post that code
or a similar example?

Not only should it work for case 2 - it should only require a one liner
decorator add and no user specified static typing.

I’ve used numba in production for around 6 years now, and never encountered
the problems you describe.

