
Why can’t you pickle generators in Python? A pattern for saving training state - bravura
http://blog.metaoptimize.com/2009/12/22/why-cant-you-pickle-generators-in-python-workaround-pattern-for-saving-training-state/
======
raffi
Pickling generators is one step from being able to pickle a continuation or
save the state of a running function. This is actually a useful thing to do. I
wrote a script awhile that spawned several processes that each required
posting multiple jobs to Mechanical Turk and doing so in a certain sequence.
My language Sleep supports continuations (and generators) and allows me to
serialize (pickle) my continuations. I just spawned off a thread for each
"process" and periodically saved that executing function to disk. Later when I
wanted to restart the program, it'd look on disk for the saved functions, load
them, and execute them. Very fun.

I wrote on article about how to do this kind of stuff awhile back:

[http://today.java.net/pub/a/today/2008/07/24/fun-with-
contin...](http://today.java.net/pub/a/today/2008/07/24/fun-with-
continuations.html)

------
kevingadd
Seems like a misguided way of solving a basic refactoring problem. What he
actually wants to pickle is a data structure that the generator depends on.
The use of some ALL CAPS magic variable name and global state rings all sorts
of warning bells in my head when I look at it.

However, it's hard to tell why he actually needs to pickle generators from
looking at his examples. I can think of some good hypothetical reasons, but
his examples don't justify it - it looks equivalent to 'I want to pickle a
suspended thread' to me.

~~~
bravura
I am the original author.

The example holds without the ALL CAPS magic variable names,
"HYPERPARAMETERS". However, I include HYPERPARAMETERS because I am including
the actual code I am using. Hyperparameters are global, read-only variables
that specify the particular experimental condition being tested. I can't say
that I have the best solution to this particular aspect of experimental
control (hyperparameters). I might write a blog post about it in the future,
to solicit feedback on improved methods. However, I have refined my current
approach over several years, and I currently use the assumption: One
experiment per process. Hence, one set of hyperparameters---specified at
invocation---per process. This assumption has saved me a lot of pain. As I
said, I am interested to discuss alternatives.

The training state, however, is not global. You can pickle the training state
objects individually. I was considering an ugly global way to refactor, but
instead I used this pattern. Which is the reason I wrote the article.

I do come to the conclusion that you must pickle the data structure that the
generator depends on. This is why I refactor it into a class object with
__getstate__ and __setstate__ methods.

The reason I want to pickle generators is as follows: Generators are the
easiest way to write methods that stream input. Because, if you stop and
restart, you want to stream from where you left off, refactoring these
generators is on the critical path to persisting your experimental state.

------
bcl
Nice! I haven't needed to pickle generators, but that should come in handy
when I do.

~~~
jcl
If I'm reading it correctly, he's not really pickling the generator itself.
Instead, he's counting the number of times he called the generator and
pickling _that_ instead. Then, to "unpickle", he retrieves the count and calls
the generator that many times.

Two major limitations of this approach: (1) You can only pickle generators
that generate the same sequence every time they are restarted. (2) All the
work the generator did prior to pickling must be performed again on
unpickling.

~~~
bravura
Good criticisms.

 _(1) You can only pickle generators that generate the same sequence every
time they are restarted._

I don't know how you can persist state if you do not make this assumption.

 _(2) All the work the generator did prior to pickling must be performed again
on unpickling._

Something faster would be to use file.tell() to get the state and file.seek()
to set the state. Since the "unpickling" is not a bottleneck, I didn't
optimize this.

