
Incrementally migrating over one million lines of code from Python 2 to Python 3 - el_duderino
https://blogs.dropbox.com/tech/2019/02/incrementally-migrating-over-one-million-lines-of-code-from-python-2-to-python-3/
======
lincolnq
It's worth noting that Guido van Rossum, creator of Python, is working at
Dropbox on this project, and that Mypy is funded primarily by Dropbox (I
think).

I'm glad that they are doing it, because it's contributed hugely to the Python
ecosystem -- especially making it easier for other companies to make the
switch to Python 3.

I highly recommend mypy for new and current python3 users. It's one of the
biggest and best reasons to be using python3 (not that you should need any
more :) )

~~~
xvector
I've never used mypy and I used Python a lot until I recently got my first
job.

It kind of looks like a band-aid. Could you elaborate on why I would use, say,
mypy as opposed to Golang if I don't need the benefits of a scripting
language?

edit: I can understand why mypy would be useful to refactor an existing
(large) project to bring forth type safety guarantees as you go forwards, but
surely for new projects there are other languages of choice if type safety is
what you want?

~~~
setr
I believe the general idea with gradual typing is that

1) It's faster to write the initial proof of concept with a dynamic language

2) It's common that the proof of concept becomes sufficiently big/stable that
it turns into the final codebase

3) Gradual typing makes the path from PoC to real codebase smoother, instead
of re-implementing in another language.

Point 1 is contentious of course, but that argument isn't relevant here. It is
believed, and that belief becomes practice.

An example of point 2 might be OSH of the oil shell, which doesn't use mypy I
believe, but was originally written in python with the intent to rewrite in
C++, and eventually decided it was too costly to rewrite (somehow he decided
that ripping apart the python VM would be the more fruitful path... and
apparently he's making good progress there).

So basically, the usecase of mypy is when you're making the decision of
continuing with python, or rewriting. Sometimes rewriting was part of the
strategy from the start, other times a script just grew into a full-fledged
program over time, but mypy's primary goal wouldn't be to introduce from the
start.

Another usecase I've personally done is having the typing available to me from
the start, but not actually making use of it until I reached a particularly
nasty section of code. And so that particular chunk becomes (reasonably)
statically typed, while the rest of the program remains standard python.

Although in that case I used a library that enforced typechecking at runtime
rather than mypy (the library was typeguard iirc), which I only enabled on
test runs (it slowed down the program significantly).

~~~
xiaodai
If you want gradual typing for a smoother ride and performance then just
choose Julia. Seriously.

~~~
happy_man
Let's wait a couple of years until Julia stabilizes itself before suggesting
is ready for production

~~~
setr
Stabilization and proliferation of libraries; afaik its ecosystem is currently
only really fulfilled for numpy/scipy/ml workloads

~~~
happy_man
Seems Julia developers and evangelists main objective is to "kill Python", am
I wrong?

~~~
setr
They can have whatever goals they wish, but regardless, having an ecosystem
like .net or python requires a critical mass of users; and once you have that,
the language goals don’t even matter; they’ll just have libraries for
everything, whether they like it or not.

------
saltcured
At work, we have much smaller codebases that have fractional FTEs allocated to
their ongoing maintenance. In spite of the difference in scale, my experience
is similar to what was described in the article. Because we had focused on
getting the unicode to work right in those old code bases, we had good test
coverage for those features as a result.

The other common legacy problems to address were:

\- Other implicit encode/decode behaviors in py2 that need to be explicit in
py3

\- Old 'print' and 'except' statements not valid in py3, easily rewritten

\- Implicitly relative 'import' statements not valid in py3, rewritten with a
little care

\- Arithmetic needing a change to the '//' operator for integer division

\- Waiting for py3 support in all third-party dependencies

\- Dealing with restructuring in standard lib packages and in upgraded third-
party libs w/ py3 support

As I reviewed the techniques for straddling py2 and py3, I was displeased with
how many seemed to involve a third dialect which was not really idiomatic py2
nor py3, particularly for the unicode/bytes handling. Many third-party
libraries and frameworks also made different choices for how they handled
this. Trying to integrate those approaches looked to produce even uglier code.

Also, some of our code had evolved since the Python 2.2 days and had
accumulated cruft to import and wrap multiple generations of older standard
lib and add-on packages which we have not cared about for 5+ years. The
additional package restructuring in py3 would have made this even more
bizarre. I wanted to see the code reset to use standard libs where possible
and cull these legacy third-party dependencies. I also wanted the code to
become more idiomatically py3, so whoever visited it for future maintenance
would not need to work so hard to understand it.

So, we chose a clean break where we finally clean up and modernize the code to
py3-only without the added burden of supporting py2 deployments from the same
code. The declared 2020 deadline helped this decision. We branched our repos
and worked on py3 ports and integration testing in parallel while continuing
to run py2 in production. We declared a feature freeze on the py2 code, so we
would not have a merging nightmare later, and so that we could use that as
pressure to prevent procrastination on scheduling the flag day where we merge
PRs and convert all our repos to a py3-only worldview.

------
jonathanpoulter
Garret Walker, from Bank of America, spoke at PyGotham last year and touched
on this topic at the end of his talk:
[https://2018.pygotham.org/talks/seventeen-million-lines-
of-p...](https://2018.pygotham.org/talks/seventeen-million-lines-of-python-
later-launching-a-startup-in-an-investment-bank/)

Disclosure: I work for Bank of America.

------
jsilence
Why does the desktop client have one million lines of code?

~~~
danpalmer
Not to be facetious, but why wouldn't it?

It has to find and diff files, coordinate a very reliable file upload of many
files, as fast as possible. It has to understand enough about the content of
those files to be able to do useful things with them. It has to reliably
update itself again and again, from possibly very old versions, it has to
communicate with an ever changing API. It has to have enough analytics in it
to support product development, error reporting, understanding how users use
it, how the product needs to evolve over time. It needs to integrate deeply
with all major OSes.

...I'm sure I've missed some things. 1 million lines of code sounds like the
right ballpark to me though.

~~~
sametmax
Plus the UI, the file manager intagration, the daemon, the authentification,
the cache... All that cross platform.

~~~
saagarjha
None of those should be written in Python, except for perhaps the cache
handling code. The rest should be platform-specific.

~~~
barkingcat
Python is platform specific.

~~~
saagarjha
…no?

~~~
chrisseaton
You can write platform specific code in Python. You can get the current
operating system and you can write an if statement comparing against operating
systems you want to write specific code for.

~~~
saagarjha
> You can write platform specific code in Python.

It's clearly not designed for doing anything advanced, though: it's not like
you can get access to platform-specific APIs outside of the basics without
doing a lot of work.

~~~
sametmax
Quite the contrary, it is easy to call system api from python. That's why we
have so many python wripmers for c code.

You have the stdlib ctypes of course, but also pypy's cffi for an even better
story.

We even have specialized ready to use tools for those, such as pywin32 to
manipulate the windows api and registry, pyqt to leverage the c++ qt lib and
all it's tooling, or watchdog to survey file changes, using the native api
when possible.

In fact, mac and linuxes implement natively of lot of tools in their os for
this reason.

There is also a reason it's one of the few languages allowed in prod at
google. It's the best at nothing, but it's damn good at most things.

~~~
saagarjha
> You have the stdlib ctypes of course, but also pypy's cffi for an even
> better story.

Not something I would call seamless.

> pyqt to leverage the c++ qt lib and all it's tooling

This isn't native.

> mac and linuxes implement natively of lot of tools in their os for this
> reason

Generally command line scripts, which is something that Python is good at.

> There is also a reason it's one of the few languages allowed in prod at
> google.

What Google allows in production has only an oblique relevance to whether it
should be used for a specific task.

------
nvr219
That "learn python the hard way" guy says don't use python 3. As someone who
is at the "Hello World" stage, should I use 2 or 3?

~~~
mserdarsanli
They are quite different languages with similar names, like java and
javascript

~~~
ben509
I saw an interesting article about how a million line production codebase was
able to run simultaneously in both python 2 and 3, so I don't see how you
could call them "different languages with similar names."

------
danpalmer
Nice write up.

At Thread we did a similar thing. Admittedly our codebase is ~10% as big, but
incrementally adding linters for incompatible code, and keeping the CI green,
helped loads. We only had about 2-3 days of engineering time to ship the final
version, the rest was done in 10% time.

------
rhacker
Is python finally getting past the 2/3 jump?

~~~
sametmax
It has for 2 years now. 3.6 was the tipping point, such a fantastic release.

See

[https://www.jetbrains.com/research/devecosystem-2018/python/](https://www.jetbrains.com/research/devecosystem-2018/python/)

For 2018 stats

~~~
fredthomsen
Agreed. I have been very pleased with the cleaner asyncio API in 3.7 though.
Much simpler for people writing async code for the first time.

~~~
sametmax
I suggested for something like asyncio.run() to be included much earlier than
that but python-ideas is a dead end for actually getting things into python.
Eventually yuri saved us all because he could demonstrate the usefullness of
things on uvloop first.

------
dekhn
While adding Python3 support for numpy at a time when I worked for a large Ad-
serving company, we managed to uncover a bug in python 2/3's import code that
had been latent (no crash) in Python 2 for 15+ years. The problem was unique
to people who had made it possible to import the same library from two
filesystem locations.

In python2, it was silent, in python3 it was a segfault. There was really only
one person in the company truly qualified to understand and fix the bug.

I'm finally moved over to Python3 but boy, was that an unwanted transition.

------
omarforgotpwd
What a disaster of lost productivity

------
raverbashing
Python 2 to 3 migration is straightforward for the most part but in the end it
has some nooks and crannies

Trickiest one that we tripped: strings in Python 3 have the __iter__ method

------
painful
mypy is a joke of a tool. I have actually used it, and 80% of its messages are
useless junk or just plain wrong. Granted, the other 20% can be on point. All
things considered, it's better to be with a tool like it, than without.

------
petters
> and removed Python 2 from our application binary. This marked the end of our
> Python version migration!

No. :-) Now you need to remove all compatibility code, modernize your syntax
and gradually start using new features.

------
ilovecaching
I've noticed a trend that migrating from Python 2 to Python 3 includes adding
mypy annotations and going async.

At that point, why not Go? You're trying to correct for a language that was
designed for scripting, not application software. Go already has a type
system, coroutines are a simpler model of concurrency than async (and Go can
actually use multithreading), you don't have to choose between writing and
async or nonasync library code, built in formatter (wheras Black is still
experimental), and the code will run 10x faster.

~~~
sammnaser
I see your point, but I think when migrating over a million loc, it's a
question of practicality. Incrementally migrating a project from Python 2 to
3, between which at least large-scale architectural patterns are more-or-less
identical, is a completely different story from migrating to Golang, which
uses drastically different paradigms to structure code and think about data
flow.

Reminds me of some of the points made here:
[https://www.joelonsoftware.com/2000/04/06/things-you-
should-...](https://www.joelonsoftware.com/2000/04/06/things-you-should-never-
do-part-i/).

~~~
ilovecaching
Are most people migrating over a million loc? And if they are, is it one
million loc Python service, or a bunch of small services that could all be
rewritten independently? Chances are the worst case scenario is very rare.

Golang really isn't that different from Python. They are both incredibly
imperative languages with a one way mentality, and most people who learned CS
in college will easily recognize both. The major difference is that Go has
much less to learn, and very few patterns to speak of.

