
Ask HN: How do you handle large Python projects? - nthompson
In C++, I follow a playbook for keeping all hell from breaking loose:<p>1) Write a googletest
2) Write a googlebenchmark
3) Run all unit tests under AddressSanitizer, ThreadSanitizer, and g++ UB sanitizer
4) Tidy up with clang-format
5) Run cppcheck<p>So I feel pretty confident I&#x27;m not doing something braindead if I can get this stuff through CI.<p>But for Python, I don&#x27;t really have good idea when I&#x27;m doing something that&#x27;ll cause me agonizing pain in the future. The only tool I use is flake8, which is awesome, but I can&#x27;t see memory leaks or performance profiles.<p>What strategies do you adopt (and what tools do you use) to keep all hell from breaking loose in large Python projects?
======
justinsaccount
Personally I focus on not creating large projects. I make as many small
individually tested libraries as I can and assemble them in smaller projects.

Not exactly the 'microservices' approach, but similar ideas.

One of the most useful things related to this is to focus on interface design.
It's easy to scrap a bad implementation and re-implement the same interface
later, it's harder to fix a bad interface that you're using all over the
place. Making some implementations pluggable up-front will also make it easier
to swap things out later.

Another thing that ended up causing the most pain in the long run was building
in too much functionality directly instead of leaving things up to plugins.
Plugins can more easily be enabled or disabled to compose specific
functionality. The alternative is tons of code and tons of configuration
options to handle every little corner case.

As far as lower level tools, the 'coverage' tool integrated with your test
suite is a must have.

~~~
StavrosK
> Not exactly the 'microservices' approach, but similar ideas.

Tangentially, I don't like how people conflate writing small, self-contained
libraries with "microservices". It's like people forgot that you can write
libraries without sticking a huge mess of networking and glue code in between
them.

([https://www.stavros.io/posts/microservices-cargo-
cult/](https://www.stavros.io/posts/microservices-cargo-cult/) for a related
rant)

------
YZF
I worked on a million+ line Python project, here are some observations:

* No duck typing

* Document all input/output parameters to functions.

* Avoid fancy meta-progamming.

* Try to break it up into smaller pieces with well defined interfaces that are ideally not pythonic. Think about the interfaces as something you'd potentially want other programming languages to be able to talk to.

* Don't use eval/exec or more generally don't pass Python code around.

In general as the scope of a module/unit gets larger you want to stick with
simpler stuff. If the module/unit is small, the interface is small, and there
are good tests around it, you can do anything you want inside that smaller
bit.

[EDIT: These comments are mostly Python specific. On top of that you'd apply
what you would in any other language, organize your code properly, consistent
style throughout which in Python includes following the relevant standards
PEP8, doing code reviews, tests etc. etc.]

~~~
StavrosK
This is pretty accurate. I'm also very excited about the new (optional) static
typing, which should help a whole lot with maintainability. If possible (it
probably is possible nowadays), use it, and lint your project with mypy. It
will show you lots of potential problems with your code, and your interfaces
will be much cleaner.

Trying to make your parts as self-contained as possible is always a good idea.

~~~
agumonkey
Lots of people are going into typescript (es6 + types). I wonder how long
before python gets the same. And when all languages merge into one. With a new
'(universal . syntax).

------
dagss
I work on a 100 KLOC Python project and have written a fair bit of C++ too.

We must have different definitions of "all hell from breaking loose", since
none of those tools would help me avoid that in C++ at all. Hell has broken
loose when you have an unmaintainable ball of mud, which IMO has little to do
with what those tools help with.

But for those tools:

Many of them you just don't need in Python due to language differences. You
need to make sure all branches of your code are somehow tested, but memory
management is easier with (mandatory) reference counting, you don't have
pointers that are not valid, and so on.

As for performance profiles, well, since you're thinking of using Python, you
already decided IO is your bottleneck and not CPU. The moment you find
yourself thinking "I should profile this code and speed it up" is the moment
you consider using another language for the job.

~~~
twotwotwo
Working on SaaS product with ~150K lines of Python in it (wc lines; there are
fewer sloccount lines).

Very much agree about maintainability. When the business situation allows it,
you need to get around to fixing the annoyances that pop up and bug you.

You eventually wind up asking yourself "is it safe to change this behavior?"
pretty often. If you do something weird you know needs to be preserved (or
know can be fixed after some other fix), comment why. If a piece of code is
depended on by some other thing in a potentially unexpected/important way,
note that. Of course, you can't always predict what info will be most useful,
but at least think of it as you write.

You're probably on a team on this size of project. You should make as much
outside-the-code info as you can accessible and searchable by as many folks as
possible. It can be reasonable to make bugs tracking your
work/plans/intentions even if you're the only one working on a change. Write
internal docs on how large new chunks of code work. Sometimes writing forces
you to sort things out that you didn't even realize were messy when they were
just in your head.

You can wind up with accidental "ownership" of old and more-difficult-to-
safely-update code. Fight this. Try to get everyone working on everything as
much as feasible. If you worked on the old code, be clear you don't think it's
perfect and encourage improvements to it. If you're looking at other folks'
code, if you need to ask silly-sounding questions to help figure it out, do.

Relatedly, new folks will need help, especially if they are actually junior,
not just new to the product. Do help, and when you help them with a specific
problem also help them fish (like, show where they'd go to look for the info
you're pointing them towards, and if there's no such place, maybe there should
be one). When they think things are strange, pay attention; they aren't inured
to your project's weirdness like you inevitably are.

Release often, have real QA, and be prepared to be responsive as bugs come up.
Releasing often tends to find bugs while the change that introduced them is
fresh on the mind. Splitting up large features can help achieve that (we're
still figuring this out ourselves, though).

It's probably worth running your unit tests in parallel (I've heard of a very
large Ruby SaaS product using a cluster). Depending on what kind of stuff you
test, you may have to deal with flaky tests. They're one of those annoyances
worth working on; you want a test failure to be meaningful.

Look for ways to shrink the project. Maybe something can be done by an outside
product or library, and let you focus on what only you do.

Getting things basically right across a wide range of areas is more important
(and harder) than zooming in on a single area like how the code looks. In our
world, that means getting operational processes, monitoring, support, feature
prioritization, how well folks work together, etc. right, and minimizing walls
within the org that would impede that.

------
sametmax
First, large Python projects are much more manageable than C++ projects, even
without any tools. It's way easier to debug, much less verbose, you have 100
less possible errors while stuff are easy to refactor.

Starting from here, unit tests will take you a long way. Tox + pytest +
coverage.py is the defactor standard for tests now, and will give you peace of
mind when editing your code. Tox can run flake8 as well so it's often done.

After that everything is a luxery. You can use mypy to get static typing, you
can make sure to have a very good editor checking stuff for you such as
PyCharm or Sublime Text + anaconda. You can use CI with something like Travis
or buildbot.

I usually make sure to have a .editorconfig file and a clear style convention
to easy team work. And I like to use sphinx to write the doc of the project,
which you really, really need to do. This include docstring for modules,
classes and functions (with Google style for me), comments, but also some
manual rst files.

Last, but not least, experience matters a lot. You learn how to organize stuff
in your dir tree. I like to split any file bigger than 500 lignes in Python
because it's such an expressive language. Having one module for exceptions.
Having proper unicode handling from the start. Etc.

~~~
StavrosK
> large Python projects are much more manageable than C++ projects, even
> without any tools

I'd probably agree, but I would _love_ to have more mature static typing, and
especially an accurate "find all usages" function. Refactoring has been major
hell because you pretty much have to grep for names if you want to find all
usages of a variable (or, more specifically and problematically, a model
property in Django).

~~~
afarrell
> have to grep

Have you tried out the Silver Searcher? [http://geoff.greer.fm/2011/12/27/the-
silver-searcher-better-...](http://geoff.greer.fm/2011/12/27/the-silver-
searcher-better-than-ack/)

~~~
StavrosK
Yes, it's all I use, but still, we have foreign keys on various models to the
user, predictably called "user", so removing or renaming one of them is hell,
and the grep tool can't tell you which lines are the one you care about with
certainty...

Vim-jedi is pretty good, but not perfect either, unfortunately...

------
econner
The Google Python style guide might be a good start:
[https://google.github.io/styleguide/pyguide.html](https://google.github.io/styleguide/pyguide.html)

I've also heard that at Google they require assert isinstance for every
parameter of every function.

New Relic has very good tools for profiling python code if you're running a
service.

~~~
asuffield
(Tedious disclaimer: my opinion only, not speaking for anybody else. I'm an
SRE at Google.)

> I've also heard that at Google they require assert isinstance for every
> parameter of every function.

This is not true.

~~~
asdfologist
How is that an opinion?

~~~
asuffield
I might be wrong, and somebody who can speak for the company might contradict
me in the future. I can only say anything with this proviso.

~~~
asdfologist
By definition an opinion can't be wrong. I think you meant speculation
instead.

------
metakermit
I guess it's worth taking a look at some more complex open source Python
projects. I think pandas [1] is a pretty good example, with a relatively large
amount of Python and Cython code.

They start off with a pretty decent amount of unit tests (84% coverage) and
make sure it's visible to developers using:

\- Travis [2] (has to pass on pull requests too before contributions are
accepted)

\- Coverage [3]

There's also Code Climate [4] for some more introspection.

[1]: [https://github.com/pydata/pandas](https://github.com/pydata/pandas)

[2]: [https://travis-ci.org](https://travis-ci.org)

[3]: [https://codecov.io/](https://codecov.io/)

[4]: [https://codeclimate.com/](https://codeclimate.com/)

------
ludwigvan
In the words of Robert Love, "Man, I cannot imagine writing let alone
maintaining a large software stack in Python."[0]

Unfortunately, it is very hard and brittle.

You definitely need rigorous testing to keep it all in one place, but I would
steer away from Python for a large project.

Also take a look at mypy or some other sort of typing to see if it can aid in
a large project. [1]

[0] [https://www.quora.com/Why-does-Google-prefer-the-Java-
stack-...](https://www.quora.com/Why-does-Google-prefer-the-Java-stack-for-
its-products-instead-of-Python)

[1] [http://code.tutsplus.com/tutorials/python-3-type-hints-
and-s...](http://code.tutsplus.com/tutorials/python-3-type-hints-and-static-
analysis--cms-25731)

------
sitkack
1) don't use objects, use bare methods and don't mutate input data

2) `from collections import namedtuple` reinforces 1

3) nice lighweight tests, py.test or nose

4) integration tests so you can actually refactor w/o having to recode 100
unit tests

5) write as little code as possible

~~~
mrfusion
Any advice on writing integration tests?

~~~
olau
Write them in terms of user requirements. When someone does X, they expect Y
to happen. If a test breaks, it should mean that a user somewhere would think
there's a problem. See for instance this:

[http://rbcs-us.com/documents/Why-Most-Unit-Testing-is-Waste....](http://rbcs-
us.com/documents/Why-Most-Unit-Testing-is-Waste.pdf)

You can write integration tests with the same tools as unit tests.

Personally, I try to make sure all major paths are covered, but I won't test
every little UI detail. It's faster to test those manually. But YMMV.

------
INTPenis
I've only ever taken over a large Python project, never written it from
scratch, and I don't have many years under my belt as a python coder.

But my $.02: Documentation, python code itself is easy to read but if you have
a large project, broken into many small code bases for libraries, services and
front ends then you need solid documentation. Not only text but also diagrams
showing how all the parts work together. Not sure if they're called diagrams
in english but it's the stuff you make in MS Visio or Draw.io.

~~~
syngrog66
yes, diagrams is the correct term. and I agree with you that they can help,
esp if a project is very large or complex, distributed, etc.

------
hacker_9
I've recently written a Blender addon using Python that came to just over 1000
LOC. Not big at all and yet already I was running into problems that are just
non existent in static languages. I couldn't refactor variable names and had
to manually replace the text which was very error prone, I had no help from
the language about whether I had referenced a variable by the right name as it
would just create a new one, no 'goto defintion' and instead I was reduced to
scrolling or ctrl+F, no braces to tell me where scope starts and ends (which
you wouldn't think mattered, but only relying on indentation actually gets
quite messy), and no contextual knowledge of the blender APIs unless I knew
what to Google and it often came down to someone asking the same question on
stackexchange.

The only way I could manage it was to write tiny functions, so I could
literally eyeball the scope and keep all the details in my short term memory.
I would not recommend using this language for larger projects.

~~~
ves
if you spend 15 or so minutes configuring your editor (I use vim, so python-
mode and ctags for me), you can get around all of these issues you mentioned.

~~~
arjie
Is `python-mode` and `ctags` all that you have for Python specifically or is
there more?

~~~
kirang1989
You should try elpy-mode or anaconda-mode. You get better completion,
integration with eldoc for documentation, go-to definition and much more.

------
kmike84
* Write tests and run then on CI. Try to figure out how to write tests with a least amount of pain - you'll need lots of them. Use py.test or a similar framework. Doctests are great, but it takes some time to learn how to use them efficiently and when they're not appropriate.

* Measure test coverage to be aware of what is not tested, but don't just pursue exact coverage % number - doing that leads to many integration tests and a few unit tests. Both kind of tests is important.

* Extract libraries from the main code, to make the main project smaller; write docs and tests for these libraries. Docs are important for these libraries. Try hard to maintain boundaries - a library should have a single purpose, and it shouldn't be tied to the rest of the code. If you find writing docs complicated them maybe the library does too much, or maybe its API is too hard to use. Fix that.

* Don't write all code yourselves, consider using open-source libraries. But don't use open-source libraries if you're not comfortable with contributing to them - there will be issues (like in any code). If the library you're going to use is not an industry standard read its source; if it is "ah, yeah, this is almost how I'd written that" use it, try to find another library or write your own otherwise.

I'd say the trick to handle large Python projects is to resist making them
large. Don't be sloppy in code organization, be pedantic about which part
"knows" about which part, extract non-specific utilities to libraries. Often
projects can be kept under 20-50K lines of code after a few years of
development by a small team if a team tries to maintain code quality and moves
non-specific features to external libraries.

flake8 and alike linters may help with consistency; it is important, but not
the main problems by far. The main problem to fight is non-locality: if one
can reason about a piece of code just by looking at it, without checking lots
of other components, the overall project size doesn't matter much.

------
devnonymous
* write unit tests (I find nosetests or pytest as the test runners most useful and mock incredibly helpful).

* run unit tests and integration tests in an automated manner for every commit. (a.k.a use tox, jenkins or somesuch...).

* Depending on the software you are creating deploy a chaos monkey[1] kinda approach for disaster/HA testing.

* Read up on good Python practices :

[http://python-guide.readthedocs.io/en/latest/](http://python-
guide.readthedocs.io/en/latest/)

[http://python.net/~goodger/projects/pycon/2007/idiomatic/han...](http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html)

------
shuzchen
It seems you're looking for tools to manage code quality. In that case I
highly recommend prospector
([http://prospector.landscape.io/en/master/](http://prospector.landscape.io/en/master/)).
It wraps a whole bunch of tools in one interface. I suggest paying close
attention to any code that scores high on McCabe code complexity.

~~~
StavrosK
Prospector looks great, thanks for the tip! One problem I always have with
these tools is that even one false positive is enough to stop me from using
it. On a tool that's used for pass/fail, even one false positive means it's
useless because I can never make it pass.

For example, it's throwing a "dodgy" warning on my Django project because of
the SECRET_KEY variable in settings.py.

~~~
shuzchen
By no means should you just use prospector as is. The defaults are pretty
good, but if you have a huge project it'll make more sense to do a little
configuration, and most of the tools that prospector wraps are configurable. I
for one don't use the default lint configuration.

I like dodgy and I actually pull out my secret key from settings.py (I pull it
from the system environment). Otherwise you can just disable dodgy by using a
custom profile
([http://prospector.landscape.io/en/master/profiles.html#enabl...](http://prospector.landscape.io/en/master/profiles.html#enabling-
and-disabling-tools))

~~~
StavrosK
I'll look into it, but my settings file always has a fake secret key in it
(that will be overridden on production) to cut down on the amount of
configuration needed for new dev setups.

I'll look into configuring it, but having to # noqa a bunch of lines isn't
ideal :-/ Hopefully it won't get to that.

------
sciurus
You can find an overview of some tools in
[https://www.slideshare.net/mobile/jamdatadude/python-
static-...](https://www.slideshare.net/mobile/jamdatadude/python-static-
analysis-tools)

------
syngrog66
doing mental analysis of the code you write goes a long way. know the impact
of each change. know your language. have a correct model in your head of how
software behaves at runtime, especially on a given OS/hw combination. this has
the advantage of being language-agnostic and doesnt require tools. if you ALSO
have tools, great. but in many cases you don't need them, at least if the
right person is doing the right kind of thinking, at design/code/test time.

------
nice_byte
My advice would be to avoid developing large projects in Python, or any other
language without static typing.

------
dochtman
Lots (and I mean, lots!) of automated tests, until you have 95%+ coverage.

------
dukoid
Simple: Don't use Python (or any other untyped language) :)

------
carapace
Twenty or more years of experience.

\-----

Am I wrong?

