
Norris Numbers - ingve
http://www.teamten.com/lawrence/writings/norris-numbers.html
======
munificent
Those order of magnitude milestones feel about right to me. They sound a lot
like memory access patterns on a computer (L1 cache, L2 cache, L3 cache, RAM,
disc, network), and I think it's for about the same reason: controlling the
time to load something into the CPU. Except, in this case the CPU is your
brain.

2k is the cache size of your brain. You can fit an entire program in your
head. It doesn't matter how it's organized because you'll just load the whole
thing.

At 20k, you can only fit part of the program in your head. But, it's small
enough that you can be _familiar_ with the whole thing. You need modularity so
that making a change only requires loading a single 2k-sized chunk of the
program, but you don't need much else to help you _find_ the right chunk.

At 200k, you probably have multiple people working on it and you may often
have to deal with "cold" code that you've never seen in your life. You need
additional architecture and documentation to help you find _where_ to make a
change before you can even start learning the part of the code that needs to
change.

You need the codebase to be organized defensively to prevent you from adding
redundant features, or doing things that break the architecture. In other
words, you need to be able to get work done with only a partial knowledge of
the code.

At 2M, you have lots of people working on it, and the team has changed over
time. The team is large enough that tribal knowledge is constantly being shed
through turnover or forgetfulness. There are parts of the program that _no
one_ understands.

The code is likely old enough that it reflects multiple different
architectural and process visions. It is no longer feasible for it to be
entirely internally consistent. The idea of a global clean-up is off the table
because it's too risky. At this point, it is like owning a castle. You work
mostly as a caretaker of it. Instead of _adding_ value, your job is to
_preserve_ the value it has accumulated over time. Additions are often at the
edges: interfacing with new systems, etc.

Personally, I find ~20k programs the most fun. Big enough to do something
interesting bug small enough to be clean and consistent.

~~~
btilly
I've personally worked on code bases at these thresholds. Here is my resulting
opinion.

A junior programmer can do < 2k lines. Think a stand alone command-line tool.
This is indeed something that a single person can understand. Learning skills
around modularity and consistency gets you beyond this.

A mid-level developer can do < 20k lines. You may have one or several people
on the team, but it is still small enough that you can pretty much know the
whole thing. An example of something at this size is a typical Ruby website.
To get beyond this you need to have a pattern to your organization, and a good
sense of how to create and maintain abstraction layers between different parts
of the system.

A 200k system is small enough for a senior developer to navigate and
understand without significant documentation, and can be created by a small
team. The architecture has to be clear, but you don't need specialized
documentation. When it comes time to add or find something, the overall
architecture and patterns will tell you where to look. You may land in
unfamiliar code, but you will know that it has to be there, and roughly what
it has to be. As for size, a small to medium company can run on this much
code. For example I was at Rent.com when we were sold to eBay for over $400
million in 2004. This was about how much code we had.

At 2M, there are a lot of teams. You may have specific tooling just to help
you maintain sanity. You definitely have documentation. There are so many
people doing so many things that you cannot rely on people following key
conventions, instead you are likely to try to enforce them. Examples of
projects at this general size would be a browser like Chrome, a compiler like
gcc, and so on.

What about 20 million lines of code? These are large projects carried out by
large organizations over many years. Examples that I have seen include the
current Linux kernel, Windows NT 4.0, and eBay circa 2006. The specialized
tooling that was being considered for a 2 million line project is now
required, and there is a lot of it. Documentation is extensive. Figuring out
who to talk to to find out about something can be a struggle. And so on.

What about larger than that? There are few examples that have turned out well.
The only person who I personally believe has done it well is
[http://research.google.com/people/jeff/](http://research.google.com/people/jeff/),
and I'm firmly of the belief that without him Google could not have become
what they did.

As for what is fun, I personally like the 200k project size best. It isn't fun
until you have the skills to contribute well. But once you do, you have the
complexity while still having a team small enough that you can personally know
everyone who is involved. But YMMV.

------
DougMerritt
That's a very unappealing title for a very interesting subject, which is that
there are complexity barriers that get in the way of creating larger programs.

For the sake of illustration, he says a novice may hit a brick wall at 2,000
lines of code, and be unable to add features after that without breaking
things.

The next level for a more experienced programmer might be 20,000 lines of
code, and he describes some things that helped him get there.

Then there's his personal breakthrough to 200,000 lines of code. etc.

(I add this gloss to spur people to read the piece, which is interesting, and
add their own ideas, not because I am claiming the above is some kind of
absolute truth.)

~~~
AnimalMuppet
And that's _exactly_ what's wrong with all the syntactic-sugar-based language
marketing. "Write your code using X, and you go from 10 ugly lines of C++ down
to 3 beautiful (if syntactically weird) lines of X!"

Great. Now tell me how X scales on a 200,000 line project.

One of the places people make this mistake is with Go. Go isn't designed to
make your 2,000-line project shorter or easier. It's designed to make Google's
20,000,000-line projects maintainable for a couple of decades.

~~~
DougMerritt
The traditional wisdom (backed by multiple studies starting decades ago) is
that the language used doesn't statistically change the number of debugged
lines of code per day, but may often change the number of machine instructions
executed per line of high level code written.

We are digressing here, yes? I read the article quickly, admittedly, but I
didn't notice him doing language advocacy.

~~~
AnimalMuppet
Yes, I'm digressing. Guilty as charged. Nevertheless, I think it's a somewhat-
on-topic digression.

If there's a wall at 2,000 lines, almost all language-advocacy examples are
below that wall. That's the _first_ wall. But language choice doesn't get
interesting until you ask what the language does at the 20,000-line wall or
the 200,000-line wall. Nobody talks about this when they advocate a language
(except, as I said, Go). The closest are Haskell and Lisp, and their claim is
that you can write the same program in fewer lines (so that you don't hit any
of the walls as quickly).

~~~
tokenrove
It's worth noting that a big part of the Common Lisp spiel was that it was
suitable for very large applications which needed to be maintained for a long
time. So the two (terseness and large-scale development) need not be mutually
exclusive.

(On the line of your point, though, maybe it's a shame that relatively-verbose
languages like Ada and Modula-3 became social pariahs because their virtues
are hard to demonstrate in the small.)

------
Xcelerate
As someone who has never really programmed in a group, what are some examples
of common scaling issues? I typically try to adhere to the principles of loose
coupling, pure functions & immutability, clean & readable syntax, "don't
reinvent the wheel", algorithmic optimization before code optimization,
version control, unit testing, etc.

But I'm curious: what useful principles have other HN users acquired after
decades of programming or working on giant projects that a "solo programmer"
like me might not be aware of?

~~~
wpietri
Great question.

For me one of the biggest shifts was test-driven development. That is, I start
with a test, write a few lines of test code, make it pass, perhaps refactor,
and write a few more lines of test. It took me a year or so to get from test-
last to test first, but I love it now; it forces me to look at code from the
external perspective. One way to put it is that it shifts my focus from
internal mechanism to real-world meaning.

Another breakthrough was pair programming. The larger a code base gets, the
more important readability and easy comprehension get. But at least for me
there's a real limit on how comprehensible I can make a piece of code on my
own. I just know the internals too well to usefully model the reaction of
somebody who neither knows the internals nor wants to. But pairing gives me
(and allows me to give) continuous feedback on what makes sense and what
doesn't.

A third favorite is known as Domain-Driven Design [1], where one organizes the
code around the actual concepts of a business domain. The larger a code base
gets, the more possible places one might look for a particular piece of code.
Organizing the code by the real-world notions is a great check on entropy.

[1] [https://en.wikipedia.org/wiki/Domain-
driven_design](https://en.wikipedia.org/wiki/Domain-driven_design)

------
wbillingsley
Curious. I teach a second year undergraduate class in Australia, in which I
put the whole class onto a single codebase, typically growing it from 1,500
lines at the start to usually around 60,000 lines by the end of term.
(Possibly slightly less this term -- I've moved universities from UQ to UNE
and it's a smaller class this time around)

I haven't noticed students hitting a wall in that process -- the code isn't
always very good (they're students) but the groups generally get there with
the code, and struggle instead with large merges, group dynamics, writing
tests, etc.

This could be because I've already put the general architecture and build
system in place before they start, but I wonder if there might be something
else at play too.

(Well, or maybe they are hitting the wall, but as they need to do this to get
through the unit, they scrabble frantically over the top of it...)

~~~
DougMerritt
I believe the article is primarily talking about single programmer codebases,
although team codebases come up at the end.

I think everyone would agree that a second year undergraduate (who had not
programmed at all before university) is not generally going to be able to
write 60,000 lines of code single-handedly. And certainly not in one term.

When comparing experiences, I think it's important to be careful to compare
apples to apples.

------
michaelfeathers
I think the most important takeaway from this article might get lost in the
other points it makes: you are better off when you have fewer features and
have fewer features that interact with each other.

20KSLOC programs just don't appear out of thin air. They start as small
programs that, if you apply good programming practices, can scale beyond
2KSLOC. And if you apply good program design practices can scale up to
20KSLOC. But all along the way you'd better be thinking about whether you
really need a particular feature and how coupled it will be with the other
ones. That should happen in every program. The problem is that we are not used
to challenging the features that are selected for our systems and recognizing
their price as the system grows.

------
JesperRavn
I think 20000 lines of code is the tipping point where code bases go from
applications/libraries to frameworks. The difference is that the objects of an
application/library are expressed in the raw language, while in a framework
they are expressed in the framework's meta-language.

E.g. one might start out writing a machine learning library where a
transformation of data is simply a function. But in scikitlearn,
transformations of data are objects that implement a transform method. This,
together with the implicit/explicit constraints on the semantics of the
transform method, help create uniformity so that understanding of the codebase
scales better than understanding an arbitrary collection of functions.

------
AstroJetson
Wait, nobody has gone "Chuck Norris's Norris Number is 2.2 million!"? (I
clicked on the title just because of Norris, I was thinking it was another
Chuck reference.)

I can see the steps. I have and can crank out 2,000 lines of code. We see this
all the time in hack-a-thons.

20K lines means a team and tools and some level of software control. Maybe a
nod to architecture.

200K lines is a good sized project with a starting level of architecture first
(maybe proceeded by throwaway prototypes) and then some serious software
development methodology.

While the author writing 200K lines of code is cool, in today's business
environment, that's not really going to happen. The cycle of prototype, code,
build, test, (x2) then pivot and then repeat everything isn't a single
programmer.

I've done 3GL / 4GL for a long time (Burroughs LINC!!) that promised that. And
while it did turn 50 lines of LINC into 1000 lines of COBOL, there was a lot
of though in those 50 lines. So I look at the current "10 lines of code in
XYZ" I think "and I need 100,000 lines of libraries too".

Large code bases are not for the faint of heart of for cowperson coders. You
may be able to write 200K of code, but it it can't be checked in and not break
my build I can't use you.

------
danbmil99
Above a certain size, don't most big projects end up as individual modules
lightly coupled through well-defined API's/comm channels?

For example, I can write a quick Django site in < 5K lines of (new) code by
relying on Django|Python|Apache|Linux, meaning I am leveraging 20M+ LOC of
tested, stable, well-documented code to support my little program.

I would assume that any big project that really requires millions of new LOC
would in fact be structured as 10 or more sub-projects of < 200K LOC each, and
the interaction of all this code would work for the same reason my Django app
works.

I guess the rub is how quickly each subproject can iterate while maintaining
stability for the other teams. I guess the OP is focusing on big projects that
are expected to continue to increase in feature count, complexity etc. However
I think my model still holds -- but you need to be seriously disciplined and
invested at that stage wrt testing & QA, and have a strong culture of practice
that supports the weight of all that legacy code but remains at least somewhat
"agile".

------
0xdeadbeefbabe
More than 200,000 lines of kernel code support some little program, but I
pretend they don't exist. For that reason, I think having a good abstraction
or architecture is the key to breaking the 200K barrier.

~~~
AnimalMuppet
To rephrase: Maybe the key is how many of the 200K lines I can completely
ignore. Even more, how many can I ignore while still depending on them.

~~~
hmbg
This is the key. I'd argue that you can keep a 20k codebase in your head. If
the codebase is larger than that, you need to be confident that you can safely
ignore the rest of it while working on your piece.

If you need to manually check with the rest of the codebase before making
changes to the chunk at hand, you're lost.

------
tilt_error
At 2.000+ lines of code, you need abstractions. Picking good abstractions is
hard. Experience teaches you which are good or not.

Above that, you need architecture. Picking a good architecture is hard...

