2k is the cache size of your brain. You can fit an entire program in your head. It doesn't matter how it's organized because you'll just load the whole thing.
At 20k, you can only fit part of the program in your head. But, it's small enough that you can be familiar with the whole thing. You need modularity so that making a change only requires loading a single 2k-sized chunk of the program, but you don't need much else to help you find the right chunk.
At 200k, you probably have multiple people working on it and you may often have to deal with "cold" code that you've never seen in your life. You need additional architecture and documentation to help you find where to make a change before you can even start learning the part of the code that needs to change.
You need the codebase to be organized defensively to prevent you from adding redundant features, or doing things that break the architecture. In other words, you need to be able to get work done with only a partial knowledge of the code.
At 2M, you have lots of people working on it, and the team has changed over time. The team is large enough that tribal knowledge is constantly being shed through turnover or forgetfulness. There are parts of the program that no one understands.
The code is likely old enough that it reflects multiple different architectural and process visions. It is no longer feasible for it to be entirely internally consistent. The idea of a global clean-up is off the table because it's too risky. At this point, it is like owning a castle. You work mostly as a caretaker of it. Instead of adding value, your job is to preserve the value it has accumulated over time. Additions are often at the edges: interfacing with new systems, etc.
Personally, I find ~20k programs the most fun. Big enough to do something interesting bug small enough to be clean and consistent.
A junior programmer can do < 2k lines. Think a stand alone command-line tool. This is indeed something that a single person can understand. Learning skills around modularity and consistency gets you beyond this.
A mid-level developer can do < 20k lines. You may have one or several people on the team, but it is still small enough that you can pretty much know the whole thing. An example of something at this size is a typical Ruby website. To get beyond this you need to have a pattern to your organization, and a good sense of how to create and maintain abstraction layers between different parts of the system.
A 200k system is small enough for a senior developer to navigate and understand without significant documentation, and can be created by a small team. The architecture has to be clear, but you don't need specialized documentation. When it comes time to add or find something, the overall architecture and patterns will tell you where to look. You may land in unfamiliar code, but you will know that it has to be there, and roughly what it has to be. As for size, a small to medium company can run on this much code. For example I was at Rent.com when we were sold to eBay for over $400 million in 2004. This was about how much code we had.
At 2M, there are a lot of teams. You may have specific tooling just to help you maintain sanity. You definitely have documentation. There are so many people doing so many things that you cannot rely on people following key conventions, instead you are likely to try to enforce them. Examples of projects at this general size would be a browser like Chrome, a compiler like gcc, and so on.
What about 20 million lines of code? These are large projects carried out by large organizations over many years. Examples that I have seen include the current Linux kernel, Windows NT 4.0, and eBay circa 2006. The specialized tooling that was being considered for a 2 million line project is now required, and there is a lot of it. Documentation is extensive. Figuring out who to talk to to find out about something can be a struggle. And so on.
What about larger than that? There are few examples that have turned out well. The only person who I personally believe has done it well is http://research.google.com/people/jeff/, and I'm firmly of the belief that without him Google could not have become what they did.
As for what is fun, I personally like the 200k project size best. It isn't fun until you have the skills to contribute well. But once you do, you have the complexity while still having a team small enough that you can personally know everyone who is involved. But YMMV.
For the sake of illustration, he says a novice may hit a brick wall at 2,000 lines of code, and be unable to add features after that without breaking things.
The next level for a more experienced programmer might be 20,000 lines of code, and he describes some things that helped him get there.
Then there's his personal breakthrough to 200,000 lines of code. etc.
(I add this gloss to spur people to read the piece, which is interesting, and add their own ideas, not because I am claiming the above is some kind of absolute truth.)
Great. Now tell me how X scales on a 200,000 line project.
One of the places people make this mistake is with Go. Go isn't designed to make your 2,000-line project shorter or easier. It's designed to make Google's 20,000,000-line projects maintainable for a couple of decades.
We are digressing here, yes? I read the article quickly, admittedly, but I didn't notice him doing language advocacy.
If there's a wall at 2,000 lines, almost all language-advocacy examples are below that wall. That's the first wall. But language choice doesn't get interesting until you ask what the language does at the 20,000-line wall or the 200,000-line wall. Nobody talks about this when they advocate a language (except, as I said, Go). The closest are Haskell and Lisp, and their claim is that you can write the same program in fewer lines (so that you don't hit any of the walls as quickly).
(On the line of your point, though, maybe it's a shame that relatively-verbose languages like Ada and Modula-3 became social pariahs because their virtues are hard to demonstrate in the small.)
To hit one of the complexity walls in this article you need to be consistently doing certain things right until your project becomes large enough that you hit new types of scaling problems. That isn't a day to day productivity issue. That's a project lifecycle issue.
That said, there are claims about how language features and project scale interact. People have strong opinions, but I do not think that anyone has studied this rigorously. That said, I know plenty of successful 2+ million line projects that exist in C. I've seen data suggestive that projects written in scripting languages fall apart at that scale.
So the traditional wisdom suggests that a 100k project written in Python will cost the same as 100k lines of C, and does more. However there are projects that you don't want to write in Python. Really.
This was not due to a lack of talent among the Python programmers. This was back when Guido worked there, and projects with people like him did not mature amazingly better.
Here is an example. It is no fun tracking down months after a change that a particular combination of code paths caused a run-time exception because developer 1 was plugging objects into code written by developer 2 where developer 2 expected a specific method to be available and developer 1 didn't do it. And this wasn't visible until you hit an error path because service 3 had a momentary outage that, in theory, the system was written to recover from.
A compile time checked type system can prevent this class of error. Scripting languages allow it. In small projects this kind of dependency is OK because they happen rarely and are reasonably easy to fix. In large projects it stops being easy to fix.
But I'm curious: what useful principles have other HN users acquired after decades of programming or working on giant projects that a "solo programmer" like me might not be aware of?
For me one of the biggest shifts was test-driven development. That is, I start with a test, write a few lines of test code, make it pass, perhaps refactor, and write a few more lines of test. It took me a year or so to get from test-last to test first, but I love it now; it forces me to look at code from the external perspective. One way to put it is that it shifts my focus from internal mechanism to real-world meaning.
Another breakthrough was pair programming. The larger a code base gets, the more important readability and easy comprehension get. But at least for me there's a real limit on how comprehensible I can make a piece of code on my own. I just know the internals too well to usefully model the reaction of somebody who neither knows the internals nor wants to. But pairing gives me (and allows me to give) continuous feedback on what makes sense and what doesn't.
A third favorite is known as Domain-Driven Design , where one organizes the code around the actual concepts of a business domain. The larger a code base gets, the more possible places one might look for a particular piece of code. Organizing the code by the real-world notions is a great check on entropy.
You may find pieces of beautiful code written by enthusiastic new employees next to files written by jaded, checked-out veterans ready to move on. Coffee-fueled bug fixes checked in right after the tests passed, to be documented later (never). There may not be a person in the organization who knows how to fully build and deploy an entire codebase from scratch.
You'd think this only happens at "bad" companies, but any company that grew fast and became hugely successful will have lingering technical debt, some of which does not need to be paid.
If you want to experience this, pick a relatively unpopular and large codebase open sourced by some company and try to understand it. Imagine someone had just asked you to add a new feature with an arbitrary deadline. Now realize that this code is probably much better than a comparable closed-source system, after all the company wasn't embarrassed about sharing it with the world.
I haven't noticed students hitting a wall in that process -- the code isn't always very good (they're students) but the groups generally get there with the code, and struggle instead with large merges, group dynamics, writing tests, etc.
This could be because I've already put the general architecture and build system in place before they start, but I wonder if there might be something else at play too.
(Well, or maybe they are hitting the wall, but as they need to do this to get through the unit, they scrabble frantically over the top of it...)
I think everyone would agree that a second year undergraduate (who had not programmed at all before university) is not generally going to be able to write 60,000 lines of code single-handedly. And certainly not in one term.
When comparing experiences, I think it's important to be careful to compare apples to apples.
20KSLOC programs just don't appear out of thin air. They start as small programs that, if you apply good programming practices, can scale beyond 2KSLOC. And if you apply good program design practices can scale up to 20KSLOC. But all along the way you'd better be thinking about whether you really need a particular feature and how coupled it will be with the other ones. That should happen in every program. The problem is that we are not used to challenging the features that are selected for our systems and recognizing their price as the system grows.
E.g. one might start out writing a machine learning library where a transformation of data is simply a function. But in scikitlearn, transformations of data are objects that implement a transform method. This, together with the implicit/explicit constraints on the semantics of the transform method, help create uniformity so that understanding of the codebase scales better than understanding an arbitrary collection of functions.
I can see the steps. I have and can crank out 2,000 lines of code. We see this all the time in hack-a-thons.
20K lines means a team and tools and some level of software control. Maybe a nod to architecture.
200K lines is a good sized project with a starting level of architecture first (maybe proceeded by throwaway prototypes) and then some serious software development methodology.
While the author writing 200K lines of code is cool, in today's business environment, that's not really going to happen. The cycle of prototype, code, build, test, (x2) then pivot and then repeat everything isn't a single programmer.
I've done 3GL / 4GL for a long time (Burroughs LINC!!) that promised that. And while it did turn 50 lines of LINC into 1000 lines of COBOL, there was a lot of though in those 50 lines. So I look at the current "10 lines of code in XYZ" I think "and I need 100,000 lines of libraries too".
Large code bases are not for the faint of heart of for cowperson coders. You may be able to write 200K of code, but it it can't be checked in and not break my build I can't use you.
For example, I can write a quick Django site in < 5K lines of (new) code by relying on Django|Python|Apache|Linux, meaning I am leveraging 20M+ LOC of tested, stable, well-documented code to support my little program.
I would assume that any big project that really requires millions of new LOC would in fact be structured as 10 or more sub-projects of < 200K LOC each, and the interaction of all this code would work for the same reason my Django app works.
I guess the rub is how quickly each subproject can iterate while maintaining stability for the other teams. I guess the OP is focusing on big projects that are expected to continue to increase in feature count, complexity etc. However I think my model still holds -- but you need to be seriously disciplined and invested at that stage wrt testing & QA, and have a strong culture of practice that supports the weight of all that legacy code but remains at least somewhat "agile".
If you need to manually check with the rest of the codebase before making changes to the chunk at hand, you're lost.
Above that, you need architecture. Picking a good architecture is hard...