Hacker News new | past | comments | ask | show | jobs | submit login
Where's All the Code? (tedunangst.com)
65 points by nalgeon 32 days ago | hide | past | favorite | 34 comments

An alternative C version (Ted's is in Go) is described in the last link of the article [1]. That is a rather interesting tutorial for how to write high-performing tools of this kind, it deals with stuff like string interning/buffering, pre-allocating "objects" to avoid having to implement dynamically growing data structures, and so on.

It's pretty cool how the author uses LLVM as an example of a big codebase, then uses "10X that" as the requirement for how many directories and so on to support.

[1]: https://nullprogram.com/blog/2022/05/22/

Is this only for c code? I ran it on a Java codebase and got 0LOC 0B on all dirs

OMG that is so ... I don't even know. Is it over-engineering to use a static 16-entry perfect hash of file name extensions to filter out the five supported languages?

It's at least somewhat obscure, in the sense that adding support for another language becomes quite involved.

It would be interesting to micro-bench this code against something more naïve that just strcmp()s the extension against a list of known strings, or something.

> Is it over-engineering to use a static 16-entry perfect hash of file name extensions to filter out the five supported languages?

It's patently absurd. I also like the ATOMIC_RADD for populating the line count:


This entire program is too clever by half, while still not doing a "correct" source line count (it just counts new lines).

Is there a different, more robust way to describe code size than lines or perhaps even characters/size?

Could we describe “ideas” or “logical units” in some way? Or maybe just count the number of keywords and operators?

I find that “lines” is so popular but almost always misrepresents simple code as being more complex, and complex code as being simpler.

Of course, lines, like in this post, can be a perfectly helpful answer to a question if the question isn’t “where’s all the complexity?”

I guess technically Kolmogorov complexity is an answer to their question since they said "describe" but I expect Waterluvian was thinking "measure" and you can't really measure Kolmogorov complexity for real programs. It is mostly useful for proofs or thought experiments.

You can try to measure Cyclomatic complexity, so that's more useful in practice.

If only someone posts tomorrow with somethhing similar to cyclomatic complexity.

Kolmogorov complexity is borderline useless if the idea is to describe code size or code complexity for humans, though. It only describes programs in terms of "what is the smallest (fully qualified) program that would generate something that ultimately leads to the original result", completely removing both the human, as well as the original code in question, from the equation.

And to make matters worse, we don't actually know how to calculate the true Kolmogorov complexity of anything because humans are notoriously bad at figuring out what "the smallest program" actually is, so it's a great device for reasoning about complexity, but it's a near useless device for determining actual complexity.

I agree with you and tialaramex above, and I was hesitating including it for this reason, but I think it is very useful as a thought experiment.

I find it a very elegant concept about complexity (even if not very practical), and wanted to share it regardless to be honest ;)

The book "Your Code as a Crime Scene" has a number of interesting techniques and tools for this.


Agree with this, but the openbsd codebase, tends to be pretty consistent in style conventions and layout.

Fortunately, the word complexity was not used there, and it's nice to see if this subsystem, or protocol, or driver, etc takes two orders of magnitude more than the others.

Use a programming that is decoupled from stateful manipulations and control flow?

Programming does not need to involve itself with the hardware, that’s just a tool for evaluation. A program is an idea expressed in a language. It’s some formal statement of intent. Sometimes we can evaluate it to produce an output. Sometimes these statements are muddied by underlying representations.

The length of a declarative program seems a pretty good rough measure of complexity in the way you are talking about. Operational complexity of course, in the number of steps to actually evaluate it is a bit different, based on what you are using.

So fully half the code of OpenBSD (3.3M loc) is in /dev/pci/drm/amd. What does that code do?

The vast bulk of it is a dump mmio register addresses, that have been cleared for public disclosure. This is driven by some kind of legal/IP requirement. The vast majority are unused, and they are expressed in a rather inefficient way, e.g. XXX_0 through XXX_31 when anywhere else it would have been parameterized XXX(N). This was all copied from the Linux kernel btw.

Rather disappointing, but what are you gonna do? Make your own high-performance graphics hardware? No. AMD is the highest performance graphics hardware with open source driver on the planet by a wide margin. Compromises have to be made. Just accept that the hardware division gives you this crazy register address file and don't touch it.

Picking at random the longest file, ./include/asic_reg/nbio/nbio_7_2_0_sh_mask.h, I think it sets aliases using #define for some addresses and values in hex to easily human-readable ans easy to understand names like BIFPLR1_PCIE_ADV_ERR_CAP_CNTL__COMPLETION_TIMEOUT_LOG_CAPABLE_MASK which is an alias for 0x00001000L. This file contains 134349 defines.

From what I can see at a glance its mostly loads upon loads of graphics code / graphics cards stuff. Its possible they have upstreamed entire drivers into their code base.

I spotted some stuff that hints at low level protocols as well (i2c f.e.), so probably its just a whole lot of work to support AMD graphics cards. Potentially they upstreamed a driver or something.

The AMD driver code in Linux is dual licenced, same for Nouveau and the Intel drivers, they have been copied into all the BSD variants.

DRM = direct rendering manager

(not digital rights management)

That's the amdgpu kernel driver. It's ported over from Linux.

all the code in /dev/pci/drm are modern graphics drivers. (DRM as in Direct Rendering Manager)

Those are probably AMD gpu drivers.

My entreprise-grade solution:

find dev arch kern uvm -type d | while read dn; do printf "$(find $dn -type f | xargs wc -l | awk 'END {print $1}')\t$(du -h $dn | awk 'END {print $1}')\t${dn}\n"; done

Just kidding, but for smaller repos it gets the job done! And it's easy to run anything on each individual dir and add it to the output on the fly.

Yeah, it seems odd that a golang solution was used instead of just shell.

  paths() {
    dir=`dirname $1`
    while test "$dir" != '/' && test "$dir" != '.' ; do
      echo $dir
      dir=`dirname $dir`

  find $dir -type f -not -path '*/.git/*' | while read file; do
    cnt=`wc -l < $file`
    for d in `paths $file`; do
      printf "%s\t%s\n" $cnt "$d"
  done | awk '{
    hierarchy[ $2 ] += $1;
  END {
    for (dir in hierarchy) {
      printf("%8d %s\n", hierarchy[dir], dir)
  }' | sort -rn | head | sort -k 2

> Easy to see everything, without too much clutter

Looks extremely cluttered to me... should be in a table, left side showing the file tree, middle showing LOC, right side showing storage.

I feel like this tool could significantly benefit from a ncdu like interface.

ncdu itself can accept a json file in a specific format [1]. Perhaps it will be a good idea to add an option to export the data in this format?

[1] https://dev.yorhel.nl/ncdu/jsonfmt

side question: how do you _get_ the code from this site. It _seems_ like it's a git repo of some sort, but i can't figure out what to clone. Is this just a ... representation of a git repo somewhere else? What am i missing?

By cloning that same URL. Except there's the snag that it's not a Git repo, it's a Mercurial repo.

    hg clone https://humungus.tedunangst.com/r/watc

From the go.mod file:

> module humungus.tedunangst.com/r/watc

Implies a “go get” should pick it up. Likely a git clone of that URL would do.

Another commenter noted that this appears to be a Mercurial repo, not a git repo. Does `go get` support Mercurial repos? If yes, what else does it support, besides git and Mercurial?

it looks like `tree`[0] can do this functionality natively, with `tree -dh --du`

[0] https://formulae.brew.sh/formula/tree#default

That only prints file and directory sizes, not lines of code. Counting lines of code correctly is the hard part, though Ted's utility is just using a simple file line count as a proxy, though that's probably good enough for this purpose.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact