Hacker News new | comments | ask | show | jobs | submit login
Reading the NSA’s codebase: LemonGraph review (ayende.com)
135 points by ohjeez 6 months ago | hide | past | web | favorite | 67 comments

Often you wrap a library api with your own to minimize impact of updating library versions to the rest of the codebase or to provide cross version support, not just to say swap it for Oracle. At a game studio I was at we shimmed all of the PS3 API. No system Apu was allowed to leak past one layer in our codebase. It helped us move to PS4 quicker. Or to windows (with everything nerfed) between the consoles.

I like to do that too. Having a thin layer over the actual API makes maintenance much easier and you can test your layer better. But it's hard to implement this in a bigger team. A lot of people don't see the point.

It really depends on the API. For stuff that is likely to change it makes sense to abstract away the API calls. eg either because the API provider likes to break stuff or because you need your code to be agnostic towards that particular library eg for portability reasons. But that isn't always a concern and if you're abstracting away an API just for the sake of it then you'll have to question whether you're adding unnecessary complexity and overhead just for the sake of perceived modularity.

Like everything in software development, it's about evaluating the right methodologies for the problems rather than hitting every screw with the same hammer.

Likeliness to change is certainly a factor, but I think the level to which you are agnostic to the particular technology is more important. If I am choosing the technology to provide functionality that can be provided with pretty much the same semantics by other technologies, then I think it often makes sense to create the abstraction so that they can be swapped out. On the other hand, if I am choosing technology for semantics that are particular to that technology, I don't think creating the extra abstraction makes sense.

For instance, caching is functionality that could be backed by redis, memcached, etc. and I think it makes sense to create a cache API in most cases. On the other hand, if I am processing data with Kafka, I want to interface with Kafka directly for all the special functionality and nuance that makes it great for particular use-cases.

'Like everything in software development, it's about evaluating the right methodologies for the problems rather than hitting every screw with the same hammer."

Very true.

Hitting screws with hammers, eh? Ever drive a nail with a drill?

It's a reasonable way to drill a few holes in wood if you're in a bind.

I do it on my current server stuff as well. Generally I don't talk to dynamo directly, I make a pretty simple wrapper that speaks the language of the project. This helps with mocking and with isolating the business logic from shifts in schema. Also makes it easier to use test dynamo db.

I would recommend this approach to any project that deals with databases. Having SQL or other query languages spread around the codebase makes it extremely hard to refactor or test and can also cause security issues.

I always write simple functions like "CustomerFunctions.GetCustomersByCountry" and keep all SQL in that file. Makes the code more readable and testable.

Is the point: thin layers defeat NSA?

why would it be hard to implement this in a bigger team? if it's hard, then it's leadership failure.

It's easy to implement if you are the dictator who can just tell people that that's the way how things are done. Or you have worked with people for years and learned your lessons. If you have to convince people it's much harder. I find that a lot of people these days avoid writing code and just want to plug components from GitHub or npm together. They don't see the point of writing an API layer if it's not coming from GitHub.

I somehow understand your comment. Logically you are right. Having the first and last word is always the easiest way. I mean, the easiest way to get what you want for a limited amount of time, until all hell breaks loose.

I hear a lot of frustration in your words, as if you had difficulties trying to explain to your colleagues why a certain thing could be done better in a certain way. I think that sometimes they might be right, sometimes reinventing the wheel is pointless, sometimes it needs to be invented, and sometimes you need to add that layer of complexity. It seems that information is not shared correctly at your workplace, as not everyone is probably aware of the same problem to solve, or maybe not everyone has the same skills, in which case you need to mentor and explain. Or it seems that you don't have constructive design discussions.

And no, this does not depend on the company/team size. Probably in larger teams it's actually easier to convince, as generally there is a process in place for such things like design, refactoring, etc.

It's not unique. I've worked in large teams with benevolent dictators (who were open to feedback and suggestion), and large teams where everyone wanted to be at the same level.

In the latter scenario, generating momentum is damned difficult, as everyone is trying to herd cats to their own priorities.

Smaller teams in that regard can be more flexible / nimble, as there are fewer total competing priorities, but that isn't to say that it is always true in either direction.

Gary Bernhardt, as usual, has an excellent screencast about that. https://www.destroyallsoftware.com/screencasts/catalog/wrapp...

You know what they say: Every problem can be solved by introducing another layer of indirection. Etc. ;-)

But seriously, that sounds like a very good idea if one has to target multiple platforms work with a change-prone API.

Well think about it, if you're unit testing or functional testing (full stack api based but mocked out downstreams) then you have at least 2 maybe 3 platforms to run against. The real, the mock and maybe the more fancy mock for the functional tests. So say you were talking directly to dynamo api for all of this, it's a lot harder to mock out the dynamo API than it is to mock out a simple 3 method interface.

How would you rate this wrapping? I mean how do you prevent the 'leaky abstraction' where you just wrap a method with your own which has the same signature and struct parameters? Or do you have your own language or domain to wrap e.g. the ps3/4 apis?

I would say it worked pretty well. We also had some pretty dogmatic rules that were revised once per project and people didn't argue about until the end of a project. This was good and bad. It made the devs who had been there 3+ years super effective with like no mental overhead. It was very difficult on new poeople to learn the system and the company AND fight with a lot of other rules at CR time. The company was biasing for lots of long term happy programmers who were super effective so I understand why they did it. The average tenure was 5+ years on the dev staff.

we wrapped it into our engine language. Even though we were a sony only shop we still just talked our stuff. For instance we didn't use any reference to any of the input apis, we just had a global input class g_Input and Input that did all of the input logic. It was pretty easy to make this work for direct input on windows vs the state machine polling system on PS3.

My takeaway here is surprise that a game development company biases for long term employees (my understanding was to push for long weeks and early burnouts)

It depends on the company for sure and can change over time. I think there are also often 2 tiers of employees at game companies. The ones who are there for the project (their choice or the company choice) and those they feel are the nucleus and treat well. And well if you're at a place like EA the size of the nucleus may be differen due to shifting across projects vs at a single project shop where there are no other projects and loss of the nucleus means the next project fails too.

It just takes so long. And sometimes you end up with the minimal api featureset across N platforms.

Sure.. but we were also only a sony shop, so we had 3 platforms ever, and only 2 were supported at any time (during transitions). So it was more that moving from PS3 -> Windows (while ps4 was only a gfx card) -> PS4 had little impact on the rest of the code base. I'm not saying we were windows / mac / linux / ios / android / ooya / ps4 / ps3 / switch / xbone / 360. I have no experince into that.

"As an aside, I’m loving the fact that you can figure out the call graph for methods based on the number of underscores prefixed to the name. No underscore, public method. Single underscore, private method used internally. Two underscores, hidden method, used sparingly. At three underscores, I could tell you what it means, but given that this is NSA’s codebase, I’ll have to kill you afterward."

To be fair, the way malloc is used is actually the preferred way. If you change the type of the pointer variable, the malloc is still valid.

I agree. I always do it this way for this exact reason.

However, I'm surprised the author did not point out that they are calling malloc() without checking for a NULL return value.

Checking the return of malloc() for NULL is not currently considered good practice, because (a) on most platforms malloc is guaranteed not to return NULL, even if the system is out of memory, and (b) on the few platforms where malloc can return NULL, handling that case in practice basically never works.

> Checking the return of malloc() for NULL is not currently considered good practice


To quote malloc(3):

  On error, these functions return NULL.  NULL may also be returned by a
  successful call to malloc() with a size of zero, or by a successful call to
  calloc() with nmemb or size equal to zero.
I'm not disagreeing with your assertion that on linux, in practice, it doesn't return NULL -- (in fact, I didn't know that! and would appreciate being pointed at whatever I need to RTFM...) -- I just prefer code that's conservative, that follows the "letter of the law" when it comes to return values.

If you write code that tests for null out of malloc, you have to commit to that code as a feature and have a test strategy for it. Simply porting the program to a platform where malloc returns null isn't enough; there are various cases to probe, like your very first malloc failing because the program was started in a low-memory environment. Or the case that the 99th malloc fails after 98 good ones.

Oh, I'm not saying you should test for NULL and then, uh, not care about the result!

If you port to some new platform, of course you need to go back and check the docs for new error codes.

For instance, I generally abort() on NULL after printing a message.

> For instance, I generally abort() on NULL after printing a message.

Ok, but...how are you printing the message?

On many platforms, with many standard libraries (glibc, for instance), printf et al will themselves malloc memory, so you're testing for a system telling you it couldn't allocate memory and then in response, potentially triggering further quite-likely-failing allocations that may or may not be handled by the library.

This is why people generally don't bother to check for NULL mallocs. Even if you're on a system that will even tell you you're in that situation (rare, these days, outside of the embedded world), there's almost nothing useful you can do.

You’re absolutely correct, in this manner of development you just hit a wall and call it a day.

A few years back I read quite a few papers on NASA’s development standards and did quite a bit of kernel dev also.

I no longer do any allocations on daemonized or driver code like this. It’s far better to preallocate pools, store temp objects on stack and generally never malloc outside of the pool management.

When we hit the alloc wall, we now have well-defined behaviors and can gracefully choose whether to reject, clean or error out. Additionally our code is generally more performant, less buggy, and easier to debug.

You're absolutely correct. (full stop)

Honestly, in my own code I try to avoid lots of little heap allocations (duh), and when writing embedded things I don't have printf or malloc anyway.

"care about the result" means:

- come up with a requirement about handling the bad result

- put in the code which does it

- test the code

> I generally abort() on NULL after printing a message.

On most modern systems, accessing the null pointer will also abort; you just won't get the nice error message. It will look the same as if the program hit a runaway recursion, or corrupt memory. The location in the program where it happened is equally traceable either way (with a core dump or debugger). Neither a segfault nor abort() do any graceful cleanup of anything. Basically, this is only a tiny increment in software quality over not handling the NULL at all.

>on most platforms malloc is guaranteed not to return NULL

This is only true on Linux, and isn't even guaranteed there. I'm pretty sure the default overcommit handling mode is still to refuse allocations that are much larger than there is free memory to support. (This was the case around 2.6.)

It's generally preferable for software to check if malloc() failed and have a chance to recover.

Malloc will in some cases return null when the allocation is too large. This might happen long before system is out of memory and it is very much indeed preferred and practical to abort allocations in those cases.

Those are rather specific cases, though, aren't they? I'd imagine malloc only returns NULL when virtual address space is exhausted, not physical memory. So if you make an exceedingly large allocation for a very specific, massive thing, then yes, you might want to handle that. But if your allocation was rather small, then you are back to what your parent commentator said and are probably better off aborting.

It depends on whether the system you're using overcommits memory. Some OSes (eg Linux) provide settings to limit virtual memory use to some percentage of physical memory. See http://engineering.pivotal.io/post/virtual_memory_settings_i...

The only time you should avoid checking malloc() return values are for special mallocs that really cannot return NULL. One example is FreeBSD's kernel malloc() when called with the M_WAITOK flag.

On a 32bit process that is easily achieved and also if the virtual memory has been fragmented.

I've encountered many situations where malloc will return null in real world applications. Some of which we really wanted to handle that scenario and fail gracefully or continue with less memory usage (and thus less performance).

Eh? When did that get started? You are claiming resources. Not checking if you got them is a leap of faith, period. If nothing else, you can abort the program before nastier things happen.

This is correct. What happens when you are trying to rely on malloc'd memory for sensitive calculations? Reading an invalid pointer is undefined behavior. I see no reason why "it almost never works" to check for NULL and abort if necessary.

> Checking the return of malloc() for NULL is not currently considered good practice

This is simply false. I was a TA for the intro C class at CMU last year, and we still teach to check for NULL. In fact, we mark down students who do not do so.

I'm mildly surprised to see the NSA using malloc instead of calloc, especially for allocating what seems to be a partially initialized struct.

I understand the question, but as @AndyKelley taught me (https://github.com/ziglang/zig/pull/993#commitcomment-289183...), it is actually worse to zero a value unnecessarily if you can use Valgrind/sanitizers to check for uninitialized values. Initializing the value as zero will prevent the detection of a bad value.

Really depends on whether they know that all fields will be written before return or not.

This way hey are calling malloc() is very common in professional code bases and a safety measure; I’m surprised the author was surprised.

The way I read it, the author is surprised that struct node_t is typedef'd to node_t* and not simply node_t

I am not used to that style so it seems odd to me, but I suppose if it was common in a codebase one would get used to it.

Most compilers in the 80’s and 90s had pretty restricted grammars based on K&R before C89 and newer standards fixed these sort of things. I’ve seen a bit of code like this from then... Off the top of my head, I’m thinking the Amiga toolkit and Xwindows. Old embedded code may still have some of this going on as well.

Sometimes limits are good. They lead to people playing it safe.

I've spent most of my life dealing with C code dating from 1986 and earlier to present day. Once you get used to the stylistic conventions one of the things that falls out is the simplicity even for huge, old legacy code bases. Often little more than grep or etags is sufficient to competently navigate and follow the flow of this kind of code. Simple things are simple: what callstacks can end up here? etc. The code often has the property that if you printed it you'd still be able to navigate it without issue. Given a random page and a line, getting to somewhere would be possible.

In comparison so many modern code bases are just completely incomprehensible without a tool in the form of an IDE (and often still quite challenging with them because even the IDE isn't sure, and so you must resort to runtime analysis). So many "tricks" used which require tools and require the tools to be bug free and robust.

The closest I have come to "pick it up and you can read and understand it without assistive technologies" is Go.

Passing sizeof(value) to malloc is really fine. But I consider the typedef of a pointer to a node is to the same identifier as the node itself to be pretty bad.

It's confusing for readers and new team members. And can even lead to strange when someone forgets to type "struct".

Why would that be? You don't cast the result of malloc in C. It's wholly unnecessary, adds clutter to the code, and potentially hides an error if you forgot to include stdlib.h.

So why would that be "the preferred way"? I've been writing C for some time and no one who knows C casts the return value of malloc.

The parent comment was referring to the usage of sizeof().

Also, the fact that they cast the result of malloc makes me think they were using a C++ compiler, which will complain that there's no cast.

>The parent comment was referring to the usage of sizeof().

Ah ok, makes more sense in re-reading now. I was lazy and went off of a child comment.

>Also, the fact that they cast the result of malloc makes me think they were using a C++ compiler, which will complain that there's no cast.

Definitely could be that.

If you're compiling for C++, you should be using static_cast<> instead.

> To be fair, the way malloc is used is actually the preferred way.

Is it really right? In C, isn't it considered bad to cast malloc result as it masks some mistakes? Also, the code doesn't check for NULL for the malloc result.

I am not referring to the lack of null check. Just the fact that it uses `malloc(sizeof(*x))` instead of using `malloc(sizeof(Type))`.

The latter creates Heisenbugs if you change the type of x and don't track down every malloc to update the type. It is better to avoid using sizeof(typename) if possible.

Now the profligate use of leading underscores. That is an issue.

> Now the profligate use of leading underscores. That is an issue.

This is a pretty common way of shoehorning access control into languages that do support them natively.

Good point about the type change risk. However, in this case that was obviously not the reason, since there is an explicit cast (although the compiler would prevent it from becoming a HB). His hand was forced by the pointer hiding in the typedef. As I see it, this code is half-way smart, which is worse than no-way smart. (KISS!)

where's the difference if you put in the numbers by hand? Which numbers do you put in to avoid mentioning the size of the type?

You just feed sizeof the instance of the object you're allocating memory for. That entails funky pseudo-dereferences for pointers so the compiler doesn't give you the size of a pointer. It will look up the object's type and requisite size to fill in the argument to malloc(). If the object in question has an unknown type (maybe all you've got is a void *) then you will have to provide the size by other means.

Malloc does not fail under Linux unless you are in very peculiar circumstances.

Not everyone is running Linux.

And anyway, it is dirty - like crossing the street without looking for cars because traffic around here is very mellow, or peeing without washing your hands afterwards.

(That being said, I once helped somebody to write a program whose task it was to deliberately get the host system to the point where malloc(3) failed. It was not as trivial as one might think.)

Malloc will still fail for overly large allocation sizes. If you try to malloc(-1), for example.

What's with the name??

There is already a well-known and popular library for graph analysis (in C++) named "Lemon Graph Library". It is 10 years old. As far as I can tell, this new library is not related in any way. (Am I mistaken?)

Did they not simply google for the words "lemon graph" before they published this code base? Or am I missing something?

My guess of how this evolved: LMDB => LemonDB.

Do you seriously think they care about re-using a name for a throw-away non-classified project? People have better things to do with their time.

The idea of the NSA googling `lemon graph` then saying "Welp, looks like someone else already has it. Back to the drawing board", is actually pretty hilarious.

On the topic of code reading, can anyone endorse any tools specific to reading a code base, rather than development?

I've been using vim and cscope forever, but I'm wondering if there's anything new and interesting out there.

...or even screencast recommendations of folks that are particularly proficient at this.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact