Hacker News new | past | comments | ask | show | jobs | submit login
NSA built a NoSQL database (apache.org)
139 points by rschildmeijer on Sept 3, 2011 | hide | past | web | favorite | 55 comments

" The core codebase consists of 200,000 lines of code (mainly Java) and 100s of pages of documentation."

100s of pages of documentation is a promising start for any open source project.

It depends on what the documentation is. If it's 100 pages of "AbstractClassFactoryClassFactoryFactory is a class that builds AbstractClassFactoryClassFactory objects", then that's useless.

Also explains why it's 200,000 lines of code, for something that should be an order of magnitude smaller.

Why do you say that has to be order of magnitude smaller? Other BigTable clones like HBase are atleast 100K lines of code, if not more.

A project I'm working on with a similar level of intricacy as BigTable has way less than 100K, and that's in a language more verbose than Java.

200k lines ... of which 85k lines are generated from Thrift and 10% from other projects, according to related ML discussions.

Another Java hater? :s

Government & consulting projects generally require the code and the use of the code to be fully documented. I'm actually looking forward to that part of it more than the software if it is any good.

NSA uses Java. Great to know. Apparently the professionals at the NSA are not swayed by fashion.

That's a ridiculous defense: they were obviously swayed by fashion at some point. It just happened several years ago.

You appear to tilting at windmills of your own making. What I find ridiculous. Is your claim of having had a window into the operational decisions of NSA back when they opted for Java.

Let me see if I understand your argument.

If they had chosen a newer language you didn't approve of, you would have glibly dismissed their engineering decision as "being swayed by fashion."

Since they chose Java, though, the engineers are being "professional" and anyone saying that their decision was swayed by fashion is being "ridiculous," claiming to have a "window into the operational decisions of NSA."

A little bias here perhaps?

They use more "fashionable" languages too. Just not for distributed databases. Most popular open source databases are also written in C/C++/Java.

200k. Why do Java programs always balloon up?

Zawinski's Law.

Some time ago I met some people from the .gov cyber security, NSA and other offices. The head of the .gov office on cyber security was really nice and invited me to go to dinner with them.

The guy from the NSA was hands down the biggest evil piece of shit I have ever experienced in my life. The way he talked, what he said, and the fact that he was given free reign to commit crimes in his training, which he openly bragged about, made me want to murder the guy right then and there.

I lost any and all respect for what the government and the NSA do.

...because of one guy, who may have been lying through his teeth in an attempt to impress the other people at dinner?

I'm not trying to defend this guy, certainly it's a big problem if he did what he said, but you can't fix something when you paint the whole fucking thing with a brush you picked up looking at one of the uglier parts.

Though I'm not defending the NSA, you should be careful not to make broad assumptions based on a single data point.


It sounds like Oracle's Label security applied to Hadoop.

It'll be a challenge because Label security slows things down quite a bit.

Prediction: There will be a backdoor.

C'mon Zed, really?

a) The code will be open source - the community can verify the code for anything untoward

b) Given the nature of the product, most implementations are going to be behind a firewall anyway, with the storage layer talking to business logic. Even if there was a backdoor, and I'm sure there isn't, not sure how NSA could get in.

Do you think there's a backdoor in NSA's open-source algorithm for SHA-1 too?

I applaud the government for putting tax dollars back into open source. My only gripe is the lack of transparency as to what this is primarily used for within the NSA (to be expected I guess). I generally like to know what I'm helping commit code to go do - although granted you have no idea what other open source projects are used for regardless of whether the lead sponsor is government or private company.

If there are plenty of good uses for the code, I'd still want to improve it, even if I find out it's used by the Kitten Krusher 3000.

Unless a "please don't use this code for evil" license is legally binding, that's just the nature of open source.

A "please don't use this code for evil" license would, by definition, not be open-source. (Also, such a license would almost certainly be ignored by evildoers.)

I don't necessarily think there will be one, but I wouldn't be surprised either.

Security flaws can be extremely subtle and 200,000 lines of code is a lot to review... Given that there's plausible deniability (we didn't do it intentionally, it was a genuine bug!), if you were them, wouldn't it at least cross your mind to try it?

Also, at some point, if it becomes popular, some sysadmin at a large foreign government agency or company will forget to firewall off a box running it (ignoring that they could also be connecting back directly - automatic updates anyone?)

But if there is a back door, doesn't releasing it as open source open the possibility that China's or Iran's equivalent of the NSA will audit the code and find it too?

That's why they should stop doing that. We aren't the smartest country on the planet anymore.

In which case the NSA say "Oops, it was a genuine mistake. Sorry." With 200,000 lines of code, there will almost certainly be unintentional security holes that haven't been found.

"My only gripe is the lack of transparency as to what this is primarily used for within the NSA (to be expected I guess)."

It's likely just used exactly how you think it would be; to hold massive amounts of key/value data. No doubt, the NSA likely has tons of data to work with. A NoSQL approach would be seemingly beneficial for this use case.

I think it was just a joke, chill :)

A joke?! A JOKE?! You jest good sir. I merely put on my tinfoil hat and thought, "Hmmm, didn't this happen to OpenBSD, Windows, every crypto system ever, numerous databases, and probably SELinux?" Then extrapolated out to a very valid point.

How dare you claim I am not deadly serious about the NSA putting a back door in a database that is intended to be secure for the internet. How. Dare. You!

I still can't tell if you're joking.

I've seen a possible back door or two in this or that, but nothing like "every crypto system ever".

If you have evidence of a back door in AES, SHA-2, or anything NIST has standardized (other than Dual_EC_DRBG or openly weakened stuff like export SSL) lots of people would like to hear about it.

Didn't the NSA actually make DES stronger?

Yes, the story goes that the NSA assisted IBM in its development by tuning the specific values in the S-boxes to be resistant to differential cryptanalysis, which had not yet been publicly discovered.

They also reduced the key length from 64 to 56 bits. I found this suspicious and didn't accept the explanation that those 8 bits were needed for "parity". Yet, respected cryptographers say this actually brings the key size more in line with the effective strength. So those additional 8 bits in the key were not contributing to the security and it improves the "truth in labeling".

Why would they build weaknesses into standard blocks, the biggest consumer of which is the US government itself?

When the NSA had at times insisted on an upper limit for a protocol's security (e.g., export crypto), they usually would require a simple upper limit on the number of secret bits in the key. When they've submitted fixes they tend to be elegant and minimal (e.g. SHA-0 to SHA-1).

Can you elaborate on the "openly weakened stuff" part?

I don't know much about security, but I am vaguely aware that there were some efforts by various governments to control, regulate, weaponize and even outlaw crypto, but I don't know where these effort have left us. Are there any crypto systems with acknowledged backdoors? Are there any which are not only widely considered to be secure, but are known to have actually prevented three-letter agencies from getting their way?

Back in the 90s the US Government prohibited export of SSL stronger than 40 bits. I believe this is what they're referring to.

Warning Zed! This is a humor free zone.

Damn, you'd think with 200k lines of awesome Java that needs to be documented with a manual that's hundreds of pages long that uses 3 other massive Java projects and released by a government agency that's done backdoors in everything from crypto systems, operating systems, to even backdoors themselves, that there'd be at least a plausibility of them putting one in.


That's just from a quick google. Back in the day there were stories of "A Visit from Mr. Brown" or something like that. The NSA or "some agency" would go around to anyone making crypto or operating systems and ask to be given backdoors in exchange for deals on export restrictions. Periodically a government agency in another country would find them and we'd be embarrassed. These days it's not as common since crypto exports aren't restricted (much) so the threat of, "If you don't add a backdoor we'll label your software a weapon and you can't sell it to the world." doesn't work.

Then again, could all just be a huge conspiracy.....mwhahahaah.

Bruce Schneier explains why that is not a back door.


So other than this, are there any other NSA "backdoors in everything from crypto systems, operating systems, to even backdoors themselves"?

Oh, the great Bruce Schneier says so, so therefore it must be. How do you know he's not a shill for Microsoft and the NSA? Hmm?

The great thing about backdoors is, when they get discovered they have perfect plausible deniability. "Oh that key named NSAKEY isn't for the NSA it's for...uh...this other agency. Yeah that's it! It's not even a key. Right Bruce? Right?!"

Are you going to deny any contrary evidence as being a shill for the people you're accusing? Down that path lies madness and conspiracies, Zed.

Oh, so the great Zed Shaw says Schneier saying it's so doesn't make it so.

The great thing about people accusing others of subterfuge is how it may never reach a suitable end.

How do we know Zed Shaw is not a shill for those opposing those for whom Bruce Schneier is a shill?? What to do!

They are an eco-friendly CO_2 emission reducing measure to reduce workload, in a desperate attempt to comply with KIOTO. They needed it to conform to the Energy Star certification scheme from the DoE.


Just curious, what backdoors have been discovered from the NSA?

Oop sorry, I replied wrong. Your answer is above.

Given that the charter of given agency is certainly not to produce FLOSS, and most certainly not for the pleasure of a foundation which has its worst adversaries as founders (hint: Ben Laurie).

It would be most plausible to have direct access to the build infrastructure, which in turn would give access to ... without the hoops of going through Oracle and IBM or whatever corporate projects.

And if you read the spiegel article (which has to do) with Ben's past-present, it is clear, that the USA is on the "offensive". The surest way to discredit any anonymity provider for whistle-blowers is to discredit the providers. Which has just happened in the last few days (note, that the contents of the 7z itself was already past 0-day, and therefore valueless, as a USA Official noted in the article).

    svn co https://svn.apache.org/repos/asf/incubator/accumulo 
doesn't seem to work

It's just a proposal atm. The svn-repo is one of the items they are requesting from ASF

Now we know how they store the data gleaned from wiretaps!

It seems that the tags for cells seems to be an important feature of this database, and they also mention it is appropriate for places where "privacy is important". Can someone explain the connection between these two? If I'm understanding right, the labeling makes it easy to address individual cells, but I'm not sure how that enhances privacy.

They seem to be access tags, not just arbitrary tags. I interpret it as item-level ACLs, something like row-level security in SQL databases.

The labeling likely refers to Mandatory Access Control (MAC) where objects (data, cells) are assigned classification labels (e.g. Top Secret, Confidential) and subjects (users, processes) can only access objects that match the subject's assigned classification level.

I would imagine that this is similar to other ACL products in which the NSA has previously expressed interest, like SELinux. The "labeling" probably means setting permission levels.

"There is a risk that Accumulo will be criticized for not providing adequate security. The access labels in Accumulo do not in themselves provide a complete security solution, but are a mechanism for labeling each piece of data with the authorizations that are necessary to see it."

I'm guessing that the idea is to make it easy to enforce permissions at the application layer. You give permissions, and you get only cells that the current query-er is allowed to see. With HBase, it would be pretty easy to put permissions by the row (add a permission column, or column family if it's complicated enough), but if you want some columns in a row to have some permissions and some to have different ones, it would get unpleasant and inefficient fast.

And regardless, all of the filtering would have to occur at the application layer, meaning you'd have to wrap every get/scan to have it do the filtering for you. The Accumulo way also gets you some efficiency because it never even has to transfer the cells that get filtered by the permissions (or even fully read their content from disk, possibly).

Even though each cell isn't separately encrypted to get you true security at the cell level (which would destroy your performance, I'd guess), this seems like a huge win if you want to have permissions at the cell level.

Awesome NSA...

So NoSQL approach makes all those skiddies SQLi attacks moot.

Still 200k lines of code = ~2000 bugs...

So, opening it to the public will expose (some) of those, and fixes will be created. and Now, when are you going to show off that really kool advanced A.I. you guys are sitting on!

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact