Hacker News new | past | comments | ask | show | jobs | submit login
Exploding Git Repositories (kate.io)
447 points by ingve on Oct 12, 2017 | hide | past | favorite | 73 comments

I wonder what the author means by "a lot" of RAM and storage. I tried it for fun. The git process pegged one CPU core and swelled to 26 GB of RAM over 8 minutes, after which I had to kill it.

Yeah I tried it too. Killed at 65G. Disappointed that Linux killed Chrome first.

    Oct 12 15:47:52 x99 kernel: [552390.074468] Out of memory: Kill process 7898 (git) score 956 or sacrifice child
    Oct 12 15:47:52 x99 kernel: [552390.074471] Killed process 7898 (git) total-vm:65304212kB, anon-rss:63789568kB, file-rss:1384kB, shmem-rss:0kB

Interesting. Linux didn't kill Chrome, it died on its own.

    Oct 12 15:42:21 x99 kernel: [552060.423448] TaskSchedulerFo[8425]: segfault at 0 ip 000055618c430740 sp 00007f344cc093f0 error 6 in chrome[556188a1d000+55d1000]
    Oct 12 15:42:21 x99 kernel: [552060.439116] Core dump to |/usr/share/apport/apport 16093 11 0 16093 pipe failed
    Oct 12 15:42:21 x99 kernel: [552060.450561] traps: chrome[16409] trap invalid opcode ip:55af00f34b4c sp:7ffee985fb20 error:0
    Oct 12 15:42:21 x99 kernel: [552060.450564]  in chrome[55aeffb76000+55d1000]
    Oct 12 15:47:52 x99 kernel: [552390.074289] syncthing invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=0, order=0, oom_score_adj=0
Seems Chrome faulted first, but it was probably capturing all signals and didn't handle OOM. Then next, syncthing faulted and it started the oom-killer which correctly selected 'git' to kill.

> [..] and didn't handle OOM.

How would Chrome 'handle' an OOM anyway? As far as I'm aware, malloc doesn't return ENOMEM when the system runs out of memory, only when you hit RLIMIT_AS and alike.

Or when you hit 4G VIRT on 32-bit.

Took me a good day's worth of debugging before some bright spark piped up and said "wait, you said you were on x86-32...?"

...yeah, I use really old computers.

I'm setting up my last machine for my wife for gaming. Athlon X4 630, and 16 GB of RAM. I loaded windows up and said it had ~2 GB free and I was like "oh crap, the RAM sticks must be dead" (because the last motherboard that I just replaced broke some RAM slots).

I fixed my old video card, a GTX 560, and wanted to see what it could run. I loaded steam and PUBG said "invalid platform error". It took me a moment. I hit alt-pausebreak, presto, Windows 32-bit. Whoops.

Hadn't had that problem in a long time except at clients running ancient windows server versions complaining about why Exchange 2003 won't work with their iPhones anymore "it used to work and we didn't change anything!" (Yeah... but the iPhone DID change--including banning your insecure 2003 Exchange protocols.)

Humblebrag ;)

Nowadays 32GB of RAM go for as little as 170$. Some mid-tier graphics cards cost much more than that.

They went for around 100$ during summer 2016, now the cheapest DDR4 is around 240$:


Wow, I didn't notice just how much fluctuation there has been in RAM prices. My Newegg order history shows I paid $65 for 16 GB of DDR3/1600 at the end of 2015. Now the exact same product is sold by Newegg for $122. Crazy!


I sometimes forget that people use Desktops or systems with ability to add extra RAM.

I'm curious how this was uploaded to GitHub successfully. I guess they do less actual introspection on the repo's contents than I thought. Did it wreak havoc on any systems behind the scenes (similar to big repos like Homebrew's)?

There isn't anything wrong with the objects. A 'fetch' succeeds but the 'checkout' is what blows up.

Good point. For those that are curious:

Clone (--no-checkout):

    $ git clone --no-checkout https://github.com/Katee/git-bomb.git
    Cloning into 'git-bomb'...
    remote: Counting objects: 18, done.
    remote: Compressing objects: 100% (6/6), done.
    remote: Total 18 (delta 2), reused 0 (delta 0), pack-reused 12
    Unpacking objects: 100% (18/18), done.
From there, you can do some operations like `git log` and `git cat-file -p HEAD` (I use the "dump" alias[1]; `git config --global alias.dump catfile -p`), but not others `git checkout` or `git status`.

[1] Thanks to Jim Weirich and Git-Immersion, http://gitimmersion.com/lab_23.html. I never knew the guy, but, ~~8yrs~~ (corrected below) 3.5yrs after his passing, I still go back to his presentations on Git and Ruby often.

Edit: And, to see the whole tree:

  while [ -n "$NEXT_REF" ]; do
    echo "$NEXT_REF"
    git dump "${NEXT_REF}"
    NEXT_REF=$(git dump "${NEXT_REF}"^{tree} 2>/dev/null | awk '{ if($4 == "d0" || $4 == "f0"){ print $3 } }')

Sad one to nitpick, but Jim died in 2014. So ~3.5 years ago.

Had the pleasure of meeting him in Singapore in 2013.

Still so much great code of his we use all the time.

Thanks for the correction, he truly was a brilliant mind. One of my regrets was not being active and outgoing enough to go meet him myself. I was lived in the Cincinnati area from 2007-2012. I first got started with Ruby in 2009, and quickly became aware of who he was (Rake, Bundler, etc) and that he lived/worked close by. But, at the time, I wasn't interested in conferences, meetups, or simply emailing someone to say thanks.

I too was curious about this.

https://github.com/Katee/git-bomb/commit/45546f17e5801791d4b... shows:

"Sorry, this diff is taking too long to generate. It may be too large to display on GitHub."

...so they must have some kind of backend limits that may have prevented this for becoming an issue.

I wonder what would happen if it was hosted on a GitLab instance? Might have to try that sometime...

Since GitHub paid a bounty and Ok'd release, perhaps they've patched some aspects of it already. Might be impossible to recreate the issue now.

My naive question is whether CLI "git" would need or could benefit from a patch. Part of me thinks it doesn't, since there are legitimate reasons for each individual aspect of creating the problematic repo. But I probably don't understand god deeply enough to know for sure.

is this a git->god typo, or a statement about your feelings towards Linus?

Please don't let Linus read this

Yes, hosting providers need rate limiting mitigations in place. GitHub's is called gitmon (at least unofficially), and you can learn more at https://m.youtube.com/watch?v=f7ecUqHxD7o

Visual Studio Team Services has a fundamentally different architecture, but we do some similar mechanisms despite that. (I should do some talks about it - but it's always hard to know how much to say about your defenses lest it give attackers clever new ideas!)

> how much to say about your defenses lest it give attackers clever new ideas

attackers will try clever new ideas anyway if their less clever old ideas don't work :P

How does the saying go? Something like "security through obscurity isn't security"?

It's not security through obscurity. It's defense in depth.

GitLab uses a custom Git client called Gitaly [0].

> Project Goals

> Make the git data storage tier of large GitLab instances, and GitLab.com in particular, fast.

[0]: https://gitlab.com/gitlab-org/gitaly

Edit: It looks like Gitaly still spawns git for low level operations. It is probably affected.

Spawning git doesn't mean that it can't just check for a timeout and stop the task with an error.

Someone will probably have to actually try an experiment with Gitlab.

Tested locally on a GitLab instance: trying to push the repo results in a unicorn worker allocating ~3GB and pegging a core, then being killed on a timeout by the unicorn watchdog.

    Counting objects: 18, done.
    Delta compression using up to 4 threads.
    Compressing objects: 100% (17/17), done.
    Writing objects: 100% (18/18), 2.13 KiB | 0 bytes/s, done.
    Total 18 (delta 3), reused 0 (delta 0)
    remote: GitLab: Failed to authorize your Git request: internal API unreachable
    To gitlab.example.com: lloeki/git-bomb.git
     ! [remote rejected] master -> master (pre-receive hook declined)
    error: failed to push some refs to 'git@gitlab.example.com:lloeki/git-bomb.git'
I had "Prevent committing secrets to Git" enable though. Disabling this makes the push work. The repo first then can be browsed at the first level only from the web UI, but clicking in any folder breaks the whole thing down with multiple git processes hanging onto git rev-list.

EDIT: reported at https://gitlab.com/gitlab-org/gitlab-ce/issues/39093 (confidential).

Thanks. Here is the comment from a GitHub engineer addressing the root cause:


Because that page is AMP by default, it takes about 7 seconds to load the page on my laptop. AMP is really slow in some cases.

Edit: see my comment below before you downvote me.

Huh, I've tested on a bunch of devices/connections and haven't encountered that. Do you know what causes AMP to be that slow for you? I'll take a look at serving non-AMP pages by default. It will require tweaking how image inclusion works.

For people who use extensions or browsers that block third party JS, AMP pages will take many seconds to load in non-mobile Web browsers.

Here is information about some of the other problems with AMP:






Fix your browser /shrug

It isn't just my browser. AMP performs very badly in some non-mobile browsers (no extensions).

Fix your website

Would you please remove amp entirely?

Same here. The page just stays blank for few seconds, and then pops into existence.

(I do use uMatrix to block 3rd party JS.)

Why not just always run git under memory limits?

For example:

  %  ulimit -a
  -t: cpu time (seconds)              unlimited
  -f: file size (blocks)              unlimited
  -d: data seg size (kbytes)          unlimited
  -s: stack size (kbytes)             8192
  -c: core file size (blocks)         0
  -m: resident set size (kbytes)      unlimited
  -u: processes                       30127
  -n: file descriptors                1024
  -l: locked-in-memory size (kbytes)  unlimited
  -v: address space (kbytes)          unlimited
  -x: file locks                      unlimited
  -i: pending signals                 30127
  -q: bytes in POSIX msg queues       819200
  -e: max nice                        30
  -r: max rt priority                 99
  -N 15:                              unlimited
  %  ulimit -d $((100 * 1024)) # 100 MB
  %  ulimit -m $((100 * 1024)) # 100 MB
  %  ulimit -l $((100 * 1024)) # 100 MB
  %  ulimit -v $((100 * 1024)) # 100 MB
  %  git clone https://github.com/Katee/git-bomb.git
  Cloning into 'git-bomb'...
  remote: Counting objects: 18, done.
  remote: Compressing objects: 100% (6/6), done.
  remote: Total 18 (delta 2), reused 0 (delta 0), pack-reused 12
  Unpacking objects: 100% (18/18), done.
  fatal: Out of memory, malloc failed (tried to allocate 118 bytes)
  warning: Clone succeeded, but checkout failed.
  You can inspect what was checked out with 'git status'
  and retry the checkout with 'git checkout -f HEAD'

Run this to create a 40K file which expands to 1GiB

  yes | head -n536870912 | bzip2 -c > /tmp/foo.bz2
I would imagine you could do something really creative with ImageMagick to create a giant PNG file as well that'll make browsers, viewers, editors crash as well.

PNG has dimensions in the header so the decoder should know when it's decompressed enough.

You can take it a step further using Zip Bombs[0].

[0]: https://en.wikipedia.org/wiki/Zip_bomb

You can also make archives that contain themselves:


Odd. It's surprising to me that this example runs out of memory. What would be a possible solution?

Admittedly I don't know that much about the inner-workings of git, but off the top of my head, perhaps something with traversing the tree depth-first and releasing resources as you hit the bottom?

You need a problem to have a solution to it. What do you consider to be the problem here?

This is essentially something that can be expressed in relatively few bytes that expands to something much larger.

Imagine I had a compressed file format for blank files "0x00" the whole way. It is implemented by writing in ascii the size of the uncompressed file.

So the contents of a file called terrabyte.blank is just ascii "1000000000000" ... or the contents of a file called petabyte.blank is "10000000000000"

I cannot decompress these files... what is the solution?

>You need a problem to have a solution to it. What do you consider to be the problem here? > >This is essentially something that can be expressed in relatively few bytes that expands to something much larger.

That seems to be the problem. I mean, if an object expands to something much larger to the point that it crashes services just by the sheer volume of the resources it takes... That is pretty much the definition of an attack vector of a denial-of-service attack.

There is a problem here, but it's not with data. It's with the service.

Being able to express trees efficiently in a data format is an useful feature, but it requires the code processing it not to be lazy and assume people will never create pathological tree structures.

I'm not following; why can't you decompress it? Of course you cant decompress it into memory, but if it's trying to that then there's a problem in the code (problem identified).

Naive solution, just write to the end of the file and make sure you have enough disk. More sophisticated solution, shard the file across multiple disks.

That's not a solution, that's sweeping the problem under the rug: "just have the OS provide storage, therefore it's not my problem any more, solved. (Never mind that with a few more layers, the tree would decompress into a structure larger than all the storage ever available to mankind)"

Git assumes it can keep a small struct in memory for each file in the repository (not the file contents, but a fixed per-file size). This repository just has a very large number of files.

Large as in 10 billions. Even if git only needed 1 byte in memory per file, it would need 10GB.

One option is to modify each of the utilities so that it doesn't have a full representation of the whole tree in memory. I doubt this is feasible in all cases, though for something like 'git status' it should be doable.

If the tree object format was required to store its own path, then you wouldn't be able to repeat the tree a bunch of times. The in-memory representation would be the same size, but you would now need that same number of objects in the repository. No more exponential fanout.

But that would kind of defeat the purpose of Git for real use cases (renaming a directory shouldn't make the size of your repo blow up).

Have git (the client) monitor its own memory usage and abort if it gets above a set limit (say, default, 1GB), with a message that tells you how to change or disable the limit.

Would this be possible with a patch-based version control system like Darcs or Pijul? Does patch-based version control have other analogous security risks, or is it "better" in this case?

If the patch language includes a recursive copy than it's possible to reproduce this problem in that setting.

If I understood correctly, this problem isn't caused by recursive copies but simply by expanding references. The example shows that the reference expansion leads to an exponential increase in resources required by the service.

This means the same in this context; if it was just expanding references one by one while walking through the tree this would not happen - the bomb requires copies of expanded references to be stored in memory.

Bare for the win.

    git clone https://github.com/Katee/git-bomb.git --bare

Going to second level on Github breaks commit name for me - it gets stuck with "Fetching latest commit..." message. Curiously, go one level deeper and the commit message is again correct.


(INB4 The article suggests Github is aware of this repo, so I have no qualms posting this link here.)

Directory hard links would "fix" this issue since `git checkout` could just create a directory hard link for each duplicated tree. I wonder why traditional UNIX does not support this for any filesystem.

(Yes you would need to add a loop detector for paths and resolve ".." differently but it's not like doing this is conceptually hard.)

Has anyone tried to see how well BitBucket and Gitlab handle this?

What happens if you try to make a recursive tree?

You can't make a valid recursive tree without a pre-image attack against SHA1. However `git` doesn't actually verify the SHA1s when it does most commands. If you make a recursive tree and try `git status` it will segfault because the directory walking gets stuck in infinite recursion.

As in a tree that points to itself? You cannot, since a tree would have to point to its own SHA1. So this would require you to know your own tree's SHA and embed it in the tree.

Reminded me of the GIF that displays its own MD5 hash:


So it's possible, but impractical?

I think it's possible.

If we all click "Download ZIP" on this repo we can crash GitHub together!

Just click here: https://codeload.github.com/Katee/git-bomb/zip/master

I hope and expect that GitHub has the basic infrastructure to monitor excessive processes and kill them.

Scratches head

...I clicked Download a few seconds ago.

GitHub is still thinking. :/

Edit: After about a minute I got a pink unicorn.

Wouldn't that just do a `git fetch` and therefore not have the issue?

"Download ZIP" downloads the repository’s files as a zip. No Git involved for the downloader.

i expect the download zip to be implemented as running 'git archive --format zip | write-http-response-stream'

Hmm I'd hope they do a caching step in between ;)

I thought it would self destruct after cloning of forking before clicking :)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact