Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Serving a website from a Git repo without cloning it (mediocregopher.com)
110 points by todsacerdoti on Feb 19, 2024 | hide | past | favorite | 53 comments


> It's fairly common to use git repositories as a vehicle for serving websites. The webdev pushes their changes to some branch of a publicly available git repository

This doesn't require the git repo to be public. My go-to "day one" static website deploy strategy is to set up a git remote on some VPS, point nginx at its public html folder, then "deploy" to that with a git push, authed by ssh key.

You'd be surprised how far you can get with simple tricks like this. I served about the first 6 months of what would become a multi-hundred-million company with this hilariously simple setup running on a $20/m DO VPS.


I do the same. My blog on a 9€/mo Contabo VPS survived multiple HN/Reddit hugs.

Lateley I've been playing with more configuration options like I have two servers configured behind a load balancer but with a heavy focus on server 1. Only if the response time of server 1 is under 500ms, server two is used, which hosts an almost identical site but with ads. So under normal circumstances nobody will see ads on my blog unless I get a huge spike (more than 600 concurrent users ususall)


I've been publishing my personal web page like this for more than 15 years now (the repos log history goes back to early 2009). I use a post-update hook that makes sure on each push that the working directory is on par with the master branch (so that things that are pushed to other branches aren't made public) and then the public site is rebuilt (using a Makefile and custom scripts making heavy use of xml2, 2xml, sed, and the coreutils, but that's another story).

I can see why the author of the linked blog post would want to serve directly from the Git repository: it's fun. But I agree that it seems uselessly inefficient compared to (re)generating a static version of the site only when changes occur.

If there is no processing needed like in the case of the blog post, a simple post-update hook can easily do the trick: go to the public directory served by the web server, then set the GIT_DIR environment variable to point to the .git directory of your repository, then execute the command `git checkout main-branch -- .` to get all the static files out of the repository). This would then have the exact same external behavior as what is proposed in the linked blog post, but will be way more efficient (except of course if you have no visitors and update your website very very often).


Oh I really like that idea!

Over at pico.sh we are trying to make it dead simple to host N static sites using common tools: rsync, scp, and sftp.

https://pgs.sh

It really should be as simple as you are describing!


This approach is fine for static websites, but for web applications you should probably block requests to /.git so that people can’t clone your application off of the web.


The .git directory doesn't necessarily need to be in the public directory at all.


It is always good reminder anyway to tell people to either make public dir not root of repo or block .git.


If anyone wants to read more about this, search for `GIT_DIR` and `GIT_WORK_TREE` environment variables.


Or, put your files under html/ in your repo, and push to /var/wwwroot (using the common /var/wwwroot/html setup)


If you're just serving static files, wouldn't a S3 website be easier and more robust?


With unbounded running costs... There is something comforting about knowing it's always going to cost you $20.


True, didn't think about it from the limited cost perspective. Would be a lot of traffic to get you past $20, but a possibility like you say.


Put CF in front of that VPS and it is better than S3.


One can also have CloudFront before S3, IIRC.


That is my goto setup for static or semi-static sites. In theory you could still see a runaway bill with CloudFront, although I haven't checked the numbers to actually see what magnitude of traffic would be needed for it to really hurt?

*edit* did a quick check, based on my definition of hurt about 1,000 request per second sustained for the month or > 12TB traffic would hurt. But then again, I'm sure there is some other weird edge case that AWS bills for that could incur more severe costs with less traffic. And that is what I take 0xFF0123 to imply


Last time I checked it was a bit clunky to get these working with HTTPS.


If you use CloudFront it's easy now. Some years back you either had to do a complicated setup or pay something like 2k for a certificate!


Hmm, yeah, doesn't look too bad: https://repost.aws/knowledge-center/cloudfront-serve-static-... Docs are confusing and you have to be careful to choose the option that's actually end-to-end HTTPS.


It's very easy these days. I use nginx+nixos for hosting and it was one line to say use https.


I was talking about hosting a website in an S3 bucket.


What was your multi hundred million dollar company?


Unfortunately, it wasn't my hundred million dollar company.


I was checking if this would be possible with GitHub, but querying the dumb protocol at https://github.com/<org>/<repo.git>/info/refs just returns the following

  Please upgrade your git client.
  GitHub.com no longer supports git over dumb-http: https://github.com/blog/809-git-dumb-http-transport-to-be-turned-off-in-90-days


I just ssh in periodically run a git pull and it’s done. I have a text file with the various things I need to do. Takes less than 2 min of work. I’ve not been convinced automating this process yet would save me much time.


The cheapest solution I’ve found for static sites is an S3 bucket with Cloudfront sat in front of it. Costs next to nothing.


Cheapest option I found is a Jekyll site deployed out of a GitHub repo (can be private) to Cloudlfare S̶i̶t̶e̶s̶ Pages. No cost at all even with a custom domain.


Cloudflare Pages, I think. I agree though, it costs nothing. They don't even have my card on file.

And nothing is free forever, but my sense is that this is going to stay free longer than most other free options.


Cloudflare Pages is great for static sites! I use it for a Hugo blog, and it "just works" - plus, deployments are really fast, like a few seconds.

It's all just so simple and fast that it reminds me of the days of SFTP'ing files to prod. Aaahhh.


Yep, CF Pages indeed! I agree with your sentiment. It always surprised me how much CF offers under their free tier and wouldn't shock me if they start pulling profitability levers and charging for many of these offerings.


Why add cloud flare to the mix?


You're right. GitHub offers all of this at no charge but your repo has to be public. Cloudlfare allows you to use a private repository.


Or use GitLab and you can use a private repo


I see a ton of different paid solutions in these comments for hosting static sites- why not just use GitHub Pages?

I've had stuff on front page a handful of times never had a complaint about slowness / errors- it's just a static site.


I have found Firebase hosting be quite solid for my static sites.

I usually have a ./sh directory which is like my "control panel" for the project, then I run ./sh/deploy.sh from the root which builds and does a `firebase deploy`.

Firebase handles HTTPS and seems to have the lowest latency of the CDN's I have tried, even though they all claim to be the "fastest".

I prefer to have a explicit "button" to deploy rather than working with git commits.



It’s hugged to death now so I can’t re-load the page. But I think the author of this post went groveling into the objects/ directory for their contents. That will work fine until the repo is GC’d, and then those loose objects get moved into a packfile.


Why not just use Partial or Shallow Clones? [0]

[0] https://github.blog/2020-12-21-get-up-to-speed-with-partial-...


git clone --depth=1 is fast because it doesn't grab the full commits history but OP is getting at a specific singular file directly.

It is neat, not aware of a git incantation so fine grained. Hope it gets built into Git directly.


Ah, thank you for the clarification -- as far as I know that is indeed unique.

It's a thimbleful sized shallow clone, one could say.


You can do a shallow clone and partial checkout.


Cool! Why you gotta tease me like this without busting out a one liner? :)

Gemini disagrees with you btw:

Unfortunately, directly combining git shallow clone and git partial checkout to grab just one specific file isn't straightforward. Here's why:

Shallow clone: This limits the downloaded commit history to a specific depth, but it still fetches all files involved in those commits. While reducing data, it wouldn't restrict files solely based on your needs.

Partial checkout: This lets you specify which files to include in your working directory, but it requires a full clone initially.

However, you have a few alternative approaches to achieve your goal:

1. Shallow clone + Sparse checkout:

Use git clone --depth=<commit_depth> --single-branch=<branch> to shallow clone the specific branch and limit history. Create a .git/info/sparse-checkout file containing only the path to the desired file. Run git read-tree -u to update the index based on the sparse-checkout file. This method downloads a limited history and only keeps the specified file in your working directory.

2. Partial clone with server support (limited availability):

Check if the server supports partial clones (currently implemented on Github and some Gitlab self-hosted instances).

Use git clone --filter=blob:none <url> for a "blobless" clone that only contains file content, no history or directory structure.

Add the desired file path to the .git/info/sparse-checkout file and run git read-tree -u as before.

This approach minimizes downloaded data but requires server support and won't work everywhere.


Because then some process on the server has to fetch/checkout the clone each time there's a push?

Whereas reading directly from the repo gets around that.


But then some process on the server has to fetch multiple files from the repository each time there's a request.

It seems this solution consists in getting around doing a bit of work when it is actually necessary (on website update) by doing a lot of work over and over.


FTA:

> This sounds like a lot of steps to serve a single file, but there's two key optimizations which can be made. The first is to cache the root tree's hash in memory, which skips two lookups right at the beginning. The root tree's hash will only change when the latest commit of the branch changes, so it's enough to cache it in memory and have a separate background process periodically re-check the latest commit.

> The second optimization is to cache tree objects in-memory using their hash as a key. The object identified by a hash never changes, so this cache is easy to manage, and by caching the tree objects in memory (perhaps with an LRU cache if memory usage is a concern) all round-trips to the remote server can be eliminated, save for the final round-trip for the file itself.

Also, the "background process periodically re-check the latest commit" seems like a bit of overkill if the repo is local; just caching and checking the mtime of `refs/heads/main` should be enough to decide whether the root tree needs re-reading.


I serve my cloned repos directly and remove access to . files with

  RewriteCond %{THE_REQUEST} ^.*/\.
  RewriteRule ^(.*)$ - [R=404]
in my .htaccess


I guess it wasn't such a good idea to do, huh?

This site can’t be reached mediocregopher.com took too long to respond. Try:

Checking the connection ERR_TIMED_OUT


Congratulations! That means you basically figured out how the clone procedure works and found a way to do so just in a partial way (also in an unsafe way). But it is a cool idea, nonetheless.

Also check out the Scalar [1] project and its predecessor, GVFS [2], both from Microsoft to manage their monorepo via a VFS layer.

[1]: https://github.com/microsoft/scalar

[2]: https://github.com/microsoft/VFSForGit


Why are you congratulating the author? There's no reason to be condescending.


What is unsafe about it?


I think it means if you serve it to the public, a hacker might eventually find a way to enumerate all of your .git repo.


Then don't put sensitive material in a repo that you decide to essentially make public and serve to the whole world.


And then what happens?


Stevefan’s configuration of the website apparently




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: