Hacker News new | past | comments | ask | show | jobs | submit login
Faster Gitlab CI/CD pipelines (nimbleways.com)
147 points by iduoad on Dec 9, 2021 | hide | past | favorite | 58 comments

One improvement that would be nice in gitlab is an extension / similar feature to the FF_USE_FASTZIP flag.

The default way to cache things is zip, which no matter the compression isn't a great fit, you really want a format that is stream-able like tar. When its stream-able you can download the tar and unpack it at the same time, not waiting for the full download before beginning to unpack. Obviously you would compress the tar (normally) with lz4 or zstd or similar, using this approach you should generally see a reduction in total time to get and unpack the cache. node_modules even more so since it generally has zillions of small files, so the unpack time can be quite high.

Be nice if gitlab supported this approach OOTB.

Edit: An even better way if your environment supports it is to use a mountable image (esp for node_modules) for your caches, this basically removes the entire unpack phase of making the cache available, instead of unpacking you just mount the image and off you go.

In macOS this looks like a sparseimage via attached hdituil, or in linux a squashfs image with a writable overlay mount (you need a privileged container for this if in a container). Since the OOTB gitlab.com runners are a root-user linux VM (i think?) this approach should work quite well. SquashFS images are a great fit for node_modules especially as it moves the creation of the zillions of tiny files to cache-creation time rather than cache-use time. If you share the cache images via a hostPath mount or similar for existing images you can effectively make caches available in 0 seconds (just mount an already downloaded image and done).

Thanks for the feedback. I passed it along to the pipeline authoring team.

> The default way to cache things is zip, which no matter the compression isn't a great fit, you really want a format that is stream-able like tar. When its stream-able you can download the tar and unpack it at the same time, not waiting for the full download before beginning to unpack.

You absolutely can stream extract a zip file: the index at the end of a zip file is merely an optimization mechanism added to a format that is otherwise somewhat tar-like at its base. While it is theoretically possible to create files that uncompress correctly only via one or the other data structures (see http://www.saurik.com/id/17) the file wouldn't be compliant and no one normally does this.

The only real tradeoff with a zip file vs tar+gzip is that the compression it does is per file, this means you gain random access (but don't lose streaming support) at the cost of being unable to share compression statistics across multiple small files (and so the resulting archive can be a lot larger).

Yep Specially crafting a zip file can make it streamable but it’s not the general default. Can GNU unzip read from stdin? Not sure but I don’t think so?

The later should already be possible.

I believe you can specify a docker image to be used for running a job so you can bake things like node_modules into the image.


That's the approach I took when I realized how slow Gitlab CI was with node-based projects: https://dsebastien.net/blog/2020-04-12-speeding-up-your-ci-c...

That helped a whole lot, but it was still far from great..

Yep true, generally baking caches into build images is a bit fragile tho, the more caches and different types of jobs you have the larger the set of combinations of images you need to cover it. Often ends up easier to decouple build image from caches, and ideally calculate a cache key based on a yarn.lock or similar (not supported by gitlab).

Also image pull performance is generally pretty poor from most registries, you will get much higher speeds out of a straight s3 download then pulling an image via docker or containerd, even tho ecr for ex is backed by s3 .

I haven’t actually used ootb gitlab caching in quite a long time because of these limitations, but would be nice for it to work great ootb!

I guess it depends on what you're building.

You can also leverage your Dockerfile to reuse the layers that haven't changed, for example see the order of commands i offered here: https://news.ycombinator.com/item?id=29393532

In some projects, you can just do something like the following (Java centric example, but applies to other stacks as well):

  - have Nexus (or another solution) running with a Docker registry next to your GitLab Runner nodes, internal network
  - setup Nexus to proxy all of the Maven/NuGet/whatever packages as well, in addition to Docker images in the registry
  - setup GitLab CI to build the base container images that you need weekly (e.g. Maven with whatever you need on top of it for builds, another base JDK image as a runner etc.), this includes not just stuff in your Dockerfile, but also those that you'll use for Runners (assuming Docker executor)
  - setup GitLab CI to build an intermediate base image for your project with all of the dependencies every morning (assuming that your project has lots of dependencies that don't change often though)
  - base your actual builds on that image, which is based on the Maven image, which will essentially cut out the need to pull everything from the Nexus Maven repo 
  - it will still allow you to add new dependencies during the day (e.g. new library), which will be pulled from the Internet and then cached both in Docker layer cache for your current pom.xml file, as well as your Nexus intermediate repo
You don't even need the intermediate dependency images, as long as you have control over which Runners your project has, with dedicated ones the Docker cache should be sufficient on its own. Right now i no longer have to fear Docker rate limits anymore and most of my builds don't even hit the Internet either at all.

I applied some of those principles at a large enterprise project and the builds went from about 7 minutes to 3, with no drawbacks to speak of. Of course, the next step would be incremental code compilation and not wasting time there, however seeing as even some of the above is overkill for many scenarios, that's probably situational.

Ah ok sounds like a good setup, I could never get past the slow image pull times for systems like that though? Maybe I missed a setting. Containerd especially was quite slow at pulling images. Image of 2GB~ could take a couple of minutes to come down, order of magnitude more than straight s3 download. Once you’ve got your images fine, but if your build nodes are elastic you can easily hit cold/new nodes and pay the penalty.

> Once you’ve got your images fine, but if your build nodes are elastic you can easily hit cold/new nodes and pay the penalty.

If you have your own Nexus or Artifactory, or even just a registry that's configured as a pull through cache (https://docs.docker.com/registry/recipes/mirror/), then it shouldn't be a problem.

If those two servers are in an internal network, even better, otherwise just get a VPS with a good port speed. Honestly, even 100 Mbps should be enough for 2 GB, taking less than 30 seconds, with 1 Gbps that time becomes 2 seconds: https://techinternets.com/copy_calc

The difference here would be that you're using your own software on your own servers and therefore can deal with the load that you generate yourself with the full allotment of resources that you have, vs having to rely on public registries that are used by tens of thousands of other developers/processes at the same time.

That's not that different from S3 either (apart from maybe more capacity being available to you on private buckets, depending on the vendor), since you can also use something like MinIO or Zenko on your own servers as well instead of relying on the cloud.

The dependency proxy in GitLab can help with caching the Docker images.


When Docker Hub introduced rate limits last year, one way to mitigate its impact was to make the dependency proxy available for everyone.



Another way can be maintaining your own group which maintains all base images for your environment, and stores them in the GitLab package registry.

Using your own images can help enforce security policies, to avoid that containers introduce vulnerabilities (e.g. when always pulling :latest tag, or sticking to old tags). Reminds me of Infrastructure as Code security scanning, which can help detect things like this. Played with it when released in GitLab 14.5, examples in https://gitlab.com/gitlab-de/playground/infrastructure-as-co... + https://gitlab.com/gitlab-de/playground/infrastructure-as-co...

Depending on the languages and environments involved, builder images might be needed to reduce complexity with installing build tools. Like, C++ with gcc and controlling the exact version being used (major versions may introduce ABI breaking changes, I've seen it with armhf and gcc6 a while ago). Builder images also reduce the possibly for users to make mistakes in CI/CD before_script/script sections. An optimized pipeline may just include a remote CI/CD template with the magic happening in central maintained projects, using the builder images.

Another thought on builder images - multi arch images with buildx. I've read https://medium.com/@tomwillfixit/migrating-a-dockerized-gitl... yesterday, need to learn more :)

Great insights Thank you KronisLV

I believe there are ongoing discussion on supporting other cache/artifact encryption/archiving.


The mountable images for cache is a great idea, thank you

Great post, thanks for sharing. We should link that in the Pipeline Efficiency docs: https://docs.gitlab.com/ee/ci/pipelines/pipeline_efficiency....

I've given a talk about similar ideas for efficient pipelines at Continuous Lifecycle, the slides have many URLs inside to learn async: https://docs.google.com/presentation/d/1nq7Q4WMv6rQc6WFJCRqj...

And if you want dive deeper, a free full day workshop with exercises to practice config, resource, caches, container images and more. I've created it for the Open Source Automation Days in early October.

Slides with exercises: https://docs.google.com/presentation/d/12ifd_w7G492FHRaS9CXA...

Exercises+solutions: https://gitlab.com/gitlab-de/workshops/ci-cd-pipeline-effici...

I did not have time yet to write a blog post sharing more insights on the exercises, but they should be self-explaining on the slides, with solutions in the repository. Let me know how it goes, feel free to repurpose for your own blog posts, and send documentation updates please :)

> We should link that in the Pipeline Efficiency docs

I've shared the resources and this HN topic with GitLab's technical writing team to make tutorials more visible on docs.gitlab.com


I went ahead and blogged about the workshop, thanks y'all for the inspiration :)


If you learn a new trick or gem, please blog and share with our community :-)

Overview of the topics inside the workshop:

- Introduction: CI/CD meets Dev, Sec and Ops

- CI/CD: Terminology and first steps

- Analyse & Identify

- Learn using the GitLab CI Pipeline Exporter to monitor the exercise project throughout the workshop.

- Efficiency actions

- Config Efficiency: CI/CD Variables in variables, job templates (YAML anchors, extends), includes (local, remote), rules and conditions (if, dynamic variables, conditional includes), !reference tags (script, rules), maintain own CI/CD templates (include templates, override config values), parent-child pipelines, multi project pipelines, better error messages to fix failures fast

- Resource Use Efficiency: Identification, max pipeline duration analysis, fail fast with stages grouping, fail fast with async needs, analyse blocking stages pipeline (solution with needs), matrix builds for parallel execution (pratice: combine matrix and extends, combine matrix and !reference), extends merge strategies (with and without !reference)

- CI/CD Infrastructure Efficiency: Optimization ideas, custom build images, optimize builds with C++ as example, GitLab runner resource analysis (sharing, tags, external dependencies, Kubernetes), local runner exercise, resource groups, storage usage analysis, caching (Python dependency exercise, including when:always on failed jobs)

- Auto-scaling: Overview, AWS auto-scaling with GitLab Runner with Terraform, insights into Spot Runners on AWS Graviton

- Group discussion

- Deployment Strategies: IaC, GitOps, Terraform, Kubernetes, registries

- Security: Secrets in CI/CD variables, Hashicorp Vault, secrets scanning, vulnerability scanning

- Observability: CI/CD Runner monitoring, SLOs, quality gates, CI/CD Tracing

- More efficiency ideas: Auto DevOps, Fast vs Resources, Conclusion and tips

So much to learn here, thank you for sharing!

Great article, wish I had something like that 3 years ago.

Adding my personal tips:

- Do not use GitLab specific caching features, unless you love vendor lock in. Instead, use multi stage Docker builds. This way you can also run your pipeline locally and all your GitLab jobs will consist of "docker build ..."

- Upvote https://gitlab.com/gitlab-org/gitlab-runner/-/issues/2797 . Testing GitLab pipelines should not be such a PIA.

In a previous life, I set up CI runner images (Amazon AMIs) that had all of our docker base images pre-cached, and ran custom docker cleanup script that excluded images with certain tags. This meant that a new runner would be relatively quick off the blocks, and get faster as it built/pulled more images.

You can get better cache hit from tagging your gitlab runners and pinning projects to certain tags.

Also this: https://medium.com/titansoft-engineering/docker-build-cache-...

... Sharing cache on multiple hosts using buildkit + buildx

I use GitLab's private registry + scheduled pipelines to prebuild our base images, but that's definitely some extra spice. Thanks for sharing!

Great tips, thank you!

> Instead, use multi stage Docker builds. This way you can also run your pipeline locally and all your GitLab jobs will consist of "docker build ..."

There's a section in the pipeline efficiency docs with more tips and tricks for optimizing Docker images: https://docs.gitlab.com/ee/ci/pipelines/pipeline_efficiency....

One point worth of note when it comes to Docker caching, more specifically pulling images, is the rate-limiting on Docker Hub.

While hosted GitLab might make use of a transparent pull-through cache (as I've gathered from glancing at relevant parts of the docs), you can benefit a lot by using one with your own local GitLab instance (assuming it does not already provide it via container registry).

We ended up switching to Harbor[1] from the vanilla registry and almost by chance stumbled on the fact that it supported a pull-through cache from various other sources (including Docker Hub).

This was especially useful after we hit the rate-limit after one of our pipelines got out of hand and decided to rebuild every locally hosted Docker image (both internal and external).

[1]: https://goharbor.io/

It’s hard to know whether to cache CI or not.

On one hand, without the cache builds can be very slow.

But on the other hand, you’ll see in a lot of projects random commits like “blow away corrupted cache”, which makes you wonder whether building the cache from scratch is an important step of reproducible builds. I personally rather let the builds run longer and be absolutely certain.

Maybe there’s a good middle ground for dev commits vs final merge commits, but unfortunately there’s no machinery in ie GitHub to specify a commit as final before merge.

yep, we do a scheduled daily build with all caching off, to test that it all works without caching, and a good way to rebuild your caches without old dependencies/files etc. Good backstop.

NixOS Gitlab Runners are quite nice in this regard. Caching for "free" (cost of admission: learning Nix)

Finally found a reason to learn Nix, thank you!

It is a lot of learning - you have to both run gitlab-runner on NixOS _and_ Nixify your project to get caching.

If you climb the learning curve that far though - you'll be hooked.

True, CI caches are only one way to look at slow pipelines.

When you are in control of the infrastructure where jobs and runners consume resources, other strategies might also help.

Package dependency installs taking some time? Move the runners into network segments with blazing fast connections, next to mirrors of package registries (if not already provided).

Possibility to use a CDN in front of the runners? Make sure the application code is capable of doing so.

C/C++ with ccache can consume a lot of disk space, and caching itself can slow down the pipelines. Calculate the cost of pipeline runs and employees waiting for builds to finish, and compare it to buying more compute resources with xyz cores, lots of memory and direct SSD disk access to speed up the builds.

That said, many different environments and workflows make decisions harder. Better observability for CI/CD is needed, having better insights into pipelines and see if and how efficiency improvements can have an impact. I have shared more thoughts in https://news.ycombinator.com/item?id=29520577

I remember there is an empty cache button in the UI of Gitlab.

Yep, in the pipeline view at the top right.


I agree, tracking down a failed CI build that ends up being a cache problem is much more frustrating to me than waiting a little longer for the build.

The attention to pipeline speed is great, i don't think I have seen anything this detailed before.

Having implemented a pipeline cache to reduce wall time with 15 minutes for each execution, i wouldn't want to go back to no cache.

But it's important to be resilient to a faulty cache. If that is hard to achieve, then it makes good sense to avoid caching.

We cache builds on my work c++ projects. A clean build would be 2+ hours, using cached artifacts is about 5 minutes.

Can you share more details about the size of the project (files count, build file size) and which caching settings you are using? I assume it is ccache and am curious how you use it.

It's a ue4 based game project. Unsure of the size offhand, sorry (and it's Sunday, so I'll check tomorrow).

> I assume it is ccache and am curious how you use it.

Nope. We just use the same machines for the same jobs (we have 3 agents, and we use teamcity which lets us pin our configurations to agents), and incremental builds. Really basic stuff. The machines are managed by terraform and packer and we rebuild the images once a month or so, so semi frequently pay the full compile hit. I am going to look into actual persistent caching for Mac very soon, it's likely we'll go with fastbuild!

It's not perfect but you could have a word in the commit message that the pipeline looks for and acts upon, so default using cache but let you not use it with "NO-CACHE: <message>"

I had thought about this as an option. It's a good idea, but maybe it's not promoting a good separation of concerns between the CI machinery and the source control system.

Great idea for a DevOps role, but the average developer will most likely not be aware of these features.

Really? I assumed developers managed that bit of the CI pipeline

Most developers here are focused on business related development, to the point that many technical details are forgotten.

It might be different in other companies though.

Would a two step process work there? A staging branch which is always built from scratch and which is then merged into main? Or main & then tagged commits for releases?

I work in games, mostly Unity but also backend services and such. I haven't had a lot of success with gitlab caches. In my experiments, it's usually faster and less error prone to simply not use the gitlab shared cache. Most caching benefits seem to come from proper git ignore usage and to forgo the network and file IO cost of a gitlab's cache system.

What kinds of things should be cached? What kind of success are others seeing?

My impression is that Github actions are more convenient, as jobs are split into steps, and steps share the filesystem state

Not so convenient if your code is on GitLab.

GitLab team member here - thanks for this write up! Great to see your thought process throughout.

+1 - thanks for sharing!

Serious question: what's the energy/resource usage tradeoff for CI? When are we burning too much resources on pointless testing and hearing data centers?

I'm not saying CI is bad, but where is the threshold where it becomes wasteful, how big should the tested configuration matrix be?

I see this part:

    key: yarn.lock
Is this going to make it start without a cache every time that yarn.lock changes? Isn't that a bit overkill? Normally `yarn install` only downloads updated packages.

Cache keys are unique identifier names for a cache, as otherwise a global cache is used by all jobs, 'key: default'.



The key identifier can also be the job name, the branch as commit ref, or something else unique for this job, or project pipeline. The example in the blog post could also use

key: yarn-cache-$CI_COMMIT_REF_SLUG

to better reflect its purpose.

GitLab 13.11 added support for multiple cache keys per job: https://about.gitlab.com/releases/2021/04/22/gitlab-13-11-re...

If you are using a monorepo, or work with submodules and different package systems, you may have Python, Ruby, NodeJS in the same CI job. By default, a defined cache needed all 'path' entries as a list, using the same global cache.

Specifying multiple keys with different path locations allows to keep caches separated, and as such, better performance for each specific job. Some jobs may not need the NodeJS, and can only specify to use the Ruby cache key for example.

In case you like to invalidate the cache every time a specific file (yarn.lock, go.sum, etc) changes, you can explicitly configure this behavior using cache:keys:files https://docs.gitlab.com/ee/ci/yaml/index.html#cachekeyfiles

This can help prevent corrupted caches, e.g. having downloaded older packages which are stale and not used by current dependency trees. Your code still optionally imports them, and jobs fail because of the old dependency. You cannot reproduce that problem in your dev environment though, starting with a fresh container and no caches. I have been debugging these things before, it takes a while to identify local job caches as the culprit. That said, suggesting to go with a little less performance gain and invalidate caches when dependencies change - if it makes sense for the package manager with often changing recursive dependencies. I've seen it with Python.

Tip for failing jobs - by default, the caches are not saved, meaning to say, a large pip install command remains volatile, even if only the user defined unit test command failed afterwards.

To avoid a slow down in the pipeline, you can use cache:when:always to always save the cache. https://docs.gitlab.com/ee/ci/yaml/index.html#cachewhen

Exercises to learn with Python are at slide 109 https://docs.google.com/presentation/d/12ifd_w7G492FHRaS9CXA...

Good information!

> In case you like to invalidate the cache every time a specific file (yarn.lock, go.sum, etc) changes, you can explicitly configure this behavior using cache:keys:files https://docs.gitlab.com/ee/ci/yaml/index.html#cachekeyfiles

They're using cache:keys:files so it will be installing all of them each time yarn.lock changes. When a build is triggered where yarn.lock hasn't changed, it does a build without downloading all the packages.

Come to think of it, builds don't always run in chronological order, so it could wind up with extra packages. Yarn has autoclean, but it says to avoid using it. NPM seems to be quite OK with it, though: https://docs.npmjs.com/cli/v7/commands/npm-prune

I think caching two folders - one that contains the downloads and one that contains the installed packages - could be the way to go. Yarn and npm have caches to prevent downloading files. And maybe only cache the downloads on the main branch.

Thank you for the great thoughts :)

> And maybe only cache the downloads on the main branch.

$CI_COMMIT_REF_SLUG resolves into the branch when executed in a pipeline. Using it as value for the cache key, Git branches (and related MRs) use different caches. It can be one way to avoid collision but requires more storage with multiple caches. https://docs.gitlab.com/ee/ci/variables/predefined_variables...

In general, I agree, the more caches and parallel execution you add, the more complex and error prone it can get. Simulating a pipeline with runtime requirements like network & caches needs its own "staging" env for developing pipelines. That's a scenario not many have, or might be willing to assign resources onto. Static simulation where you predict the building blocks from the yaml config, is something GitLab's pipeline authoring team is working on in https://gitlab.com/groups/gitlab-org/-/epics/6498

And it is also a matter of insights and observability - the critical path in the pipeline has a long max duration, where do you start analysing and how do you prevent this scenario from happening again. Monitoring with the GitLb CI Pipeline Exporter for Prometheus is great, another way of looking into CI/CD pipelines can be tracing.

CI/CD Tracing with OpenTelemetry is discussed in https://gitlab.com/gitlab-org/gitlab/-/issues/338943 to learn about user experiences, and define the next steps. Imho a very hot topic, seeing more awareness for metrics and traces from everyone. Like, seeing the full trace for pipeline from start to end with different spans inside, and learning that the container image pull takes a long time. That can be the entry point into deeper analysis.

Another idea is to make app instrumentation easier for developers, providing tips for e.g. adding /metrics as an http endpoint using Prometheus and OpenTelemetry client libraries. That way you not only see the CI/CD infrastructure & pipelines, but also user side application performance monitoring and beyond in distributed environments. I'm collecting ideas for blog posts in https://gitlab.com/gitlab-com/marketing/corporate_marketing/...

For someone starting with pipeline efficiency tasks, I'd recommend setting a goal - like shown in the blog post X minutes down to Y - and then start with analysing to get an idea about the blocking parts. Evaluate and test solutions for each part, e.g. a terraform apply might depend on AWS APIs, whereas a Docker pull could be switched to use the Dependency proxy in GitLab for caching.

Each environment has different requirements - collect helpful resources from howtos, blog posts, docs, HN threads, etc. and also ask the community about their experience. https://forum.gitlab.com/ is a good spot too. Recommend to create an example project highlighting the pipeline, and allowing everyone to fork, analyse, add suggestions.

I think it would be amazing if Gitlab CI would allow to send CI pipeline traces to a OTLP endpoint; I can then decide via Orel-collector where I want to send the trace spans to e.g. Google Tracé or Jaeger etc

This is an excellent article. How does GitHub CI/CD compare to this?

Here’s an example of how to enable caching in GitHub Actions:


The example is a bit Erlang specific but the cache action it contains is quite generic.

As Far As I know, You can do pretty much the same thing with Github CI/CD (Github Actions).

Is there any good way to cache a built container and use it as the image for later steps?

I think that caching files is much less effective than caching a built container which already has dependencies installed. For tools like apt and pip the time it takes to "install" can be longer that the time it takes to "download".

Just push it to some image registry (e.g. the one that comes with GitLab), and then use it as image on the next job.

You can also use tags to enforce that it is the same runner who runs the two jobs, so that pulling the image becomes instant

Shared this thread into a new community forum topic for valuable resources: https://forum.gitlab.com/t/ci-cd-pipeline-efficiency-resourc...

Applications are open for YC Summer 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact