Set a global timeout for your jobs. Seriously. Think you don't need one? You're wrong. Set a global timeout for your jobs. Whoever pays that bill will thank you later. Private actions don't give a damn if `setup-node` is taking a whole hour to install Node. They don't care if your hosting service is having trouble and runs a deployment for 5 hours. You will be billed for that, and it adds up. Set a global timeout for your jobs.
Is there a way to set a global timeout now? Last I checked, you could only set them per job, i.e. you have to copy-paste "timeout-minutes: 15" to every single job in every single workflow in every single repo, and hope you didn't miss one. That's been the case for years[1] and a quick search shows it's still true[2]
The cynic in me thinks they like having the extra revenue.
This is a problem of misaligned incentives, not just in GitHub, but pretty much every CI provider out there.
In fact, the incentives are diametrically opposed in that almost every one of them makes more money when our builds take longer to run, regardless of the reason. So they are financially disincentivized to build anything that could make them faster, or even make it easier to limit the duration, as is the case here. When it does happen, it's a rare triumph of some combination of people with genuinely good intentions, customer demand, and competitive pressure, over the demand for financial returns that every company has to eventually come to terms with, and not sustainable over the super long term.
The ones that let us host our own runners at least offer us an escape hatch where we can opt out of the diametrically opposed incentive structure and back into a still-not-aligned but neutral one, but then we give up much of the benefits of CI as a SaaS and have to spend engineering hours building and maintaining our own build infrastructure at a significant cost.
Let's not forget that traditional CI in itself is already a commodity where providers sell us dumb CI minutes that we have to spend our own eng hours engineering deployment and testing solutions on top of, and eventually have to sink entire full-time engineering teams' worth of hours into fighting against the natural tendency for these systems to get slower as we add more code and people.
I believe the solution is deployment & testing platforms tailored to specific technologies, meticulously engineered to be so ridiculously fast that they can reasonably be offered as an all-you-can-eat plan for a fixed monthly price per seat, instead of the industry standard usage-based pricing of traditional CI providers. This aligns incentives much better in that slow builds hurt the provider's bottom line as much as they hurt customers' engineering productivity, and on the flip side it financially incentivizes constant investments into making the system even faster from the provider side since faster builds means they can serve more customers on the same hardware and pocket the difference as profit.
Shameless plug: I've been building one of these platforms at https://reflame.app. Reflame can deploy client-rendered React web apps in milliseconds, fast enough to make deploying to the internet feel like local dev.
For me act works fairly well, though it isn't exactly the same as Github. Matrix builds didn't work properly (just noticed it has been fixed now), and the base images aren't quite the same.
Github should send a bunch of money to the act developer - I know I wouldn't have used Github actions at all without act existing, I'm sure other people must be in the same situation. (Though I'm not paying Github either, so perhaps I'm not a target customer...)
Like many things, the more complex the workflow the more useful it becomes running it locally.
I've found it especially useful for fixing complex workflows or working with custom actions. It's not strictly needed, but it does speed up your workflow once you figure out the kinks.
Unfortunately, it only supports running single jobs. More complex tasks that requires dependencies, variables, job creation context (MR, Trigger, Web, etc.) can't be tested.
I believe pull request from forks are not triggered by default because some people where using this to mine cryptocrap on cpu using the quotas of other projects.
Ya, while this restriction makes sense it does mean that you have to jump through a few weird hoops to get Fork based workflows to allow commenting on pull requests from the action
I've been waiting since 2013 for crypto to have a facebook-level contribution but it's still just the same crap. Every 2-3 years, the world remembers crypto exists and they funnel money into it only to be fleeced. I've yet to see one piece of tech from the crypto craze that has substantially altered peoples' lives besides losing most people money.
If you're looking to get on the jump for a new piece of tech, it's probably too late for crypto anyways. My mom knows about it. lol
Facebook substantially altered lives: it is used for genocide, recruiting insurrections, and ruins mental health -- and nobody here is wringing their hands about working on the broader ecosystem. Engineers who worked to build out DTC brands aren't considered lepers.
It feels like a double standard and pulling the ladder up, so even if the next generation of startup programmers get work, they'll be "tainted" by starting in crypto. I am in a sophisticated niche of the industry just like many readers here are in their comfortable niche of their industries and reasonably distant from the most harmful practices.
Maybe I just need to accept that SV programmers don't like to look in the mirror and see that most of our work is ultimately as pointless/destructive as crypto but with a couple extra steps.
This comment is so confusing. You first seemed to take offense at the term "cryptocrap" but then here call crypto "pointless/destructive." Facebook is obviously mostly garbage these days (IMO) however I think it's hard to argue that it provides no utility to people. My family members are spread out across four different countries and mostly stay connected via Facebook. I'd argue that Facebook provides more utility to people than crypto at this point. That says a lot more about crypto than it does about Facebook.
The tech industry is like the Wild West- you can do (almost) anything you want! It's big enough that there will always be opportunities in our lifetime for tech workers to make money positively contributing to society. On the other hand, there's plenty of funding for exploitative/harmful applications. It doesn't matter what other people are doing or have done, your choice and your integrity belong to you.
When picking a potential employer, programmers often think "what about Google?" or "what about Facebook?" or "What about the hottest a16z startup?". When those programmers then compare/contrast the behavior of the companies they're considering, it is not considered a logical fallacy, but "making a list of pros/cons."
If I am going to trade my time and energy for money, I think it's fair to compare my options, and not an act of irrationality.
You're completely missing the point of my comment. You say "crypto is not as bad as facebook" which is a textbook whataboutism. "crypto" isn't as bad as the Shoah, so what?
Cryptocurrencies are almost as old as the smartphone and yet not one actual application has emerged.
They aren't a good hedge against inflation, as their prices collapsed when inflation re-appeared. They aren't a good vehicle for transactions, which is why crypto transactions occupy less than 0.1% of the world's financial transactions. And scams and ripoffs are rife.
I think your view on of crypto might be skewed by this. I think working in coal or surveillance tech or fracking is bad, and that isn't changed by the fact that it is someone's best chance at getting rich from it.
Note that most programmers didn't get rich out of social media or the boom from the last decade: maybe 100,000 became millionaires out of 20 million programmers, most of whom had generally middle class salaries
The UI for seeing logs is driving me insaaaane.
It's extremely slow and sluggish. Sometimes you have to refresh the page to see the actual latest line of the log.
No it doesn't. The browser's built-in search doesn't work since the page loads lazily as you scroll. Searching for a term that is not on the screen will not work.
* Scheduled actions basically never run anywhere close to on schedule. If you schedule something to run every 13 minutes, it may just run 1-3 times an hour with random 30 minute to 1 hour waits in between executions.
* Triggering a workflow as a result of an action from another workflow doesn't work if you're using the GITHUB_TOKEN as part of the action. Github does this to prevent accidental recursion, but it forces you to either use insecure PATs or rearchitect how to handle chained events: https://docs.github.com/en/actions/using-workflows/triggerin...
Yet another pitfall: Changing the system clock on runners can throw off billing and calculation of used minutes. Colleague of mine told me about that one last year.
I miss the days of setting the clock/date to avoid time bombs in software builds. "back in my day, things were so much easier!" now, I would not be surprised if the teams working on these kinds of lock outs are larger than the teams building the product.
Potentially, maybe? I didn't follow it closely enough, but the crux was a group trying to test out various time-related calls, and they'd set the base time of the containers to different times forward and behind 'now'. There were hundreds/thousands of minutes 'billed', going over the account threshhold, stopping all builds. But... I don't remember hearing if there was some diff between 'negative' charges and 'positive' charges.
Because sometimes you want to change it to a date many months or years in the future. That kind of thing as a test to make sure nothing goes wrong based on the date, like Y2K/Year 2038, or malicious code that's designed to lay dormant until some point in the future.
that is a very naive way to go about that, i think, for reasons exactly like jacobyoder discovered. Seems to me that it would be much better to write code which allows you to override the time seen by your software by setting some delta specified in an environment variable or something, or by having test cases which test your logic without requiring that the notion of the current time be changed.
Changing the current time will do unknown things to other software running on the system, and there are too many question marks there for that to seem reasonable, to me.
"But you can't really know what it will be like if you just change an env in one area - what if you miss something? Or what it someone changes some code later that doesn't respect the env var?"
"well... that's what tests are for, right? We'll write more tests to ensure usages of these timezone/date/time specific areas are covered".
"Someone will miss something. I missed something years ago and it was bad. We can't rely on that. We have to change the actual server instance clock to ensure we can verify this behaviour works correctly".
When I pointed out that what we were testing was a scenario that wouldn't ever really exist in real life - mobile clients not using 'real' time (they're almost all synced to satellites and time servers you typically can't override) hitting a server that also may be set to a vastly different time (not synced to a timeserver with a few ms drift).... I was dismissed as "not understanding the problem".
There's a need to say "how might this perform in 8 months, when DST changes". But... you'd set both client and server to the same future (or past) time. This was setting up a server to be days/weeks in future, then testing 'now' clients against it, which "worked" somehow, but imo was just giving weird false assurances that a hypothetical scenario worked that wasn't ever going to happen in real life. Basically couldn't happen in real life under normal conditions.
don't rely on public infrastructure of unknown usage to do scheduled tasks of precise execution time is required. public GitHub runners offer few guarantees.
I welcome their anti-recursion measures because I fought in the recursive clone wars and no one should have to support any systems that allow that.
I'm not talking about using this on a free tier or something. Github actions are billed monthly. This goes way beyond just not having a tight SLA. Precision isn't even the ask here. It's one thing if a job scheduled to run every 10 minutes occasionally takes 12 or 13 in between runs. It's a completely different matter if it takes an hour.
Having some safe-guards against unbounded recursion is one thing, but the escape hatch for it right now is to use less secure credentials. That's just madness.
The biggest pitfall I see is people inadvertently making them "smart" which causes massive headaches when debugging them.
As much intelligence as possible ought to be pushed down to and tested and debugged on the script level so that you're left with little more than a linear sequence of 4-5 commands in your YAML.
The debugging tooling on github actions is, frankly, abysmal. You need a third party ngrok action to even run ssh.
+1. The logic in any CI server should be “call the build script”. This make it so much easier to debug failures, and easy to switch to another CI setup when the current director of IT quits and a new one comes in and forces everyone to use his/her favorite CI server.
Absolutely. The problem is when you have a lot of time-consuming steps and you only want to re-run the failed one with a slight change and then continue where it left off. Make can do that of course, but you need to make Make do it, and save/restore a workspace/artifacts. I haven't done that in GHA and GHA has lacked a lot of core ci/cd functionality for a long time, so I don't know if it's possible in GHA.
Steps that only cache their final artifact on success, and an if condition on the skippable steps that only runs if the artifact doesn't exist. I'm using this to prevent re-running builds when a successful build for the same SHA already exists, for instance.
Dear GitHub actions. I want &pointers so I don’t have to repeat myself.
Also, I like that you build the hypothetical merge of branch + main. But that commit SHA is gone after that successful build. Give me a way to track this. I need to store artifacts related to this build, as I don’t want to build those again!
they do seem to be capable of saving most things people call artifacts or if you are looking for something more along the lines of caching parts of the build for future builds, you can adjust it pretty easily by adjusting what the cache key is based on.
The one note here is clearing that cache/cache management isn't straight forward currently (although they are improving it), there are a few acceptable workarounds though.
Also a fun one is the "on" key that specifies when the workflow should run. "on" is magic in yaml, and some implementations will convert it to the string (!) "True" when it occurs as a key (I'm not talking about values here). This was a bit confusing when I tried to replace a hand-written yaml with a generated json ... They were identical, except for the on/True key. It's still not clear to me whether this is according to yaml spec or not, but in any case a json "on" key does work ... So I wonder, does GitHub Actions internally look for both "on" and "True"?
You might be getting the string True because in Yaml 1.1 the scalars “y”, “n”, “yes”, “no”, “on”, and “off” (in all their casings) are Boolean literals.
I believe YAML supports non-string keys, so your key would be parsed to the corresponding Boolean value (true), if the pipeline then goes through JSON where only string keys are supported the serialiser could simply stringing the key rather than raise an error, leading to “True”.
And that’s one of the billion reasons why barewords are bad.
I think this has been fixed in Yaml 1.2, but there’s a lot of Yaml 1.1 libraries out there, and they can’t just switch since they could break user code.
I once worked on a project where the input was YAML config files and a lot of different programs would read/write the files. Every different parser had at least one implementation-specific quirk. Often we would run into the quirks because someone edited the YAML by hand, and one parser would handle it fine, while another would barf.
That's when I found out the YAML spec explicitly says it's human-readable, not human-writeable. Our mistake was assuming YAML was a configuration format, when actually it's a data serialization format (again, spec explicitly says this) that is easy to read.
Now I only write YAML files with a YAML generator, because just running a hand-edited file through a parser may fall victim to a parser quirk.
I've hit problems where YAML generated by one implementation will hit parsing quirks in another implementation. Now my advice is: If you have something that consumes YAML, generate JSON and feed it that instead. YAML is defined to be a superset of JSON, and every implementation should be able to handle it fine.
For those interested, the problem was with the string "08". At least at the time, the pyYAML generator I was using would render it as 08 (without quotes), which is not parsable as a number because the leading 0 indicates the number is in octal, but 8 isn't a valid octal digit. Since it wasn't parsable as a number, it should default to being treated as a string. However the golang parser disagreed and instead raised an error because "8" was not a valid octal digit.
Yeah exactly, generate json instead. I now have some GitHub Actions config in Python, and some in Nix. Nix is actually really nice for this. Like yaml it has less line noise than json and it support comments, but unlike yaml the language is extremely simple without weird corner cases to make it "easier", and it has variables and functions so you can write reusable jobs.
Yeah, I’ve lived through many a transition from one CI server to the next, so nowadays I just have CI call a script. You want to really minimize the amount you depend on a particular CI server’s features, to make switching very easy. Even if you never have to switch, it will be easier to maintain.
I've used Github Actions at work for the past year and I'm a fan overall. The clearest sign of this is my feedback for improvement is almost entirely about missing features instead of broken ones. For examples, it'd be nice:
1. for Github to natively allow CI management for several repos in a centralized way, so repo setup can just be "select this CI config" instead of "copy this YAML file and change the project name in some places"
2. to mandate certain CI steps at the organization level (such as running `black`) so it isn't opt-in
Me, I looove that the actions config has to be in a file in the repo, so I know where to find it, and if I have read access to the repo, I have access to the config. (Don't even need to be logged into a GH acct, although I usually am).
If they allowed config to come from an internal setting not visible in repo, i'm sure repos I collaborate with would start using that feature, and I would not be able to find their Actions configs.
(I work mostly on open source, which may lead to different patterns of access and such).
It’s still early days. You can’t use “act” locally if you have a reusable workflow. And what bit me was that you can’t pass environment variables from the caller. My workaround was to write them to a file and then cat the file into $GITHUB_ENV in the reusable workflow.
However, that then exposed me to the up thread bug about files. So now I also have to delete the file before creating it. Sigh.
The most annoying thing for me is that a workflow_dispatch only workflow can't be launch manually until it's push into default branch as they are not listed. I can understand the Web-UI won't list them but even GitHub cli can't launch them. Once they appear in the default branch, only then you are free to launch them on any branch.
There's something in here I don't understand, and I thought I knew the reason why it does this (for at least some workflow types, maybe not workflow_dispatch)
If you're only free to run those workflows when they land in the default branch, does that also mean that the workflow that runs is the one from the default branch and if you change the workflow in a PR, it will only run the new workflow on merge?
I know there's something in here to permit non-owned commits (from an external contributor) to be tested against a trusted workflow from the main branch, but I don't think it has anything to do with workflow_dispatch. I would expect that if you're able to run workflows and target any branch, then if the workflow you run is the one contained in that branch, you'd be able to select any workflow that is named and defined in the branch's configuration.
I'm not saying that's how it works, I'm saying that's how I'd imagine it to work. If someone knows "the rule" that we can use to disambiguate this and understand it from the precepts that went into the design, maybe speak up here? I don't get it.
> If you're only free to run those workflows when they land in the default branch, does that also mean that the workflow that runs is the one from the default branch and if you change the workflow in a PR, it will only run the new workflow on merge?
the premise of your question is wrong. you can trigger workflow_dispatch workflows in any branch via the UI if a workflow by that name also exists in the default branch, and only via API if no workflow by that name exists in the default branch.
Maybe the premise of my question is about an entirely different misunderstanding then. There is a locus of control issue, and a story about how the original permissions model of GitHub Actions was chronically broken.
> In order to protect public repositories for malicious users we run all pull request workflows raised from repository forks with a read-only token and no access to secrets. This makes common workflows like labeling or commenting on pull requests very difficult.
> In order to solve this, we’ve added a new pull_request_target event, which behaves in an almost identical way to the pull_request event with the same set of filters and payload. However, instead of running against the workflow and code from the merge commit, the event runs against the workflow and code from the base of the pull request. This means the workflow is running from a trusted source and is given access to a read/write token as well as secrets enabling the maintainer to safely comment on or label a pull request.
That's the user story I was thinking of. Completely unrelated to the default branch issue that GP was describing, I guess.
Agreed. The workaround my team uses is to first merge an empty action with the right parameters set, then open a second branch to implement the action steps. Once the action hits main, you can start it, using the definition from any branch.
the API supports workflow_dispatch in non-default branches, even if that workflow doesn't exist in the default branch. you just need to call the API to do it. curl makes it pretty easy but not as easy as clicking a button, I agree.
Another pitfall I have encountered is the lack of a true ephemeral agent runner solution for running the actions runner agent in your own infrastructure. The way it works (the last time I checked) is when you register a worker as "ephemeral: true" it automatically deregisters itself from your runner pool and kills the agent process when a job is completed, but it is up to you to clean things up. This leads to somewhat hacky scripts to delete the compute instance after the agent process exits. There is also no officially supported kubernetes controller for creating ephemeral agents but the community created one [1] is often mistaken as an official github project.
actions-runner-controller has been worked on by GitHub for a while and is effectively an official project. I don't think the distinction is important at this point.
my employer used some code from philips-labs to support ephemeral runners. works great after a few customizations.
I wrote a shell script and a very small Go program to support ephemeral MacOS runners on-premise.
Their self-hosted runners are pretty jank. If your workflow writes something to the docker container's user's home directory, you will see it in the next workflow you run. Due to this and other things, I need a "preamble" action that needs to run right after checkout. Oh, if don't checkout at the beginning of your workflow, you will be using the previous workflow's copy of the repository.
I'm 100% sure they don't use this internally as these are glaring issues that impacts anyone using the self hosted runner. They also recommend running the container as root[1] instead of designing something more secure and sane.
it's not about security or sanity, it's because people run containers whose UIDs do not match the host system, and they write to the host system by mounting volumes for the container to use.
the result is root or another user inside the container can write root-owned files because they have the same UID as root on the container host.
my employer runs an orchestrator and destroys each runner VM after a single job so this only bites the user who causes it, and not anyone else.
There’s an awkward gotcha/incompatibility between “Required status checks” and workflows that get skipped [1], eg due to setting a “paths” property of a push/pull_request workflow trigger [2].
The checks associated with the workflow don’t run and stay in a pending state, preventing the PR from being merged.
The only workaround I’m aware of is to use an action such as paths-filter [3] instead at the job level.
A further, related frustration/limitation - you can _only_ set the “paths” property [2] at the workflow level (i.e. not per-job), so those rules apply to all jobs in the workflow. Given that you can only build a DAG of jobs (ie “needs”) within a single workflow, it makes it quite difficult to do anything non trivial in a monorepo.
We recently converted all our projects to Github Actions, and while it really brings a lot of convenience it also feels to me like a very brittle solution with lots of gotchas and messy API surfaces.
Of course, the nature of running various commands on virtual machines and shells is inherently messy, but GHA could have done a lot to hide that. Instead I feel like I'm forced mix YAML, bash, Powershell and various higher-level scripting languages (that came with the projects) in an unholy concoction that is hard to get right (return codes, passing values and escaping directly comes to mind) and that is even harder to debug, due to the nature of running somewhere else (act helps, a little, but it doesn't properly replicate the GHA environment).
I kind of wished I could write all my workflows cross-platform from start with some well known but fullfledged scripting language. (Which of course I could, and just use GHA to call that script). What options are out there to make this whole thing less brittle?
Another surprise: "ubuntu-latest" is not the latest ubuntu! It is stuck at 20.04. If you want 22.04 you need to specify "ubuntu-22.04". Similar issue with macos-latest.
"ubuntu-latest" isn't necessarily the latest Ubuntu, it's the latest version that has been fixed to the point of having no workflow-breaking known issues, I believe.
I rely on that repo to build my own images and it is a frequent cause of failed builds. I'm going to convert almost all of it to installation via homebrew instead, I think. works well for MacOS anyway.
> The problem is that exec does not return a non-zero return code if the command fails. Instead, it returns a rejected promise.
> While this behavior can be changed by passing ignoreReturnCode as the third argument ExecOptions, the default behavior is very surprising.
This is the same behavior as node's child process exec when wrapped by util.promisify[1] If something returns a promise (async func) it should be expected that it has the possibility of being rejected.
Another pitfall I ran into recently with a workflow I've been working on [1]: Checks and CI that are made with GitHub Actions are reported to the new Checks API, while some (all?) external services report to their old Statuses API. This makes it needlessly difficult to ascertain whether a PR/branch is "green" or not. They finally decided to create a "statusRollUp" that combines the state of the two APIs, but it's not available in their REST api, only their GraphQL API.
Docker request limits are kind of a pain to deal with in Github Actions. This recently bit us, and no amount of logging into a _paid_ docker account would rectify the problem.
As it turns out, images are pulled at the start of the run, which means your docker login will have no effect if you're currently bumping into these pull limits. This is made worse by the fact that the images themselves are controlled in the remote actions you're using, not something in your own codebase.
So you're left with either: forking the action and controlling it yourself, or hoping the maintainer will push to the Github registry.
We have a new integration at Cronitor to fully monitor your GitHub actions - works with both hosted and self-hosted runners. This is in beta right now. If anybody would like to check it out, let me know, Shane at cronitor