Hacker News new | past | comments | ask | show | jobs | submit login

Scheduled tasks are a great way to brown-out your downstream dependencies.

In one instance, MDAM RAID checks caused P99 latency spikes first Sunday of every month [0] (default setting). It caused a lot of pain to our customers until the check was IO throttled, which meant spikes weren't as high, but lasted for a longer time.

Scheduled tasks are a great way to brown-out yourself.

In another case, the client process hadn't set a socket timeout on a blocking tcp connection [1] (default setting), and so it'd run out of worker threads blocked on recv routinely when the server (fronted by reverse-proxy) started rejecting incoming due to overload. Only a restart of the process would recover the client.

Scheduled tasks are a great way to prove HAProxy will scale way better than your backend. Thanks u/wtarreau

Speaking of HAProxy: It fronted a thread-based server processing all sorts of heavy and light queries with segregated thread-pools for various workloads. During peak, proxy would accept connections and queue work faster than the worker threads could handle, and on occasion, the work-queue would grow so big that it not only contained retries of same work scheduled by the desperate clients but a laundry list of work that wasn't valid anymore (processed by some other back-end in the retry path, or simply expired as time-to-service exceeded SLA). Yet, there the back-end was, in a quagmire, chugging through never-ending work, in constant overload when ironically the client wasn't even waiting on the other end and had long closed the connection. The health-checks were passing because, well, that was on a separate thread-pool, with a different work-queue. Smiles all around.

Least conns and the event horizon. Tread carefully.

Least conns load balancing bit us hard on multiple occasions, and is now banned for similar reasons outlined here: https://rachelbythebay.com/w/2018/04/21/lb/

[0] https://serverfault.com/questions/199096/linux-software-raid...

[1] https://stackoverflow.com/questions/667640/how-to-tell-if-a-...

> Scheduled tasks are a great way to brown-out your downstream dependencies.

I've been trying to convince my division to prioritize adding events to our stats dashboard.

Comparing response times to CPU times is just expected level of effort to interpret the graphs. But you don't have any visibility into how a chron job, service restart, rollback, or reboot of a host caused these knock-on effects. And without that data you get these little adrenaline jolts at regular intervals when someone reports a false positive. Especially in pre-prod, where corners get cut on hardware budgets, and deploying a low-churn service may mean it's down for the duration.

This might end up being a thing I have to do myself, since everyone else just nods and says that'd be nice.

I guess it would be nice to have a "fuzzy" cron job setting, so you could specify "run it at a random time during the day/hour/minute"

Someone pointed out in comments on the blog post (https://about.gitlab.com/2019/08/27/tyranny-of-the-clock/#co...) that systemd actually has some good functionality for that.

Elsewhere this is referred to as "jitter". Most systems with regular schedules are built without it and eventually, many will grow into facemelting hordes.

General best practice is to add fuzzing to your scheduled task. So for example wrap it in a shell script that starts with `RANDSLEEP=$((RANDOM + 100)); sleep ${RANDSLEEP:1:2}`. Then it will sleep between 0 and 99 seconds. Or do 1:1 if you run the task every minute so fuzz the start by 0 to 9 seconds.

This seems to be the behaviour of AWS 'rate' Cloudwatch events. Eg. rate(1 hour) or rate(10 minutes) don't necessarily run on the hour. Not sure if they're relative to when you set them up or random though.

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact