At Dollar Shave Club, we use CircleCI 2's Scheduled Workflows to run "monitors" against our production services every minute. These are idempotent, analytics-disabled API & Browser (via Puppeteer) tests that we also run as CI tests on every commit.
We send all monitor metrics to DataDog. When a monitor fails, the appropriate teams will get a Slack notification with the full stack trace. A DataDog monitor will also be triggered, alerting the appropriate teams.
For browser monitors, we upload screenshots and Puppeteer tracing files to S3, then share links within each Slack hook. This allows people to figure out what's going on just by clicking links in Slack.
We were planning to improve this setup in the future, but it's good enough for us right now. For example, CircleCI goes degrades frequently so we sometimes get spotty coverage. We basically spend < $200/month with CircleCI to monitor about 300 APIs/pages every minute.
I second this. Been using the free plan of UptimeRobot for years without a problem. It pings your website every 5 minutes and you get an email alert when it’s down.
I've had a bad experience with Uptime Robot - it only checked our server via IPv4. We didn't know IPv6 was down for far too long (it was a DNS problem).
Guess I'm not the only one, and of course my personal site was down for the last three days because the domain registration had expired and couldn't be renewed on an outdated card.
I facepalmed.
But once my setups get a bit more complex I was thinking I'd build webhooks into an analytics server, and ping those from each server with a json request which includes health data for each's databases and other servers.
I'm no where near that so I'll probably sign up for the free plan on uptimerobot later mentioned in other comments.
I basically add a /global-health endpoint on my server. It executes a bunch of checks programmatically - e.g. database connection, rendering working, etc. It would be easy to add in "fail if cert expires in next month".
Then, I monitor just that one endpoint with Stackdriver (because it's easy). If any of the checks fail, it logs it, prints details, and sets a 500 header code. Adding new checks is just a code change.
What tools do you use to monitor uptime of your web apps and/or APIs?
I use a custom AWS Lambda function. It fires every four hours or so, and tries to make an https connection to each configured URL (the URLs are stored in a file in S3) and if the site is either down, or if there is an SSL error (which probably means an expired certificate) then it sends me a text message using SNS.
The whole thing is about 50 lines of code, and that's in Java. And it doesn't even come close to exceeding the free tier limit of Lambda calls, so it doesn't even cost anything so far.
To be fair, I could have used a 3rd party service, but writing this thing was my first foray into using Lambda, so I did it as much for the learning experience as anything. But it works really well, so I doubt I'll replace it anytime soon.
Yeah, it's just using the standard AWS scheduled events stuff. Nothing fancy. Then again, our needs (for now) are pretty simple. I basically just want to know if our website goes down or if our SSL certificate expires or something.
One of the sites is a SaaS offering, but it's not live yet, so I don't need to stay super-on-top-of-it. Once it's live we'll want more frequent monitoring and some other stuff, so we might either move to another approach, or supplement this with something else.
If you want a light (and free) DIY solution, it's pretty straightforward to build a basic uptime monitor (with connections to Slack, SMS, whatever) using Standard Library [0] and Scheduled Tasks, one of our engineers just posted this article (you can build from your browser):
Another option if you don't feel in the mood for DIY is TJ Holowaychuk's Apex Ping: https://apex.sh/ping/. Great service, run by a solo developer, reasonable price.
Sorry but that has got to be one of the worst company/product names ever. Either it is completely un-searchable since literally every programming language has a "standard library" or if it actually takes off it will push down actually relevant results for niche programming language docs.
Yes — you’re not the first person to come at me with a pitchfork for the domain and name and won’t be the last. I do appreciate the feedback, but we’re committed to the name. :)
We haven’t historically had a problem with “stdlib”, we’re already the top Google result. “Standard Library” (full name) is new for us as we expand to a less technical cohort of customers. We’re working with some pretty great people and companies (Stripe, Slack) on our mission to build a, well, Standard Library — so if you can get over the name choice you should check out our online development environment! -> https://code.stdlib.com/
We have been fairly happy with Runscope [0] for fairly simplistic monitoring of api response codes and body payloads. My biggest complaint is probably the lack of individualized response data after >= 24 hours.
Still waiting to see if the CA Technologies acquisition [1] makes things worse or not.
There are more resources working on Runscope than ever before. CA continues to invest in stability, new features and support. CA is also going through an acquisition and that introduces more variables, but as of today (1+ year after acquiring us), CA has been extremely supportive of Runscope.
Protectumus monitors the website uptime, speed, dns changes, scans the website for malware like a traditional antivirus, blocks bad bots, custom IP's and countries. Protectumos acts as a Firewall. We are the only security company specialized in SEO Security, we offer unique SEO services such as Search engines Cloaking monitoring, Google DMCA Complaints, Blacklist monitoring & removal and more.
> Protectumus acts as a web application firewall (WAF) and scans the website for known malware. Once the malware is found it will be automatically removed.
How does a WAF remove malware? Is there an additional agent sitting on the server or something? And surely when a server is pwned, it needs to be reformatted, not just the malware 'removed'.
Thanks for your question. We have both a Firewall and a traditional Antivirus. They work separately.
The firewall blocks bad bods, hack attempts like SQL injections, XSS, CSRF and more.
The antivirus scans for known malware (we have a big list of malware definitions), but we use AI and Machine Learning and the Antivirus learns from previous detections and is able to act alone. The script is able to automatically remove the malware once it is found.
To piggy back on the question, has anyone had a good experience using Prometheus and Grafana for monitoring? I'm looking into trying it. I've looked into Zenoss but from what I gather it's slow.
I love it. Coupled with Alertmanager it's a really easy to use and powerful platform. We also started serving custom Prometheus exporters for our product usage which has been helpful and really easy to implement.
To me, this solution is too much leaning toward graphing and that's not the primary feature you need to health check services.
Also Prometheus only saves 15 days worth of data by default and since they aren't willing to develop down sampling, it's not good to keep any long term history of the metric but you can overcome this by using InfluxDB as the storage but I would rather use something like monit which is simpler and easier to setup and gets the actual job done well.
we <3 prometheus & Grafana :) Mostly because it gets us the classic monitoring of hosts/services like nagios, etc but also gets us app-specific metrics as well.
We are using it and it's great, so easy to also plug in your own metrics from whatever application you're running. I insisted with the management that we needed it on a new project and they were like 'fine if you insist, we do have Google analytics and uptime Robot you know', now they call me as soon as it's down (we installed it on a crappy server because it was just my fixation :) )
Another happy user here. Only we use Telegraf to gather the metrics on the servers, I've found that it's a bit more flexible than the normal node exporter.
For alerts we have a mix of grafana and AlerManager.
Yes, of course, Its a great tool to track statistics and write alerts. Prometheus is designed to scrape the stats from the services directly, but its also possible to do some kind of active checks. Maybe for checking the ssl cert expiry date you can use the blackbox exporter. https://github.com/prometheus/blackbox_exporter
I have no affiliation with any of these companies.
I use StatusCake for basic uptime monitoring for websites. I switched to them from Pingdom because they are cheaper. Only downsides I've had with StatusCake is if something is down it doesn't give you the cause. Pingdom would show you the trace route. That has made it hard to tell and I would get quite a few false positives saying sites were down. I haven't had false positive issues for months now though. Their paid plans have SSL/domain monitoring.
You need to realize health checking doesn't need to be cool.
It's a good solution to check the statistics by human eyes but there are simpler and more effective solutions for health checking.
If you're looking beyond uptime + certs, we do functional + visual browser testing at https://ghostinspector.com/. Lots of folks use it for monitoring their website or application (in additional to their CI process). We have a free tier that includes scheduling. [Disclosure: I'm the founder]
I have been using Ghost Inspector for awhile now, so far it has been fantastic. It is really nice to be able to push to a branch and get notified in Slack a couple seconds later with any issues including screenshots.
Nagios, in combination with with check_mk 'raw edition' (https://mathias-kettner.com/editions.html). The Nagios configuration is automatically generated via Puppet resource collection.
SSL certificate expiry is easily checked with a nagios check (use the -C flag on check_http). If you use Letsencrypt with a client like acmetool (https://github.com/hlandau/acme) your certs will never expire. Of course the nagios check is still necessary to ensure acmetool keeps doing its job!
Domain name expiry checking could also be a nagios job, or alternatively you could write a small script that checks whois output and execute it regularly with cron.
Configuring your registrar for auto-renewal helps avoid a certain class of errors ("I forgot to renew!") but not others ("my credit card expired and the e-mail notifications from my registrar didn't reach me").
TICK stack: Telegraf, Influxdb, Chronograf, Kapacitor. They're have a massive array of plugins that can watch/store/graph/alert on just about anything.
I agree with this. I just set this up at work and it's really cool. Still in active development and not fully mature yet, but gets the job done and is pretty good quality.
I wish the documentation was better but telegraf's documentation is light years ahead of collectd, which is similar software.
Kapacitor needs some more examples and the default Chronograf-generated TICKscript needed to be thoroughly modified to meet my needs. It took me way too long to figure out how to use stateChangesOnly() to prevent me from getting constant notifications once something went into an alarm state.
That said, the stack works well, even if it has a few rough edges. Thanks to Influxdata for the open source stuff. High quality open source software makes me want to endorse them and purchase the paid products.
Agree on the TickScript criticism! Just not enough stack overflow answers to go around. Btw, stateChangesOnly() can take a time argument which is really really handy!
Pingdom, SiteUptime, and Montastic. None of these are relied upon for mission critical monitoring. I set them up as a sort of "canary in the coal mine" for each of my servers, to help alert me to issues.
I also created a very hacky, browser start page which has GIF's that are pulled from each of my servers. Sort of like how "game copy world" used to setup their mirror page. Super unsophisticated but it allows me to do an "uptime check" every time I open my browser.
I see that it's trending but without any comments - so allow me a shameless plug, I created a tool to monitor my APIs (can schedule calls, do response content checks, send alerts etc): http://www.apilope.com
If you drop me a line after you signed up I can flag you as a demo user that's free forever - or at least until you want to pay or cancel :)
I run Selenium integration tests inside a docker container every 5 minutes or so and attach the results to a sentry.io logger. I put up a boilerplate version on github: https://github.com/mikedh/selenium-simple
This is almost literally what I used as an inspiration for the site transaction monitoring part of https://checklyhq.com, as in a browser emulation feeding a monitoring system at regular intervals. Selenium was always a bit of hog so I jumped on Puppeteer when it came around.
I use Ping [0], never used an alternative so not sure how it compares on features, but it has everything I need: uptime alerts, header & body requests. All packaged in a nice interface with a solid pricing structure.
Re SSL expiry: we've taken the route of running VirtualMin on our 'commodity' servers, sites setup in there have an auto-renew policy if they use SSL, so it requires no action and auto-updates whenever certs expire (IIRC it's every three months? but I might be wrong: it requires no thought nor action from me).
For domain names, we use Joker (Swiss company), we've found them to be good in all aspects, and they send renewal notices well in advance (including a link that anyone can use to renew the domain, even without having an account with them).
Uptime monitoring is a whole different kettle of fish, and we manage it on a per client basis depending on needs. For the sites we host, generally if the server's up, then the sites are up — but we also do more specific monitoring if it's a client requirement.
I'm using the same, but I'm looking around for alternatives after I received a couple of false alerts (about 3-4 over the last year) from them. Still, overall a good service.
I was unable to find a solution that was able to monitor my websites and servers using a single service, so I decided to develop something[0] to scratch my own itch. It has been open for almost a year now and it is slowly gaining some traction. :) It allows you to monitor your server for resources, your websites for the uptime, SSL certificate, broken links and mixed contents errors. It also has a public status page for you to use. Also the setup is extremely easy as it is just a single binary (and it is open source!) to run using your own crontab.
Feel free to let me know what you think about it, feedback is always greatly appreciated.
I've written custom bash scripts that Zabbix runs and alerts based on their output.
For example, domain expiry - I have a script that Zabbix runs once a day that does a whois and grabs the expiry date for the domain in question. Convert that to a unix string and subtract the current date on it. Echo that from the script.
In Zabbix we can now alert if that item's value is <30d. Similar for SSL certificates and the web monitoring stuff is built in
Edit: Oh actually on the whois, I remember it was a huge pain in the butt getting the expiry for a variety of different domains - I now use https://jsonwhoisapi.com/ to get the whois info
External monitoring: Pingdom/Site24x7. Lesson learnt - have the alerts route to at least a few emails outside of your company domain if you use the same domain for email.
Site monitoring: NewRelic
PagerDuty/OpsGenie: For alert routing if you have more than 2 people.
- Custom health check endpoint: checks a bunch of internal status metrics (NewRelic metrics for speed and error rate, DB connection check, a few E2E tests). Returns a simple overall status: OK, Warning (error rate high but not critical, slower than usual response times), Critical (DB down, errors high, anything else). Benefit of this approach is that it's easy to add new health checks any time you add a feature (example: is redis up?).
- Pingdom: polls custom endpoint every minute. Sends notification to PagerDuty if critical, or if warning for more than 5 mins.
- PagerDuty to notify team
- SSL Expiry: calendar notifications (whole team) and reminder emails from our SSL cert issuer
I started using Site 24x7 from Zoho about a year ago for all of my apps and my clients "basic" websites.
It ended up being a great tool for me because it allows for the most basic ping test and to content checks to be setup in 10 mins. It also has the ability to add reporting based on apps, servers, and databases. The aws add-ins helped me tune my usage, as i was paying way to much for a couple services that i could downgrade with out impacting my apps.
I feel it is priced right and a good value up to the 89/month plan.
People in the market for API monitoring and site transaction monitoring, have a look at https://checklyhq.com. We offer full API monitoring and add Puppeteer-based site transaction monitoring. We aim to be a one-stop-shop, so we give you a big dashboard and nice things like SSL certs monitoring and SMS alerting.
Many of our larger clients use https://www.splunk.com for enterprise-level monitoring, alerting, and log analysis. It's proven effective to debug issues in environments where I don't have direct log access.
https://nodeping.com - cheap, has an api, can graph a simple JSON response in addition to ICMP, http etc. Generally reliable and we’ve been using it as our backstop monitoring for hundreds of nodes for several years.
DataDog, our API is 100% serverless microservices (aws lambda) and 90% of them connect to elasticsearch / dynamo. If we start getting high error rates an alarm goes off and slack / email lights up. We are monitoring upwards of 300 lambda's this way.
KeyChest.net is great for managing your TLS certificates.
It can use the certificate transparency (CT) logs to detect new certificates for your domain, so once set up you don't have to maintain it. Make sure to enable the weekly report email too.
Virtually every monitoring tool has already been mentioned. But I just want to add that you can always get CollectD and StatsD up and running in seconds for free on linux hosts. Very lightweight, and can measure virtually anything.
For my needs -- several websites -- a free solution and simple solution will be enough. Pings a few times per day would do. Even once per day might do.
I discovered earlier this week that Keybase.io also offers pseudo monitoring by emailing me the proof on my website was no longer valid due to broken https.
I installed a node.js module that take a screens-shot of a web page, a module that compares two images, and a module that sends e-mails. Glued it together, put it on a 1$/month VPS. I now get an e-mail every time a web page that I'm interested in changes.
UpDown is nice but I'm considering moving because they don't support multiple domains on the same status page. I don't need a single status page for each of the domains I own, that would be ridiculous.
We send all monitor metrics to DataDog. When a monitor fails, the appropriate teams will get a Slack notification with the full stack trace. A DataDog monitor will also be triggered, alerting the appropriate teams.
For browser monitors, we upload screenshots and Puppeteer tracing files to S3, then share links within each Slack hook. This allows people to figure out what's going on just by clicking links in Slack.
We were planning to improve this setup in the future, but it's good enough for us right now. For example, CircleCI goes degrades frequently so we sometimes get spotty coverage. We basically spend < $200/month with CircleCI to monitor about 300 APIs/pages every minute.
You can read more here:
- https://engineering.dollarshaveclub.com/monitor-all-the-thin...
- https://circleci.com/blog/how-dollar-shave-club-3x-d-velocit...
- https://github.com/dollarshaveclub/monitor