Ask HN: How do you monitor your websites?

jongleberry · on Oct 24, 2018

At Dollar Shave Club, we use CircleCI 2's Scheduled Workflows to run "monitors" against our production services every minute. These are idempotent, analytics-disabled API & Browser (via Puppeteer) tests that we also run as CI tests on every commit.

We send all monitor metrics to DataDog. When a monitor fails, the appropriate teams will get a Slack notification with the full stack trace. A DataDog monitor will also be triggered, alerting the appropriate teams.

For browser monitors, we upload screenshots and Puppeteer tracing files to S3, then share links within each Slack hook. This allows people to figure out what's going on just by clicking links in Slack.

We were planning to improve this setup in the future, but it's good enough for us right now. For example, CircleCI goes degrades frequently so we sometimes get spotty coverage. We basically spend < $200/month with CircleCI to monitor about 300 APIs/pages every minute.

You can read more here:

- https://engineering.dollarshaveclub.com/monitor-all-the-thin...

- https://circleci.com/blog/how-dollar-shave-club-3x-d-velocit...

- https://github.com/dollarshaveclub/monitor

js4ever · on Oct 24, 2018

UptimeRobot is quick to setup and have a lot of options, including a white label portal [0] to monitor status it's reliable and cheap ($54/year)

I'm not affiliated with them, just an happy customer

[0] E.g. status page: https://status.appdrag.com/

miguelmota · on Oct 24, 2018

I second this. Been using the free plan of UptimeRobot for years without a problem. It pings your website every 5 minutes and you get an email alert when it’s down.

Nicksil · on Oct 24, 2018

https://uptimerobot.com/

gregopet · on Oct 24, 2018

I've had a bad experience with Uptime Robot - it only checked our server via IPv4. We didn't know IPv6 was down for far too long (it was a DNS problem).

timbit42 · on Oct 25, 2018

Sounds like you didn't check that they checked for ipv6. Businesses aren't going to tell you where their services are lacking.

Kraftwurm · on Oct 26, 2018

+1 for uptimerobot. Been using them from the start... (not affiliated)

wilsonnb3 · on Oct 24, 2018

Once every few months, I go to the website.

codyb · on Oct 24, 2018

Guess I'm not the only one, and of course my personal site was down for the last three days because the domain registration had expired and couldn't be renewed on an outdated card.

I facepalmed.

But once my setups get a bit more complex I was thinking I'd build webhooks into an analytics server, and ping those from each server with a json request which includes health data for each's databases and other servers.

I'm no where near that so I'll probably sign up for the free plan on uptimerobot later mentioned in other comments.

ondiekijunior · on Oct 25, 2018

Well this solution works for very many. Though for me it's mostly weekly for some sites, and daily for others.

cascada · on Oct 25, 2018

Really??? Teach me! How do you manage doing that???

danillonunes · on Oct 27, 2018

Step 1: Go to the website.

Step 2: After a few months, repeat step 1.

philip1209 · on Oct 24, 2018

I basically add a /global-health endpoint on my server. It executes a bunch of checks programmatically - e.g. database connection, rendering working, etc. It would be easy to add in "fail if cert expires in next month".

Then, I monitor just that one endpoint with Stackdriver (because it's easy). If any of the checks fail, it logs it, prints details, and sets a 500 header code. Adding new checks is just a code change.

Spivak · on Oct 24, 2018

Are you worried about that endpoint being used for DoS or is it light enough?

iw0 · on Oct 24, 2018

The way i do it is running the checks in a background every x minutes and add the result in a json file. It’s fast and safe.

petecoop · on Oct 25, 2018

I'd guess you'd then also want to return a timestamp in the response and then fail if the timestamp is older than x minutes too

philip1209 · on Oct 24, 2018

It's rate-limited and behind Cloudflare. But, the checks are like "can I connect to Redis", so it is pretty lightweight.

mindcrime · on Oct 24, 2018

What tools do you use to monitor uptime of your web apps and/or APIs?

I use a custom AWS Lambda function. It fires every four hours or so, and tries to make an https connection to each configured URL (the URLs are stored in a file in S3) and if the site is either down, or if there is an SSL error (which probably means an expired certificate) then it sends me a text message using SNS.

The whole thing is about 50 lines of code, and that's in Java. And it doesn't even come close to exceeding the free tier limit of Lambda calls, so it doesn't even cost anything so far.

To be fair, I could have used a 3rd party service, but writing this thing was my first foray into using Lambda, so I did it as much for the learning experience as anything. But it works really well, so I doubt I'll replace it anytime soon.

exceptionallyOK · on Oct 24, 2018

Just wondering, did you use Scheduled Events or what did you use to implement firing on an interval?

mindcrime · on Oct 24, 2018

Yeah, it's just using the standard AWS scheduled events stuff. Nothing fancy. Then again, our needs (for now) are pretty simple. I basically just want to know if our website goes down or if our SSL certificate expires or something.

One of the sites is a SaaS offering, but it's not live yet, so I don't need to stay super-on-top-of-it. Once it's live we'll want more frequent monitoring and some other stuff, so we might either move to another approach, or supplement this with something else.

keithwhor · on Oct 24, 2018

If you want a light (and free) DIY solution, it's pretty straightforward to build a basic uptime monitor (with connections to Slack, SMS, whatever) using Standard Library [0] and Scheduled Tasks, one of our engineers just posted this article (you can build from your browser):

https://hackernoon.com/build-an-uptime-monitor-in-minutes-wi...

Another option if you don't feel in the mood for DIY is TJ Holowaychuk's Apex Ping: https://apex.sh/ping/. Great service, run by a solo developer, reasonable price.

[0] https://stdlib.com

Disclaimer: I founded Standard Library. :)

SahAssar · on Oct 24, 2018

Sorry but that has got to be one of the worst company/product names ever. Either it is completely un-searchable since literally every programming language has a "standard library" or if it actually takes off it will push down actually relevant results for niche programming language docs.

keithwhor · on Oct 24, 2018

Yes — you’re not the first person to come at me with a pitchfork for the domain and name and won’t be the last. I do appreciate the feedback, but we’re committed to the name. :)

We haven’t historically had a problem with “stdlib”, we’re already the top Google result. “Standard Library” (full name) is new for us as we expand to a less technical cohort of customers. We’re working with some pretty great people and companies (Stripe, Slack) on our mission to build a, well, Standard Library — so if you can get over the name choice you should check out our online development environment! -> https://code.stdlib.com/

CodeAlong · on Oct 24, 2018

We have been fairly happy with Runscope [0] for fairly simplistic monitoring of api response codes and body payloads. My biggest complaint is probably the lack of individualized response data after >= 24 hours.

Still waiting to see if the CA Technologies acquisition [1] makes things worse or not.

[0] https://www.runscope.com/ [1] https://blog.runscope.com/posts/301

johns · on Oct 24, 2018

Note that the retention is not time-based: https://www.runscope.com/support/kb/test-result-retention Hopefully we have a better solution for this in the future.

There are more resources working on Runscope than ever before. CA continues to invest in stability, new features and support. CA is also going through an acquisition and that introduces more variables, but as of today (1+ year after acquiring us), CA has been extremely supportive of Runscope.

raresp · on Oct 24, 2018

Try Protectumus: https://protectumus.com

Protectumus monitors the website uptime, speed, dns changes, scans the website for malware like a traditional antivirus, blocks bad bots, custom IP's and countries. Protectumos acts as a Firewall. We are the only security company specialized in SEO Security, we offer unique SEO services such as Search engines Cloaking monitoring, Google DMCA Complaints, Blacklist monitoring & removal and more.

Details here - https://news.ycombinator.com/item?id=18295381

Full disclosure: I'm the founder.

jamieweb · on Oct 24, 2018

From your site:

> Protectumus acts as a web application firewall (WAF) and scans the website for known malware. Once the malware is found it will be automatically removed.

How does a WAF remove malware? Is there an additional agent sitting on the server or something? And surely when a server is pwned, it needs to be reformatted, not just the malware 'removed'.

raresp · on Oct 25, 2018

Thanks for your question. We have both a Firewall and a traditional Antivirus. They work separately.

The firewall blocks bad bods, hack attempts like SQL injections, XSS, CSRF and more.

The antivirus scans for known malware (we have a big list of malware definitions), but we use AI and Machine Learning and the Antivirus learns from previous detections and is able to act alone. The script is able to automatically remove the malware once it is found.

jamieweb · on Oct 25, 2018

Thank you for the info.

raresp · on Oct 26, 2018

In case you're interested in my product I prepared a 100% discount coupon, when you register use this coupon code: 193AE710F68353B8C17774D73BE52466

m_x_m · on Oct 24, 2018

I use uptimerobot.com to monitor uptime for personal or client stuff that isn't really mission-critical.

It's not perfect but I hardly have issues with it.

Honestly, I don't do expiry checks...

cpburns2009 · on Oct 24, 2018

To piggy back on the question, has anyone had a good experience using Prometheus and Grafana for monitoring? I'm looking into trying it. I've looked into Zenoss but from what I gather it's slow.

mkez00 · on Oct 24, 2018

I love it. Coupled with Alertmanager it's a really easy to use and powerful platform. We also started serving custom Prometheus exporters for our product usage which has been helpful and really easy to implement.

h1d · on Oct 25, 2018

To me, this solution is too much leaning toward graphing and that's not the primary feature you need to health check services.

Also Prometheus only saves 15 days worth of data by default and since they aren't willing to develop down sampling, it's not good to keep any long term history of the metric but you can overcome this by using InfluxDB as the storage but I would rather use something like monit which is simpler and easier to setup and gets the actual job done well.

zie · on Oct 24, 2018

we <3 prometheus & Grafana :) Mostly because it gets us the classic monitoring of hosts/services like nagios, etc but also gets us app-specific metrics as well.

gregopet · on Oct 24, 2018

We are using it and it's great, so easy to also plug in your own metrics from whatever application you're running. I insisted with the management that we needed it on a new project and they were like 'fine if you insist, we do have Google analytics and uptime Robot you know', now they call me as soon as it's down (we installed it on a crappy server because it was just my fixation :) )

nickserv · on Oct 24, 2018

Another happy user here. Only we use Telegraf to gather the metrics on the servers, I've found that it's a bit more flexible than the normal node exporter.

For alerts we have a mix of grafana and AlerManager.

ftonobo · on Oct 24, 2018

Yes, of course, Its a great tool to track statistics and write alerts. Prometheus is designed to scrape the stats from the services directly, but its also possible to do some kind of active checks. Maybe for checking the ssl cert expiry date you can use the blackbox exporter. https://github.com/prometheus/blackbox_exporter

kels · on Oct 24, 2018

I have no affiliation with any of these companies.

I use StatusCake for basic uptime monitoring for websites. I switched to them from Pingdom because they are cheaper. Only downsides I've had with StatusCake is if something is down it doesn't give you the cause. Pingdom would show you the trace route. That has made it hard to tell and I would get quite a few false positives saying sites were down. I haven't had false positive issues for months now though. Their paid plans have SSL/domain monitoring.

Monitis for cheap Linux CPU/RAM/Load monitoring.

abrongersma · on Oct 24, 2018

I've also had great success with StatusCake. The SSL cert monitoring is a lifesaver.

castis · on Oct 24, 2018

I have Telegraf running on all my machines that pipes data to an InfluxDB for storage and then Grafana for visualizations.

Gives me all sorts of useful information that I use to make decisions.

unixhero · on Oct 24, 2018

This is such a cool way of doing monitoring.

h1d · on Oct 25, 2018

You need to realize health checking doesn't need to be cool. It's a good solution to check the statistics by human eyes but there are simpler and more effective solutions for health checking.

jwklemm · on Oct 24, 2018

If you're looking beyond uptime + certs, we do functional + visual browser testing at https://ghostinspector.com/. Lots of folks use it for monitoring their website or application (in additional to their CI process). We have a free tier that includes scheduling. [Disclosure: I'm the founder]

thestepafter · on Oct 24, 2018

I have been using Ghost Inspector for awhile now, so far it has been fantastic. It is really nice to be able to push to a branch and get notified in Slack a couple seconds later with any issues including screenshots.

jwklemm · on Oct 24, 2018

\o/ Really glad to hear that! Reach out if we can ever help with anything.

cure · on Oct 24, 2018

Nagios, in combination with with check_mk 'raw edition' (https://mathias-kettner.com/editions.html). The Nagios configuration is automatically generated via Puppet resource collection.

SSL certificate expiry is easily checked with a nagios check (use the -C flag on check_http). If you use Letsencrypt with a client like acmetool (https://github.com/hlandau/acme) your certs will never expire. Of course the nagios check is still necessary to ensure acmetool keeps doing its job!

Domain name expiry checking could also be a nagios job, or alternatively you could write a small script that checks whois output and execute it regularly with cron.

Configuring your registrar for auto-renewal helps avoid a certain class of errors ("I forgot to renew!") but not others ("my credit card expired and the e-mail notifications from my registrar didn't reach me").

h1d · on Oct 25, 2018

Is there a reason to pick Nagios today than the reason "I know it well"?

exabrial · on Oct 25, 2018

TICK stack: Telegraf, Influxdb, Chronograf, Kapacitor. They're have a massive array of plugins that can watch/store/graph/alert on just about anything.

https://www.influxdata.com/time-series-platform/

justin_oaks · on Oct 25, 2018

I agree with this. I just set this up at work and it's really cool. Still in active development and not fully mature yet, but gets the job done and is pretty good quality.

I wish the documentation was better but telegraf's documentation is light years ahead of collectd, which is similar software.

Kapacitor needs some more examples and the default Chronograf-generated TICKscript needed to be thoroughly modified to meet my needs. It took me way too long to figure out how to use stateChangesOnly() to prevent me from getting constant notifications once something went into an alarm state.

That said, the stack works well, even if it has a few rough edges. Thanks to Influxdata for the open source stuff. High quality open source software makes me want to endorse them and purchase the paid products.

exabrial · on Oct 25, 2018

Agree on the TickScript criticism! Just not enough stack overflow answers to go around. Btw, stateChangesOnly() can take a time argument which is really really handy!

josefresco · on Oct 24, 2018

Pingdom, SiteUptime, and Montastic. None of these are relied upon for mission critical monitoring. I set them up as a sort of "canary in the coal mine" for each of my servers, to help alert me to issues.

I also created a very hacky, browser start page which has GIF's that are pulled from each of my servers. Sort of like how "game copy world" used to setup their mirror page. Super unsophisticated but it allows me to do an "uptime check" every time I open my browser.

jozi9 · on Oct 24, 2018

I see that it's trending but without any comments - so allow me a shameless plug, I created a tool to monitor my APIs (can schedule calls, do response content checks, send alerts etc): http://www.apilope.com

If you drop me a line after you signed up I can flag you as a demo user that's free forever - or at least until you want to pay or cancel :)

mvanbaak · on Oct 24, 2018

Amazon route53 checks for monitoring. Cert renew either aws certificate manager or letsencrypt(depending on the usecase) so it’s handled automatically

mikedh · on Oct 24, 2018

I run Selenium integration tests inside a docker container every 5 minutes or so and attach the results to a sentry.io logger. I put up a boilerplate version on github: https://github.com/mikedh/selenium-simple

tnolet · on Oct 24, 2018

This is almost literally what I used as an inspiration for the site transaction monitoring part of https://checklyhq.com, as in a browser emulation feeding a monitoring system at regular intervals. Selenium was always a bit of hog so I jumped on Puppeteer when it came around.

tomspeak · on Oct 24, 2018

I use Ping [0], never used an alternative so not sure how it compares on features, but it has everything I need: uptime alerts, header & body requests. All packaged in a nice interface with a solid pricing structure.

[0] https://apex.sh/ping/

jimsmart · on Oct 25, 2018

Answering the second part of your question:—

Re SSL expiry: we've taken the route of running VirtualMin on our 'commodity' servers, sites setup in there have an auto-renew policy if they use SSL, so it requires no action and auto-updates whenever certs expire (IIRC it's every three months? but I might be wrong: it requires no thought nor action from me).

For domain names, we use Joker (Swiss company), we've found them to be good in all aspects, and they send renewal notices well in advance (including a link that anyone can use to renew the domain, even without having an account with them).

Uptime monitoring is a whole different kettle of fish, and we manage it on a per client basis depending on needs. For the sites we host, generally if the server's up, then the sites are up — but we also do more specific monitoring if it's a client requirement.

marcrosoft · on Oct 24, 2018

uptimerobot.com Use automatic let's encrypt certs and forget about it

mtarnovan · on Oct 24, 2018

I'm using the same, but I'm looking around for alternatives after I received a couple of false alerts (about 3-4 over the last year) from them. Still, overall a good service.

WildGreenLeave · on Oct 25, 2018

I was unable to find a solution that was able to monitor my websites and servers using a single service, so I decided to develop something[0] to scratch my own itch. It has been open for almost a year now and it is slowly gaining some traction. :) It allows you to monitor your server for resources, your websites for the uptime, SSL certificate, broken links and mixed contents errors. It also has a public status page for you to use. Also the setup is extremely easy as it is just a single binary (and it is open source!) to run using your own crontab.

Feel free to let me know what you think about it, feedback is always greatly appreciated.

Disclaimer: I'm the developer.

[0] https://servitor.io

corobo · on Oct 24, 2018

I've written custom bash scripts that Zabbix runs and alerts based on their output.

For example, domain expiry - I have a script that Zabbix runs once a day that does a whois and grabs the expiry date for the domain in question. Convert that to a unix string and subtract the current date on it. Echo that from the script.

In Zabbix we can now alert if that item's value is <30d. Similar for SSL certificates and the web monitoring stuff is built in

Edit: Oh actually on the whois, I remember it was a huge pain in the butt getting the expiry for a variety of different domains - I now use https://jsonwhoisapi.com/ to get the whois info

8ytecoder · on Oct 24, 2018

External monitoring: Pingdom/Site24x7. Lesson learnt - have the alerts route to at least a few emails outside of your company domain if you use the same domain for email.

Site monitoring: NewRelic

PagerDuty/OpsGenie: For alert routing if you have more than 2 people.

scosman · on Oct 24, 2018

- Custom health check endpoint: checks a bunch of internal status metrics (NewRelic metrics for speed and error rate, DB connection check, a few E2E tests). Returns a simple overall status: OK, Warning (error rate high but not critical, slower than usual response times), Critical (DB down, errors high, anything else). Benefit of this approach is that it's easy to add new health checks any time you add a feature (example: is redis up?).

- Pingdom: polls custom endpoint every minute. Sends notification to PagerDuty if critical, or if warning for more than 5 mins.

- PagerDuty to notify team

- SSL Expiry: calendar notifications (whole team) and reminder emails from our SSL cert issuer

quizme2000 · on Oct 24, 2018

I started using Site 24x7 from Zoho about a year ago for all of my apps and my clients "basic" websites.

It ended up being a great tool for me because it allows for the most basic ping test and to content checks to be setup in 10 mins. It also has the ability to add reporting based on apps, servers, and databases. The aws add-ins helped me tune my usage, as i was paying way to much for a couple services that i could downgrade with out impacting my apps.

I feel it is priced right and a good value up to the 89/month plan.

tnolet · on Oct 24, 2018

People in the market for API monitoring and site transaction monitoring, have a look at https://checklyhq.com. We offer full API monitoring and add Puppeteer-based site transaction monitoring. We aim to be a one-stop-shop, so we give you a big dashboard and nice things like SSL certs monitoring and SMS alerting.

Full disclosure: I'm the founder.

DougN7 · on Oct 24, 2018

I use PA Server Monitor. The Web Page monitor can check a URL, submitting form data if configured, and check for text on the returned page.

It can also alert if the SSL cert is within X days of expiring.

https://www.poweradmin.com/help/pa-server-monitor-7-1/monito...

reefoctopus · on Oct 24, 2018

I have uptimerobot.com send me an email when it’s down. That email is forwarded to my cell phone via text using [phonenumber]@txt.att.net.

romdev · on Oct 24, 2018

Many of our larger clients use https://www.splunk.com for enterprise-level monitoring, alerting, and log analysis. It's proven effective to debug issues in environments where I don't have direct log access.

flower-giraffe · on Oct 24, 2018

https://nodeping.com - cheap, has an api, can graph a simple JSON response in addition to ICMP, http etc. Generally reliable and we’ve been using it as our backstop monitoring for hundreds of nodes for several years.

hkchad · on Oct 24, 2018

DataDog, our API is 100% serverless microservices (aws lambda) and 90% of them connect to elasticsearch / dynamo. If we start getting high error rates an alarm goes off and slack / email lights up. We are monitoring upwards of 300 lambda's this way.

jamieweb · on Oct 24, 2018

KeyChest.net is great for managing your TLS certificates.

It can use the certificate transparency (CT) logs to detect new certificates for your domain, so once set up you don't have to maintain it. Make sure to enable the weekly report email too.

ArtWomb · on Oct 24, 2018

Virtually every monitoring tool has already been mentioned. But I just want to add that you can always get CollectD and StatsD up and running in seconds for free on linux hosts. Very lightweight, and can measure virtually anything.

ajawee · on Oct 26, 2018

For uptime montioring: https://uptimerobot.com

For backend and frontend monitoring: https://atatus.com

petecooper · on Oct 24, 2018

Not websites as such, but I tend to monitor the underlying web servers and services with netdata.

https://github.com/netdata/netdata

shanecleveland · on Oct 24, 2018

I have very simple needs in this regard, so a very simple tool does the trick and does it very well: https://servercheck.in

jgallias · on Oct 25, 2018

https://hetrixtools.com and https://watchful.li

cascada · on Oct 25, 2018

For my needs -- several websites -- a free solution and simple solution will be enough. Pings a few times per day would do. Even once per day might do.

What are free solutions?

One is Google Docs. What else?

rorykoehler · on Oct 24, 2018

New Relic is great for APM, uptime etc

SSL & domain name on auto-renew. SSL via lets encrypt with a renewal cron and domain is set to auto-renew in domain registrar dashboard.

marlinsearch · on Oct 24, 2018

https://www.statuscake.com/ free account has been working fine for me.

aequitas · on Oct 24, 2018

I discovered earlier this week that Keybase.io also offers pseudo monitoring by emailing me the proof on my website was no longer valid due to broken https.

panda888888 · on Oct 25, 2018

We have everything set to give us email notifications and have a group (softwareadmin@company.com) that the emails go to.

nrjames · on Oct 24, 2018

We use Monit

https://mmonit.com/monit/

h1d · on Oct 25, 2018

Love its simple config language and ease of setup.

sparrish · on Oct 24, 2018

NodePing has powerful monitoring for web app, APIs, and websockets. Checks for SSL and domain expiry as well.

Fileformat · on Oct 24, 2018

I recommend NodePing as well. https://nodeping.com/

A ton of different types of checks. A lot of value for not a lot of money. The public status page style is minimalastic but exactly what I want.

Example: https://status.regexplanet.com/

utternerd · on Oct 24, 2018

Along with all our other services, they're monitored by our off-prem Icinga instance.

z3t4 · on Oct 25, 2018

I installed a node.js module that take a screens-shot of a web page, a module that compares two images, and a module that sends e-mails. Glued it together, put it on a 1$/month VPS. I now get an e-mail every time a web page that I'm interested in changes.

elorant · on Oct 25, 2018

what's the name of the module that takes screenshots?

gyaru · on Oct 25, 2018

Wild guess, maybe using puppeteer and it's page.screenshot.

https://github.com/GoogleChrome/puppeteer/blob/v1.9.0/docs/a...

cozuya · on Oct 24, 2018

I get pinged through a discord webhook when my site crashes (restarts).

jdlyga · on Oct 24, 2018

I have a super tiny personal website run on a VPS. I use uptimerobot.

yandexbaidu · on Oct 25, 2018

We wait until the CIO calls since he gets all of the alerts anyway.

xtralife · on Oct 25, 2018

You can try WebGazer. It's a very user-friendly website monitoring tool and its cheap. I didn't get any false-alarms.

https://www.webgazer.io/

dddw · on Oct 24, 2018

keychest.net for ssl expires great free service, please consider donating to it!

vax425 · on Oct 26, 2018

Montastic. Simple. $5.

elcomet · on Oct 25, 2018

I use datadog

shawn · on Oct 24, 2018

https://updown.io/

Best value of any tool I've ever used. It does literally everything you asked. I didn't even know it checked SSL expiry till it pinged me.

simlevesque · on Oct 24, 2018

UpDown is nice but I'm considering moving because they don't support multiple domains on the same status page. I don't need a single status page for each of the domains I own, that would be ridiculous.

wpmoradi · on Oct 24, 2018

I am curious to see other people's solutions.