Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How do you monitor your websites?
143 points by skies on Oct 24, 2018 | hide | past | favorite | 100 comments
What tools do you use to monitor uptime of your web apps and/or APIs? Also, how do you track SSL/domain name expiry?



At Dollar Shave Club, we use CircleCI 2's Scheduled Workflows to run "monitors" against our production services every minute. These are idempotent, analytics-disabled API & Browser (via Puppeteer) tests that we also run as CI tests on every commit.

We send all monitor metrics to DataDog. When a monitor fails, the appropriate teams will get a Slack notification with the full stack trace. A DataDog monitor will also be triggered, alerting the appropriate teams.

For browser monitors, we upload screenshots and Puppeteer tracing files to S3, then share links within each Slack hook. This allows people to figure out what's going on just by clicking links in Slack.

We were planning to improve this setup in the future, but it's good enough for us right now. For example, CircleCI goes degrades frequently so we sometimes get spotty coverage. We basically spend < $200/month with CircleCI to monitor about 300 APIs/pages every minute.

You can read more here:

- https://engineering.dollarshaveclub.com/monitor-all-the-thin...

- https://circleci.com/blog/how-dollar-shave-club-3x-d-velocit...

- https://github.com/dollarshaveclub/monitor


UptimeRobot is quick to setup and have a lot of options, including a white label portal [0] to monitor status it's reliable and cheap ($54/year)

I'm not affiliated with them, just an happy customer

[0] E.g. status page: https://status.appdrag.com/


I second this. Been using the free plan of UptimeRobot for years without a problem. It pings your website every 5 minutes and you get an email alert when it’s down.



I've had a bad experience with Uptime Robot - it only checked our server via IPv4. We didn't know IPv6 was down for far too long (it was a DNS problem).


Sounds like you didn't check that they checked for ipv6. Businesses aren't going to tell you where their services are lacking.


+1 for uptimerobot. Been using them from the start... (not affiliated)


Once every few months, I go to the website.


Guess I'm not the only one, and of course my personal site was down for the last three days because the domain registration had expired and couldn't be renewed on an outdated card.

I facepalmed.

But once my setups get a bit more complex I was thinking I'd build webhooks into an analytics server, and ping those from each server with a json request which includes health data for each's databases and other servers.

I'm no where near that so I'll probably sign up for the free plan on uptimerobot later mentioned in other comments.


Well this solution works for very many. Though for me it's mostly weekly for some sites, and daily for others.


Really??? Teach me! How do you manage doing that???


Step 1: Go to the website.

Step 2: After a few months, repeat step 1.


I basically add a /global-health endpoint on my server. It executes a bunch of checks programmatically - e.g. database connection, rendering working, etc. It would be easy to add in "fail if cert expires in next month".

Then, I monitor just that one endpoint with Stackdriver (because it's easy). If any of the checks fail, it logs it, prints details, and sets a 500 header code. Adding new checks is just a code change.


Are you worried about that endpoint being used for DoS or is it light enough?


The way i do it is running the checks in a background every x minutes and add the result in a json file. It’s fast and safe.


I'd guess you'd then also want to return a timestamp in the response and then fail if the timestamp is older than x minutes too


It's rate-limited and behind Cloudflare. But, the checks are like "can I connect to Redis", so it is pretty lightweight.


What tools do you use to monitor uptime of your web apps and/or APIs?

I use a custom AWS Lambda function. It fires every four hours or so, and tries to make an https connection to each configured URL (the URLs are stored in a file in S3) and if the site is either down, or if there is an SSL error (which probably means an expired certificate) then it sends me a text message using SNS.

The whole thing is about 50 lines of code, and that's in Java. And it doesn't even come close to exceeding the free tier limit of Lambda calls, so it doesn't even cost anything so far.

To be fair, I could have used a 3rd party service, but writing this thing was my first foray into using Lambda, so I did it as much for the learning experience as anything. But it works really well, so I doubt I'll replace it anytime soon.


Just wondering, did you use Scheduled Events or what did you use to implement firing on an interval?


Yeah, it's just using the standard AWS scheduled events stuff. Nothing fancy. Then again, our needs (for now) are pretty simple. I basically just want to know if our website goes down or if our SSL certificate expires or something.

One of the sites is a SaaS offering, but it's not live yet, so I don't need to stay super-on-top-of-it. Once it's live we'll want more frequent monitoring and some other stuff, so we might either move to another approach, or supplement this with something else.


If you want a light (and free) DIY solution, it's pretty straightforward to build a basic uptime monitor (with connections to Slack, SMS, whatever) using Standard Library [0] and Scheduled Tasks, one of our engineers just posted this article (you can build from your browser):

https://hackernoon.com/build-an-uptime-monitor-in-minutes-wi...

Another option if you don't feel in the mood for DIY is TJ Holowaychuk's Apex Ping: https://apex.sh/ping/. Great service, run by a solo developer, reasonable price.

[0] https://stdlib.com

Disclaimer: I founded Standard Library. :)


Sorry but that has got to be one of the worst company/product names ever. Either it is completely un-searchable since literally every programming language has a "standard library" or if it actually takes off it will push down actually relevant results for niche programming language docs.


Yes — you’re not the first person to come at me with a pitchfork for the domain and name and won’t be the last. I do appreciate the feedback, but we’re committed to the name. :)

We haven’t historically had a problem with “stdlib”, we’re already the top Google result. “Standard Library” (full name) is new for us as we expand to a less technical cohort of customers. We’re working with some pretty great people and companies (Stripe, Slack) on our mission to build a, well, Standard Library — so if you can get over the name choice you should check out our online development environment! -> https://code.stdlib.com/


We have been fairly happy with Runscope [0] for fairly simplistic monitoring of api response codes and body payloads. My biggest complaint is probably the lack of individualized response data after >= 24 hours.

Still waiting to see if the CA Technologies acquisition [1] makes things worse or not.

[0] https://www.runscope.com/ [1] https://blog.runscope.com/posts/301


Note that the retention is not time-based: https://www.runscope.com/support/kb/test-result-retention Hopefully we have a better solution for this in the future.

There are more resources working on Runscope than ever before. CA continues to invest in stability, new features and support. CA is also going through an acquisition and that introduces more variables, but as of today (1+ year after acquiring us), CA has been extremely supportive of Runscope.


Try Protectumus: https://protectumus.com

Protectumus monitors the website uptime, speed, dns changes, scans the website for malware like a traditional antivirus, blocks bad bots, custom IP's and countries. Protectumos acts as a Firewall. We are the only security company specialized in SEO Security, we offer unique SEO services such as Search engines Cloaking monitoring, Google DMCA Complaints, Blacklist monitoring & removal and more.

Details here - https://news.ycombinator.com/item?id=18295381

Full disclosure: I'm the founder.


From your site:

> Protectumus acts as a web application firewall (WAF) and scans the website for known malware. Once the malware is found it will be automatically removed.

How does a WAF remove malware? Is there an additional agent sitting on the server or something? And surely when a server is pwned, it needs to be reformatted, not just the malware 'removed'.


Thanks for your question. We have both a Firewall and a traditional Antivirus. They work separately.

The firewall blocks bad bods, hack attempts like SQL injections, XSS, CSRF and more.

The antivirus scans for known malware (we have a big list of malware definitions), but we use AI and Machine Learning and the Antivirus learns from previous detections and is able to act alone. The script is able to automatically remove the malware once it is found.


Thank you for the info.


In case you're interested in my product I prepared a 100% discount coupon, when you register use this coupon code: 193AE710F68353B8C17774D73BE52466


I use uptimerobot.com to monitor uptime for personal or client stuff that isn't really mission-critical.

It's not perfect but I hardly have issues with it.

Honestly, I don't do expiry checks...


To piggy back on the question, has anyone had a good experience using Prometheus and Grafana for monitoring? I'm looking into trying it. I've looked into Zenoss but from what I gather it's slow.


I love it. Coupled with Alertmanager it's a really easy to use and powerful platform. We also started serving custom Prometheus exporters for our product usage which has been helpful and really easy to implement.


To me, this solution is too much leaning toward graphing and that's not the primary feature you need to health check services.

Also Prometheus only saves 15 days worth of data by default and since they aren't willing to develop down sampling, it's not good to keep any long term history of the metric but you can overcome this by using InfluxDB as the storage but I would rather use something like monit which is simpler and easier to setup and gets the actual job done well.


we <3 prometheus & Grafana :) Mostly because it gets us the classic monitoring of hosts/services like nagios, etc but also gets us app-specific metrics as well.


We are using it and it's great, so easy to also plug in your own metrics from whatever application you're running. I insisted with the management that we needed it on a new project and they were like 'fine if you insist, we do have Google analytics and uptime Robot you know', now they call me as soon as it's down (we installed it on a crappy server because it was just my fixation :) )


Another happy user here. Only we use Telegraf to gather the metrics on the servers, I've found that it's a bit more flexible than the normal node exporter.

For alerts we have a mix of grafana and AlerManager.


Yes, of course, Its a great tool to track statistics and write alerts. Prometheus is designed to scrape the stats from the services directly, but its also possible to do some kind of active checks. Maybe for checking the ssl cert expiry date you can use the blackbox exporter. https://github.com/prometheus/blackbox_exporter


I have no affiliation with any of these companies.

I use StatusCake for basic uptime monitoring for websites. I switched to them from Pingdom because they are cheaper. Only downsides I've had with StatusCake is if something is down it doesn't give you the cause. Pingdom would show you the trace route. That has made it hard to tell and I would get quite a few false positives saying sites were down. I haven't had false positive issues for months now though. Their paid plans have SSL/domain monitoring.

Monitis for cheap Linux CPU/RAM/Load monitoring.


I've also had great success with StatusCake. The SSL cert monitoring is a lifesaver.


I have Telegraf running on all my machines that pipes data to an InfluxDB for storage and then Grafana for visualizations.

Gives me all sorts of useful information that I use to make decisions.


This is such a cool way of doing monitoring.


You need to realize health checking doesn't need to be cool. It's a good solution to check the statistics by human eyes but there are simpler and more effective solutions for health checking.


If you're looking beyond uptime + certs, we do functional + visual browser testing at https://ghostinspector.com/. Lots of folks use it for monitoring their website or application (in additional to their CI process). We have a free tier that includes scheduling. [Disclosure: I'm the founder]


I have been using Ghost Inspector for awhile now, so far it has been fantastic. It is really nice to be able to push to a branch and get notified in Slack a couple seconds later with any issues including screenshots.


\o/ Really glad to hear that! Reach out if we can ever help with anything.


Nagios, in combination with with check_mk 'raw edition' (https://mathias-kettner.com/editions.html). The Nagios configuration is automatically generated via Puppet resource collection.

SSL certificate expiry is easily checked with a nagios check (use the -C flag on check_http). If you use Letsencrypt with a client like acmetool (https://github.com/hlandau/acme) your certs will never expire. Of course the nagios check is still necessary to ensure acmetool keeps doing its job!

Domain name expiry checking could also be a nagios job, or alternatively you could write a small script that checks whois output and execute it regularly with cron.

Configuring your registrar for auto-renewal helps avoid a certain class of errors ("I forgot to renew!") but not others ("my credit card expired and the e-mail notifications from my registrar didn't reach me").


Is there a reason to pick Nagios today than the reason "I know it well"?


TICK stack: Telegraf, Influxdb, Chronograf, Kapacitor. They're have a massive array of plugins that can watch/store/graph/alert on just about anything.

https://www.influxdata.com/time-series-platform/


I agree with this. I just set this up at work and it's really cool. Still in active development and not fully mature yet, but gets the job done and is pretty good quality.

I wish the documentation was better but telegraf's documentation is light years ahead of collectd, which is similar software.

Kapacitor needs some more examples and the default Chronograf-generated TICKscript needed to be thoroughly modified to meet my needs. It took me way too long to figure out how to use stateChangesOnly() to prevent me from getting constant notifications once something went into an alarm state.

That said, the stack works well, even if it has a few rough edges. Thanks to Influxdata for the open source stuff. High quality open source software makes me want to endorse them and purchase the paid products.


Agree on the TickScript criticism! Just not enough stack overflow answers to go around. Btw, stateChangesOnly() can take a time argument which is really really handy!


Pingdom, SiteUptime, and Montastic. None of these are relied upon for mission critical monitoring. I set them up as a sort of "canary in the coal mine" for each of my servers, to help alert me to issues.

I also created a very hacky, browser start page which has GIF's that are pulled from each of my servers. Sort of like how "game copy world" used to setup their mirror page. Super unsophisticated but it allows me to do an "uptime check" every time I open my browser.


I see that it's trending but without any comments - so allow me a shameless plug, I created a tool to monitor my APIs (can schedule calls, do response content checks, send alerts etc): http://www.apilope.com

If you drop me a line after you signed up I can flag you as a demo user that's free forever - or at least until you want to pay or cancel :)


Amazon route53 checks for monitoring. Cert renew either aws certificate manager or letsencrypt(depending on the usecase) so it’s handled automatically


I run Selenium integration tests inside a docker container every 5 minutes or so and attach the results to a sentry.io logger. I put up a boilerplate version on github: https://github.com/mikedh/selenium-simple


This is almost literally what I used as an inspiration for the site transaction monitoring part of https://checklyhq.com, as in a browser emulation feeding a monitoring system at regular intervals. Selenium was always a bit of hog so I jumped on Puppeteer when it came around.


I use Ping [0], never used an alternative so not sure how it compares on features, but it has everything I need: uptime alerts, header & body requests. All packaged in a nice interface with a solid pricing structure.

[0] https://apex.sh/ping/


Answering the second part of your question:—

Re SSL expiry: we've taken the route of running VirtualMin on our 'commodity' servers, sites setup in there have an auto-renew policy if they use SSL, so it requires no action and auto-updates whenever certs expire (IIRC it's every three months? but I might be wrong: it requires no thought nor action from me).

For domain names, we use Joker (Swiss company), we've found them to be good in all aspects, and they send renewal notices well in advance (including a link that anyone can use to renew the domain, even without having an account with them).

Uptime monitoring is a whole different kettle of fish, and we manage it on a per client basis depending on needs. For the sites we host, generally if the server's up, then the sites are up — but we also do more specific monitoring if it's a client requirement.


uptimerobot.com Use automatic let's encrypt certs and forget about it


I'm using the same, but I'm looking around for alternatives after I received a couple of false alerts (about 3-4 over the last year) from them. Still, overall a good service.


I was unable to find a solution that was able to monitor my websites and servers using a single service, so I decided to develop something[0] to scratch my own itch. It has been open for almost a year now and it is slowly gaining some traction. :) It allows you to monitor your server for resources, your websites for the uptime, SSL certificate, broken links and mixed contents errors. It also has a public status page for you to use. Also the setup is extremely easy as it is just a single binary (and it is open source!) to run using your own crontab.

Feel free to let me know what you think about it, feedback is always greatly appreciated.

Disclaimer: I'm the developer.

[0] https://servitor.io


I've written custom bash scripts that Zabbix runs and alerts based on their output.

For example, domain expiry - I have a script that Zabbix runs once a day that does a whois and grabs the expiry date for the domain in question. Convert that to a unix string and subtract the current date on it. Echo that from the script.

In Zabbix we can now alert if that item's value is <30d. Similar for SSL certificates and the web monitoring stuff is built in

Edit: Oh actually on the whois, I remember it was a huge pain in the butt getting the expiry for a variety of different domains - I now use https://jsonwhoisapi.com/ to get the whois info


External monitoring: Pingdom/Site24x7. Lesson learnt - have the alerts route to at least a few emails outside of your company domain if you use the same domain for email.

Site monitoring: NewRelic

PagerDuty/OpsGenie: For alert routing if you have more than 2 people.


- Custom health check endpoint: checks a bunch of internal status metrics (NewRelic metrics for speed and error rate, DB connection check, a few E2E tests). Returns a simple overall status: OK, Warning (error rate high but not critical, slower than usual response times), Critical (DB down, errors high, anything else). Benefit of this approach is that it's easy to add new health checks any time you add a feature (example: is redis up?).

- Pingdom: polls custom endpoint every minute. Sends notification to PagerDuty if critical, or if warning for more than 5 mins.

- PagerDuty to notify team

- SSL Expiry: calendar notifications (whole team) and reminder emails from our SSL cert issuer


I started using Site 24x7 from Zoho about a year ago for all of my apps and my clients "basic" websites.

It ended up being a great tool for me because it allows for the most basic ping test and to content checks to be setup in 10 mins. It also has the ability to add reporting based on apps, servers, and databases. The aws add-ins helped me tune my usage, as i was paying way to much for a couple services that i could downgrade with out impacting my apps.

I feel it is priced right and a good value up to the 89/month plan.


People in the market for API monitoring and site transaction monitoring, have a look at https://checklyhq.com. We offer full API monitoring and add Puppeteer-based site transaction monitoring. We aim to be a one-stop-shop, so we give you a big dashboard and nice things like SSL certs monitoring and SMS alerting.

Full disclosure: I'm the founder.


I use PA Server Monitor. The Web Page monitor can check a URL, submitting form data if configured, and check for text on the returned page.

It can also alert if the SSL cert is within X days of expiring.

https://www.poweradmin.com/help/pa-server-monitor-7-1/monito...


I have uptimerobot.com send me an email when it’s down. That email is forwarded to my cell phone via text using [phonenumber]@txt.att.net.


Many of our larger clients use https://www.splunk.com for enterprise-level monitoring, alerting, and log analysis. It's proven effective to debug issues in environments where I don't have direct log access.


https://nodeping.com - cheap, has an api, can graph a simple JSON response in addition to ICMP, http etc. Generally reliable and we’ve been using it as our backstop monitoring for hundreds of nodes for several years.


DataDog, our API is 100% serverless microservices (aws lambda) and 90% of them connect to elasticsearch / dynamo. If we start getting high error rates an alarm goes off and slack / email lights up. We are monitoring upwards of 300 lambda's this way.


KeyChest.net is great for managing your TLS certificates.

It can use the certificate transparency (CT) logs to detect new certificates for your domain, so once set up you don't have to maintain it. Make sure to enable the weekly report email too.


Virtually every monitoring tool has already been mentioned. But I just want to add that you can always get CollectD and StatsD up and running in seconds for free on linux hosts. Very lightweight, and can measure virtually anything.


For uptime montioring: https://uptimerobot.com

For backend and frontend monitoring: https://atatus.com


Not websites as such, but I tend to monitor the underlying web servers and services with netdata.

https://github.com/netdata/netdata


I have very simple needs in this regard, so a very simple tool does the trick and does it very well: https://servercheck.in



For my needs -- several websites -- a free solution and simple solution will be enough. Pings a few times per day would do. Even once per day might do.

What are free solutions?

One is Google Docs. What else?


New Relic is great for APM, uptime etc

SSL & domain name on auto-renew. SSL via lets encrypt with a renewal cron and domain is set to auto-renew in domain registrar dashboard.


https://www.statuscake.com/ free account has been working fine for me.


I discovered earlier this week that Keybase.io also offers pseudo monitoring by emailing me the proof on my website was no longer valid due to broken https.


We have everything set to give us email notifications and have a group (softwareadmin@company.com) that the emails go to.



Love its simple config language and ease of setup.


NodePing has powerful monitoring for web app, APIs, and websockets. Checks for SSL and domain expiry as well.


I recommend NodePing as well. https://nodeping.com/

A ton of different types of checks. A lot of value for not a lot of money. The public status page style is minimalastic but exactly what I want.

Example: https://status.regexplanet.com/


Along with all our other services, they're monitored by our off-prem Icinga instance.


I installed a node.js module that take a screens-shot of a web page, a module that compares two images, and a module that sends e-mails. Glued it together, put it on a 1$/month VPS. I now get an e-mail every time a web page that I'm interested in changes.


what's the name of the module that takes screenshots?


Wild guess, maybe using puppeteer and it's page.screenshot.

https://github.com/GoogleChrome/puppeteer/blob/v1.9.0/docs/a...


I get pinged through a discord webhook when my site crashes (restarts).


I have a super tiny personal website run on a VPS. I use uptimerobot.


We wait until the CIO calls since he gets all of the alerts anyway.


You can try WebGazer. It's a very user-friendly website monitoring tool and its cheap. I didn't get any false-alarms.

https://www.webgazer.io/


keychest.net for ssl expires great free service, please consider donating to it!


Montastic. Simple. $5.


I use datadog


https://updown.io/

Best value of any tool I've ever used. It does literally everything you asked. I didn't even know it checked SSL expiry till it pinged me.


UpDown is nice but I'm considering moving because they don't support multiple domains on the same status page. I don't need a single status page for each of the domains I own, that would be ridiculous.


I am curious to see other people's solutions.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: