They say "you need to add monitoring, metrics and logging" at the 500k user mark...

Raidion · on Dec 26, 2018

I feel like this is the reason stuff doesn't get shipped. If you couldn't release v.01 without adding in a robust analytics suite, nothing would ever get released.

Nothing wrong with releasing a .01v and not putting priority on anything but basic functionality until you start to grow. Half of this means you need to build out functionality that doesn't require heavy monitoring, metrics, and logging to get the job done. All that infrastructure stuff is more fun to code when you've blown through the free tier of S3 storage anyway.

sethammons · on Dec 26, 2018

Nothing heavy needed. You should know that your thing is working. A basic health check. You should know what errors are happening with basic logging. As soon as you have a paying user, you should have some form of reporting or alerting on this. You should have some basic metric(s) in place very early on like requests per second or sign ups. This should take you in the order of hours, not days, to set up. Nothing heavy. But without basic visibility into your system, you are asking for problems.

komali2 · on Dec 26, 2018

What tools would you use to set this up?

sethammons · on Dec 26, 2018

Personally, if I were just starting something (assuming non-aws), I would use a free pinging service or write a quick cron to do it. In the event that it did not come back healthy, I would either send myself an email or an sms - the health check can also report error counts or use storage to track number of restarts per unit of time. Logging would just be some basic logging rotation and stdout piped to that file. Basic metrics could be as simple as a metrics table that you can query with SQL. When you eventually need fancy graphs (maybe a bit later), there are lots of options.

I'm not overly familiar with aws services, but I'd be very surprised they don't give a basic good-enough solution to this with cloudwatch, which is even less work than outlined above.

alexdias · on Dec 26, 2018

Depends on your stack and requirements (do you want to know about errors ASAP, or is a 2-5 minute delay ok?), but I personally love NewRelic because of how easy it is to set up (and the number of features that it has).

If you want tools that you can manage yourself, then a combination of StatsD + Grafana for metrics, and Sentry for errors. For logs, Graylog if you want to set up something complicated but powerful, and syslog-ng if you just want to dump the logs somewhere (so you can simply grep them).

late2part · on Dec 26, 2018

Most of these tools cost too much to scale past 1M users.

throwaway98121 · on Dec 26, 2018

You could write your own service... some thin agent that runs on your boxes and dumps the files every hour to some storage optimized boxes (your data lake)... where another process picks up those files periodically (or by getting a notification that something new is present) and loads them into a Postgres instance (you actually probably want column oriented).

Running every hour, you won’t get up to the second or minute data points. For more critical sections of code, maybe have your code directly log to an agent on the machine that periodically flushes those calls to some logging service.

late2part · on Dec 27, 2018

1. Collectd and Graphite serve this well or can be modified

2. Nine of these commercial services gives you per second granularity that I have seen

late2part · on Dec 27, 2018

s/Nine/None/

#if there are 9 i'd like to know of them :-)

d21d3q · on Dec 26, 2018

Recently I tried logz.io In free plan you have 3 days retention, 2gb of logs daily, automatic notifications (insights - your syslog is parsed and common errors are recognized). And they have one click set up of visualisations in kibana.