I work as the backend developer at a mobile app startup, and we don't currently have any centralized logging.
So... how do you do it? Is there any way to have something similar to AWS X-Ray, to trace a single chain of events across platforms? Unless it's a bad idea? I really don't know ^^'
In Python, use the RotatingFileHandler to avoid running out of space.
2) Incrementally forward your log files to a server using something like fluentd that can pre-aggregate and/or filter messages.
Big advantage of logging to disk: if logging server is unreachable, forwarder can resume once it's up again. If you log directly over network, if things fail the very log messages you need to troubleshoot the failure are potentially gone.
3) Visualize. Create alerts.
I've evaluated a bunch of logging solutions. Splunk is the best, and affordable at low data volumes (they have a pricing calculator, you can check for yourself). It's medium hard to setup.
Sumo Logic is the easiest to set up, and at low data volumes, prices are similar to Splunk. You can get something working within an hour or less.
ELK stack is free only in bits but not in engineering time.
I've not actually tried Sentry.io but I saw it at PyCon and it looks pretty impressive. If you only care about tracking errors/events and not about general-purpose logging functionality per se, I would take a serious look at it.
For most people, the network being down means they can't reach the disk.
Buffering unsent logs via local disk or RAM is critical due to network flakiness for sure, but not logging over the network as well is a bad idea 100% of the time.
If you're talking about cloud solutions, that's what instance stores are for. Logging to local disk and then forwarding is still the best answer. And there's still a lot of world out there that's running on actual hardware.
It takes a kind of "ticket" approach to messages, it'll deduplicate and combine similar errors, and you go into a dashboard and see "We got ten thousand of this error, let me track it down, fix it, ack it and see if we keep getting it."
This simplified the ELK stack setup a whole bunch.
I think it needs at least of few more months or years to be a no-brainer to choose.
If network is down then how your users are going to reach your app?
All syslog instances push to a central instance, also running rsyslog. This allows us to tail logs on each instance, as well as tail / grep system-wide on the central instance.
Central instance pushes everything directly into elasticsearch.
Using Kibana for searching and aggregating. Using simple scripts for generating alarms and reports.
Every day a snapshot of the previous day is uploaded to S3 and indexes from 14 days ago are removed. This allows us to easily restore historical data from the past, but also keeps our ES instance relatively thin for daily usage / tracking / debugging. It also makes it possible to replace our central log instance without losing too much.
All devs use some simple convention (ideally built into the logging libs) to make searching and tracing relatively easy. These include "request ids" for all logs pertaining to a single process of work, and "thread ids" for tracing multiple related "requests".
I documented how I have rsyslog and elasticsearch set up here:
I imagine in those cases something like logstash may help, but I don't really know as I tend to avoid logstash.
Think of a multi-stage pipeline for getting raw data from your transactional/interaction systems and extracting insights and intelligence out of them.
Stage-1: Ingestion – Keep this simple. Don't mess this up. Its a serious headache if you do.
1. Generate a request-id or message-id at the genesis of the request and propagate it throughout your call graph from client to servers (across any number of api call graph hops).
2. At each node in the call graph, emit whatever logs you want to emit, include this id.
3. Use whatever format you find natural and easy to use in each tech stack environment. Key is to make the logging instrumentation very natural and normal to each codebase such that the instrumentation does not get accidentally broken while adding new features.
4. Build a plumbing layer (agnostic of what is being logged) that can locally buffer these log messages, periodically compress and package them with added sequence and integrity verification mechanisms, and reliability transmit them to a central warehouse. Use this across all your server-side nodes. Build a similar one for each of your client side platforms.
5. At the central warehouse, immediately persist these log packages durably and then only respond to client indicating it is safe to purge those packages on their local nodes.
Stage-2: Use-case driven ETLs.
6. Come up with use-cases to consume this data. Define data tables (facts and dimensions) needed to support these consumption use case.
7. Build a high-performance stream processing system that can process the raw log packages for doing ETL (extract, transform and load) on the raw data in different formats to the defined consumable data tables.
Stage-3: Actual Use-case data applications.
Run your analytics and machine learning systems on top of these stable consumable data formats.
Keep the stages separate and decoupled in code and systems. Don't do end-to-end optimizations and break the boundaries. Recognize that the actors/stakeholders involved in each stage are different. The job of data team is to be the guardian of these stages and run the systems and org processes to support it.
You seem to be describing low level principles, not high level ones. A high level principle would be "forward your logs to a centralized logging service and let the logging library and the service do 100% of the work for you", which I think is what nearly everyone should do (and which most are already doing).
Super simple, super useful, not everyone does it.
A simple example for a Python Flask app: http://www.structlog.org/en/stable/examples.html
log = logger.new(request_id=str(uuid.uuid4()))
log.info("user logged in", user="test-user")
# gives you:
# event='user logged in' request_id='ffcdc44f-b952-4b5f-95e6-0f1f3a9ee5fd' user='test-user'
If I just copied your example I would probably have a few different ID:s for the same request in different parts of the application (unless it was a single-service app directly exposed to the internet).
>the backend developer at a mobile app startup,
How about syslog, ELK stack or something and focus on building the app
Some good points in there like correlation IDs etc all the same
Have you tried something like opentracing.io ?
My GAE apps and google services just log there automatically. My non-GCP services require a keyfile and couple lines of fairly trivial setup.
I have a single logging console across my entire system with nearly zero effort and expense. It works incredibly well. Doing this in-house is a waste of engineering resources.
Not sure about other use cases such as visualization and triggering events. I assume they have an API or integrations for such things, just haven't needed it as of yet.
Their pricing changed recently, don't remember the details, but I do remember previously that non Google Cloud nodes did incur an additional cost. Free limits are decent, haven't paid yet for personal side stuff. But YMMV, check the pricing page https://cloud.google.com/stackdriver/pricing
After using stack driver, the setting up your whole logging mechanism in AWS atleast is so backward.
For error tracking, which is mostly what you'll care about, use a service like honeybadger, or rollbar, or whatever fits well with your stack.
For performance metrics use a dedicated service for that as well. NewRelic, or Skylight, or whatever works well for your stack.
You do need to instrument your applications to emit traces, but don't go overboard. Make sure everything can extract the trace ID from headers / metadata and that requests they generate include the trace ID. Most languages have plugins for their HTTP / gRPC server and client libraries to do this automatically.
You will want your edge proxy to start the trace for you; this is very easy with Envoy and ... possible ... with nginx and the opentracing plugins.
I use structured logs (zap specifically), so I wrote an adaptor that accepts a context.Context for a given request, extracts the trace ID from that (and x-b3-sampled), and logs it with every line. This means that when I'm looking at logs, I can easily cut-n-paste the ID into Jaeger to look at the entire request, or if I'm looking at a trace, type the ID into Kibana and see every log line associated with the request. (The truly motivated engineer would modify the Jaeger UI to pull logs along with spans since they're both stored in ES. Someday I will do this.)
As for log storage and searching, every existing solution is terrible and you will hate it. I used ELK. With 4 Amazon-managed m4.large nodes... it still takes forever to search our tiny amount of logs (O(GB/day)). It took me days to figure out how to make fluentd parse zap's output properly. And every time I use Kibana, I curse it as the query language does overly-broad full-text searches, completely ignoring my query and then spending a minute to return all log lines that contain the letter "a" or something. "kubectl logs xxx | grep whatever" was my go-to searching solution. Fast and free.
If anyone wants to pay me to write a sane log storage and searching system... send me an email ;)
You can pry lingo/sawzall from my cold, dead hands.
Google has a system for that... but when I was there, it was awful. Meanwhile in the real world, we have ELK... and it's even worse. People stop looking at logs once the kubelet rotates it. It's just too slow and flaky.
(But yes... one key aspect to lingo/sawzall's design is that logs are sharded. And naturally, logs are sharded. Each program produces a log file over a period of time, and so (time, pod) forms a natural shard. Introduce something like ELK, and your sharding is thrown away, so you can never properly parallelize searching. A properly designed logging system would maintain shards and ensure that workers have replicas of those shards, so that you can use lots of computers to quickly get you the result you want. Of course, as much should be indexed as practical, so you can find the shard you're looking for without looking at every shard. Lots of work that could be done here, and it's all super easy. That's why it makes me mad that nobody has done this.)
For general purpose logging - we deploy Papertrail's remote_syslog2 https://github.com/papertrail/remote_syslog2 - which is more or less set it and forget it setup. e.g. specify which text files I want to aggregate, and then watch them flow into the live tail viewer.
For logging in more limited environments (can't sudo or apt-get install), we use Loggly's http API (https://www.loggly.com/docs/http-endpoint/). Also, Loggly's JSON support allows us to answer questions like: "how many signup events failed since the last deployment". Or "What is the most common signup error".
Bonus! If you're looking for trace-level reporting and integrating that with your logs, check out the AppOptics and Loggly integration: https://www.loggly.com/blog/announcing-appoptics-loggly-inte...
There is still active server-side development that does not show up in the rs2 repo on GitHub.
I will forward this comment along to our product team as feedback - thanks!
If you try to add logging across the entire infrastructure in one shot, you won't know what logs you actually need. And when it comes time to diagnose a problem, you probably won't be capturing the correct data.
For me, this looked like logging to a ringbuffer and then dumping that log with an associated error report when an exception occoured. was good enough for 99% of the errors i debugged, and we never actually needed a log-shipping solution. Logs were kept on disk and requested to be uploaded on demand when investigating specific issues.
it depends on what kind of startup u are in, what kind of product you ship, what kind of user base you have, what kind of solution you have. if you cobble together a set of SaaS solutions, ETL will be your integration challenge.
The logs are written from containers to CloudWatch and consequently forwarded to ElasticSearch where we use Kibana and LogTrail  to view the logs and search them.
It's nowhere near as nice as XRay and other APM solutions but it hardly took any time to throw together. Fundamentally, this is how XRay works, only there is a specific format for the ID.
However, XRay now supports our runtime so we'll take another look at that. It looked like an interesting option at the time.
For a mobile app you'd want to assign a guid or some sort of user id to the device itself so you can track the distinct API calls it makes. I believe XRay and other systems support this but we don't have a mobile app so I don't know how that'd work for you.
Check it out. It's done wonders for me. You can manipulate,sort,retain and do other things on log events with it. It uses elasticsearch to store the logs.
It has SIEM like functionality with alerts and they are continuing to make it more suitable as a SIEM replacement.
And it does have cloudtrail support.
Maybe it's better now since I tried but it was a real negative when trying to import some of them to find out later they were incompatible.
You seem to be interesting in tracing or APM  which also has many providers.
Lots of people do a local Elasticsearch, Logstash, Kibana stack which can be done without licensing with a variety of forwarders.
You might be most interested in Envoy Proxy or Elastic APM (there are many others)
Zipkin (https://github.com/openzipkin) and OpenTracing (https://github.com/opentracing) purport to be vendor/platform agnostic tracing frameworks and have support with various servers/systems/etc.
X-Ray was pretty trivial to use in AWS land w/ Java as a client.
It took me some time to... build up the courage to read through all of your answers, and you have been of tremendous help. I've learned quite a lot. Thank you very much! I deeply appreciate it!
I'll steer clear of self-hosted ELK, for now, mostly because being the only backend, I can't really take the risk of holding the whole team back while getting it up and running or maintaining it.
I'll look into Splunk, Sumo Logic, Sentry & a few others, while keeping in mind the more general guidelines that were laid down here.
Also, thank you for the terminology! It's much easier to find the proper resources know that I know what to look for!
Edit: I'll also take some time to answer to the different comments; but it really felt rude of me to be procrastinating while you all had taken the time to properly answer
Since it sounds like you also control the app, maybe make an HTTP header that the app sends that has some kind of UUID for that transaction. When your backend gets it, keep passing it on and logging it as part of your context when you emit log lines. Then using whatever log aggregation system you use, you can search for that UUID.
As for collecting your logs, I like ELK stacks, and they are easy to set up and get all your syslogging to go there. There are also ready made helm charts to install these into a kubernetes cluster if you're using that, and they will automatically scoop up everything logged to stdout/stderr.
Central rsyslog server uses mmnormalize/liblognorm to parse the apache logs and load them into Elasticsearch.
haproxy logs directly to rsyslog via a domain socket, RELP to central server, lognorm to load into ES.
ELB logs go into S3, and logstash pulls them down and loads them into ES.
The remainder of syslog messages just go into files on the central server.
We also have Sentry set up with some newer applications logging into that.
Next, you'll want to track analytics centrally. Etsy and Netflix have been pioneers in this area. Their engineering blogs are very good to follow. Think: something like a timeseries database (like Influx / Prometheus) and getting data into it. Use tools like Grafana to get data out of it in dashboards or reports. This is separate from your application debug / error logging system.
The next step after this is developing something that consumes data from both of those systems and provides alerts based on unusual activity -- something that provides early warning to devops.
Having spent months with team, found this to be the best high performance stack for cloud & on-premise solutions for our clients
- Tracing and/or APM
The above are true for systems and applications but let's talk applications. Your decision should be based on assessment of at leat the following:
- Do you have compliance requirements? (e.g. GDPR)
- What is your logs/metrics/traces retention period? (let's assume 30 days)
- What is your logs/metrics/traces lifecycle requirements? (are you going to need logs older the 30 days? If not, I'd say don't bother delete everything, keeping them around has managerial and hosting costs)
I advice to take a look at ElasticSearch:
- ElasticSearch for hosting logs
- For sending logs, metrics and tracer you can use filebeat, metricbeat and ElasticSearch APM or Jaeger.
If you are a small startup, I'd say go with ElasticSearch Cloud and use their tools. They do all you need and more.
: I prefer metricbeat over prometheus/grafana because it solves the high availability headache for those who already have an ES cluster and you don't have to support (setup, monitor, manage, scale) an additional stack. You can use a push model which has it's own pros and cons.
ps. No affiliation with elastic, I just spent some time with a variety of their products and like what I see so far.
Aside from the classic print statements and grepping log files manually.
Our services send to fluentd running on each instance which aggregate and flush to a Kinesis stream in AWS with KCL workers responsible for putting it through a separate pipeline that allocates the logs to specific indexes depending on the service(s) they come from as well as applying ACLs on a per-index basis.
That's how you get started. There are plenty of tools on this stack to do APM, security auditing, request logging, etc. If you are using a decent application server stack that produces metrics, it can handle those too.
This works flawlessly for years and relatively easy to set up. Couldn't recommend it more
For each individual instance of some class of things, generate a unique identifier. For example, each network request the mobile app makes to the backend should have a request ID. The mobile app includes that request ID in all its log entries and sends it with the request. The backend plumbs it through everywhere and all its log entries have it, too. If you have multiple instances of things in the backend, like batches of queries sent to a database, log an identifier for them as well.
Then you dump all of this into one big index in some semi-structured data store and use the identifiers to pull out all related entries.
4 Core Machine with 32Gb Ram
about 3000 logs per second. 70% CPU usage and 80% SSD Usage-
Quite happy with the setup, since the SSD can be upgraded to a faster one. Also a more powerfull machine could handle about 10000 logs per second.
Would love to hear other number from Splunk or similar solutions.
Costs: Nearly zero. Some time to setup and bring redis + curator into play.
In our legacy environment we're writing to files and sending them up to cloudwatch using awslogs.
Cloudwatch is kind of ass for logging, but they added insights somewhat recently; it upgraded cloudwatch logs from being unusable to just being a pain in the ass to use.
This works for us so far because it's super simple and we don't have a major need for log analytics, just the occasional production debugging session.
I did a PoC for fluentd + logdna/logz/etc and that also seemed to work pretty well.
If you need just text output logging, there are a few solutions already described. But at this point you should really make a decision - are you after simple text logs, or can you put in work to get structured events or tracing out of your app.
I'd probably use Graylog or some ELK stack variation though if our client would let us, since Splunk is $$$.
Advanced: log structured objects (keys and string values) to Riemann. Write smart rules in Riemann then send those to ES and explore structured object in Kibana.
Works great, can handle thousands of messages per second on modest hardware.
But before you get there, you can standardize on a "request ID" header that gets passed through your call stack and logged by whatever services receive it. You can search for it in your log aggregator (SumoLogic, Splunk, etc.) and get a good idea of which services your request went to, what time they got it, how long it took, etc.
2) Use time series databases to log your server metrics. Eg. InfluxDB
3) Familiarize yourself with CLI tools like cat, less, tail, grep, sed for when you have to get your hands dirty with raw data.
4) Logrotate is a great choice to cap the size of different program logs.
NewRelic might also help.
If you're asking about logging generally - it's a vast subject and you probably need to ask more specific questions. On a StackExchange site probably.
Have made an small attempt to fix it in node.js
Logs are number one for privacy violations. Pin codes, passwords, social security numbers etc. People remember to hash the data in their databases, but logs are often forgotten about, then stored and archived containing data that are illegal under many laws such as GDPR. And obviously also a security risk.
Developers most of time remember to scrub the data out of actual log messages but forget trace and rawer logged data also go into some log aggregators.
I am sure I have accidentally done it as well, though I try my hardest not to.
* Twitter: https://twitter.com/TwitterSupport/status/992132808192634881
* Monzo: https://www.zdnet.com/article/monzo-admits-to-storing-paymen...
* Github: https://www.bleepingcomputer.com/news/security/github-accide...
* Facebook: https://www.wired.com/story/facebook-passwords-plaintext-cha...
If your system observed something, it writes it down (logs it). If you want your system to react to that thing happening, then the log is going have to be machine-parsable.
Helps is see what happened across the whole system.
on newer projects, Serilog with combo of text file logging, sql sink and just recently to centralised ElasticSearch
Since then Logary has expanded with excellent support for sending both metrics and tracing data to a large number of targets. In production, I use this setup;
client browser -> Logs & metrics to Logary Rutta HTTP ingestion endpoint via Logary JS
nginx-ingress -> Traces to Jaeger Agent via opentracing C++ client
nginx-ingress -> Metrics to InfluxDB
nginx-ingress <- Metrics via Prometheus scrape annotation
Our NextJS site and GraphQL server:
site -> Traces to Jaeger Agent via opentracing
site -> stdout logs via Logary JS (also get added as Logs in the Span of OpenTracing)
site <- Metrics via Prometheus scrape annotation and prom-client
api -> Traces to Jaeger Agent via Logary's F# API
api -> Metrics to InfluxDB via Logary's F# API
api <- Metrics via Prometheus scrape annotation and Logary.Prometheus
api -> Events to Mixpanel via Logary.Targets.Mixpanel
api -> Logs to Stackdriver via Logary.Targets.Stackdriver (hosted on GCP)
Also, Kubernetes ships logs via FluentD to Stackdriver in GCP, but they are not structured, and the remaining infrastructural services also send traces to Jaeger if they can.
Logary Rutta is a stand-alone log router, written in Hopac + F# (like Concurrent ML), and used by some of the largest Swedish software companies for thousands of logs and metrics per second. It's capable of shipping to a large number of targets https://github.com/logary/logary/tree/master/src/targets Since it talks HTTP and UDP with a number of encodings (JSON, plain, binary), it's easy to plug into an existing infrastructure and existing log shippers. It can also connect point-to-point to itself with a high-perf binary encoding. Because you can send any JSON into it, it's very easy to get started with together with mobile apps.
Logary for JS currently has support for user logs, and I'm currently testing rudimentary metrics and browser info.
Logary for .Net supports the OpenTelemetry spec, structured logging and metrics.
Of course you can pick any toolchain you want, but I've had great success (and great fun!) writing and using the above. You can see I don't keep logs on disk; it causes them to fill up; if your network is down, your service is down, and then you know it's the network anyway.
Once in Logary, you can choose where you send them. I've done an analytics/ETL pipeline based on Logary with its Stackdriver+BigQuery+GooglePubSub targets and with Flink, with great success as well. Logary is free to use for non-profit and then I have a pricing calculator on the home page, for when you start selling the software you build. Pricing aside, how Logary is structured and how I've used it might give you some hints on how to do it yourself.
If you're doing distributed containers, lambdas, or other more ephemeral things, you just can't do logging at leaf unfortunately.