Hacker News new | past | comments | ask | show | jobs | submit login
Grafana Loki – Like Prometheus, but for logs (github.com/grafana)
444 points by bovermyer on Dec 12, 2018 | hide | past | favorite | 162 comments

I still remember when Grafana first come out because of InfluxDB. It's amazing at that time. I try to setup Graphite and eventually found out Grafana+InfluxDB and go w/ it from there.

Grafana move really fast. Very practical. It first if I remember use ElasticSearch/InfluxDB to store its config itself. Then move to its own SQLite. Then MySQL/Postgres. Eventually add alerting. Add multiple data source.

Nowsday my team even use its with SQL for a cheap version of Periscope/Mode.

Then now they figure out this log thing. ElasticSearch is great but try to run it yourself. On a 50 nodes k8s, I bet your first ES config will down after 10mins the first time FluentD come up and send all the log under the sun since the cluster come up

Given that Grafan guys know how to deal with visualization, I trust them to deliver a great experience again for log.

Look at their history, I'm going to bet hard on this.

> Then now they figure out this log thing. ElasticSearch is great but try to run it yourself. On a 50 nodes k8s, I bet your first ES config will down after 10mins the first time FluentD come up and send all the log under the sun since the cluster come up

Tell me about it. I have not one, but multiple k8s clusters (some over 50 nodes), sending data to a single Elasticsearch cluster. A lot of tuning was done, and it's not yet perfect.

Elasticsearch is amazing, but it requires external support, and the tools available to do that are not on par. Curator specifically – it's dumb as a rock and is very limited on what it can do. Even a hot/warm architecture is a challenge to do on curator alone, you should ideally have custom scripts to manage it. Which is a shame as this is one of the "reference" architectures by Elastic. Whatever Curator does should really be part of ES itself.

The Kibana + ELK combo won't give you things like log tailing (logtrail is hackish and hard on the servers). The log forwarders are horrible to work with (be it logstash, filebeat – or worse: fluentd).

Some days it just works and you are happy. Some other days, you wonder why we moved away from syslog senders and text files...

I actually have same conclusion as you. ES isn't easy to tune and run well.

A thing many overlook are restart it. You cannot just go and `systemctl restart` it, once that occurs, constant rebalance happenning because it puts the load to the rest of system and eventually bring them down, all around :(. That make it harder to operator in K8S when it needs time to detach/attach volume, plus the overhead of overlay network.

So I myself didn't know about a perfect tunning. But the log search on Kibana is super helpful though.

That's why I kind of sold on this Loki thing. I have a good feeling that software in Go tend to require less config/tunning compare with Java. Eg, InfluxDB.

> Some other days, you wonder why we moved away from syslog senders and text files...

I wonder the same. I actually like to grep log with tail more than a crappy web ui like Kibana, Sumologic etc.

Tailing log in a webui is clunky.

> once [a reboot] occurs, constant rebalance happening because it puts the load to the rest of system and eventually bring them down, all around

I have this same problem with a 3 node cluster, no Kubernetes. This was using Elasticsearch 2.3 and it worked for a year or two and then all of a sudden, any network interruptions would cause the thing to get split brain and all indices would go red. It would take 8+ hours to go back to green and usually required a reboot of all servers or else it would never recover.

It was happening often enough that the decision was just run a single ES node since the total data was only 25GB :(

> Whatever Curator does should really be part of ES itself.

Just as an FYI, Index Lifecycle Management (ILM) is landing in 6.6 and provides some of the features that Curator supports: https://www.elastic.co/guide/en/elasticsearch/reference/6.x/...

Basically, it allows you to define things like hot/warm/cold architecture, rollover, retention, etc at the index-level, inside Elasticsearch. Will be landing as beta

> Whatever Curator does should really be part of ES itself.

It is becoming part of ES. Index Life Cycle management UI is coming up in Kibana.

> The Kibana + ELK combo won't give you things like log tailing (logtrail is hackish and hard on the servers)

Recently, in 6.5.0 log tailing was added. Now one can look at streaming logs live.

> The log forwarders are horrible to work with (be it logstash, filebeat – or worse: fluentd)

I'd like to know the problem you faced.

>"The log forwarders are horrible to work with (be it logstash, filebeat – or worse: fluentd)."

Why are they horrible to work with? Why is fluentd worse than the other two?

> Then now they figure out this log thing. ElasticSearch is great but try to run it yourself. On a 50 nodes k8s, I bet your first ES config will down after 10mins the first time FluentD come up and send all the log under the sun since the cluster come up

Put a buffer in front of it, pace the initial ingestion and Bob's your uncle. If that was the main problem operating an Elasticsearch cluster it would be all roses.

I like the E?K stack a lot and I admit that operating ES is not exactly easy, but the initial ingestion issue the easy to solve.

That's what I endup doing though. No longer directly write to ES but go into Kafka.

But taking a step back, thinking about how many components you have in the system just to make it work. Compare with Telegraf/InfluxDB/Grafana stack, which is for metrics, operate it is seamless.

That's my high hope for Loki. I hope eventually Loki can just as easy to operate as InfluxDB.

I'd be interested in replacing an in-house logs filtering/aggregation solution with Loki, but there's a lot that is not clear to me from the initial announcement & materials. Could you please help me understand what are the answers to the following questions?

- Can I use Loki without Prometheus? I'd like to feed raw logs to it, with custom-generated metadata. I don't want to have to use Prometheus, nor InfluxDB.

- Can I edit (modify) metadata for some old log line after it was already inserted? Specifically, I need to be able to rebuild metadata later, if I add some new "filters" to my logs (I want to be able to apply them retroactively).

- Can I run aggregate queries on the metadata (sum, avg, min/max)? If not, what can I do with the metadata? Can I graph the metadata on normal Grafana graphs?

- Is the text of the logs compressed? If yes, what compression algorithm is used? If not, why?

- Where can I find the API docs (or at least the API source code) for Loki?

1. Yes, you can. We initially targeted Kubernetes because that's what we use and packaging for it is simple. We'll soon release packages for all distros, support syslog/journald etc.

2. Hmm, please open an issue regarding this with the use-case that prompted this? This is not a use-case we have but if this is important for more people, we'd be happy to support it!

3. You can only select based on the metadata, the metadata is just a set of kv pairs. Like {app=cassandra, namespace=prod, instance=cassandra-minion-0}

4. Yes, we used gzip. Please see the sheet referenced at the end of the design doc [0] to see what we compared with.

5. It's mainly protobufs right now [1] over HTTP, but we'll be adding more docs soon. Mind opening an issue for this so that we don't forget?

[0] https://docs.google.com/document/d/11tjK_lvp1-SVsFZjgOTr1vV3... [1] https://github.com/grafana/loki/tree/master/pkg/logproto

Hm, why is zstd marked as "BROKEN"? zstd would've been my first choice for something like this.

Author here! Really excited to release this at KubeCon; happy to answer any questions you might have.

I'm curious about this, because at Square we maintain our own homegrown log aggregation system, and it's not really a core competency.

While a lot of our logging needs seem like they would be fulfilled by this system — because we attach trace IDs to log messages, and because (at least in Payments) you can usually find the appropriate trace ID by searching for a Payment ID, which could be annotated too — there are definitely many times I've copy/pasted the text in quotes from a log-generating line of code in a Java or Go file, to find out if it's being executed, or as a handle into a subsection of code/logging.

In the linked design doc, you include a motivating tweet near the top, saying, “just give me log files and grep, I am dying”. But unless I'm misreading things, there's no `grep` here. Right?

I'm guessing you could narrow down (using metadata) and then grep, but if the narrowest metadata you have is app name and time range, you're still going to be grepping over a lot of data…

It’s defiantly something that’s missing from the readme, and perhaps not that obvious in the grafana explore view either - but it is there! You can push a regexp match server side and have that distributed to each Loki node, giving you distributed grep.

Will make it more obvious. Davkals has an iteration of the UI that makes it a separate field, which will also help.

Maybe I'm not understanding this - the docs say Loki is all about storing compressed log data with metadata, such that only the metadata is indexed. Are you saying you can search the compressed, unindexed data using regex? If so, wouldn't that potentially be incredibly slow?

The good thing about this is that the grepping can be parallelised and distributed on to several nodes. Having said that, once you select the relevant metadata right, you should be able to narrow it down enough for the queries to be snappy enough.

While this will definitely be slower than something that indexes the contents, you'll be able to store much more in Loki at much lower costs.

Yeah, I am thinking about the worst case here, but never underestimate the power of your users to perform very silly queries!

For your hosted service, will you put in place any restrictions on, for example, the size of the time range that can be queried?

Also for your hosted service, will the degree of parallelisation vary by pricing tier?

What are you using to run the regex? ripgrep could make up for some of the loss from not having it indexed.


Looks like the Go regex lib, which isn't super performant, so it could potentially be improved if it ends up being an issue.

There is some documentation about the Loki search syntax in the Grafana docs:


Please make it more obvious with exact examples of how to do this.

I was going to suggest you look at OKlog [0] but it seems that the project is recently archived and no longer maintained. A pity because I really like its simple and minimalistic design. It does still work of course and the code is all there. Loki would do well to look to it for some inspiration. I particularly like the super simple install, and lack if the need for some clunky UI to tail and grep your distributed logs.

[0] https://github.com/oklog/oklog

Agreed - Loki was heavily influenced by OKLog. We really like its ease of use and simplicity - hopefully we managed to get some of that right with Loki. OTOH we felt like we needed a little bit of metadata and index to help find the right logs - the brute force approach of only having grep doesn't seem like the right tradeoff to me. We use a minimal label index to narrow down the search space + then let you do distributed grep...

There is a lightweight CLI for Loki too, you don't have to use the Grafana UI: https://github.com/grafana/loki/blob/master/docs/logcli.md

That's great. With a lightweight CLI you can run this in development and just use the CLI. Then add Grafana to your production environment.

Thanks! Always love Peter's stuff. We actually have a perfectly viable working system internally. We're just (always) considering alternatives :-)

Looks awesome!

Three questions:

1. It sounded like it was easy to set key-value metadata in the promtail tool. Things like hostname, availability zone, etc.

But can you also append metadata via the log-files themselves?

Basically, our log files are JSONL (http://jsonlines.org/) and look like this:

    { "user-id" : "abc", "client-type" : "mobile", "etc" : "…", "messages" : ["Parsing incoming http request", "Saving user data, valid model", "Finishing up http request, sending response to client"] }
    { "user-id: " "def", "etc" : "you get the idea…" }
    { "third log line here" : "and so on" }

Can we ingest these into Loki and have the "user-id" metadata appended as key-value labels to each message?

2. Sometimes it makes sense to group related log lines. In the example above, we'll have ~20-30 log lines from a single http request. It would be convenient if we somehow could group them together, for example based on a unique X-Request-Id value. And then use that in the UI to see all related log lines together easily.

3. We currently store metadata about log-lines that is numeric. For example, things like request time. Will it be possible to query on that type of numerical value, to e.g. find all the log lines where requests took more than 500 ms?

I think the problem with setting key-value metadata via the log output, is that permits unbounded cardinality.

The current design inherently limits cardinality to the number of pods you have running and the various labels applied to them.

For 3 I'd say that doesn't make sense with this design. I'd suggest taking a look at something like scalyr.com. You can configure a log parser and then your logs become both queryable, and you can create time-series on numeric fields such as request time and look at the 99th percentile, min, max, etc.

I clicked through and read this on the grafana page:

> Loki is meant to be complementary to existing solutions like Elasticsearch and Splunk that do full text indexing

Can you elaborate a bit on this please? If I'm already using Elasticsearch, Splunk or the like, why would I want to add on another, less powerful logging service? (not trying to be a dick, genuinely want to understand why I'd want this!)

Simple answer: Cost.

When debugging you'd want as much info as possible and you'd want to be able to simply tail + grep it. I've been told to log less because the amount I was logging would burn a hole in the pocket when deployed to production. Sometimes, at scale people only send WARN (maybe even only ERROR) and above in production which is sometimes not enough when trying to debug a system on fire.

While ELK does a great job of indexing the contents, and if you depend on it for BI, you should definitely still keep it for use-cases where just select+grep won't suffice. But for just storing logs, and being able to select, stream and grep laaarge quantities of logs, Loki will come in handy.

So for example, you might send only ERROR level to Splunk, and everything to Loki?

Or maybe you'd send everything to Splunk, but with a small retention period, whereas you'd send everything to Loki but with a 90 day retention period (or whatever)?

Few more questions:

1. Batching logs will allow better compression ratios and bigger blobs (which means lower per-operation costs), but must be balanced with the risk of data loss - what is your strategy here?

2. Will this handle multi-line logs? Say a regex matches part of a multi-line log, will it then return all the lines for that log?

3. Can you add your own labels, or are you limited only to those assigned by Loki?

4. If you are limited to labels assignd by Loki, how will you handle labelling as you expand out of k8s and accept logs for other sources (e.g. syslog)?

Thought of another one - logs really have 2 timestamps associated with them: the time the log was generated, and the time the log was ingested by Loki.

Will Loki be able to parse the log generation time out of logs (where it's included, and it usually is), or will it only use the log ingestion time for time range searches?

What if I have a service that logs quite verbosely and shows anomalous behaviour over say 10 minutes? I assume I'd have to scan all those log messages sequentially as they're not indexed at all, right? Will Grafana offer a UI for doing so or would I have to dump the log segment and grep it?

Loki is design to compliment Prometheus; as such we envisage you using Prometheus metrics to isolate the service and time range exhibiting the anomalous behaviour (by looking at latency and error metrics, for instance) and then "pivot" to Loki to see that logs. As Loki uses the same label metadata as Prometheus, that pivot is automatic and almost "magic" - showing you the relevant logs for a given PromQL query.

Grafana is offering a logging UI for Loki in the upcoming v6 release called Explore; you can enable it on the master builds right now, see https://grafana.com/blog/2018/09/21/grafanas-explore-ui-taki...). It makes is super easy to start exploring and sifting through your logs.

Also, Loki allows you to push regexp matches server side, so you can distribute the "grep" among multiple machines for extra points :-)

Thanks, makes sense. Seems like a great idea, kudos!

Hi! Is there a way to plug it with alertmanager and trigger alarms if a specific message is found in the monitored log ? Ie corruption message in a postgresql instance's log.

Good luck today, will be watching from my office ;)

Looks like a good light-weight alternative for setting up a full elasticsearch cluster for some light log insights.

I've been kind of building something similar myself, but this looks much better.

The free cloud demo does look it is getting hammered right now ..

>Looks like a good light-weight alternative for setting up a full elasticsearch cluster for some light log insights.

Exactly! Elastic is a really powerful system but I think there is growing sentiment that its overkill for container logs. This is exactly where we see Loki really helping - almost complimenting Elastic even.

> I've been kind of building something similar myself, but this looks much better.

:blush: thanks! We'd love you input on Loki too...

> The free cloud demo does look it is getting hammered right now ..

Yeah, looks like I'm going to be scaling that all day... Our motivation for over the free service for the next few months is to really battle harden the system - anyone sending us data is really helping us iron out the kinks and improve the open source. Its early days, but expect it to get much better over the next weeks and months..

> Elastic is a really powerful system but I think there is growing sentiment that its overkill for container logs

I don't know much about kubernetes, but Loki looks super interesting for our application logs too.

Could someone maybe ELI5 what the difference is between "container logs" and, well, any other kind of logs? Don't most dockerized application send their stdout to Docker, and aren't those, then, the container logs? What kind of logs _aren't_ "container logs" and therefore a better fit for ElasticSearch than Loki?

Thanks :-)

I work with Tom on Loki. When we said we wanted to focus on Kubernetes, we meant we wanted to correlate metrics with logs and focusing on Kubernetes lets us do that well.

Having said that, this will work for any logs as long as you can tag the logs meaningfully. We'll soon be releasing packages for all major distros and journald.

So this is something like mtail [1], with the ability to also drill down into the raw logs, and integrated into Grafana? Sounds great!

We're heavy users of Grafana, Graphite, Prometheus, and Elasticsearch, and while the latter started in a BI role it's expanded to take on pretty much anything we can throw at it. However, there's still tons of system and service logs we're not gathering yet, because the effort to get them to Elasticsearch and store/maintain them is not worth it, especially since the value is not always clear until well after the pipeline is set up.

I'm definitely excited to see what loki can do for us, and just upgraded one of our test Grafanas to nightly to start playing with Explore, looks great!

[1] https://github.com/google/mtail

Grafana Explore author here. It's still in Beta and would love feedback. Simply open an issue at [0].

What we tried to get right is the seamless switch from a Prometheus to Loki where it's retaining the labels of the query to essentially find the logs that come from the same e.g., "job". The assumption is that you need to be consistent with your relabelling rules of Prometheus and Loki.

[0] https://github.com/grafana/grafana/issues/new

Exactly! In fact, we're moving to use some of mtails libraries and functionality [0]. The main difference is we do centralise, scalable persistence of the logs themselves, wheres IIUC mtail just processes them and exports metrics.

Thanks for the feedback, and feel free to reach out with any questions.

[0] https://github.com/grafana/loki/pull/51

Serious question. Why not just put logs in postgres? Rich query language. Indexes. Has support for time-series based indexes. Optionally can support full text search if needed.

Apart from the query support, which others have mentioned, there's the sheer scale of data. The most recent time series implementation I worked on had to ingest terabytes per minute from a wide variety of sources in a single factory... And the goal was a single system combining every factory worldwide. Easily measured in exabytes per hour, with thousands of sources. A traditional relational database - even a really good one like postgres - just isn't built with that kind of use case in mind. There really is something to a database specifically designed for a particular kind of query on ridiculous quantities of data.

According to my napkin math that's more than the LHC's raw output, do they really need all that for factories? Seems nuts.

Need? Probably not. But often you don't know what metrics you care about until something goes wrong - so you over-engineer it and log all the things.

There are things like TimescaleDB[0] that build on PG to give you columnar support. Time series DBs need to handle a ton of entries and be able to slice on the column which is supported by vanilla PG[1] but most of the security tools are not looking to timeseries PG largely.

[0] https://www.timescale.com/

[1] https://blog.dbi-services.com/optimized-row-columnar-orc-for...

Timescale isn't really a columnar database, it's more like an advanced partitioning extension for time-series data ("any data you want to shard based on a time column") where you can optionally include other partition keys for the sharding. But it can be used very well for analytic cases like this thanks to that.

Real columnar databases like MemSQL or Clickhouse are a different beast -- for example they give very good column-wise compression in the dataset, which can save dramatic amounts of space. They're also good very for use cases like this, since they're heavily optimized for OLAP style workloads.

There is also cstore_fdw which does offer columnar, compressed storage for PostgreSQL as a foreign table, but it won't hold a candle to something like MemSQL or Clickhouse in terms of raw performance. Maybe one day.

Ultimately it's not about columnar storage or partitioning support, though, it's about the data and the queries you want to run on it, in what amount of time. Timescale can do pretty good for a lot of cases like this I bet, and I'm investigating it myself for a project.

Thank you for clarifying. Obviously I only have a shallow understanding of what the hell I'm talking about. :)

I have been evaluating TimescaleDB and my company currently uses Splunk, ES, and Prometheus. I'm going to be giving Loki a go this weekend for our k8s cluster.

Your explanation here really helped clarify some things I've only explored a bit.

I have been wondering the same thing. But, I spent the last hour looking for something like Kibana that would work with Postgres.

Postgres full text search, and Timescale extension would be ideal for logging kind of application. I could just set data retention to something like 2 months to keep the data volume simple and manageable.

But, all "data explorer" like tools for SQL databases obviously focus on having their end users write SQL. I'm hard pressed to find a good search interface that talks to Postgres at the backend..

We store our logs in PostgreSQL and use Metabase for visualisations & analysis: https://metabase.com/

It's still focused on SQL but the "Question" flow is quite simple, and doesn't require you to write SQL. You can, of course, write raw SQL if you want to and have those results still visualised.

I am playing with Metabase trying to see if there is any way for me to get a "search" on a column of table. It feels like I'm looking for a different use case.

Metabase seems excellent for building a dash board composed of several SQL based questions, and to generally have a searchable list of questions(which are internally SQL queries). But, searching for logs seems to not be one of its use cases...

It's weird but even for small company you can easily generate like 50GB a day :(. Some log are very verbose and devlloper want to put a lot more context into it. JSON log is supper verbose and increase space. And people expext to query arbitrary field out of these.

Last I used(2 years ago), Postgres weren't able to keep up with it. Though I just use default RDS and didn't do anything to tune it.

I believe you can.

I asked myself that question over the past few months and ended up building a proto-type logging pipeline using Postgres/TimescaleDB.

I wrote about it in a blogpost; https://www.komu.engineer/blogs/timescaledb/timescaledb-for-...

That could work for some people, but when you're ingesting 10+ GB / day (which is not difficult, even just from access logs) and need to keep them for several years, it quickly becomes infeasible.

Hmmm curious how much of a difference we're talking about between:

Logging in plain text - syslog

Logging to ELK(+indexes)

Logging to a DB like Postgres

Systemd's binary logs

This Grafana Loki

10+ GB/day is big to deal with as you say. But, I'm wondering it would be around the same regardless of your choice above..

With all of the choices though, we can always set staging rules and stage old data out to archives.

Personally, I'm struggling to find an equivalent search UI for Postgres.

We generate 1T of logs per day. Doing this with PostgreSQL would be hilariously bad.

We have way too many technologies in our industry with names borrowed from culture.

Before clicking to look at the comments, I totally thought this was about some interesting creation mythos with a character that taught humanity how to harness wood from nature to build the first wood dwellings.

I mean, that's totally something that could feasibly get posted to HN and do well.

There are only two hard problems in computer science: cache invalidation and naming things.

> There are only two hard problems in computer science: cache invalidation, naming things, and off by one errors.

That's how I learned the phrase.

And Loki has all three ;-)

> There are only two hard problems in computer science: cache invalidation and naming things.

There are only two hard problems in computer science: cache invalidation, naming things and off-by-one errors.

Alternate between vowels and consonants, look for the resulting string in various search engines, the first one that gets no results is the new name.

Keep track of all items (like "main page", "search results for $blah") that component items (like "body of post 234", "avatar of user 3540") are used in, and when a component item changes, invalidate its cache and that of all items it was used in.

> Alternate between vowels and consonants, look for the resulting string in various search engines, the first one that gets no results is the new name.

I'm pretty sure the quote was originally talking about function, variable and data structure naming, in which Google may or may not be help. Really, I've always found naming problems in that are more to do with correctly predicting the future, since the most egregious problems I encounter are when something that used to be named at least passably has changed over time to be very confusingly, if not outright misleadingly, named.

If you don't know what to name a variable, name it something like _whatisthisreplacemelater_ so if you ever come up with a good name, you can easily replace it. What could go wrong, right? Just an easy fix for later today..

We built a system like this similar to this at Datastreamer to store the logs from our web crawl.

We used Grafana + KairosDB and turned the logs into tags essentially.

Our entire productions system had easy 'taps' that you could annotate to monitor everything about our crawler.

For example, number of HTTP requests, their status codes, the language of the content.

We also record intersections of the tags like lang+domain.

The downside of a system like this is that you have to know all your metrics apriori... If not and you need them at runtime you're out of luck.

The UPSIDE is that you use like 1/100th of the total size you would originally need for raw logs.

What we found is that you quickly converge on the tags you need and then you don't end up adding many more.

Everyone says 'disk' is cheap but in our situation our logs outpace the amount of data we would collect. We'd have 1000s of petabytes of logs by now.

Is it just me or can no one else find how to actually send the log data to Loki either?

I don't believe that you have 1000s of petabytes by now. Unless you are Amazon AWS or Google or IBM.

From the design doc:

> We will be able to pass on these savings and offer the system at a price a few orders of magnitude lower than competitors. For example, GCS cost $0.026/GB/month, whereas Loggly costs ~$100/GB/month[0]

Maybe I'm misunderstanding, but I don't see how Loggly costs anything like $100/GB/month? It seems clear to me that the $99 plan costs $100/GB/day (or $100/30GB/month). Not exactly cheap, but nowhere near as expensive as the design doc reckons.

[0] https://www.loggly.com/plans-and-pricing/

Yes, you're correct. We'll correct the mistake in the doc.

This premise was very far off, and cost effectiveness is key to the design rationale - do you think this impacts upon the utility and attractiveness of Loki?

Yes, we have an internal doc where we compared with more providers (including loggly) where we got the pricing right. The line currently mentioned in the doc is a typo.

The rationale still holds and one of the primary reasons we built Loki is to have a an easy to manage scalable "open-source" solution.

I personally have been told to log less at a previous company because of the associated costs of logs and I don't think I utilised anything beyond grep (with a few exceptions). I personally feel the trade-offs are right and a simple greppable/tailable solution that is cheap is missing from the eco-system.

A good response, thanks for confirming that you've built this with the correct pricing in mind.

Where do people recommend logging infrastructure changes (e.g. database changes, server configuration) for all things potentially not in code?

Whether something is "in code" or not is orthogonal to the problem of tracking infrastructure changes. E.G. You can commit a change to Terraform, but that doesn't tell someone when "terraform apply" occurred.

What you should use depends on what tools you have available. We currently use Datadog, so we use their API for publishing events. We can then search for them in the event stream or overlay them on dashboards.

You could look at the datastores Grafana supports for annotations and choose whichever of them you're comfortable operating.


We need more context. Server configuration should definitely be in code/source control these days. Database changes, well, what kind of changes are you talking about?

These days, as we have multiple departments and services, it’s hard to track down any issues we identify without knowing what kind of changes have happened to potentially cause it. As we don’t have a single development platform (multiple acquisitionss) this is hard for us at least until we have unified systems (will take time). I’m referring to any infrastructure or operational change to the stack. For example, deployments/roll outs, scaling servers up or down, port changes, updates to software on key servers, changes to database table structure, manual changes to database data. Any changes at all to the production stack.

You might want to take a look at Architectural Decision Records in addition to Change logs. A lightweight format for tracking changes to infra that can be checked into source control.

One example - https://github.com/joelparkerhenderson/architecture_decision...

Choose a documentation source-of-truth (Confluence, etc) and keep a changelog[1]. I've also seen a dedicated email group (i.e. change-mngt@company.com) used for tracking non-code changes.

[1]: https://keepachangelog.com/en/1.0.0/

Terraform checked into version control is pretty common

It's all in code now... At least how I do it.

If you're just tailing logs on the server (or with kubectl) but need a bit more power, angle-grinder [1] is pretty good (and of course doesn't need to be hosted or anything, it's just a CLI app)

[1] https://github.com/rcoh/angle-grinder

Is this the same technology kausal.co guys had? It’s a shame it was just shut down after the acquisition.

This is a continuation of our ideas from Kausal, yes! Really glad we joined Grafana as they have given us the time and resources to pursue this.

And not shutdown at all! Everything we did at Kausal is now part of Grafana and Grafana Cloud - David's PromQL-completion UI is in Grafana v6 as the explore view, and Cortex is the backend for Grafana Cloud's hosted Prometheus...

Thanks! What about the distributed tracing feature?

Its coming :-)

Are you talking about the Cortex Prometheus service? They're working on transferring that to Grafana branding.

Is this logging already available on Grafana Cloud ? Because someone I know is looking for a hosted logging solution and was looking at ELK till now.

Do you have integration with Docker ? Especially Logspout ?

> Is this logging already available on Grafana Cloud ? Because someone I know is looking for a hosted logging solution and was looking at ELK till now.

There is a free preview right now: http://grafana.com/loki We'll be introducing a paid tier in the next few months, once things are more stable.

> Do you have integration with Docker ? Especially Logspout ?

Not yet! That sounds like a great idea though - mind filling an issue in the repo for it so I don't forget?

if this is like splunk that doesn't make me cry unicorn blood tears every time an invoice arrives then i'm all in. unfortunately it doesn't look like it and honestly i'm not sure what it is. 'prometheus for logs' doesn't quite explain it.

Hi please see the design doc for more inspiration as to why we built it: https://docs.google.com/document/d/11tjK_lvp1-SVsFZjgOTr1vV3...

We're soon coming out with a blog post explaining the motivations and architecture in detail. But yes, the pricing is one of the motivations to build this, and as logs will be stored in S3 or an object store, the cost will several orders of magnitudes less.

Grafana Explore author here: just posted my slides from the announcement talk [1] that hopefully clarifies a bit on what it is. Re splunk, depends what features you rely on. Loki is a no-frills log aggregation. Would be great to hear what features are needed for you. Feel free to write to me at david at grafana or open an issue.

[1] https://speakerdeck.com/davkal/on-the-path-to-full-observabi...

For Grafana folks: I got all the way through sign up and sending logs from k8s cluster + setting up logging data source to Loki hosted instance, but could not find a way to actually explore the logs in the hosted grafana instance :)

I figured that you have to click "Explore" next and then see the Logging tab, although it was not easy for me to find, may be highlight it as a first class link on the left panel?

Looks cool, is the source code for promtail available? I can see there's an image available https://hub.docker.com/r/grafana/promtail (which is used in the readme) but I can't find the source anywhere. Anyone know where it is?

Of course! Everything is opensource, all apache licensed:


Cool, thanks. No idea how I failed to find it...

Thanks! I swear I searched the repo and didn't get any useful results, but yeah it's there plain as day...

This looks similar to Graylog (https://www.graylog.org/)?

From Loki's readme, It still stores the full log (but compressed), and indexes values in each logline.

Hi, we don't index the values, but rather we index metadata about the logstream, for example, service name, instance ip, and things like that and not the actual contents of the log lines.

Inside kubernetes the metatadata would be podname, namespace, deployement name, container name, etc...

> indexes values in each logline.

We batch together sets of lines into compressed "chunks", and only add index entries per chunk. In practices, chunks have 1000s of lines in them - so the index tends to be many orders of magnitude smaller than in other systems.

What is the purpose of log aggregation without indexing?

Thats a really good question. I feel like its pretty well covered in the design doc[0], and you should also read about OKlog[1] which really championed this idea.

To be clear: Loki indexes metadata about the streams, and the streams themselves are indexed by time. We don't full-text index the streams though. We think, when combined with metrics, this represents a nice trade off in terms of complexity vs features. This simplification not only makes Loki easier to understand, but also scale and operate.

[0] https://docs.google.com/document/d/11tjK_lvp1-SVsFZjgOTr1vV3... [1] https://github.com/oklog/oklog

Awesome, can't wait to dig into this. Nice work.

I ran the curl ... | kubectl command on my cluster.

What would be the reason that promtail is not tailing log files for all of the running pods?

Does this integrate with Prometheus's Alertmanager then for sending alerts based on certain logging events?

Is there a way to send logs to Loki without using the tail agent? For example, a REST API for storing messages?

There is a REST API, yes [0]. Currently you send snappy-compressed protobufs[1] containing yours log to a HTTP endpoint.

We initially started using fluentd as the agent, but we found its metadata "enrichment" facilities weren't reliable enough - we'd get log lines without the pod tags, for instance. For something like Loki, which depends really heavily on the metadata for index, this was super important. So we wrote promtail.

[0] https://github.com/grafana/loki/blob/master/docs/api.md [1] https://github.com/grafana/loki/blob/master/pkg/logproto/log...

Awesome, thanks!

Hi, we're currently only exposing an API that we wrote. But we have plans to integrate with other agents such as fluentd for log collection. If the use-case has enough support, we might even be supporting a REST API.

Why not a unixish simple tool to handle metrics? assuming $prog outputs log info to stdout: $prog | collect | action and 'collect | action' could all be done somewhat simply with awk and 1000x more portable.

And store it where? And query it how? What will render my NoC dashboards?

With graphite protocol the metric stream is already plain-text with just 3 fields per line (path value time), its pumped over a socket to a collector.. but that's where the hard stuff starts.

First I'm not trashing the project, I'm just wondering why simple unix like solutions aren't used.

Store it where ever you want, this isn't a magical datastore that makes things faster, use clickhouse-client, whatever it doesn't matter.

There is a widening disconnect between the unix way and how new projects are created.

This is not metrics, this is logs.

Logs contain a lot of detail that are not appropriate for metrics. Logs output tends to be expensive, it costs a lot of time and IO to emit, compress, send, store logs. Logs also tend to contain PII things. Usernames, IP addresses, trace IDs, etc. We don't want that kind of cardinality in metrics.

Metrics are another thing. They tend to be internal counters inside a single process. With multi-threaded languages like Go, Java, C++, we can track metric data in memory very efficiently.

For metrics, see https://openmetrics.io/

If you read my earlier posts logs vs metrics doesn't matter in this context and hence why I said to stream to clickhouse-client...

From what I can tell you're talking about processing logs to device metrics, but this project is specifically about storing the raw logs themselves. Two different (albeit potentially related) problems.

What you are proposing for deriving metrics from logs exists in plenty of forms. Check out https://github.com/google/mtail , for example. Of course, you still have to store them somewhere, visualize them somehow, etc. That won't simply be part of your shell pipeline.

because some people want prebuilt solutions rather than messing with command-line scripts for everything.

So people want non unixish tools on unix, seems like a road we keep going down over and over and eventually goes back to the unix way. Also piping isn't 'scripting'

But I just setup Elasticsearch and Kibana yesterday!

A few years ago, I set up centralized logging with Elasticsearch and Kibana on a site with very little traffic (less than 100 requests per minute). After a few months, it apparently fell behind on indexing and worked the logging server's hard drives 24x7. After a few more months, the server died. (It refused to power up and I wasn't interested in going deeper.) I freely admit that I don't know how to configure Elasticsearch correctly, but I feel like it should have been able to handle the indexing load in its default configuration.

I think I'll try out this new logging solution. I only want basic indexing.

I try to stay away from long-lived processes written in Java. Maybe it's just me, but I think they consume insane amounts of memory.

No you didn't. You brought an ELK cluster up. Now the fun begins.

Unless you have a small load, six months from now you'll say "I've spent the last 6 months setting up and tuning an ELK cluster. It looks like it might be ready for production workloads now, but let's make it a beta for the time being – I don't want to see a crash like we had last time..."

Yeh, the trick with ELK is to spin it up on fewer, but beefier physical servers. We've had great success with this, compared to our first attempts, after failing by following internet memes about running ELK across lots of small virtuals....

Very interesting, will check it out!

As inmature I'm my first thought was thst the logo looks like a dick to me. :|

When I used to do branding and would be training new designers I would tell them the first rule of logo design is "Always Be Checking to Make Sure Logo is not a Penis". For some reason logo that is slightly vagina is mostly ok though.

There is another school of thought that is "Always Be Sneaking Dicks Into Your Art". I have some respect for people in the second camp.

Anyone wondering what the logo looked like before being updated, see here: https://github.com/grafana/loki/blob/a5d1c56ce8755136c5dde02...


And what'd be wrong with that?

You cannot unsee it lol

But on a serious note the developers should indeed look into this. I recall this comment from HN [0] which highlights why neutral branding and naming are important.

0: https://news.ycombinator.com/item?id=14702513

I could never unsee it in this old easymock logo https://www.javacodegeeks.com/wp-content/uploads/2012/10/eas.... They have since changed it to something much more neutral, thankfully!

I can't see it. In my world, penises are a little bit bigger than those proportions... might as well see a heart in the logo. I actually like the logo, as it looks like two things overlapping each other which can be moved to be aligned.

Thats fair; the logo was a last minute thing, we'll work on a new one...

Many people seem to agree with you so they should probably change it even though I must admit that it's not that obvious to me.

That being said I'm also not really sure what it's supposed to look like anyway (is it an L?) so it's probably worth changing for that as well.

Same here. Looks to me like it can even indicate 2 different states.

This was the first thought in my office when we looked at Loki today.

Immature or not, the are at least a reasonable number of people that see just that when they look at the logo.

At the end of the day, is all immaterial, they change it, or they don't, I'm not sure it really matters.

If I'm not mistaken, some designers even go to lengths to make their mundane logos more memorable.

It's not likely I'll be forgetting the name of Loki any time soon so their branding probably (?) succeeded?

Agree, that is an unfortunate logo.

They are either going to have to take it off the readme, shrink it, or likely, redesign. Maybe just rotate it 90-120 degrees clockwise.

lol at shrink it

We've put a new logo up now - what do you think?

It is definitely less phallic now that you moved to straight lines instead of curves.

That was quick! Thank God I can concentrate on the content now ;).

My first impression was "Airbnb has toppled over", but yeah, it kind of does...

It was the very first thing that popped in to my head. I'm glad they changed it.

Now that they've changed it to straight lines it looks like a cigarette to me.

Well, he is the God of Mischief (in the Marvelverse anyway).

Hah, that's actually worth pointing out. I personally don't see it, but I'm sure other people would. If anything - and continuing the immature thread - it looks like dickbutt to me.

can it just replace ELK with a single golang binary? big if it does

If all you use ELK for us simple process/container logs then yes! That’s the whole idea.

But Loki is not a replacement for Elastic - we don’t have any complex query support, and we don’t do full text search. Elastic is great for analytics or BI, but we think Loki is the way forward for your container/pod logs.


its just nvidia was over-valued and the market is correcting itself now alongside all the other tech stocks getting smashed, but I think investors should double down on nvidia now while the getting is good. The company still has the best gpu engineers on the planet $$$.

Grafana, are you sure you want to name a piece of software after a guy famous for lying, betrayal, and other forms of unreliability? Besides, I already know two cats, a dog, two children, and a Saturn hatchback named Loki. The Saturn is named appropriately, by the way; let's hope your software isn't.

Loki was reliable in that he was chaos incarnate. To misappropriate another bit of pop culture, "you can always trust a dishonest man to be dishonest."

I'm just a really bit Marvel fan though - he did redeem himself in the last film, no?

In the actual nordic myths, he is a rather bad guy and in the end fights against the gods when Ragnarök (armageddon) comes.

I was thinking it would be funny to call it Logki.

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact