Hacker News new | past | comments | ask | show | jobs | submit login

I disagree with the premise of your statement. It's typical that a log will be accessed zero times. Collecting, aggregating, and indexing logs is usually a mistake made by people who aren't clear on the use case for the logs.

Absolutely, the vast majority (95%+) of logs are never read by a human. Therefore, processing it is enormously wasteful. A good architecture will write once and not touch anything until it is needed.

I spent years working on system handling 50+PB/day of logs. No database or ELK can handle that, and even if it did it would be prohibitively expensive.

Where did you work? CERN?

It's adorable when people think scientific computing has the same scale as a Google or Microsoft.

Ignoring the fantasy b.s. in the second half of the article, the stuff at the top is exactly what I mean.

A mighty 400 GB/s: i.e. much less than the > 50 PB/day of logs the other person mentioned;

1600 hours of SD video per second: i.e. about 1-2 million concurrent HD streams, or much less than the amount actually served by YouTube.

IBM Summit "world's most powerful supercomputer": < 5000 nodes, i.e. much below the median cell size described in the 2015 Borg paper. Summit is a respectable computer but it would get lost in a corner of a FAANG datacenter.

CERN is a correct example. The LHC reportedly generates 1PB per second: https://home.cern/news/news/computing/cern-data-centre-passe...

If you define “generates” to mean “discards” then yes.

It still gets processed though and only all non-interesting events get discarded..

Otherwise the tape alone to store it on would exceed their total operating budget in a day, so they have to be a bit clever about it.

I think that even Google can not save 1PB/s in 2020.

The numbers are not fantasy at all - this will be a huge radio telescope - one square kilometer of pure collecting area and thousands of receiving antennas (For reference: Arecibo has around 0.073 km^2). We are talking data input to the correlator on the terabit/s scale. And technology-demonstration with ASKAP are well under way. ALMA is working quite well by now as well (> 600 Gb/s with just an 50 antenna array).

it’s adorable how proud you are to have worked at FAANG and how angry you get at the idea some other organisation handles equivalent scale


400GB/s is about 35 PB/day

not quite as big a difference

So, Youtube puts whole streams into their logs? Interesting.

Indeed, I mostly stop looking at logs once I get the metrics from mtail in prometheus/grafana.

What is the use case for logs?

There isn't a universal one. If you don't have a concrete one in mind, you shouldn't produce the log at all.

I appreciate the zen-like nature of this advice, but I think you also know how unreasonable it is most of the time, unless by 'concrete' you allow something as vague as, "troubleshoot production issues".

Ad-hoc production troubleshooting is a reason to keep, at most, 7 days of logs. Usually you want the most recent minute or hour. Troubleshooting usually does not need collection, aggregation, and indexing because either the problem is isolated to a host or the logs of a single host, pod, or process are representative of what is happening in the rest of the fleet. Even if you want to access all logs, it's still better to leave them where they were produced and push a predicate out to every host; your log-producing fleet has far, far more compute resources than your poor little central database, no matter how big that DB is.

What a bunch of odd and arbitrary statements. Examples: I often use logs older than 7 days for troubleshooting. I rarely troubleshoot only using last minute or hour data. I need aggregation most of the time when troubleshooting. I also treat most runtime environments as cattle so relying on it to keep logs locally would be wrong.

All but trivially reproducible bug reports require, or benefit immensely from, logs about the transaction in question. The pipeline from support to product to engineering to an actual investigation is usually much more than 7 days.

May I ask what kind of production environments you have in mind? Are these large-scale FAANG-style deployments or something else?

Well, only to the extent that the management of small amounts of logs is not very interesting. There is not an ACM SIG for very small databases.

Anyway, GPDR requires you to have a purpose for any log that contains any IP address. Keeping logs for undefined purposes and unlimited time frames is not ok any more.

It could be an e-commerce site, which depending on the case can produce a shitload of logs. Imagine having a few hundred of thousands of users daily and you record every page they view, along with heatmaps, and whatnot. In most cases those logs should never be touched by a human in raw format. You feed them in your analytics engine, and start making decisions about your conversion. And then you delete them because the goalpost is constantly moving.

To be fair, a lot of problems came and come from 'let us store this stuff just in cases'. It also helps in some unforeseen cases, but in both scenarios you essentially end up in an unstructured unknown situation which is generally not what you want in IT, Business or as a person.

Applications are open for YC Winter 2021

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact