Some quick questions, let's say I have ~500 instances of VM with my services deployed there produce ~500 MiB of log data per hour per VM instance. Can this amount of data be managed by Logrange? What would be a typical environment on AWS/GCP for that type of load for Logrange? What would cpu/memory load be on my nodes from the log collector which shall be deployed there?
It is about 250GiB/hour of data, what is ~70MiB/second. I would say it is not a big deal for writing, cause one instance with SSD drive allows to write hundreds of megabytes per second.
> Can this amount of data be managed by Logrange?
Yes.
> What would be a typical environment on AWS/GCP for that type of load for Logrange?
Logrange could be deployed into cloud, on-premises, k8s, standalone etc. There is no special requirements where it is deployed. Considering the resource consumption for the example I would say having 2 CPU instance with 8 gig of RAM (for FS buffers in most), and the disk big enough for storing the traffic. Logrange itself adds a very little overhead to the resource consumption.
> What would cpu/memory load be on my nodes from the log collector which shall be deployed there?
Not too much. On the nodes, I presume it is within a 1% CPU on the nodes and may be 10-30% of CPU on the database itself for writing the traffic.
Very interesting. I got your point that full text index is not required for logs. But it’s presented as “streaming database “ and word “database “ kind of implies some indexing, isn’t it? For common use case, we might need to search some words or phrases in logs for long time back, like few months.
The point is full text indexing could be required, but not at the moment when the data is persisted. Logrange groups data by tags, what is already type of a sparse indexing. Also, if records have time-stamps, like records of an application log, Logrange will use the time index for making requests.
When data is persisted and stored into partition (an analogue of database table), an external index could be built for accessing this data as well.
Even having the terabytes of data from multiple applications in one database, the full text index could be not needed for the rare requests like search of a phrase. The query usually narrows the sub set of the data which should be considered, by tags and the time window where the text is looked for.
Ok, but let's consider example. Say, I'm looking for logs related to PayPal payments in my system. And I need to search logs for 'PayPal', or on example 'PayPal withdraw', or 'PayPal refund', including cases with regexes.
If I need to run this query once in a while, is response time satisfiable? I want to see some lines right away and probably get some count of matches for few months back. I don't expect it instantly, but in timely manner.
1. how you form your query, i.e. if you can narrow down your search to a particular partition(s);
2. amount of data you have in the partition(s) you're performing your search against;
As an example, let's say you want to look for errors in your Nginx servers you can write something like:
SELECT FROM server=nginx WHERE msg constains "error"
This query would do exactly 2 steps described above, i.e. it will narrow down the search to a partition(s) where nginx logs are stored, then perform your search/filter against that data only. The response would take several seconds if the partition is about couple gigabytes (and the disk is SSD).
But I would say that for Logrange this is not the primary use case, i.e. to run single user queries here and there on the raw collected data. Logrange primary use case is to store everything in one place, be effecient, reliable and fast in this task and later build some process on top of the collected data which will allow effectively process user queries in accordance with the specifics of user's domain.
Good! Say if my nginx log is huge, is there such a thing like automatic partitioning? By timestamp, most naturally. And each partition can run search request in parallel, I assume.
Partition in Logrange consists of chunks, that, in general, can be distributed across multiple storages. When a query is executed it runs against one or multiple chunks (what may be done in parallel). If your nginx log is huge it will not be longer than to run against a chunk. So as bigger your partition is less correlation between query execution time and your data size.
If your data is a set of time-series, time-index could be involved what will speed up execution of your query dramatically.