
Ask HN: Generating reports on time series data without killing performance? - Kendrick2
I have an application which contains approximately 10 million time series records growing at a rate of 1 million per year. This application generates reports on subsets of the time series data. These reports have poor performance characteristics because for each record in a time series, another subset of time series data must often be loaded as part of the computations behind the report. Effectively this can lead to N^2 performance. The current workaround is to implement these reports as stored procedures in the Oracle database where the time series data resides. This saves network roundtrips for every time series record.<p>I&#x27;ve found the stored procedures to be insufficiently flexible to handle complex requirements compared to modern programming languages (requirements come in for new reports on a regular basis). I&#x27;d like to generate these reports in application code (C#) but can&#x27;t see a way around the performance issue. Has anyone dealt with similar challenges and how did you work around them?
======
eschutte2
That volume of data should be handled pretty easily by the right indexes and
joins. At one place with that order of magnitude of data we kept an offline
synced clone of the data with indexes designed for the queries we needed to
run (this is basically OLAP). Have you already looked into that?

~~~
Kendrick2
Yes currently all the reports operate quite efficiently on a set of indexes.
The performance of the database is currently very good with the reports in
PL/SQL. The problem arises when the reports are generated using application
logic on another server. Let's say we have a report for the last 10,000
records. For each of those records, I have to query the database again for
some other time series information. If the latency between the reporting
server and database server is 2ms, that report is going to take 20 seconds to
generate.

Edit: If you're saying mirror the time series data onto the application server
to avoid the network roundtrip, I guess that is possible but we still face the
SQL overhead with each query.

~~~
eschutte2
Yeah, you definitely don't want to be making those roundtrips. Why doesn't a
join work?

~~~
Kendrick2
Complex domain-specific requirements preclude simple joins unfortunately. This
is why I'd like to redevelop these queries in C#. There are many imperative
steps performed on each time series record to decide how these reports are
generated.

To give a very broad example of one report: take 10,000 latest time series
records from geographic location X. Now for each of those records load the
latest 100 time series records from location Y and Z. Multiply all records
from Y and Z together then divide by the value at location X. Throw in lots of
business rules at every one of these steps so we have no chance of
representing this as a tidy join operation and so that we can make no
assumptions about the data we'll need to load to generate the report :)

~~~
eschutte2
If it's not possible to do a few big queries to pull all the necessary base
data (all the related Ys and Zs for the last 10,000 Xs) and then join/filter
it yourself in code, then I think you are stuck with many queries. The options
I can think of are: 1) keep the logic in stored procedures; 2) find a way to
fetch the raw data in a few large queries, then join yourself; 3) let the
reports be slow and generate them in the background; 4) write one complex SQL
query that does all the crunching (if possible, you're saying it's not).

------
brudgers
1\. 10 million records is not big data.

2\. If the rate of data acquisition is 1 million records a year, then 9
million records are the same as last year. There's no reason to hit the
database again, because time series data [to be time series data] does not
change.

3\. This suggests that processing the data from the Oracle database is not a
requirement...i.e. it could be moved to another system and processed there.
Again, it's small data and doesn't change so duplication presents neither a
storage issue nor a consistency issue.

4\. The size and static nature of the data suggest that it might fit into
memory on a single, moderately speced PC. An AWS type approach is also
possible.

5\. The right data storage format depends on the workload...maybe something
column oriented?

Good luck.

------
NumberCruncher
When working with time series it is a good practice to move the data in memory
and vectorise both the data and the data-manipulation logic. Doing so you can
take advantage of the battle tested and fast linear algebra libraries like
BLAS and LAPACK (or even move the calculation to the GPU). The last time I
worked with time series data I used python-pandas. Even without former python
knowledge it took only a couple of weeks to get productive. It has also a
wrapper around SQLalchemy so you don't have to deal directly with the ORM.

------
thorin
I think you're probably wrong that you can't handle it in stored procedures
(packages). You may even be able to do a lot of the processing in a single SQL
statement? Try this first, then PL/SQL. Do the bits you can't do in PL/SQL in
a java class stored on the database...

------
tmaly
I am currently on a task of processing 4 million records to produce a report.
It takes me about 10 seconds in the code.

You could always use a sliding window type algorithm to keep up with the grown
of the data.

------
ak39
If the requirements allow, consider replicating data in memory with SQLite
:memory: option. 10 million records will be nothing for it.

Let us know.

------
maigrait
Are these java stored procedures?

