Great to see such benchmarks and another dataset to test with!
Just wanted to add a small explanation for the interested to why at least OctoSQL[0] falls behind in the first and last test.
OctoSQL supports retractions. That means an input for OctoSQL can not only be producing records, but also saying "I'm taking this record back", and OctoSQL will calculate the correct output based on that with minimal processing overhead required. It's a dataflow system in that sense.
To achieve that, it has to buffer data at the output to be able to display the current output state (which might have rows removed from it). It’s the same case for LIMIT which is right now very unoptimal in OctoSQL. Because the top 100 rows can actually change if you have an ORDER BY clause specified and new rows/retractions arrive - it buffers the whole output, and then displays the top 100 rows.
Running with the stream_native output format should drastically reduce the memory overhead and possibly processing time (I’ll post an update when the data is downloaded and I can compare), at least in the first case.
Obviously, OctoSQL could be much smarter about detecting whether a source can even produce retractions, and optimize it’s behavior based on that. No point in buffering if retractions are not possible (which is the case for i.e. a JSON file). This would make LIMIT and the output table much faster and lighter.
It’s definitely a big priority for me for future development to fix this (and make LIMIT and the table output format usable with big outputs), so stay tuned! :)
And congrats to SPyQL for the first place here, its design is awesome.
EDIT: Running the first test with `--output stream_native` (which just sends the productions and retractions as they come to the output, instead of buffering and sorting them) does indeed run 2-3x faster and use 50 MB RAM - instead of GB's.
Thanks @cube2222! Great! Sorry for overlooking that option. I will definitely add that option to octosql to make the comparison fair. Should I add it to the 3 queries?
No worries! I think it'll be most fair if I just let you know when these are fixed, and you can properly use them, performantly, with a sane output option (that's not stream native, but something like JSON). I.e. LIMIT won't work with stream_output.
If you want to add it to the first one, I won't oppose, but don't feel pressured to do it. Especially the third query does highlight a very realistict limitation of OctoSQL as it works right now, so it's only fair as it is.
I've just released OctoSQL v0.6.0 which fixes the issue and adds eagerly printed CSV and JSON output, as well as a non-buffered LIMIT implementation. It's now just slightly slower in the benchmark than SPyQL.
Great to see such benchmarks and another dataset to test with!
Just wanted to add a small explanation for the interested to why at least OctoSQL[0] falls behind in the first and last test.
OctoSQL supports retractions. That means an input for OctoSQL can not only be producing records, but also saying "I'm taking this record back", and OctoSQL will calculate the correct output based on that with minimal processing overhead required. It's a dataflow system in that sense.
To achieve that, it has to buffer data at the output to be able to display the current output state (which might have rows removed from it). It’s the same case for LIMIT which is right now very unoptimal in OctoSQL. Because the top 100 rows can actually change if you have an ORDER BY clause specified and new rows/retractions arrive - it buffers the whole output, and then displays the top 100 rows.
Running with the stream_native output format should drastically reduce the memory overhead and possibly processing time (I’ll post an update when the data is downloaded and I can compare), at least in the first case.
Obviously, OctoSQL could be much smarter about detecting whether a source can even produce retractions, and optimize it’s behavior based on that. No point in buffering if retractions are not possible (which is the case for i.e. a JSON file). This would make LIMIT and the output table much faster and lighter.
It’s definitely a big priority for me for future development to fix this (and make LIMIT and the table output format usable with big outputs), so stay tuned! :)
And congrats to SPyQL for the first place here, its design is awesome.
EDIT: Running the first test with `--output stream_native` (which just sends the productions and retractions as they come to the output, instead of buffering and sorting them) does indeed run 2-3x faster and use 50 MB RAM - instead of GB's.
[0]:https://github.com/cube2222/octosql