
Filter Before You Parse: Faster Analytics on Raw Data with Sparser - bandwitch
https://dawn.cs.stanford.edu/2018/08/07/sparser/
======
zerebubuth
Sounds a lot like the "on the fly parsing" (§3.1) in Alagiannis' NoDB (See
[https://stratos.seas.harvard.edu/files/stratos/files/nodb-
ca...](https://stratos.seas.harvard.edu/files/stratos/files/nodb-cacm.pdf) for
details).

------
carterschonwald
I’m more interested in the work some folks are doing on using succinct
structures style techniques to accelerate parsing.
[https://github.com/haskell-works/hw-json](https://github.com/haskell-
works/hw-json)

It’s still relatively immature. But it’s a more algorithmic approach that I
think plays nice with pretty much any source of semistructred Data. Though
simd acceleration certainly is pretty sweet too

------
X6S1x6Okd1st
Anyone figure out how to get an instance of spark up with sparser working?

------
CalChris
#define PREPROCESSING 1

------
PaulHoule
I have done that for a long time.

~~~
mamcx
How do that? If I have json/csv data in tabular forms how apply this idea in a
simple way? Because the naive me think is not possible without parsing it.

~~~
virgilp
It's quite simple, really - at least in some instances.

Say you look for log lines (or jsons) that contain the username 'mamcx'; you
first filter-away lines that do NOT contain 'mamcx', then parse only the
remaining ones (and apply the condition again, on the properly-parsed ones)

~~~
mamcx
Ok, but if I found a match I get stuck in a incomplete result:

    
    
            //Dude, and the data above what?
            "username": "mamcx",
            "version": 3
        }
    

So this need to backtrack a lot?

Or this scan for separators ({}) then in each scan it again scan per line?

    
    
        for line.scan("{", "}"):
            for result.find("mamcx")

~~~
kornish
You may want to rethink your log format to have a single (minified JSON) entry
per line - it can make your life easier in a variety of ways. Out of interest,
what library are you using that logs single entries on multiple lines?

~~~
mamcx
I don't always control my inputs. I use a variety of libraries (like explained
here:
[https://www.reddit.com/r/rust/comments/8ygbvy/state_of_rust_...](https://www.reddit.com/r/rust/comments/8ygbvy/state_of_rust_for_iosandroid_on_2018/))
so I'm more for the overall idea.

But for this specific case I use
[https://www.newtonsoft.com/json](https://www.newtonsoft.com/json).

Eventually could incorporate the idea in my arsenal and maybe build a format
streaming pipeline utility that can use across platforms.

