
Indexing JSON logs with Parquet - vistarchris
http://labs.vistarmedia.com/2016/12/27/indexing-json-logs-with-parquet.html
======
justinsaccount
parquet would be a lot more interesting if it could be freed from all the
java/hadoop/spark baggage.

I don't want hadoop. I don't want spark. I don't want drill. I don't want
presto.

I don't have big data. I do happen to have a few hundred gigabytes of
compressed csv files which would likely be a lot faster to slice and dice if
they were stored in a compact column format.

The other day I did something like

    
    
      $ zcat logs/*.gz | cut -f 3,5 | fgrep -w 23 | count_distinct
    

It took about 30 minutes on a single machine, mostly from the IO/decompression
overhead.

All I want to do is something like

    
    
      $ zcat logs/*.json.gz | json2parquet parquet_logs
      $ parquet-filter --output-fields src dport=23 parquet_logs | count_distinct

~~~
zten
I can see why you'd want to keep it non-Java and simple. I will admit that I
am biased since I work with Spark all day, but I think you'll eventually want
to write a query where you wish you had Spark SQL or could manipulate it as a
Spark RDD.

Spark actually isn't too painful for this because you can run it locally in a
single process. It is not exactly as elegant as the UNIX way you outline in
your second example, but it isn't as horrifying as submitting a YARN job or
spinning up a cluster.

In spark-shell:

    
    
        // Skip this conversion if you're just running one query. Do it if you're running many.
        spark.read.json(input paths).write.parquet("/target/path")  
        spark.read.parquet("/target/path").createOrReplaceTempView("logs")
        spark.sql("select count(distinct src) from logs where dport = 23").show()
    

The incremental conversion of your JSON data set to Parquet will be a little
bit more annoying to write in Scala than the above example, but is very much
doable. There is also a small amount of overhead with the first
spark.read.parquet, but it's faster on a local data source than it is against
something like S3.

~~~
justinsaccount
Yeah.. I agree. It's not really that I don't want to use spark, it's more that
I don't want to have to use spark just to convert some data files.

If I can write a short tool that wraps the

    
    
      spark.read.json(input paths).write.parquet("/target/path")
    

As

    
    
      spark.read.json(argv[1]).write.parquet(argv[2])
    

and then use it like

    
    
      json_to_parquet /logs/json/2016-12-28.gz /logs/parquet/2016-12-28
    

or something, that would work.

I just wish there were some more cli friendly tools for the cases where the
grep | cut | sort pipeline works, but you'd like to just make it a bit more
efficient.

------
whatnotests
I'm curious about where the threshold lies between "ELK Stack is good enough"
and "We need Parquet".

~~~
coredog64
My employer is doing a lot of this, mostly for cost reduction. Keeping live
data in ElasticSearch (or some other horizontally scalable system) is
expensive. Far cheaper to keep a small portion live and the rest in S3.

Also worth mentioning that managed ES in AWS has a hard limit of 20 data nodes
with 512GB of EBS. We've outgrown that and are now looking hard at
alternatives like this.

