Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: How to run queries on 20B clicks log
4 points by xstartup on Nov 10, 2017 | hide | past | favorite | 6 comments
I have a click log with 20 Billion clicks. I want to perform a group by, count, sum, average queries on this data set and make it available to all the customers. Each record includes Browser name, Country, Carrier, Campaign ID. Queries include a) Get the number of clicks from X to Y date, for each specific browser. What options do I have?



BigQuery would probably be the easiest, low maintenance approach. Normally I would leverage the Hadoop ecosystem for this type of ad-tech/analytics problem, but I am not 100% certain of your use case, budget, etc. Providing clients read-only SQL access would be much better than a REST API.


You mention about making it available to customers - are you looking for APIs ?

https://github.com/o1lab/xmysql - one command to generate rest api for any mysql database in seconds. (shameless plug - something i've been workng on)

This supports grouping, count and where as query params of api. I'll be happy to help - pls raise an issue if you get stuck. Thank you.


This is not going to perform well with this amount of data. HTTP Response times for queries are likely to timeout. Needs a job-based queue.


I'd go for Druid + Pivot (or BigQuery if you happen to be in GCloud already).

http://druid.io/

https://docs.imply.io/pivot/


Have you actually deployed this in production? Last year, we had lots of issue with Druid. Do you know about clickhouse?


can you write briefly about the issues you had with Druid? We've been looking at bringing it into out stack for our OLAP queries and aggregates and would appreciate a war story.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: