Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: What is the recommended/straightforward GCP option for data processing?
1 point by s-xyz on April 18, 2020 | hide | past | favorite | 3 comments
What is for you the most fitting and straightforward solution for performing data processing tasks (cleaning raw files and pushing it to BigQuery)?

1. Cloud functions with Pub/Sub: for each file that comes in, execute an isolated function that cleans and stores the single file

- Pro: simple to setup, great for parallel requests (eg send with Pub/Sub 1000 requests for 1000 files), error reporting included

- Cons: max duration 10min of a single task, maybe expensive as Ghz consumption increases

2. Dataflow: create a pipeline and trigger it using GCF/Pub/Sub

- Pro: seems to be able to handle heavy computing problems, and the name/explanation indicates its designed for setting up proper Data processing pipelines.

- Cons: I personally find the startup time very long for individual executions. Eg when doing batch processing. I guess I would need stream processing to speed things up. The biggest issue I find is that it seems complex (many steps involved) to set it up properly (also with monitoring errors), in particular compared to to Cloud Functions and App Engine. For simply processing files it seems a lot of work

3. App Engine and Cloud Tasks: create a flask app with a route and parameters (eg file path) that when called processes a single file, similar to GCF logic. Run async requests background tasks w/ Cloud Tasks.

- Pro: can handle large memory computations and long duration tasks. Easy to setup and deploy. Auto scales up with flex env.

- Cons: I am confused that you need to setup a flask app, thus doubting whether it is intended to be used like I am doing?

4. Compute Engine and Cloud Tasks: same as App Engine but with a fixed non-auto scalable VM

- Pro: probably cheaper than App Engine

- Cons: no autoscaling capacity (unless you set this up). I also find it a lot of work to setup the deployment schedule (as apposed to App Engine where you can just run gcloud app deploy)



Where are said files coming from? What sorta latency requirements do you have?


Cloud Storage, no latency requirements


ymmv but I kinda like Airflow for automating data ingestion tasks - mainly using its scheduling to run Python scripts that ETL data into bigquery.

BigQuery can also directly query data on gcs now I believe, so maybe that's an option if you don't need to do any transforms.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: