
Ask HN: What is the recommended/straightforward GCP option for data processing? - s-xyz
What is for you the most fitting and straightforward solution for performing data processing tasks (cleaning raw files and pushing it to BigQuery)?<p><i></i>1. Cloud functions with Pub&#x2F;Sub<i></i>: for each file that comes in, execute an isolated function that cleans and stores the single file<p>- <i></i>Pro<i></i>: simple to setup, great for parallel requests (eg send with Pub&#x2F;Sub 1000 requests for 1000 files), error reporting included<p>- <i></i>Cons<i></i>: max duration 10min of a single task, maybe expensive as Ghz consumption increases<p><i></i>2. Dataflow<i></i>: create a pipeline and trigger it using GCF&#x2F;Pub&#x2F;Sub<p>- <i></i>Pro<i></i>: seems to be able to handle heavy computing problems, and the name&#x2F;explanation indicates its designed for setting up proper Data processing pipelines.<p>- <i></i>Cons<i></i>: I personally find the startup time very long for individual executions. Eg when doing batch processing. I guess I would need stream processing to speed things up. The biggest issue I find is that it seems complex (many steps involved) to set it up properly (also with monitoring errors), in particular compared to to Cloud Functions and App Engine. For simply processing files it seems a lot of work<p><i></i>3. App Engine and Cloud Tasks<i></i>: create a flask app with a route and parameters (eg file path) that when called processes a single file, similar to GCF logic. Run async requests background tasks w&#x2F; Cloud Tasks.<p>- <i></i>Pro<i></i>: can handle large memory computations and long duration tasks. Easy to setup and deploy. Auto scales up with flex env.<p>- <i></i>Cons<i></i>: I am confused that you need to setup a flask app, thus doubting whether it is intended to be used like I am doing?<p><i></i>4. Compute Engine and Cloud Tasks<i></i>: same as App Engine but with a fixed non-auto scalable VM<p>- <i></i>Pro<i></i>: probably cheaper than App Engine<p>- <i></i>Cons<i></i>: no autoscaling capacity (unless you set this up). I also find it a lot of work to setup the deployment schedule (as apposed to App Engine where you can just run gcloud app deploy)
======
mattbillenstein
Where are said files coming from? What sorta latency requirements do you have?

~~~
s-xyz
Cloud Storage, no latency requirements

~~~
mattbillenstein
ymmv but I kinda like Airflow for automating data ingestion tasks - mainly
using its scheduling to run Python scripts that ETL data into bigquery.

BigQuery can also directly query data on gcs now I believe, so maybe that's an
option if you don't need to do any transforms.

