The book is still very relevant but would be nice to get a refresh - many things have changed. Perhaps the biggest one is the execution engine. Jordan gave a nice talk on this in March:
I would love to know why BigQuery has a limit of 1,000 loads per day, per project .
I'm a founder of an early stage company that helps companies ETL their data into various Data warehouses, and I'm frequently met with companies that have evaluated BigQuery but ended up with Redshift or Snowflake because the maximum inserts/update/deletes are too low for their ETL process. Until these limits are increased, BigQuery is not an alternative to Redshift or Snowflake for a vast number of analytics teams that query thousands of tables that are updated hourly.
The limitation is per table and for "batch loads" only. If you have per row inserts, you can always use bigquery's amazing streaming API, you can also do materialization and federated query loads.
Batch limits are there to help folks from creating too many small files, which affects performance (I believe Jordan's book explains this in greater detail).
Would love to know more about your use cases since this type of thing shouldn't be the deciding factor in choosing an analytics data warehouse and should be easy to architect for. As someone pointed out, most quotas are negotiable. If anything, bigquery ingest process is pretty amazing and unique in the industry .
Apologies for brevity, I'm out of office on paternity leave.
Check out, as well as the blog posts on the required reading list on the topic of inserts. There's also a pretty good Kafka-bq connector.
It sounds like "ingest" was the big factor in your choice of a data warehouse. Regardless, I think bigquery has the best story in the industry here (but I'm biased).
I'd help more hands on but sadly not online for a while :)
50,000 requests per project per day, which can be increased.
Which has a limit of 50,000 loads per project, per day.
edit: corrected limit.
Your initial concern of 1000 loads per project is actually 50000 and can be raised.
The limit of 1000 loads per TABLE per day is there, can't be raised, and a couple posts above explain how to work with this limit. This limit is also higher for partitioned tables.
Apologies for brevity.
In practice, I've never hit either of these quotas, because our data architecture is such that we'd have to have some kind of major operational screw up to encounter them (e.g., a run-away failed import job that somehow manages to fail 50,000 times in a single day). Regardless, you can work around the per-table limit by using Wildcard Tables.
In my case, the source of my data is strictly batch-oriented, so importing data into distinct tables with a common prefix solves this problem for me. Where I do use partition tables, the likelihood that I'm going to update a single partition 1000 times in a day is highly unlikely (I'd be curious about any use case for that).
The DML features of standard SQL are probably best viewed as a "helpful backup" rather than a general solution. Yes, you can run UPDATE and INSERT statements against your non-partitioned tables, but you probably want to limit that to situations where you are trying to correct something specific, or on an ad-hoc basis, and not as a general solution.
In any case, there are a number of tools you can use to not be impacted by this particular limitation. For example, you could use streaming. And while streaming has it's own quota (100,000 inserts per project per second), it's straightforward to ask for an increased quota.
The trick is to think of BQ like you do Hive or other "non-typical" SQL databases: yes it has a SQL query interface, but if you treat it like PGSQL or Oracle, you aren't going to get the benefits out of it.
If you are going to try and perform OLTP operations against BQ you are going to fail. It's a data warehouse, not a transactional system. Working with log-structured data, or append-only data, is where it really shines.
The limit of 1,000 load jobs per table per day cannot be raised.