Hacker News new | past | comments | ask | show | jobs | submit login
Google makes BigQuery available in Sheets (techcrunch.com)
345 points by truth_seeker 13 days ago | hide | past | web | favorite | 122 comments

All this fancy stuff, but where's my Sets at? Google took down the website like 15 years ago. I used that thing daily. Then they put it into Google Sheets. You could type "apple" then "pear" then "orange" and then select cells, drag, and you'd get a list of fruits! Some you never heard of before!

I looked around and Google has a patent on it, and seemingly no public implementation. This makes me super sad because I honestly considered Sets one of the most useful exploration and ideation tools ever created.

BigQuery in Sheets, cool I guess? I want Sets back.

that's the same functionality, but the results are not as good.

I put in: fat obese porcine

I get: wild-type hyperactive amnesic misshapen narcissistic slimy cancerous motile putrid gravid

I wonder how enforceable that patent is.

What's a use case for this?

Still not as cool as Google Squared, where 10 years ago, you could create rows of any nouns and cols of any adjectives, and it would automagically research and fill out all the data for you according to its best guess. Create rows of car models, cols of price, seating, horsepower, safety rating, stuff like that.

That’s all in Google Knowledge Graph. They dropped it because it makes finding information too easy. And hence, competitor search engines could get info out of it super cheaply.

This reason is so much more disappointing than what I had previously assumed.

Seems unlikely to be the real reason

Seems convincing. Likely the same reasoning behind removing Real-Time results.

Boo! I’d love to have nice hooks into Knowledge Graph.

I don't know how Google's implementation of sets worked, but a simple implementation you could do yourself in a Jupiter notebook is to load up some pretrained word vectors, normalize all vectors, look up the vectors for all words in your seed set, take the mean of them, and then look for the nearest neighbours.

Stick it on a website and then the rest of the world can use it too!

For all the complaints about spreadsheets, attempts to replace them and such... They are a very reslilient paradigm.

The way the paradigm and learning curve work just seems to work for a lot of people. I reckon it's still the most popular plaform for coding, data engineering, analytics & data science... using liberal definitions for all those terms.

It also seems like something that's in google's wheelhouse.

Earlier this week, someone was passing this tweet around the office, which I believe captures the cold truth succinctly:

> "All enterprise software competes with Excel.

> All productivity software competes with emailing things to yourself."


This is disappointingly true.

You'd think we would have come up with better answers. I'm constantly looking for ways to better manage bug tracking, todo lists, and even managing job searches and professional networking, because I'm still not really happy with anything I've found.

As I bounce back and forth between different tools, I always land back on just using Excel or Google Sheets. I'm never entirely happy with those either, but I find that they're no worse at the job than the dedicated tools.

I kind of smile at it, fwiw.

It's a sort of tribute to users-know-best and a "I made it so I understand it" mentality. Spreadsheets are a rare example where people make their own tools in a modern work environment, where most tools are highly politicised.

For me, I'd like to see spreadsheets improve at the tasks they're doing, not hand over to some cookie-cut app procured by management.

to close the loop: https://twitter.com/patconnolly/status/1116722347849977856

re: spreadsheets and recursive publics ;)

Have you given notion a try? It sort of feels like one of those tools that are so general they can actually compete with spreadsheets and emails and simple lists. And pretty user-friendly to boot. I have started using it for more and more simple lists and other personal stuff.

I have, I like the product and I agree about the user-friendliness, but it doesn't quite do it for me.

Actually, I should phrase it a different way. For me, Notion competes with OneNote, and for my personal use they're somewhat comparable. Notion doesn't take the spot of Excel/Sheets in my toolkit because Excel/Sheets can be gradually automated out of initially fuzzy requirements.

I just looked at notion, and it's unfortunately not general enough to warrant making a version that works on a non-mobile device. I'll check it out when I can use a version on my desktop (or at least on the web).

Notion has MacOS and Windows versions:


And importantly, you know Excel/Sheets well probably. Other tools start at a major disadvantage in that they're competing with a tool you're familiar with. That's a powerful adversary.

I would place my knowledge of Excel/Sheets at the level you'd expect from a typical programmer - I don't know the keyboard shortcuts, but I know how to make pivot tables and write formulas.

My main issue is that the specialized tools always impose some sort of opinion on how ____ should work that is close to mine but different in impactful ways. Case in point, JIRA. With how infinitely customizable the columns are, it might as well be a spreadsheet with links to email threads.

That's selling JIRA short on how much the UX tries to do for you, but I'm intentionally making a point about how much overhead there is in paperwork generated by the fact that you're using JIRA. When I compare how much paperwork JIRA saves me vs. how much paperwork JIRA generates, I often find, even as a PM, that I come out behind. As a developer, I pretty much never feel like I'm coming out ahead on time.

I don't mean to pick on JIRA specifically because every bug tracker I've used has this problem, but JIRA was the tool I had to use for work for the longest time, so I wanted to stick to what I experienced firsthand.

This is very, very true in my experience. After years and years of trying out all, I keep coming back to a list of tasks in a sheet so I've combined the two. It's just so easy to reconfigure or query. And with Google Sheets on my phone, it is portable. Add an IFTTT note widget to add a row to my sheet and I've got instant capture. Finally, you can do whatever you want with it using Google apps script (e.g. send reminders). It's hard to beat. Despite all the capabilities, I still keep it simple and focus on capture and review. I could switch this to a list in a text file or on paper very easily and maintain the core of my functionality.

I always think "people built cathedrals and pyramids without all this fancy software, it isn't the key to getting things done".

At my current job we were supposed to have switched to Rally. However, the PMs prefer Excel. The result is that a spreadsheet on the PM's computer is the real system of record and sometimes things make it into Rally. This is a terrible experience. I've abandoned both systems and just work on whatever I feel like. Luckily I have enough clout and a slightly unique role that I can get away with doing this. I feel awful for the devs who are supposed to be working on the normal dev tasks.

Interesting, I don't relate much to either of the statement (not that my experience negates its applicability for other people).

On a bit of thought, I find it useful to organize and break down things and maintain lists (the daily/weekly list habit works best for me currently). This is different than emailing things to myself because I maintain an actively curated list that's hopefully logically organized in big and small tasks.

On the enterprise software front, I am a linux-friendly software engineer, so commandline tools are my best friend. I am not sure Excel can ever beat the productivity of vim+linux tools, let alone other specialized tools like sql. One thing to note is that a lot of time I deal with production data, so its a huge effort to actually move data to excel.

I tried using sc where I usually used excel but the learning curve steep and I didn’t stick with it, have given that a try?

What's sc?

It's a text mode spreadsheet program. [0]

[0] https://en.wikipedia.org/wiki/Sc_(spreadsheet_calculator)

How very true. I know almost every person who uses Excel on a regular basis would kill you if you tried to take it away (I heard one proposed change to openoffice for price reasons; no more).

And I have found keeping a todo.txt file on my desktop is still better than any productivity software. Although, these days, it's a symlink to a cloud-storage folder on my various machines.

And here I was thinking that only I emailed things to myself all the time. Its true that there really is nothing that I've found to be easier than emailing things to myself.

Meaning just how awesome email and Excel are!

> They are a very reslilient paradigm.

I've done some tech consulting work and we found a way to deal with this, even at a "traditional" SME level.

The answer :

enter excel (not csv) -> decode data with sheetJS into web app + web frontend for "make the stuff happen" -> export to excel with sheetJS (not csv).

MY 2 cts interpretation : everyone uses excel because everyone uses excel. Above workflow does not raise any barrier for client and he's happy with time performance gains

Anyone know of a tool that can extract the calculation/dependency DAG from an Excel workbook or Google sheet?

Ideally I'd like a Python package that I can point at an .xlsx file and it would give me back an object graph that replicates the values and calculations of the workbook.

Thank you, thank you, thank you. That is exactly what I was looking for.

Excel is just a zip of xml so your graph is in there already. Just xslt it to whatever format you prefer.

I end up outputting a great deal of data to spreadsheets simply because a lot of users who want to work with data know how to do it in a spreadsheet.

And really I don't mind as some of those users are really good at it.

I wish spreadsheets didn't get as much unnecessary hate as they do. They are a pretty great tool, and can do almost anything. They are to finance/accounting guys as shells are to many of us.

My hate for spreadsheets come from working in institutions that copy/paste around hideous VBA infested sheets for metrics/reporting instead of using a database.

There is a lot of painful excel abuse in business.

Most of the time that happens because the internal tools/databases built or maintained by IT are unusable or just don't do the necessary reporting. Most companies don't have in-house dev teams that are capable of building usable business-relevant tools.

I used to hate on spreadsheets because when I was in IT as a kid I had to install Excel and Office and it really sucked back then.

I didn't understand why finance liked them so much and I just looked upon finance groups as boring old finance nerds.

Man was I wrong... I love Sheets and spreadsheets in general.

I've literally spent hours in Sheets running financial analysis of our operational stack and saved thousands and thousands of dollars a month by analyzing our hardware spend.

Shell is a great way to think about it: bash scripts mutating flat files can be reasonable for a personal adhoc task, and maybe even share for reusability, but you don’t want to find them at the core of a backend architecture where an RDBMS ought to be.

I don't think they get unnecessary hate, it's just that they have a tendency to grow in complexity to the point where a spreadsheet is no longer the right job but is still in use.

I don't think it happens as much any more but the same analogy could be used for shell scripts. You may have a process that a couple shell scripts are useful for, but they can also metastasize to the point where only the original developer (if even) can maintain them.

Exactly. To me, if the sheet is larger than what I can keep on a decent sized screen, it's probably time to move it. Not definitely, but probably.

They should be treated like jupytr notebooks, used for exploration but productionalizing excel workflows is not fun at all especially if people have gotten into the ActiveX/VBA control stuff.

Similar to MS Access or other low code environments, it is great until it is absolutely unacceptable due to system limitations.

Finance guys and gals have shells like R or Python.

Spreadsheets are GUIs for people who prefer that to shells.

Don't kid yourself, most finance guys still use Excel. Maybe a few younger ones in the valley use tools like R, Python, etc. but not in most places.

> Don't kid yourself, most finance guys still use Excel.

I didn't say anything about that.

You said spreadsheets are like shells for finance people. I was just pointing out that shells for finance people exist and that most prefer not to use them.

Googles own blog [1] announcing the feature for G Suite Business, from back in January. This is a great feature that I have been searching for, for a long time.

[1] https://cloud.google.com/blog/products/g-suite/connecting-bi...

That looks different. The Data Connector, which you linked, lets you write a query to BQ and have the results in Sheets. Connected Sheets sounds like it's actually a front-end to the underlying data.

Notably, the Data Connector has a row limit of 10,000 results, while the Connected Sheets claim to surpass the regular row limits of Google Sheets in general.

How does pricing work here? Say I have a pivot table on underlying 10TB data. I share the sheet with 10 people, each one refreshes it 100 times a day. Do I pay for each data refresh, ie 10TB * 10 * 100= 10000TB/day, that would be $50000/day if I read the pricing table right. Can someone more familiar with bigquery verify that?

I had used Data Studio to provide visualizations of data I had in BQ. I wasn't able to create anything useful without creating a bunch of custom views in BQ. Data Studio seemed to refresh the data a few times a day, which queried the BQ views . It ended up being annoying enough from a pricing perspective that I stopped using Data Studio.

Give BI Engine a try! https://cloud.google.com/bi-engine/docs/. The model is that you pay for storage--currently free during the beta--but any queries from Data Studio don't incur costs. It was just announced during Next this week.

As a disclaimer, I'm an engineer on the BigQuery team (but not BI Engine specifically).

what happens if the data in bigquery changes, for example a row inserted? Are there any additional costs related to invalidation of the in-memory cache (ie the BI Engine) ?

Tables created in projects that have BI Engine enabled support streaming inserts, DML operations, etc. just like other BigQuery tables. Any queries you run using BigQuery itself incur regular query costs (https://cloud.google.com/bigquery/pricing#queries) but there are no extra charges beyond the storage costs (https://cloud.google.com/bi-engine/pricing) for BI Engine to manage the data for you behind the scenes.

We'll have to wait for details about connected sheets, but BQ itself caches queries by default for 24 hours and you're not charged for cache hits. I assume something similar applies.

That would assume the underlying data does not change. If you use BiqQuery with ingest runnig 24/7, you often don't want to be served from cache. Besides, filters and slicers can be modified on a per user basis, that would trigger recalculations. Again - not sure if a cache would handle this.

It's important to remember how BigQuery actually stores it's data, with a columnar base datastore called Capacitor[0]. While you write data as a row, BigQuery does a certain amount of preprocessing on that data to turn it into column-oriented data. Writing a single new row is non-trivial, so it tends to write data in batches (though there is some other magic going on here I probably don't understand).

[0] https://cloud.google.com/blog/products/gcp/inside-capacitor-...

BigQuery is usually intended for data warehouse use cases, where missing the most recent 23 hours of data is less of a big deal. It's intended to feed things like exec dashboards, not production systems.

I believe the views will refresh when the underlying data changes, so if you are streaming or inserting frequently the costs could add up

Big query isn't a great use case if data that needs to be queried repeatedly and is updating that frequently, I don't think. With streaming inserts, you probably should generally only be doing inserts on the latest date/time based partition, so that you can take advantage of cached queries for all data but the most recent date/time partition.

That depends on how many columns your pivot table uses. Less columns = less data = lower price. If the implementation is smart enough it could use BigQuery partitions for further cost savings.

Is the data size calculated based on raw data on which queries operate (for example all transactions with amount column), or on query result size (for example sum(amount) -single row) ?

BigQuery bills based on amount of data read. BigQuery is a column-store database, so a read will always touch every row of a table, but only the columns actually used. So e.g. "SELECT * LIMIT 1" will read the entire table, while "SELECT SUM(transaction.total)" will only read 1 column out of potentially hundreds. The latter query will be billed identically as a filtered query like "SELECT SUM(transaction.total) WHERE transaction.total > 10." Filtering on a different column will be billed more, because it needs to read the second column as well, regardless of how many rows (if any!) contribute to the result set.

Idiomatic BigQuery will make use of partitioning, so that a large dataset will span multiple tables in a way that you only read the tables of interest. (E.g. the Google Analytics integration partitions by date, so reporting on 1 month of data will only read 30 tables out of the 1,500 you might have.)

It's worth noting that depending on the amount of Data you capture in GA, each date table could be anywhere from 1s to 10s of GBs, and it can get pretty pricey to query from those tables.

1s to 10s of GB is pretty much nothing in BigQuery-land. (unless you are using them in production, which as noted elsewhere is a bad idea)

Especially with column-based querying and partitioning.

Bigquery also uses caching which presumably is used here.

Similar functionality called PowerPivot/PowerQuery[1][2] exists in Excel. It's hands down one of my favorite things to use Excel for, and also one of the least understood/known features by most Excel users.

I can't wait to see how Google's flavor of it for Sheets works.

[1] It's officially called "Get & Transform" now, but when it came out one Excel 2010 it was called PowerQuery, and still gets referenced that way everywhere (including the official help docs).

[2] https://support.office.com/en-us/article/get-transform-in-ex...

This was one of the things I missed when moving from Excel to Sheets and led to me building https://seekwell.io. It lets you grab data from your database and drop it in Sheets using just SQL (Postgres, MySQL, SQL Server and Snowflake).

We've also built a full SQL and Python IDE with a shared code repository, but solving the simple problem of "write SQL -> get data in a Sheet" was where we started.

I’ve had trouble getting more than 10k rows in google sheets. It will be interesting to see how well this works.

I have a couple sheets that are very close to the 2 million cell limits, in which almost every cell either does a vlookup or is being queried or is displaying some Google sheet SQL result.

Sheet takes time to load but still works fine!

Have to ask, what is the use for such a Frankenstein?

Boy, I'm not sure quite why, but the idea of a spreadsheet with nearly 2 million cells stresses me out...

Why that over a DB?

Spreadsheets are a DB. Do you mean over a DB server? Or a relational DB? Or a DB without a bundled Model view controller?

Not for a lack of programming skills, but I really felt (and still stand by it) that Google sheets was the perfect solution for that problem.

well, for >10k rows most likely it is better to use more specialized tools to create pivot tables. Google Sheets as browser app just not intended to do this.

I've tried using the current version of Sheets App Script for some advanced business functionality, but it's horrendously slow - taking over 10 seconds to loop over just 5000 rows. And they're using an ancient version of js, making development annoying.

I really hope that this announcement would herald a renewed interest on their side in improving Sheets as a whole.

Yes, I found it much faster to do everything in memory, minimizing the number of reads/writes.

And it was a bit of shocker writing es5 js again.

It's worth noting that there has been a Sheets connector for BigQuery since January, although only for G Suite customers: https://cloud.google.com/blog/products/g-suite/connecting-bi...

This announcement seems like an expansion on what's above (e.g. pivot tables). I've tested what's above for my org and there's a lot of promise. (only criticism is creating formulas in the query don't work automatically since they are returned as Text cells; workaround is to cast as Automatic)

Last time I checked, it was still hard to get a Google CloudSQL DB into BigQuery, so I’m surprised they did the sheets integration first. We used a third party that just copied data from our CloudSQL to BigQuery.

Yea this is surprisingly annoying. I did CloudSQL -> csv -> dataprep -> bigquery. For about 10 tables with ~20 columns each.

A job that should be less than an hour took over 6 hours due to various annoying datatype issues.

Another interesting thing is that BigQuery doesn't have Full Text Search so you need to go back into postgres or use elastisearch to use FTS which is beyond annoying.

Check out the newly announced data fusion

What tool did you use?

If you find this interesting you should take a look at https://www.sigmacomputing.com/

I wish Google would let individual users upgrade to GSuite Business to try these tools, rather than forcing the whole organization to upgrade. The pricing model is impractical.

I use GSuite as an individual. I just had to use a personal domain and forward my other account to it.

Yeah, but I really want to use it in my organization. It's too expensive to upgrade for a single user.

hey you can create pivot tables by BigQuery without paid GSuite subscription at all, this is possible with SeekTable (this is free BI tool): https://www.seektable.com/help/bigquery-pivot-table

How's bigquery performance in comparison to clickhouse? When is it worth switching from one to the other?

For ClickHouse speed and performance is #1 priority, so it beats most of the other databases.

So as an in-house solution, ClickHouse most probably would be the fastest option (if your use case suits OLAP requirements).

For clouds / PaaS - it's hard to compare directly. Do you know how many servers will process your BigQuery request? AFAIK usually BigQuery shows a bit higher performance than single mid-level ClickHouse server (but you can also have a cluster of ClickHouse servers).

If you use is only to run 4 different queries on NYC taxi data, they are about similar: https://tech.marksblogg.com/benchmarks.html

Otherwise it's really difficult to compare without knowing how much data you have, how often you query, etc.

Not sure you can compare them as BigQuery is serverless/SaaS. Guess it is more of a question of if you want to run it yourself.

BQ is considerably more performant.

This is uninformed and unsubstantiated.

The answer depends massively on the intended use case.

What are the use cases where one would shine and the other wouldn't? I'm interested in real life usage of any of those on >100 GB datasets.

With bigquery the cost is per query. If you do a "SELECT * FROM AAA" you'll pay whatever the size of the table AAA is, price being 5$ per TB.

If you do a "SELECT field1, field2 FROM AAA" you'll pay only for the total size of field1 and field2 rows.

So you usually want to use BigQuery in situations where you don't need to query data all the time, but rather a fixed number of times a day.

Now about performance: BigQuery queues queries, so you don't have a guaranteed time. It can take a couple of seconds before your query starts running, it can take longer. If you need something that responds in < xxx ms, BigQuery is not it.

But the queries themselves are fast. If you need to query across petabytes of data, as a simple BigQuery query will gladly run on however many dozens or hundreds of instances it needs, at no additional cost for you (since you only pay by the size of your data queried).

It's really a great example of serverless. You can run your query across 100 instances, but you only use those instances for a few seconds.

Can't think of a realistic use case where such a query would be appropriate, even from just the performance standpoint. In fact, for a very long time the internal counterpart of BigQuery didn't even support "SELECT star", and nobody complained too badly. If you'd like to give Google a gift, however, sure "SELECT star" all you want. :-)

  > "SELECT field1, field2 FROM AAA" you'll pay only for the total size of field1 and field2 rows.
Moreover, if you also use a WHERE clause, you'll pay even less.

This can happen in data cleaning/loading where you load unclean data into a table that is ready for analysis. I have loaded data through staging tables regularly. There may be multiple stages.

Another example is materialized view creation. It's common for these to scan large quantities of data to compute aggregates.

That's not the recommended way of loading data into BigQuery, though.


That is not true

A where clause still searches the entire column, unless it is conditioned on the partition column.

Not if you filter on clustered columns, like you should if you care about performance.

True, but now we have moved pretty far from just saying "use a where clause".

"Partition your table and create clustered columns and filter on those"


Nerd on the streets, BigQuery in the sheets?
crystaln 13 days ago [flagged]

Cool. I wonder when they will shut it down.

I largely ignore Google products announcements these days. Why bother getting excited about things that won't last?

This is getting tired and adds nothing to the discussion.

If one is not familiar with Google's background on the matter, it's the number one thing they should know.

No it is not the number one thing. This comment is especially irrelevant in Google Cloud context.

No, the first thing you should know is that Google consumer products are not Enterprise features.

Is Sheets consumer or enterprise?

Sheets integration with bigquery that has an enterprise pricing plan is enterprise.

There's also a decent argument to be made that sheets is an enterprise product too. It certainly is for enterprise customers.

Disclosure: I work in Google Apps, but not on sheets or docs.

Feel free to list the Google Cloud products that they have shut down as part of your argument. A better argument would be that you will do not trust them not to do a massive price increase when they become dominant enough with a product. Like they did with the Maps API.

>I largely ignore Google products announcements these days.

Well clearly not...

> The beta of connected sheets should go live within the next few months.

In case anyone’s looking for it, there’s the buried lede. Whenever Google announces something like this, don’t hold your breath. Remember that robo-calling assistant they demoed awhile back? I remember that should’ve gone live within a few months, too.

Why do you spread such negativity if you haven't even fact checked that what you're saying is true?

That's the quote from the article. Do you have any additional info on when that would become available? The link from another reply is completely unrelated, as of writing this.

It's already working on Pixel devices in the US. Source: I used it.

Sorry, just to clarify, is that BigQuery integration into Sheets your are talking about? Can't find it in the web UI.

No I was referring to the robocalling feature. It's an actual product now.

I was spreading a quote from the article.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact