Hacker News new | comments | show | ask | jobs | submit login
Why Can't Gmail Search? (designbygravity.wordpress.com)
131 points by cschanck 3032 days ago | hide | past | web | favorite | 85 comments

I once spoke to someone in the know about this, and the main reason is that it's quite expensive for them to do any reasonable job with stemming, etc.

I know that sounds weird (but it's google, they're omnipotent!), but it makes sense: It's worth their while to stem content they crawl and index off the web, cause everybody could in theory access any given page. However, with email, the only person who'll ever benefit is the recipient.

They could use Gears for some more in-depth local indexing, then it would be you bearing extra computational and storage costs.

That's a really interesting notion. Have any web applications moved indexing to the client-side before with gears and/or js?

You could probably build a gadget that does this, yes?

It would at least make sense to deal with plurals. I can't tell you the number of times I've searched for something using an s (or not) at the end, and failed to find what I was looking for, only to remember later that I should try to add or remove an s from my search term.

If it's expensive, then why not reflect that by providing it to people who pay for their accounts?

Perhaps it would be an option that they could sell to people?

Exactly....I have always wondered why there is no google premium...I would gladly pay $100/year to have a collection of certain domains easily removed from my google searches, or pick from a list of my favorite sites for a "site:" search, etc etc.

Well, you can do at least that last bit fairly easily. http://www.google.com/coop/cse/

It seems like they've worked on mitigating this on the web with suggestions for alternate spellings and by displaying related searches. While you can see stemming at work in many Google searches, I'm pretty sure they don't build extensive substring indices on the web end either. For example, I've had searches where a substring returns 0 results and the exact phrase returns a handful.

> However, with email, the only person who'll ever benefit is the recipient.

I'm not sure that makes sense - if they add stemming, all users of GMail benefit. Going by your explanation, it wouldn't make sense to add any expensive features to GMail, because the only person who would ever benefit from them is the single user.

I think you misunderstand the nature of stemming; the point is that each and every user's inbox would have to be processed individually, and apparently Google doesn't think the overhead is worth the results.

I understand stemming. I just think the webpage contrast is not a good explanation for why they're not doing it. Building an unstemmed search index per user is also expensive, and helps only the recipient, but they do it because they think the expense is necessary. They stem webpages because they think the expense is necessary. They don't stem mails because they think the expense is not justified, not because the only person who benefits is the recipient.

It is also expensive, but it is less expensive than doing that AND stemming. They just decided that stemming was a line where the benefit (add'l users, more use of Gmail, more AdSense revenue, whatever metric) wasn't worth the (development and ongoing processing) costs.

This is a drawback to putting everything in the cloud: features will be weighed by the CPU cycles and storage required by the providers. Can't wait til we come full circle and get back to client/server computing ;). I'm only half joking. I actually can't wait until things mature enough so we have a hybrid of both models. Then I can decide just how much stuff I want indexed and also not worry what happens when my cable modem flakes out.

Can't wait til we come full circle and get back to client/server computing ;).

The next big hype cycle: OS as an OS.

Finally, someone else who feels this way! I've never understood why people praise the search in GMail so much.

Gmail search is fast. On outlook, a desktop mail client, it can take up to an hour to search for a simple term, in my experience.

if I may quote some scripture:

"Gerald Weinberg tells the story of a programmer who was flown to Detroit to help debug a troubled program. The programmer worked with the team who had developed the program and concluded after several days that the situation was hopeless.

On the flight home, he mulled over the situation and realized what the problem was. By the end of the flight, he had an outline for the new code. He tested the code for several days and was about to return to Detroit when he got a telegram saying that the project was canceled because the program was impossible to write. He headed back to Detroit anyway and convinced the executives that the project could be completed.

Then he had to convince the project's original programmers. They listened to his presentation, and when he'd finished, the creator of the old system asked, "And how long does your program take?"

"That varies, but about ten seconds per input."

"Aha! But my program takes only one second per input." The veteran leaned back, satisfied that he'd stumped the upstart. The other programmers seemed to agree, but the new programmer wasn't intimidated."

"Yes, but your program doesn't work. If mine doesn't have to work, I can make it run instantly."

- _Code Complete_, pp.595-596

I love that story and have quoted it many times (though, Code Complete? really, just read Weinberg). But it doesn't apply here. Gmail search certainly works. It just doesn't have every feature.

Gmail search certainly works. It just doesn't have every feature.

It's not "completely broken," but no hits for a query of "zag" in an email that contains "zagg" comes uncomfortably close to "doesn't work." (FWIW, I use gmail and haven't had any huge problems with search, although I have had to do way more work to find something than I would expect given that it's from Google).

Code Complete? really, just read Weinberg

'twas just a quick copy and paste from my fortune file. Although as far as I'm concerned there ain't nothing wrong with Code Complete.

there ain't nothing wrong with Code Complete

It's ok but relatively mediocre, and in my observation usually indicates a programmer who hasn't sought out better sources.

in my observation usually indicates a programmer who hasn't sought out better sources.

Those being?

I don't really have an opinion to state on the actual discussion at hand, but I figured I'd toss in an old link which contains what some of "those" might be:


Javier Kragen Sitaker's article/mail "My Evolution as a Programmer", recounts one coder's his growth as a programmer, career, and exposure to a variety of books throughout. It's really an excellent read, and contains some comparisons between a few good books - particularly Code Complete and The Pracice of Programming. I quote,

During this time, I read "The Practice of Programming", which is a lot like "Code Complete", but shorter and much higher in quality. I had read the same authors' "The Elements of Programming Style" back in 1995, on much the same subjects, but that book is nearly unreadable today --- it's written in PL/1 and FORTRAN IV. TPoP, aside from being written with modern programming languages, also contains insights from several decades more of the authors' experience.

-- The author in question is Brian Kernighan. Anyrate, I leave any interested person to go check the article out, if you haven't seen it already.

It's a bad sign when Jeff Atwood is your biggest proponent :/

I take solace in the fact that people actually read Code Complete - it's not just a bookshelf ornament. Yes, I know, Knuth is God, and his work is a masterpiece - the point is nobody actually reads the friggin thing.

There are zillions, obviously. What do you want, a list? I don't know, SICP, CLRS, Knuth? Non-mainstream languages? Functional programming? For that matter, a single classic CS paper that wasn't assigned as homework? Code Complete, its grounding in software-engineering literature notwithstanding, isn't very deep.

I consider the lack of depth in Code Complete one of its great strength. You couldn't hand Knuth to a newbie programmer expect them to get anything useful out of it, whereas Code Complete will teach him a lot really useful things he can use his entire career.

If you've been working as a programmer for 12 years it probably won't teach you too many new things, but if you've been working for 12 weeks it is a great book.

In my experience, it's a more reliable heuristic to fault people for what they haven't read than for what they have read. Some of the smartest people I know can and do quote from childrens' books and anime.

He got a... telegram? Safe to assume this isn't a recent tale :)

(not that its age should detract anything of course, I just found the obvious historic reference amusing).

Have you used Outlook 2007 with the built-in search indexer? Incredibly quick.

I've been using Outlook 2007 since it launched. It is incredibly quick... when it works. I frequently find that I can be LOOKING AT THE EMAIL I WANT. Then type a phrase that I am staring at, and not have that email appear in the results. Never mind that sometimes it will randomly start taking 60 seconds or more to return results until you restart the program.

I didn't know. I'm stuck using Outlook 2003 at work :(

Try Xobni search plugin - it's fast and free for the basic version. It will certainly alleviate your search pains with Outlook 2003.

So it sounds like you're using Outlook 2003 on XP and don't have Windows Search (what powers Outlook instant search) installed.

Hound your incompetent IT department and say you need to search your email and you need Windows Search 4 installed on your machine instead of blaming a 6 year old product and a 8 year old operating system.

If you're stuck on Outlook 2003, Lookout is an add-on for '03 that adds blazing fast search. I find it works faster than Xobni.

My favourite one is Lookeen! Lookout is old and cannot search for and in docx..xobni is a matter of tase, nothing for me!

+1 for Lookout...free and works like a charm!

This whole thread (outlook 2003 being slow, 07 being a lot faster) makes it seem like xobni should have sold for 20mm

My outlook box is pretty large and it never takes more than 3 or 4 seconds. It's not the greatest search either, but it does handle substrings nicely and automatically. It works most of the time (though I'd still be much happier if it worked all of the time.)

An hour, really? I don't use Outlook, so I don't know if that's right - but I suppose that's why the various Outlook search plugins are so popular.

I haven't had a problem with Gmail search as described, but I would not classify it as fast. Often I'll type in a simple query and have it spend 10-20 seconds before it returns results. If I perform the same query on the same dataset in Spotlight in Mac OS X, it starts to return results instantly with the search completing within 5-10.

It may be fast, but what you want is results.

I think some people also want timely results. I certainly do.

Ideally, yes, but that doesn't appear to be an option. Given the choice between instant and incomplete results, and slightly slower but complete results, I'll take the latter.

Considering the difference in speed, it's faster for me to find what I was looking for by sifting through GMail's results, than to wait for Outlook's results to show up at all. Outlook's search essentially renders my whole machine useless until it's done searching.

Gmail's search works well. Except when you need substring search or fuzzy search, in which case it absolutely sucks.

Most of the things in Gmail are done very well but non-exact search really needs improvement.

I love Gmail search because I've seen the difference - have you ever tried searching email in Mobile Me? It's like they ignore your query and search a random string.

Probably because most other email search sucks even worse. Frankly I think a lot of peoples threshold for saying something "sucks" is too low. Come on, it could be better, but it doesn't totally suck.

Reminds me of Louis C K's bit on Conan: http://videogum.com/archives/late-night/the-videogum-louis-c...

In my estimation, not being able to do substring searches in emails, sucks.

The problem seems to be substring searching, which I guess isn't something I've ever tried to use.

I tend to think of the gmail search as being fairly powerful and much faster than my usual mail client (Mail.app), it's one of the few reasons I ever use the gmail web interface.

If anyone is curious, the docs are here http://mail.google.com/support/bin/answer.py?hl=en&answe... (I had never bothered to look them up until now).

I actually heard that it's quite costly to index emails for gmail (not from a gmail source, just random web chatter). It makese sense. Most emails are not important, they are just huge amounts of random chatter. I'd imagine indexing emails (full-text) properly would require some effort. The gmail team is probably on a budget :)

But then again, if they can index the web. Why not email.

I'm no expert, but reMail does it on my freakin iPhone - I'm guessing they could do even better on a cluster...

Agreed, Google could probably do it better. But imagine maintaining separate indexes for each and every gmail acct and constantly updating those indexes. reMail used to do something similar to that.

Well, they already do (and have to) maintain separate indexes for each and every gmail account (or else how would you search on the metadata fields like To: and From: in your inbox at all?).

Supposedly the issue is that they don't perform more computationally-expensive linguistic analysis during the indexing phase. If they tokenize each word but don't perform any stemming or lemmatization, for example, the result would be similar: only full-word non-substring matching.

As others have pointed out, its probably a cost-benefit decision by Google to not spare grid cycles on full-fledged linguistic analysis for individual's email accounts. Google CAN do better at it, as is evidenced by their web search index.

Not only that, I also found that if you search by sender name and the sender name is never in any of the subject of any of your emails, it doesnt find it.

use "from:example.com" or similar... important part being the "from:" bit

Speaking of which: you and I, as developers, understand search axes, and use them intuitively. But they're utterly opaque for someone who hasn't encountered that interface before--and given that Google itself doesn't support search axes in its main product, I don't think they're an interface someone's likely to learn elsewhere, either. If you're unaware of search axes, you have to click, "Show search options," which is in an extremely tiny font, next to a button that very notably says "Search the Web"--not search my inbox. Furthermore, once you actually discover that, and use it, you still don't know about the axes, because that search box doesn't use them. The only place where you can discover search axes is by clicking on a label, which results in "label:foo" being shown in the search box--but that's actually a lie, because the output if you dump that into the search field, versus if you click on a label name, don't match if you have more than about 30 messages in that label. Go try it.

So, yeah. Axes are great. And they're completely unintuitive and impossible to discover in Gmail unless you read the help, which no user ever does. So you've done a great job finding a part of Gmail's search interface that's at least as broken as the underlying implementation.

> The only place where you can discover search axes is by clicking on a label, which results in "label:foo" being shown in the search box (...)

That's how I found it, and I learned a little more about it by a need to get all mail to/from a specific client for a specific month and doing advanced searches.

Granted, that's also how I learned about "site:", by going to advanced search on Google and specifying a single site to search, the resulting page shows the "site:" in the search box.

> (...) but that's actually a lie, because the output if you dump that into the search field, versus if you click on a label name, don't match if you have more than about 30 messages in that label. Go try it.

Good point, though as far as I can tell the number of results is the only difference... but still, it's a bit misleading.

"the output if you dump that into the search field, versus if you click on a label name, don't match if you have more than about 30 messages in that label. Go try it."

The output is different because it only shows me 20 rows per page when I search on label:label, versus 100 rows if I click the label, but the results are the same. Were you seeing something more broken that the number of results per page?

FWIW, I've tried it on a few labels, none having fewer than 10,000 messages in them.

There are many subtle differences: as you noted, the rows per page differs; clicking to select all offers "Select all conversations that match this search," rather than "Select all conversations in <Label>"; the "Remove label <Foo>" button disappears; an Archive button appears; "Move to" disappears, but "Move to Inbox" appears instead; etc. It's not broken in the sense that it doesn't work, but it's broken in the sense that Google's implying an equivalence that is not there.

This is just not true.. at least it doesn't seem to be, for me. Do a lot of people face this??

I think I hit the same problem occasionally, but I can usually keep trying different keywords until I find one that works. If he only tried 4 keywords like he writes, then there was probably some word in the email that would have been at least semi-unique. Maybe "earbud" or "order" or something. Then browsing or date-based filtering can usually do the rest. Kind of a pain, but not enough for me to ditch GMail completely.

His experiment fails.

He didn't apply the same inputs to both systems, therefore his findings must be discarded.

He changed it up, and had he STARTED at Yahoo and performed the exact same searches, he would yield the same results, only reversed, and his blog post would be about why he dumped Yahoo instead.

Here's my case:

1) He searched Google for "Zags" which is NOT in "Zaggs". So, no results.

2) Then he goes to Yahoo and searches for "Zag" which IS in "Zaggs" - AH HA, he gets a result (of course he does!)

3) He RETURNS to Google and searches for "Zagg" which IS in "Zaggs", and boom, results. Surprised? Me neither.

I guess my point is, I don't have enough karma to have down arrows next to this post, so I'm going to write a cranky comment about this article.

3) He RETURNS to Google and searches for "Zagg" which IS in "Zaggs", and boom, results. Surprised? Me neither.

You might want to hold off on your cranky comment. Would you be surprised if the search for 'Zagg' did NOT turn up the results for 'Zaggs'? Because for better or worse, that's what actually happens.

GMail search does not perform stemming (like removing that final 's') and also does not allow for substring searches. So in fact, a search for 'Zagg' will return nothing. While this isn't a fatal flaw, it is a drag.

He did search for "zag" in Gmail. Right after "zbuds" and before "headset".

GAH! All that frustration over nothing. Thank you.

it seems to me that this problem isnt just limited to gmail. other than google search itself, I think search in other google products (android included) is lacking when it comes to what I presume to be basic functionality.

Typical of someone to glibly say 'where's regex matching?' when it's very advanced, and it's possible Google simply don't have the cpu for making (and searching) a search tree for many megabytes of mail.

I don't know how Yahoo do it, but this guy should at least present a solution to the problem. It reminds me of that Alexei Sayle sketch where he says 'I blame the council' (which in the UK is the local authority and handles all sorts of things in a town or city), and at the end, wanting someone to blame for blaming the council? He blames the council!

What's interesting is that after reading this article I've realized I always expected e-mail search to be bad. Why? Possibly because, when another entity holds a large portion of my social information, it feigning ignorance about deep technical knowledge of my personal life (which is likely better than my own understanding) makes me feel good.

It's less 'big brother'y, which may explain why poor e-mail search doesn't bother me. It's as though its waiting for me to get the answer first.

Good points there. When I search in google I know I can be sloppy with my typing because 90% of the time it's quicker to get it wrong and click the "did you mean?" results, rather than edit my input.

That feature's absence is very obvious when struggling to find data in gmail.

Personally I use subject line tags for stuff I want to filter on. (Like 'music' 'todo' 'idea'), and when I store something in gmail I want to remember I make sure the key words I will search for are very obvious (and easy to spell).

I have hit this problem too. If you subscribe to the git email list, but only want to look at discussion, not patches, it ought to be possible to filter out the patches with 'subject:patch'. But this doesn't completely work, because quite often the patches have patchv2 patchv3 etc, which doesn't match.

The curious thing is that google groups is even more painful; where you would think it would be more worth having the indexes, because more people are going to search the same data.

I think Google found that they can make more money by directing Usenet searches to crappy archive sites full of Google ads.

Which is a shame, since 5 years ago Usenet search was absolutely wonderful.

I find it easier to let an external service index my list subscriptions:


(or http://git.markmail.org/search/?q=#query:type%3Adevelopment if you want to use the list-specific site).

I hope they build more robust searching into their Wave client.

If lack of substring search is bad, how much worse is the fact that it sometimes doesn't even return all the results for a correct query. I use labels a lot, and I often use boolean searches to search for exactly which Venn intersection I need. A search for "in:inbox is:unread -label:Triangle" (without the quotes) in my gmail turns up messages labeled Triangle. I have similar other problems and cannot trust gmail's search.

My solution was to use fetchmail, mb2md, and fgrep on my local server. Now I feel like I can almost trust the setup I have.

That's really clever how he auto-forwards all his gmail messages to yahoo. I wish I had done something like that form the get go. Too late now. Here is a Lifehacker article explaining how to use automate gmail backups with sendmail and cygwin:


I'd suggest he tries using Lotus Notes' built-in "search". If there was ever a more useless function, it's LN's pathetic attempt at searching for stuff. It'll find stuff to every search term you enter! Just nothing that a) contains that search term, and b) resembles anything you wanted to find.

At least gmail filters out spam better than any other mail client and yes it includes Yahoo Mail.

Oh yeah, the gap in gmail's search ability does hint at possible startup ideas to solve this problem. However I wonder why some great startups have passed on directly competing in this sphere (reMail being a good example)

I did a search for a mail I was sure I had in gmail a few weeks ago, and it returned 0 results. I came to the conclusion that I'd simply deleted the mail.

Now I'm wondering if it wasn't me at all.

I'm going to keep my fingers crossed for "search as you type" across all Google properties. This has got to happen...someday...right?

At least it's better than Google Groups search.

just an observation: lot of attacks on gmail today/recently.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact