I once spoke to someone in the know about this, and the main reason is that it's quite expensive for them to do any reasonable job with stemming, etc.
I know that sounds weird (but it's google, they're omnipotent!), but it makes sense: It's worth their while to stem content they crawl and index off the web, cause everybody could in theory access any given page. However, with email, the only person who'll ever benefit is the recipient.
It would at least make sense to deal with plurals. I can't tell you the number of times I've searched for something using an s (or not) at the end, and failed to find what I was looking for, only to remember later that I should try to add or remove an s from my search term.
Exactly....I have always wondered why there is no google premium...I would gladly pay $100/year to have a collection of certain domains easily removed from my google searches, or pick from a list of my favorite sites for a "site:" search, etc etc.
It seems like they've worked on mitigating this on the web with suggestions for alternate spellings and by displaying related searches. While you can see stemming at work in many Google searches, I'm pretty sure they don't build extensive substring indices on the web end either. For example, I've had searches where a substring returns 0 results and the exact phrase returns a handful.
> However, with email, the only person who'll ever benefit is the recipient.
I'm not sure that makes sense - if they add stemming, all users of GMail benefit. Going by your explanation, it wouldn't make sense to add any expensive features to GMail, because the only person who would ever benefit from them is the single user.
I think you misunderstand the nature of stemming; the point is that each and every user's inbox would have to be processed individually, and apparently Google doesn't think the overhead is worth the results.
I understand stemming. I just think the webpage contrast is not a good explanation for why they're not doing it. Building an unstemmed search index per user is also expensive, and helps only the recipient, but they do it because they think the expense is necessary. They stem webpages because they think the expense is necessary. They don't stem mails because they think the expense is not justified, not because the only person who benefits is the recipient.
It is also expensive, but it is less expensive than doing that AND stemming. They just decided that stemming was a line where the benefit (add'l users, more use of Gmail, more AdSense revenue, whatever metric) wasn't worth the (development and ongoing processing) costs.
This is a drawback to putting everything in the cloud: features will be weighed by the CPU cycles and storage required by the providers. Can't wait til we come full circle and get back to client/server computing ;). I'm only half joking. I actually can't wait until things mature enough so we have a hybrid of both models. Then I can decide just how much stuff I want indexed and also not worry what happens when my cable modem flakes out.
"Gerald Weinberg tells the story of a programmer who was flown to Detroit to help debug a troubled program. The programmer worked with the team who had developed the program and concluded after several days that the situation was hopeless.
On the flight home, he mulled over the situation and realized what the problem was. By the end of the flight, he had an outline for the new code. He tested the code for several days and was about to return to Detroit when he got a telegram saying that the project was canceled because the program was impossible to write. He headed back to Detroit anyway and convinced the executives that the project could be completed.
Then he had to convince the project's original programmers. They listened to his presentation, and when he'd finished, the creator of the old system asked, "And how long does your program take?"
"That varies, but about ten seconds per input."
"Aha! But my program takes only one second per input." The veteran leaned back, satisfied that he'd stumped the upstart. The other programmers seemed to agree, but the new programmer wasn't intimidated."
"Yes, but your program doesn't work. If mine doesn't have to work, I can make it run instantly."
Gmail search certainly works. It just doesn't have every feature.
It's not "completely broken," but no hits for a query of "zag" in an email that contains "zagg" comes uncomfortably close to "doesn't work." (FWIW, I use gmail and haven't had any huge problems with search, although I have had to do way more work to find something than I would expect given that it's from Google).
Code Complete? really, just read Weinberg
'twas just a quick copy and paste from my fortune file. Although as far as I'm concerned there ain't nothing wrong with Code Complete.
Javier Kragen Sitaker's article/mail "My Evolution as a Programmer", recounts one coder's his growth as a programmer, career, and exposure to a variety of books throughout. It's really an excellent read, and contains some comparisons between a few good books - particularly Code Complete and The Pracice of Programming. I quote,
During this time, I read "The Practice of Programming", which is a lot like "Code Complete", but shorter and much higher in quality. I had read the same authors' "The Elements of Programming Style" back in 1995, on much the same subjects, but that book is nearly unreadable today --- it's written in PL/1 and FORTRAN IV. TPoP, aside from being written with modern programming languages, also contains insights from several decades more of the authors' experience.
-- The author in question is Brian Kernighan. Anyrate, I leave any interested person to go check the article out, if you haven't seen it already.
I take solace in the fact that people actually read Code Complete - it's not just a bookshelf ornament. Yes, I know, Knuth is God, and his work is a masterpiece - the point is nobody actually reads the friggin thing.
There are zillions, obviously. What do you want, a list? I don't know, SICP, CLRS, Knuth? Non-mainstream languages? Functional programming? For that matter, a single classic CS paper that wasn't assigned as homework? Code Complete, its grounding in software-engineering literature notwithstanding, isn't very deep.
I consider the lack of depth in Code Complete one of its great strength. You couldn't hand Knuth to a newbie programmer expect them to get anything useful out of it, whereas Code Complete will teach him a lot really useful things he can use his entire career.
If you've been working as a programmer for 12 years it probably won't teach you too many new things, but if you've been working for 12 weeks it is a great book.
In my experience, it's a more reliable heuristic to fault people for what they haven't read than for what they have read. Some of the smartest people I know can and do quote from childrens' books and anime.
I've been using Outlook 2007 since it launched. It is incredibly quick... when it works. I frequently find that I can be LOOKING AT THE EMAIL I WANT. Then type a phrase that I am staring at, and not have that email appear in the results. Never mind that sometimes it will randomly start taking 60 seconds or more to return results until you restart the program.
So it sounds like you're using Outlook 2003 on XP and don't have Windows Search (what powers Outlook instant search) installed.
Hound your incompetent IT department and say you need to search your email and you need Windows Search 4 installed on your machine instead of blaming a 6 year old product and a 8 year old operating system.
My outlook box is pretty large and it never takes more than 3 or 4 seconds. It's not the greatest search either, but it does handle substrings nicely and automatically. It works most of the time (though I'd still be much happier if it worked all of the time.)
An hour, really? I don't use Outlook, so I don't know if that's right - but I suppose that's why the various Outlook search plugins are so popular.
I haven't had a problem with Gmail search as described, but I would not classify it as fast. Often I'll type in a simple query and have it spend 10-20 seconds before it returns results. If I perform the same query on the same dataset in Spotlight in Mac OS X, it starts to return results instantly with the search completing within 5-10.
Considering the difference in speed, it's faster for me to find what I was looking for by sifting through GMail's results, than to wait for Outlook's results to show up at all. Outlook's search essentially renders my whole machine useless until it's done searching.
I actually heard that it's quite costly to index emails for gmail (not from a gmail source, just random web chatter). It makese sense. Most emails are not important, they are just huge amounts of random chatter. I'd imagine indexing emails (full-text) properly would require some effort. The gmail team is probably on a budget :)
But then again, if they can index the web. Why not email.
Well, they already do (and have to) maintain separate indexes for each and every gmail account (or else how would you search on the metadata fields like To: and From: in your inbox at all?).
Supposedly the issue is that they don't perform more computationally-expensive linguistic analysis during the indexing phase. If they tokenize each word but don't perform any stemming or lemmatization, for example, the result would be similar: only full-word non-substring matching.
As others have pointed out, its probably a cost-benefit decision by Google to not spare grid cycles on full-fledged linguistic analysis for individual's email accounts. Google CAN do better at it, as is evidenced by their web search index.
Speaking of which: you and I, as developers, understand search axes, and use them intuitively. But they're utterly opaque for someone who hasn't encountered that interface before--and given that Google itself doesn't support search axes in its main product, I don't think they're an interface someone's likely to learn elsewhere, either. If you're unaware of search axes, you have to click, "Show search options," which is in an extremely tiny font, next to a button that very notably says "Search the Web"--not search my inbox. Furthermore, once you actually discover that, and use it, you still don't know about the axes, because that search box doesn't use them. The only place where you can discover search axes is by clicking on a label, which results in "label:foo" being shown in the search box--but that's actually a lie, because the output if you dump that into the search field, versus if you click on a label name, don't match if you have more than about 30 messages in that label. Go try it.
So, yeah. Axes are great. And they're completely unintuitive and impossible to discover in Gmail unless you read the help, which no user ever does. So you've done a great job finding a part of Gmail's search interface that's at least as broken as the underlying implementation.
> The only place where you can discover search axes is by clicking on a label, which results in "label:foo" being shown in the search box (...)
That's how I found it, and I learned a little more about it by a need to get all mail to/from a specific client for a specific month and doing advanced searches.
Granted, that's also how I learned about "site:", by going to advanced search on Google and specifying a single site to search, the resulting page shows the "site:" in the search box.
> (...) but that's actually a lie, because the output if you dump that into the search field, versus if you click on a label name, don't match if you have more than about 30 messages in that label. Go try it.
Good point, though as far as I can tell the number of results is the only difference... but still, it's a bit misleading.
"the output if you dump that into the search field, versus if you click on a label name, don't match if you have more than about 30 messages in that label. Go try it."
The output is different because it only shows me 20 rows per page when I search on label:label, versus 100 rows if I click the label, but the results are the same. Were you seeing something more broken that the number of results per page?
FWIW, I've tried it on a few labels, none having fewer than 10,000 messages in them.
There are many subtle differences: as you noted, the rows per page differs; clicking to select all offers "Select all conversations that match this search," rather than "Select all conversations in <Label>"; the "Remove label <Foo>" button disappears; an Archive button appears; "Move to" disappears, but "Move to Inbox" appears instead; etc. It's not broken in the sense that it doesn't work, but it's broken in the sense that Google's implying an equivalence that is not there.
I think I hit the same problem occasionally, but I can usually keep trying different keywords until I find one that works. If he only tried 4 keywords like he writes, then there was probably some word in the email that would have been at least semi-unique. Maybe "earbud" or "order" or something. Then browsing or date-based filtering can usually do the rest. Kind of a pain, but not enough for me to ditch GMail completely.
3) He RETURNS to Google and searches for "Zagg" which IS in "Zaggs", and boom, results. Surprised? Me neither.
You might want to hold off on your cranky comment. Would you be surprised if the search for 'Zagg' did NOT turn up the results for 'Zaggs'? Because for better or worse, that's what actually happens.
GMail search does not perform stemming (like removing that final 's') and also does not allow for substring searches. So in fact, a search for 'Zagg' will return nothing. While this isn't a fatal flaw, it is a drag.
it seems to me that this problem isnt just limited to gmail. other than google search itself, I think search in other google products (android included) is lacking when it comes to what I presume to be basic functionality.
Typical of someone to glibly say 'where's regex matching?' when it's very advanced, and it's possible Google simply don't have the cpu for making (and searching) a search tree for many megabytes of mail.
I don't know how Yahoo do it, but this guy should at least present a solution to the problem. It reminds me of that Alexei Sayle sketch where he says 'I blame the council' (which in the UK is the local authority and handles all sorts of things in a town or city), and at the end, wanting someone to blame for blaming the council? He blames the council!
What's interesting is that after reading this article I've realized I always expected e-mail search to be bad. Why? Possibly because, when another entity holds a large portion of my social information, it feigning ignorance about deep technical knowledge of my personal life (which is likely better than my own understanding) makes me feel good.
It's less 'big brother'y, which may explain why poor e-mail search doesn't bother me. It's as though its waiting for me to get the answer first.
Good points there. When I search in google I know I can be sloppy with my typing because 90% of the time it's quicker to get it wrong and click the "did you mean?" results, rather than edit my input.
That feature's absence is very obvious when struggling to find data in gmail.
Personally I use subject line tags for stuff I want to filter on. (Like 'music' 'todo' 'idea'), and when I store something in gmail I want to remember I make sure the key words I will search for are very obvious (and easy to spell).
I have hit this problem too. If you subscribe to the git email list, but only want to look at discussion, not patches, it ought to be possible to filter out the patches with 'subject:patch'. But this doesn't completely work, because quite often the patches have patchv2 patchv3 etc, which doesn't match.
The curious thing is that google groups is even more painful; where you would think it would be more worth having the indexes, because more people are going to search the same data.
If lack of substring search is bad, how much worse is the fact that it sometimes doesn't even return all the results for a correct query. I use labels a lot, and I often use boolean searches to search for exactly which Venn intersection I need. A search for "in:inbox is:unread -label:Triangle" (without the quotes) in my gmail turns up messages labeled Triangle. I have similar other problems and cannot trust gmail's search.
That's really clever how he auto-forwards all his gmail messages to yahoo. I wish I had done something like that form the get go. Too late now. Here is a Lifehacker article explaining how to use automate gmail backups with sendmail and cygwin:
I'd suggest he tries using Lotus Notes' built-in "search". If there was ever a more useless function, it's LN's pathetic attempt at searching for stuff. It'll find stuff to every search term you enter! Just nothing that a) contains that search term, and b) resembles anything you wanted to find.
Oh yeah, the gap in gmail's search ability does hint at possible startup ideas to solve this problem. However I wonder why some great startups have passed on directly competing in this sphere (reMail being a good example)