

Stop Me If You've Seen This Word Before - Oompa
http://www.codinghorror.com/blog/archives/001186.html

======
dwwatk01
I ran into a similar problem teaching myself perl a couple years ago by doing
a short tutorial then foolishly jumping into a co-worker's code. "What the
hell is '$.'? Hmm, well I'm sure Google can help me. What? No matching
documents?!? What is this crazy s.o.b. doing here??"

~~~
a-priori
Glad to hear I'm not the only one who's had trouble with this. I remember
reading some C code once and I saw some strange operators I had never seen
before... <? and >?, if I remember correctly.

I eventually figured out they were GCC extensions for "min" and "max", but
only after a painful experience trying to get more info out of Google than
"here's how you do addition and multiplication in C!"

~~~
technoguyrob
[http://www.google.com/codesearch?hl=en&lr=&q=lang%3A...](http://www.google.com/codesearch?hl=en&lr=&q=lang%3Ac%2B%2B+%22+%3E%3F+%22&sbtn=Search)

------
robg
_Way back in 2004, I ran a little experiment with Google -- over a period of a
week, I searched for an entire dictionary of ~110k individual English words
and recorded how many hits Google returned for each._

Of course, a word can appear on a page multiple times. That's why, I think,
folks used to ignore the stopwords. They introduced noise when trying to
access the content words. Now, with span constraints, you can incorporate them
into the analysis. So "a matrix" and "the matrix" returns very different
results, even without quotes.

------
whacked_new
> "The" is one of the most common words in the English language

"the" is THE most common word in English.

------
gills
It makes sense that low-information terms would have a lower preference when
searching without any context. If your index models the context around terms,
you can get better results from a low-information search.

I think...I'm kind of shooting from the hip here relating it to context
modeling in lossless compression schemes like CABAC and PPM.

Could you overcome stop words with some sort of Bayesian phrase matching over
some learned hidden states?

------
jgrahamc
POPFile has stopwords because people in the community insisted on it. My
commercial email filtering software does not because it turned out that in my
tests that the accuracy difference they made was so small as to be in the
noise. And they were costly in terms of time to check, and to maintain across
different languages.

------
randomuser7
I guess the idea was to help allow English search queries (i.e. exclude words
people were using to describe their query but shouldn't be searched for).

------
liuliu
it is about how to sort with stop word. Tranditional tf-idf method didn't work
well as it didn't contain any information about each word relative location in
its context. a simple method is to index "the the", the word group instead of
single "the". I guess it is what Google does now with "to be or not to be".
However, the word grouping tech is a common method in CJK full text search.

------
jcromartie
I like how the top Google words are all generic web marketing words, with the
two exceptions of "hotels" and "women."

------
tumult
[http://www.google.com/search?hl=en&safe=off&q=%22the...](http://www.google.com/search?hl=en&safe=off&q=%22the+the%22+band&btnG=Search)

as soon as the article asserted this wouldn't work, i tried googling, and it
worked fine. i stopped reading after that.

edit: for whatever reason, if you follow the link directly, the search results
are wrong. you might have to submit the query again after the page loads to
get the right results. weird! maybe he was onto something (nope)

~~~
jwilliams
He says in the article: _Google doesn't seem to use stop words any more_. (in
fact he makes a point of that fact).

~~~
fhars
And in this case, yahoo seems to be better than google:
<http://twingine.no/search.php?q=the+the>

Now all we need is a MySQL stopword story contest...

