Hacker News new | comments | ask | show | jobs | submit login
Improvements to searching for special characters in programming languages (blog.google)
105 points by TheQwerty on Mar 2, 2017 | hide | past | web | favorite | 29 comments

I can now find the C+@ programming language. So it's not heavily special cased for common programming languages.

Google Code Search (2006-2013) [1] was more useful. I miss that. Its search allowed regular expressions.

[1] https://en.wikipedia.org/wiki/Google_Code_Search

It doesn't seem to work perfectly. Doing a verbatim search for "C+@" programming language produces a lot of results without the "C+@" on the page.

This is great. I feel Google has slowly become too user friendly. My mobile results are always way less technical than my desktop results. If I'm in the car (passenger) and want to look up a problem I'm having while programming, I get mostly related queries that are a simplified version of what I'm looking for.

I really believe that the technical crowd drives what becomes popular (app recommendations for family and friends). I feel a lot of the "Google hacking" queries have become less obvious and the search bubble stuff was getting bothersome. This is definitely a step in the right direction. Hopefully I'll be a little less frustrated with results in the future.

Google tailors its results to the kind of person it thinks you are. For example, if you immediately search "python", you will get results about snakes. But if you search for programming first, and then python second, it will now give back programming results on the second search. This continues to apply if you searched "programming" last week.

This behavior is actually very nuanced and impressive to watch, once you understand what's going on.

I don't think google is becoming more user friendly at the expense of being technical. It certainly isn't for me. What your problem sounds like is that it's built two separate profiles for you - one of which is what you're likely to search of desktop, and the other for what you're likely to search on mobile.

That specific example isn't actually true. Even with no prior history if you search "python" the first result is the programming language.

That's because Google is smart and despite the fact that more people know of Python as a snake, when somebody types just "python" into a search query, it's almost definitely true that they mean the programming language. Few people Google for types of snakes.

A similar thing is true for "ruby" and "rust".

I don't think the term user friendly is the right term. More like they have been targeting a different user over the years (also the content on the web has exploded).

Although most of the time Google will give me technical results if I can coax it.

To make this change at Google's scale is a triumph. Even dealing with this type of tokenization on our vastly smaller document set can be challenging.

Genuine question: why should this be any more difficult than searching for any other type of character? I've long found it hard to understand why Google is so bad at searching for non-alphanumeric characters.

When indexing documents (and querying for them), there is a process the terms go through, to split them up and then normalize them so they can be found easier. A trivial example is you want to find "can't" when someone searches for "cant". Typically special characters are removed for several reasons: the vocabulary of terms becomes smaller and saves space and time, you can ignore punctuation (like searching things that jut against parenthesis), you can remove accent marks and diacritics, and a host of other things.

This is hard because (a) you have to have contextual awareness of punctuation in certain places, like the '&' character in john&jane vs the logical &&. (b) your vocabulary of terms becomes larger - which is probably not a big deal for most folks but if you are Google then a 0.0001% increase in the vocab is a killer in space.

--EDIT-- The vocab increase is probably not as much as I noted above - but even adding a dozen terms can have an impact at Google's scale.

> A trivial example is you want to find "can't" when someone searches for "cant".

In that case, I think you want to just store "can't" and treat "cant" the way you would any other potential near-miss spelling of a more common word.

OK, thanks - that sounds plausible on the face of it, but why wouldn't you store special characters and then ignore them when matching patterns? You could then make an exception for strings in quotes (or some other option for activating a more precise search).

Maybe Google hasn't previously thought the extra space/complexity was worth the special treatment but given the relative quantity of data they already index and the usefulness of this feature I'm surprised.

[ex-Googler, used to work on search, this issue came up repeatedly during my tenure then].

The storage cost was prohibitive. Search engines rely on a data structure known as an inverted index; it's basically a list, for each token, of every document that contains the token, and for a context-aware search engine like Google it usually contains the position within the document of the token as well. Single-character punctuation marks like periods, commas, parentheses, dashes etc. appear in literally every sentence. That means that the inverted index for periods or commas would have to contain an entry for literally every single sentence on the web.

There's a similar problem for common words like 'a', 'the', prepositions, etc, but these are usually already solved by stopwording.

That's why this announcement only covers groups of punctuation with 2-3 characters. These don't appear in ordinary text, and so you can generate posting lists for them that are reasonably-sized. (I suspect that the economics of the index have changed as well, making storage costs cheaper, but this work happened after I left and so I don't know details.)

You need to double the size of the index. You now need an index with punctuation and without punctuation.

Previously if a document contained "(hello" it would just be stored in the index once: as "hello". With this change, it needs to be stored in the index twice, as "(hello" and "hello", so that people searching either term can find it.

Meanwhile, code searching at GitHub completely ignores characters like =, $, {. And, it's case insensitive. Argh.

It's the most frustrating "feature" I've ever seen. GitHub, the platform for hosting code, has a search function that doesn't work for code. How does that make any sense?!

Fixing that seems like PM101 material, yet here we are in 2017 with this still being a thing...

Only very slightly related:

When I was a teenager I made music under the name shark^^bait

The ^^ is what made it stand out from others.

The issue is there is no efficient way to search for that phrase with the special characters.

I have no idea if I can still find the absolutely god awful music I made back then.

Using the phrase match in google just searches for sharkbait which doesn't help at all.

It doesn't help that years later a little movie called Finding Nemo came out.

This will be extremely helpful next time I have to use a Haskell library that decides to implement everything as infix operators named "~<$>" and ".~=" and stuff.

Hoogle is the way to go, man. https://www.haskell.org/hoogle/

I know Hoogle exists, but that just searches one kind of documentation. Despite the cute name, it's not Google. You can't Hoogle an error message and see if anyone else got it.

Usually I use http://symbolhound.com

Catching up with DuckDuckGo?

needs to index perlvar for this; very few hits on anything from there and those that did the results that come up are for bash only

it's google though shouldn't be surprised

I am a bit sad that no Haskell results show up in this search:

">>= operator": https://www.google.com/#q=%3E%3E%3D+operator&*

But it's a sight better than it was before. It actually shows meaningful programming language results. And if I call the operator by it's Haskell name at the same time, I get very good results:

">>= bind": https://www.google.com/#q=%3E%3E%3D+bind&*

Or just the language name:

">>= Haskell": https://www.google.com/#q=%3E%3E%3D+haskell&*

For your first search, I see "Operator Glossary - Haskell Lang" as the 9th result.

Ah, Google's personalization of searches. Here's what I see:

Operators in C++ - TutorialsPoint https://www.tutorialspoint.com/cplusplus/cpp_operators.htm

Operators in C++ - Learning C++ in simple and easy steps : A beginner's tutorial ... Right shift AND assignment operator, C >>= 2 is same as C = C >> 2. ‎C++ Loop Types · ‎Conditional operator · ‎C++ Pointer Operators · ‎Increment operator Assignment operators - JavaScript | MDN https://developer.mozilla.org › ... › JavaScript reference › Expressions and operators

Feb 3, 2017 - An assignment operator assigns a value to its left operand based on the value of its right operand. ... Right shift assignment, x >>= y, x = x >> y. Operators in C and C++ - Wikipedia https://en.wikipedia.org/wiki/Operators_in_C_and_C%2B%2B

This is a list of operators in the C and C++ programming languages. All the operators listed exist in C++; the fourth column "Included in C", states whether an ...

Right Shift Assignment Operator (>>=) - MSDN - Microsoft https://msdn.microsoft.com/en-us/library/y9h99e01(v=vs.100)....

Using this operator is almost the same as specifying result = result >> expression, except that result is only evaluated once. The >>= operator shifts the bits of ...

<<= Operator (C# Reference) - MSDN - Microsoft https://msdn.microsoft.com/en-us/library/ayt2kcfb.aspx

Jul 20, 2015 - except that x is only evaluated once. The << operator shifts x left by the number of bits specified by y . The <<= operator cannot be overloaded ...

C# Operators https://msdn.microsoft.com/en-us/library/6a71f45d.aspx

Jul 20, 2015 - x >>= y – right-shift assignment. Shift the value of x right by y places, store the result in x , and return the new value. => – lambda declaration. -= Operator (C# Reference)1 - MSDN - Microsoft https://msdn.microsoft.com/en-us/library/d31sybc9.aspx

Jul 20, 2015 - except that x is only evaluated once. The / operator is predefined for numeric types to perform division. The /= operator cannot be overloaded ...

What does this ">>=" operator mean in C? - Stack Overflow stackoverflow.com/questions/17769948/what-does-this-operator-mean-in-c

Jul 21, 2013 - unsigned long set; /set is after modified/ set >>= 1;. I found this in a ... The expression set >>= 1; means set = set >> 1; that is right shift bits of set ...

java - What does "|=" mean? (pipe equal operator) - Stack Overflow stackoverflow.com/questions/14295469/what-does-mean-pipe-equal-operator

Jan 12, 2013 - |= reads the same way as += . notification.defaults |= Notification.DEFAULT_SOUND; .... 2 <<= Left shift AND assignment operator C <<= 2 is same as C = C << 2 >>= Right shift AND assignment operator C >>= 2 is same as ...

C++ Operator Precedence - cppreference.com en.cppreference.com/w/cpp/language/operator_precedence

Oct 12, 2016 - Precedence, Operator, Description, Associativity. 1, :: Scope resolution ... For relational operators > and ≥ respectively. 9, == != For relational ...

Fantastic! As a Lisper and Clojurian, I can say that this really helps beginners to search for reader macros.

Any tips on searching for C and not C++ or C#? (google or ddg)

Long due considering DuckDuckGo is quite developer friendly.

Agree. On DuckDuckGo just do:

<your programming problem> !so

Takes your search directly to Stack Overflow

If you don't like the results, try it again with !g and your search is submitted to Google.

They've got 9000+ bangs now:


Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact