Hacker News new | past | comments | ask | show | jobs | submit login
Falsehoods Programmers Believe About Search (opensourceconnections.com)
337 points by binarymax 5 months ago | hide | past | web | favorite | 179 comments

I've implemented search engines for small to relatively large organizations. Even at the companies where nobody knew anything about how search hardly any of these falsehoods were believed.

Also this doesn't work as a good Falsehoods Programmers believe thing subject because Falsehoods programmers believe are not about technologies but about non-technological things that are commonly needing to be handled in programs - hence Falsehoods programmers believe about:

Names, Phone Numbers (sort of technical but it's not falsehoods about how phones work, but rather about how phone numbers are structured and what they 'mean'), Credit Cards, Addresses

Good possible future Falsehoods programmers believe about:

Sleep patterns, Personal identifiers, Genders

In fact I am currently dealing with a falsehoods programmers believe about versioning of laws and standards at work.

Eric Myers needs to write an article called “Falsehoods developers believe about writing falsehoods developers believe articles”.

"Falsehoods programmers believe considered harmful"

Falsehoods programmers... you wouldn't believe what comes next?!!

'Why "Falsehoods programmers believe" is not my favorite genre of programming article'

Already tried to compile such a list, "Falsehoods Programmers Believe About Falsehoods Lists": https://kevin.deldycke.com/2016/12/falsehoods-programmers-be... :D

Yeah, I can't imagine most if any developer believing, "You don’t need to monitor search queries, results, and clicks"

This is less what programmers believe, and more a "Things to keep in mind" sort of list. But I suppose the less accurate title is a bit more click-baity.

I don't know, some of the best ones are technical, there's apparently a Github Repo on "falsehoods" articles. With a section on software engineering even:

- https://github.com/kdeldycke/awesome-falsehood#software-engi...

I don't know, looking through those they seem to be more another example of the linked article - not classical Falsehoods programmers believe - for example this one https://pozorvlak.livejournal.com/174763.html Falsehoods programmers believe about build systems - quite a lot of them are really Implementation Decisions of Build Systems Do not cover every need.

I’m not sure if this is parody.

Falsehoods Google, Microsoft, JIRA, and others seem to believe about search:

That, when searching for a string, I don't want exact matches to appear in the results.

If your search ever DOESN'T return exact matches (barring common misspelling correction), you're doing something seriously wrong.

This drives me crazy - and it happens far too often!

Even worse, Google sometimes shows results that don't even contain text you've specified an exact match for with double quotes, e.g. "find me"

This is a misfeature that Google started using some years ago.

There is a tab-style link directly below the search box called "Tools" in the search results page. Once clicked it displays a few settings and one of them can be set to "Verbatim".

Choose that and your search terms will actually be what is searched for, as opposed to some arbitrary subset of it. I wish this was better documented.

It looks like that resets with each search, but thanks for the tip.

DuckDuckGo for search results, Google for “Google Search” results.

Amazon. As in the shop - I can do one search and get a result, even a 'featured' result, and then change it to something more generic that's an exact match for the _title_, sort in a way that will make me easily see it again, (e.g. by ascending price) and it not show (e.g. first result is more expensive than the product I just saw).

It's infuriating, because presumably I don't always happen to see the unknowingly hidden option.

it seems like it used to be much easier to get Google to return exact matches. Just my subjective experience of course, but as accuracy for word-sense-disambiguation significantly improved it seems like Google has become much more comfortable returning what it believes are close or related matches. Overall it's probably better search, but I find myself having to put things in quotes more than I used to when looking for a very precise result.

Does Google get false positives (incorrectly detecting a mistake and “fixing” it?) or false negatives (failing to detect a mistake you make) more often? Is it possible that cognitive biases prevent you from giving Google credit when it fixes your mistakes? Is it possible Google has done extensive studies to find the preferred trade-off? Is it possible that your preferences are different than the average person because you’re an engineer?

I once was curious about this so I kept track of my Google searches for a week, and it was overwhelmingly the case that what it returned was what I wanted but not what I asked vs the other way around.

Another frustrating thing with google search is when it translates (or attempts to translate) queries for you and then fails to show you any results in the language of what you typed in. Even with languages specified in my google account I can't get it to stop without quoting part of search. I've switched to using yandex for a lot of searches just to avoid it.

No kidding. The other day I searched for a gas station on google maps and got bus station instead

Quit thinking you know what you want better than the machines.

I'm not sure I understand you.

If I search for "restaurants", I want search results that ARE restaurants, not search results which have the word "restaurants" in them.

What do you want to have happen?

It depends what you're looking for. If it's an error message, you probably want that exact string, not results that are errors but don't include the string.

It used to be that searching literally "restaurants" (i.e. with quotes) would search for an exact match (particularly useful for multi-word searches in those days), but no more. It's taken as a 'hint' or something, I believe, but not an absolute instruction.

And this is how it began the AI takeover.

If I’m doing a google search, you’re probably right. If I’m searching my gmail account, you’re probably wrong. I’ve searched for exact phrases that occur in my email (for both gmail and outlook) and failed to get the matches anywhere in the results (and had to find them by other means). Same with Jira. It’s very frustrating to have to sort through hundreds or thousands of messages for the result you wanted.

The gmail one in particular drives me absolutely bonkers. Like, I don’t care if the search needs to take 15 seconds to do, just find the email with the phrase that I know is there!

This is even more frustrating when you do a date constrained search and google tells you there are no emails from that date, but if you page through manually, it’s there. I feel like gmail is constantly gaslighting me.

Is it so weird that websites for restaurants would literally have the word “restaurant” on it somewhere? Eg

> Foobar Canteen is a 2 Michelin star restaurant located in the heart of Soho.

This used to be how search engines knew what was a page about restaurants and what wasn’t.

But in any case, the problem with not returning exact strings is those times when you do need exact strings. Like researching a famous quote, passage of text or software error message.

If I search for "python lea", I want information about the python package "lea". I do not want general information about "learn python".

Ran that search just now, got 5 results about using lea, the 6th was about scikit-learn.

It's all part of their attempt to de-commoditize their stuff, changing from an indexing-and-keyword-tool to invasive "assistant" that Knows What You Meant To Say.

However, as someone who already learned to translate my desires into keywords, it's freaking annoying.


There's also:

That we want well known standards like CTRL + F in a browser to be hijacked and replaced by default with a custom search experience that's a lot worse than a browser's search.

Try CTRL + F'ing on Stripe's documentation: https://stripe.com/docs/api/plans

Often those custom search implementations are there because the "page" you're on isn't really a page, the scrollbar is fake, text is inserted and removed automatically as you "scroll", and as a result of all that Ctrl-F doesn't actually work by default. Of course you could argue that these heavyweight designs are a bad idea in the first place, but that's a trickier discussion. I think it's rare that web sites hijack Ctrl-F when leaving it alone is an option -- but I could be wrong about that.

>Of course you could argue that these heavyweight designs are a bad idea in the first place, but that's a trickier discussion.

I would argue that, and I don't think it's a particularly tricky discussion; If your site design subverts the normal, expected behaviour and functionality of the browser to such an extreme degree, then you created a poor user experience.

On Chrome, you can hit CTRL + G, which does the same as CTRL + F, but is not hookable by web sites

Thanks for that, I didn't know it! It seems that also F3 works, and in fact CTRL+G is alias of F3, and both work in Firefox as well.

The only issue is that in Firefox, it is only equivalent for the first search; once you close the bottom bar, subsequent F3/CTRL-G just do "find next occurrence" and do not display the bar anymore. Chrome always displays the search input on the other hand.

Edit: since talking shortcuts, in Firefox ' (apostrophe) is like CTRL-F but searches only hyperlinks (and you can cycle through in case of multiple matches with F3/CTRL-G) which is extremely useful for quickly navigating pages via keyboard only.

Ctrl + G certainly is hookable[0], folks just rarely know that it's an alias for 'find'. If you REALLY want the browser's search, in Chrome, you can use the mouse to open the menu and choose "Find". You could also use any keyboard shortcut that focuses the URL bar (so keyboard events are no longer sent to the page) and press Ctrl-f then.

0: In Google Sheets, for example, Ctrl-g opens the JS-driven find bar, or, if it's already open, advances the match.

Somewhat related: The last version of Chromium said "Press Alt+F and then X" instead of Ctrl-Shift-Q when I tried to quit it using that key combo.

Unfortunately, Alt+F is trivially overridable by web pages (Twitch.tv in this case -- to move to the search bar), so that doesn't really work.

Chromium devs have no idea what the impact of their decisions are... and judging by the issue trackers they don't care.

I kinda want to burn down the world after reading this comment. How did we let computing get to be such a garbage fire?

Holy shit I had no idea! I was tired of Chrome's bookmark manager hijacking CTRL + F. I'll use CTRL + G from now on!

Luckily, Stripe is polite enough to allow users to fall back to default search with another tap of ctrl+f.

I would second the discouragement to override this behaviour. Although they have handled it well with a CTRL + F + F again bringing it back to native search behaviour.

I don't see an issue, their widget allows me to go back to my default ctrl-f by pressing it again.

The main issue is it's on by default and it's a vastly inferior search UI to what everyone has been using to search / skim a page since browsers existed.

Ace editor does this too - possibly because large documents might not all actually be in the DOM.

In Stripe's case, the docs are all rendered server side and are viewable without Javascript.

I'm not sure if you can hook into the native CTRL + F search tool and see what a user typed (my gut says no way there's an API for that), so I guess Stripe just wanted to track as much information as possible on what people are searching for, even if it makes the user experience a lot worse.

(I am an engineer who worked on this feature)

The docs are indeed viewable without JS[0] (in a limited way) but the default experience relies on JS to render text.

We don't render all content on the page at once for performance reasons, which is (as a sibling speculated) the driving reason for overriding cmd+f/ctrl+f by default.

I hope to write an engineering blog post soon about how we build the Stripe api docs, with some focus on the performance and UX tradeoffs at play here.

[0] https://stripe.com/docs/api?javascript=false

This. Stripe's overengineered, custom Ctrl+F is unusable on Firefox. they could've just put a search bar for their own "search" feature instead of breaking the Ctrl+F browser convention that we're so familiar with.

For me it shows the option to use the native search by pressing CMD+F a second time.

oof, that's completely unusable on Firefox for me -- it seems like the loading is blocking my keyboard input or something.

    Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0

(I have worked on this feature) Thanks for reporting – I can reproduce and will take a look at fixing shortly. Sorry for the troubles!

Why don’t you fix it by removing it. Most web users don’t want native features overridden. It’s obnoxious. It breaks UI and UX paradigms.

Whoever thought it was a good idea should get shot out of a canon.

Decisions like this are why I do not support adding more functionality into web browsers. Most web developers have proven to be inept and incompetent. As demonstrated by this dumpster fire of a “feature”

Can you fix this issue by not hijacking keyboard shortcuts needlessly?

Thanks for taking a look!

What does that do?

(Mobile, cannot invoke keyboard on page, JS disabled.)

And behaviour may change.

Just tell us.

Instead of being able to hit CTRL + F and immediately search and then have your browser highlight matches and decorate your scrollbar with where results are (so you can skim), they decided to override that behavior and introduce their own take on what search results should look like.

One that takes multiple seconds to get a response on a search and it's all contained in a tiny modal dialog box that has no skimmability and when you click one of the results it does a new page load to bring you to the results. Stripe is usually a superb developer experience. Truthfully I have no idea how it ended up in production as a default option.

It's something like "Go to Resource" in code editors. Tries to navigate to methods / things based on what you type

* When you find the boolean operator ‘OR’, you always know it doesn’t mean Oregon

One of my favorite sets of local search bugs involve interpreting "near me" as "near maine".

Trying to fix every single problem in the search module/layer/service is an anti-pattern by and of itself.

There's an anecdote[1] from early days of Google Search where a certain domain was ranking 1st for an unrelated query (i.e., a false positive). The managers refused to move ahead before that got fixed, but the bug/edge case proved a head scratcher for several weeks on end.

Lastly one of the engineers solved the problem - by buying the domain and taking it offline.

Point being, if you can fix the problem outside of the code domain, do just that.

[1] sadly can't seem to find it - mostly getting spam articles related to SEO

I'm pretty sure I've read a similar story in the book "I'm Feeling Lucky". It goes like this:

In the early days of Froogle, a shopping search engine made by Google, searching for "sneakers" always yielded a garden gnome wearing sneakers, one unit on sale, as the top result. This was considered bad, as someone searching for "sneakers" probably wanted to buy sneakers, not garden gnomes. The whole team tried to fix it, but they didn't want to just hardcode an exception. It eluded them for a while. Finally, it was not there anymore. They asked around for who had solved it, no one answered. Finally, one colleague arrived late - and placed the gnome on their desk.

"Buy the Gnome" should become a saying, like "eat the frog".

That's a very good story !

A lot of these entries are probably better handled with improved feedback than changed behavior. If you can tell whether the user meant 'either' or 'Oregon', that's great, but spending a week on the problem is a lot less urgent than just displaying "including results for Oregon".

Does Google have some kind of cultural allergy to special-casing or writing fallback rules around its recommendation systems? I ask because Chrome's spellcheck still lacks a lot of words that you can find in an abridged dictionary; it seems as though fallback rules like "the first hit needs at least one keyword match" or "never flag words found in Merriam-Webster as unknown" are basically never employed.

I see you’ve never made any embarrassing email mistakes.



I live in Oregon and always search using OR. I'm so used to that being the abbreviation for Oregon that this is the first time it has occurred to me that it could be confused with boolean or!

There's a fantastic and subtle bug in the Google home hub and nest integration whereupon when the conditions are met, the nest integration can be seen in the pull-down menu's but voice integration doesn't work.

So, when one says "Hey Google, show me the front yard." Instead of showing the camera feed-- one gets information about a bar in LA called "The front yard".

Reminds me of a Siri inquiry I has last week:

Me: Hey, Siri, how long does it take to drive to Yellowstone National Park?

Siri: OK, one option I see is US National Commercial Real Estate Services on W Park Run Dr.

(Yes, that's verbatim — I screenshotted it)

I can't fathom how it arrived at that answer?

Huh. I just worked on a bug around this at work. We just added some quoting and that solved it.

It's only a bug when you're not in maine. :)

Except when people are planning a trip to maine.

Howdy. Author here. Really cool to see so much good discussion on this. I want to turn several of them into blog posts on their own with explanations/stories/what-have-you. Taking votes for what you'd like to see first. For the record, my fave is "Languages don’t change".

Thank you for the thorough and practical write-up.

About the only thing I would add to it is i18n concerns.

A few quick ones off of the top of my head:

  - Words are separated by whitespace or dashes.
  - Customers only ever enter ASCII.
  - Customers only ever enter accented characters with/without accents.
  - A "Unicode-capable" system will happily take in any valid unicode.
  - A "Unicode-capable" system will pass through any valid unicode undisturbed.
  - Software systems perform Unicode normalization.
  - WinNT API is UTF-16.
  - There is 1-to-1 mapping between uppercase and lowercase.
  - Unicode collation algorithm is optimal for every single language.
  - Unicode collation algorithm is optimal for multi-language document sets.
  - Distinguishing/coalescing plural and singular forms of words is easy.
  - There are separate plural/singular forms of words.
  - Words have stem and optional suffixes, but not prefixes.
  - Soundex etc. works for every language.

> There are separate plural/singular forms of words.

Or that there are just two plural/singular forms (1 and many) for translating strings, or that which form to pick is clear.

While English has one form for 1, and one form for 0/many:

- French pluralises 0 the same way as 1,

- Czech has a form for exactly 2-4 items,

- Irish has forms for exactly 3-6 and 7-10 items,

- Polish has a form for all numbers that end in 2-4,

- Russian has a form for all numbers that end in 1,

- Arabic has forms for exactly 0 and 2 items, ending in 03-10, and many more.

A strings table will need at least 10+ variants if you want to translate strings referring to number of items.

yes! tokenization and problems with word boundaries alone would be great to dive into!

Thanks! Nice additions!

Having spent a fair chunk of my career dealing with search, I went down that list nodding in agreement to nearly every single bullet point save for about 10...

...and those I had to classify as "problems I probably had and didn't recognize" or "will surely encounter soon"

So often we underestimate this thing...

I found the list of falsehoods about phone numbers (https://github.com/google/libphonenumber/blob/master/FALSEHO...) really enlightening because it gave a short rationale or example for each point. I think that's way more helpful and useful than the more traditional snarky format.

May I propose an additional falsehood?

"Users won't want to turn search highlighting off."

Maybe it's just me, but this[1] seems distracting.

[1]: https://docs.python.org/3/library/pickle.html?highlight=pick...

You can increase recall without adding noise. Customer wanted to match substrings of words and then was like, why are all these irrelevant documents returning?

I think this article is setting up a pretty high bar for search. For small datasets, you can very well just add an automatically generated "description" column in your database, and then do a SQL LIKE query: it's a simple substring matching.

It's by no means smart, doesn't handle misspellings or anything, but it works reasonably fast and predictably. This is basically how almost every desktop app with a search bar works. This is how word processors and editors work when users search within the document.

"I think this article is setting up a pretty high bar for search."

All of the "Falsehoods Programmers Believe About ..." genre articles do.

The way to use them is not to view them as an immutable checklist that all programs must conform to or else they are forever and always nothing but total crap, but as a list of things professionals should at least have some clue about, and that you should generally make deliberate decisions about, rather than accidental ones. Are you a pizza place with ten locations in a single state? Then by all means, take US-only addresses, and hard-code the time zone on your web site, and probably just ignore search, and expect first & last names or whatever. Just do it as a deliberate tradeoff, with an understanding of what it may take to undo it later.

Are you working in an international company serving customers all around the world, with the need to provide some search functionality? Well, you probably need to be able to fulfill a lot more of the relevant lists.

I actually went to town on the features for the search on an internal project... And all anyone ever uses is pulling up tickets by number or by client or technician names.

"It has x, y, z features that the Google search has..." "I didn't know Google did that."

There's even an explanatory popup with how to do anything fancier than a straight text search.... But they just really use numbers and names. Advanced feature usage is once in a blue moon.

At least they're happy with the search function, but lesson learned - for a lot of usages, people aren't expecting much more than a simple text match.

Really, searching for proper names is almost the entirety of what I see clients and internal users do with search, across various projects.

When was the last time your average (non programmer) user expected search to behave like CTRL+F?

I'm not sure you're doing your users a service implementing search with SQL LIKE. I think it's probably better to divert them to Google, use a full text SQL index, use a managed search service like Algolia, or not do search. Otherwise, you're just promising them functionality that is almost always going to fail them.

Why is that?

Users have been pretty heavily conditioned to use search in specific ways that are different than finding a text in documents. They have a broad range of needs that a wildcard 'find in files' search doesn't really support. And most frustratingly users expect a single search bar to support them all. Some needs are known item - finding an item by name (like contacts on your phone). Other needs are informational - finding a fact or idea by expressing requirements. Sometimes its about getting a survey of information about a topic, or sometimes its about compare-and-contrasting different products.

The primitives available in SQL LIKE don't really lend themselves to solving any of these problems. There's no concept of relevance ranking, there's exact, direct, case-insensitive search, not to mention it's going to do a full table scan on every search...

(You'd have my ear more if we were talking about full text search features in SQL.)

I'm pretty sure we are talking about different ideas of "search."

This is more or less what I recently told a client who wanted search on a utility I built him. He wanted rich search, like using quote marks and database operators, but his budget for the whole project was about $1500. I told him that I could build the entire project, plus simple searches on the description fields with wildcard matching. Or I could give him fully featured search by using third party software, but for triple the budget. Explaining it that way got the message across, and it turned out that simple search was enough.

Use your database's ngram/trigram module on that description column (and full text index) to handle misspellings:



Sorry but that’s going to result in a pretty horrible search experience. If you are putting a search bar on your page and that’s your search backend - you might as well skip search entirely because it’s only going to cause you and your customers pain.

The difference with find on page is that it’s obvious and transparent what is being searched and the expectations of the interface. Trust me when I say that a search bar to a layperson on your site is them thinking “oooh I can google”

I appreciate your experience, but you may have also noticed that there's a number of sibling comments from developers whose customers definitely did want a Ctrl-F type search, and not some near-AI match-what-I-really-meant thing. I've certainly worked in vertical markets where that's what the customer actually wanted.

I think it more goes back to the real most important lesson of programming - you must know your customer and their needs first. If a google-like experience is what your customers demand, then you better understand all of that stuff and build it. If they just want to search for names and ticket numbers, any more advanced intelligence is a waste of time that could have been used to build other features that the customer actually wants.

Sound like you need to redesign your search bar on the page to convey to users the expectation that this is a simple search.

Do you have an example of that that simple search bar could look like?

I'm not a professional UI/UX designer but here's a guess. It should not be the prominent thing on your page. Prominent search bars like the one on Google's home page convey to users the idea that this is the primary way to navigate and use this website, and therefore be loaded with higher expectations. So don't make the search bar look prominent.

Next, make it context-specific. Don't put a search bar at the top of the page suggesting that this bar can search for everything. If you use a simple implementation like a SQL LIKE to implement search, put the search bar right next to the thing that is displaying results from the table. Make it look like it's filtering the table.

Finally, label the search bar using words like "Keywords," which also suggest to users that they should be typing keywords instead of a more complicated natural language phrase.

Those are interesting ideas thanks for the thoughts. FWIW I’ve seen users try to use even non-prominent search boxes like those as if they can do more than SQL LIKE. Most users have no idea how any of this works and just want answers.

Mostly I think this whole thread demonstrates the point of the original article, but I appreciate your response.

Yeah, I rolled my eyes after opening the article, it's a load of tosh depending on what your need is.

I have written search engines for a couple of sites that combined serve about a million uniques a year.

It's not great, but it's not terrible, and took less than a week. People search for places and names, so it's quite easy to match them.

We looked at one of the open source engines, but it was a lot of effort for not a lot of gain, and essentially adds another significant moving part to go wrong.

F*ck it, I am ditching VS and coding in assembly from now on.

Yes!!! Am joining the revolution!

agreed, simple and effective. but also- most databases have more advanced full-text search functions, for example in about an extra hour of work you can easily set up simple boolean searches using mysql's MATCH against a full-text index over multiple columns.

>This is basically how almost every desktop app with a search bar works.

not windows 10

Oh god......

This is both awesome and so so discouraging. Does anyone have some direction on how to produce good search systems??

Focus on measuring search quality and methodology first. Be a scientist. Great search teams obsess about methodology. Treat everything you try as a hypothesis, not guaranteed to work. Create a feedback loop that improves the pace of experimentation.

Other than that, the solution space is just as wide open as regular programming. It's just in many ways more frustrating because nobody knows what they really want from search, they just "know it when they see it" and no two users really can agree on what a good result is! :)

This is a very, very insightful point. I would add: never expect a singular "perfect" algorithm, but rather build a framework that lets you blend (and evaluate/weight) the signals from various hacks, workarounds, heuristics, and "proper" algorithms.

In addition to @softwaredoug's comment is his book "Relevant Search", it's a great starting point! https://www.manning.com/books/relevant-search

I found "Search User Interfaces" by Marti Hearst very informative. It's available online for free: https://searchuserinterfaces.com/

Perhaps https://yacy.net could help you.

Search engines work like databases - Too vague but arguably yes in the abstract.

Search can be considered an additional feature just like any other - Yes? How do you falsify this?

Search can be added as a well performing feature to your existing product quickly - Yes if you're using a CMS with search already there like Drupal, or you can use that thing where your search uses/directs to Google.

I wonder if that one was meant to be controversial as it was the first item. My pedant sense started tingling immediately.

Search engines don't work like your standard RDBMS with SQL and whatnot. You can't just make a SQL query with a LIKE operator and just call it a day if you want modern, featureful searching.

But a search engine is absolutely a database. Lots of things are databases even if they aren't RDBMS and can't be queried with SQL.

Although, as a side note, I have seen some interesting projects that allow you to query things like file systems and operating systems using SQL, or at least syntax largely inspired by SQL.

> Search can be added as a well performing feature to your existing product quickly - Yes if you're using a CMS with search already there like Drupal

Adding a feature by using a product that already has that feature is not "adding a feature to a product". It's "doing nothing since there's nothing to do". ;-)

Using Google search for pages might work for simple sites that mostly host text content, but not for things like "find all foos that are between 20 and 30 kg".

If it's just a few things like "find all foos that are between 20 and 30 kg" then that might be nothing but building a simple query out of a few criteria. Not all searches need to be or even aspire to be a super-general search like Google. The ebay search probably isn't all that complicated (relatively) for example. If you're trying to make another Google for some strange reason then the article applies more.

* All customers may see the same data

God, how I hate that authorization woes find a way to make everything else 5x more complicated.

I wonder if that's aimed at permissions-based stuff or, like, search bubbling?

I'm talking about permission-based stuff.

On misspellings (since there are quite a few lines here dedicated to them), I had the fun responsibility of learning / knowing too much about how our search worked (we were/are using an old version of Solr), and started telling people that there's a way to at least do something about misspellings.

After conversations with two or three product managers, it became clear that the best course of action was to do nothing at all. I'm definitely not an expert on search or human behavior, and running through all the possible interpretations of how to handle misspelled words and what the customer wants was way more work than I was prepared to do.

I'll even point out that my initial suggestion was, "Let's just copy Google and do, 'Did you mean to type _______?'" Even that was met with, "what if the customers X" "what if the customers Y" etc. etc. Wasn't worth the time (at the time).

You could call it related searches and only display the suggestion when all words are either in the products catalog or in the dictionary, also checking if the query returns something with a phrase search. That can help with typos without ever being to weird

A list of postulations without examples or explanations is not useful.

Agreed. It leaves no room for debate or for understanding the assumptions involved.

Also, while many items in the list are insightful, I find what bothers me in this and similar lists is when you could swap anything for "search" (or "time", "addresses" or whatever the other lists happen to mention).

See for example, replacing "search" with an X:

  - Choosing the correct X is easy and you will always be happy with your decision

  - Once setup, X will work the same way forever

  - Once setup, X will work the same way for a while

  - Once setup, X will work the same way for the next week

  - The default X settings will deliver a good X experience
The problem with these assertions is that, while cute, they are so broad and generic they tell us almost nothing about the specific problem of search engines. For almost every decision in software design and implementation, the above assertions hold true.

Quite true!

Or even enough context to interpret:

> Search can be considered an additional feature just like any other

Is that a falsehood? - what does it even mean?

Almost nothing. I guarantee that for any non-trivial feature, you could just say:

"<non-trivial feature F> can be considered an additional feature just like any other"

And everyone will agree that's probably false. They could have written "search is almost never a trivial feature, and you should take your time to consider complications", but I suppose that wouldn't sound as a cute as a "Falsehoods Programmers Believe" list.

Related to "Customers who know what they are looking for will search for it in the way you expect", many people don't understand that a search engine works by matching text strings (albeit in an often sophisticated way). They see it as sending commands that the search engine understands, and will then find results for...

I know VSCode had an issue where people would type whole sentences into the settings search bar. They got around it by incorporating some of Bing’s NLP logic. Goes to show, even amongst the “technically inclined” (those who not just use VSCode, but also try modify things in it), this still holds.

Search interfaces should have a configuration for smart users:

  [ ] Disable fuzzy parsing hacks (reject my queries if they have bad syntax).
  [ ] Don't search for sound-alikes; assume I spelt everything rite.
  [ ] Respect the non-alphanumeric characters in my query, which I put there for a reason.

> A customer using the same query twice expects the same results for both searches

Really, this is a falsehood? Like, I want the same query to give the same results given the same dataset always. When do you not want that?

>>> A customer using the same query twice expects the same results for both searches

Of course this is false; please consider:

- customers expect to see in search results whatever new information they added/updated in the system (this is related to "Customers don’t expect near real time updates");

- customers expect "personalized" search results; having built up a history of searches centered around particular subjects (say, programming), you'll expect much different results for "string" than the general population gets;

- customers expect new/more results having logged in, or having gained new permissions/roles;

- customers running "knowledge" or "command" queries ("what is the weather?" "password 16") expect varying results

Or, for a short query string, the user may have a different intent without realizing they've put the same query in.

I might dash off a search for "sneakers" when I am researching footwear. A week later, I might be thinking about movies and enter the same query string, expecting IMDB results.

- customers expect "personalized" search results; having built up a history of searches centered around particular subjects (say, programming), you'll expect much different results for "string" than the general population gets;

Not always the desired behavior. This should be toggleable. It becomes very difficult to find results outside of what google thinks you want.

The gist of this is that customers sometimes re-enter the same query after it failed thinking they'll get what they want the second time. The lesson here is that you can't assume what the customer wants because you don't know. Information needs can be unconscious and contexts between the same query entered twice may have switched.

When the datasets are not the same -- the web is ever evolving, and if I just upgraded Ubuntu, I want the latest results for my search query about why a software package isn't working.

Any query that could possibly return "news" should return new results whenever that news is updated.

Bug reports, newspaper articles, blog entries, sports scores...

Even your corporate internal wiki is going to have newer and older articles. Reindexing is a thing.

Temporal context.

If I google "waitrose closing time" I want the closing time of my local supermarket today.

When I googled that yesterday, I got a different result, and that's what I want.

Since there is no mention of a time parameter, these two queries can be separated by a certain amount of time in which the engine learns more about the user.

Perhaps the user is a business user rather than a developer and once profiled correctly results can be adjusted.

That I actually want Sublime Text to stop responding for 5 mins while searching for a single space character across my entire project.

I'm just waiting for the inevitable article titled "Falsehoods Programmers Believe Lists Considered Harmful."

Is everything we believe about everything wrong?!

"All models are wrong, but some are useful" (generally attributed to the statistician George Box).

A belief, or a system of beliefs, is but a model. It's virtually guaranteed to be wrong. It also may very well serve the important function of being simple enough to handle in-core, while at the same time being close enough to substitute for the real thing.

> virtually guaranteed to be wrong

I would go a step further and say all formal models are proven to be wrong. After all, that's what Gödel and Turing kept going on about.

We can't prove any non-trivial program ever halts or does not halt. In fact, we can't (or don't) prove much about our programs we run anywhere.

All programs are a collection of assumptions. To bring this back to the topic at hand, if all of our search assumptions are useful to some meaningful number of people then it really doesn't matter how many "falsehoods" we trip over. Those falsehoods fall away, becoming mere insignificant edge-cases. Satisfying all people all the time in all cases is a fool's errand.

Articles like this are good at letting you know your blindspots so you can choose your blindspots rather than succumb to them. But don't let it become dogma.

>all formal models are proven to be wrong

Your point certainly holds true for any physical entity as far as we know - probabilistic quantum effects, Heisenberg's Uncertainty, chaotic systems, and all that.

However if you were to model a theoretical entity, and given a few more constraints (like strict computability, which precludes a turing-complete systems), you can indeed have correct models. Alas, in practice this is a rather rare example.

On a related note, a hell of a lot of strife in the world seems to boil down to people insisting that their preferred taxonomy is the correct one, no matter what the context, rather than accepting that taxonomies aren't facts in the first place, they are tools.

On which note, the answer to a list like this isn't necessarily "memorize it and avoid all these problems". The benefit can simply be in making these tradeoffs consciously, so you can judge your model better.

If you're Google, differentiating 'or' as in either from 'OR' as in Oregon is a task you need to take on. But if you're writing a National Park lookup tool, you probably just don't want to worry about that case. In that case it's still worth knowing; you might be able to save users some time by at least showing clearly how you reinterpreted their input.

>The benefit can simply be in making these tradeoffs consciously, so you can judge your model better.

Very much so; engineering is all about choosing the trade-offs, and hopefully improving them in the future. The list also helps with solving some of the unknown-unknowns problem in regard to what the customer expectations may be; even whole new domains of expectations (like immediacy of update, or handling of accented/non-english characters).

Side note:

As far as I can tell, Google got rid of the special-cased "OR" in the general search - right now it's a word, not a predefined/reserved symbol.

They were able to do so by adding "implicit OR-like" operator between all the words in the query. Not quite an implicit OR, not quite an implicit AND; something bit more complex in between.

The words of the query get weighted against matches both on their own, but also as adjacent words (higher weight) and whole phrases (yet higher weight). All in all the problem got solved by improved matching & sorting algorithm, not by somehow smartly detecting when "OR" is meant as "OR", or OR, or or.

The problem got solved in the match scoring/sorting domain, rather than in the query parsing domain.

Probably. Still, nothing says you can't write an alternative article; falsehoods non programmers believe about programming.

That might actually be a worthwhile article that helps programmers communicate effectively with non-programmers.

One that tripped me up a few years ago: non-programmers think that 'strings' are long fiberous things that cats play with. The connection between the word 'string' and the concept of text is not intuitively obvious to people who don't already know the lingo. Seems obvious now, in retrospect.

My life got a lot better when I stopped believing anything and just decided that there are maybe one or two things I'm fairly confident are true. As in, I'm pretty confident that I exist, and fairly confident that you do. But I couldn't prove either of those, and everything else is basically up for grabs.


But how do you deal with the following situations?

-- talking to people, since you have no opinion

-- understanding the people around you , building a mental model of them

-- general confidence

> talking to people, since you have no opinion

That's a big tragedy in our society, that you're expected to have a definite opinion on everything. Myself, I have very few strong opinions, and those that I have I hold loosely. When someone asks, I usually try to sketch the space within which I believe the answer lies (e.g. "I suspect X, but then there's Y and Z, and also V I'm not sure what to do with"). This has a nice side effect of making strongly-opinionated regulars suddenly unsure about their own opinions.

I am also this way by nature but it drives a lot of people nuts so I’ve learned to temper it for the particular audience, expressing confidence appropriate to the context of our shared assumptions.

I don't think the OP said that they have no opinion. Beliefs are a conviction based on cultural or personal faith, morality, or values. Opinions are viewpoints, we all have them, but it's good to be aware that they're not based on facts.

I've implemented a fair bit of search engines. Usually the problems are with non technical people in a project. I've had to coach a fair bit of product owners and UX designers on the basics of search. There are two issues I tend to have with them: 1) they avoid things that they think are hard that just aren't 2) they are unaware of features that e.g. Elasticsearch would support that are highly relevant to their project and therefore don't plan for using those.

A UX person thinks of search as a text box "like google". However, a lot of search UIs have a lot going on when you start typing and when you get results back to refine search results, DYM corrections, breakdowns/aggregations, suggestions, etc. A lot of these features require careful planning and design and are not necessarily easy to bolt on if you don't.

I've also had to do basic things like patiently explaining the difference between sorting and ranking and humbly suggesting that, maybe, having a multi column layout with sortable columns isn't necessarily the right thing for presenting search results where the output is a list of stuff in order of relevance.

Engineers are easier to deal with once you sit them down and talk them through how stuff works.

As with many of the other commentors here, I wonder how many programmers truly believe these things. Maybe as recently as the 90s or 2000s. Maybe developers who are fresh out of school.

But we've had search engines as a major part of our lives for about two decades now. Most of us use one at least daily. We're familiar with the complexities of search engines and how they differ from simply searching a document for an exact string or even a regular expression. Many programmers like me work with tools like analytics and log aggregators that expose the complexities of search to us in a way that's more intimate than the veneers of Google and Amazon.

Maybe I'm just lucky in that my experiences have dispelled these notions of search being easy or simple. But I hope I'm not alone.

Also, there's a disparity between what search is and what your users expect. Technically, I could make a really simplistic "search engine" that amounts to a SQL LIKE query. It may not be good or what users might expect coming from Google/Amazon/etc, but it would be a search engine. (Oops. Looks like my pedant hat slipped back on when I wasn't looking.)

I don't know, the first third of the list contained about ten things I don't think any programmer believes about search, so I gave up at that point.

cough "setup" is a noun, "set up" is a verb </pedant>

I would add to the list of falsehoods:

- customers are always searching for a specific item, rather than an entire category

- customers know that a search engine for one kind of item (e.g. products for sale) won't also search the entire rest of your website

One major annoyance,hard to search for any topic related to c programming online, one has to wade through mountains of results on C++ and C#.

> Choosing the correct search engine is easy and you will always be happy with your decision

I laughed, but I don't think this is a correct representation of something many programmers genuinely believe. It's worded in such a way that it's clear this is a joke. Not sure if I should read the full list if it's just going to be jokes like this one.

Yeah so that one is kinda a niche search engineer joke of the old Solr vs Elasticsearch battle that's been going on in the space for years. Sorry that some of the tongue-in-cheek-ness turned you off, but many of these items resonate closely with those of us in the search/relevance engineering space.

How many genuinely unique search engines are there really to choose from? (Not counting those based on the same underlying libraries)

Adding some shameless self promotion :)


My favorite is "languages don't matter and I can just throw text in there"

"Once setup, search will work the same way forever" - I don't know a single programmer who believes this about any software.

> Regular Expressions have minimal performance impact

REs and FSMs equivalent.

for real? Is RegEx actually a FSM behind the scenes? or are you trying to say something else

Yes and no. The theoretical "regular expressions" are indeed Type-3 grammars in Chomsky's hierarchy.

In practice, the common "RegEx" implementation implement a lot of extras, that break the theoretical backing, and also exhibit highly non-linear behaviors. Cf. this excellent paper by Russ Cox: https://swtch.com/~rsc/regexp/regexp1.html

Thank you for this reference!

also thank you for this. I really need to study Chomsky's UG stuff.

Ha. I like the way you think!

Textbook regular expressions correspond precisely to DFAs, so they’re definitely a type of FSM.

Most Regexp implementations in the wild are more powerful than textbook regexps, so they not only encode all languages accepted by DFAs, but can also encode other languages. E.g. back-references are not a feature of regular languages.

Depends on the implementation, but I believe grep actually builds a Finite State Machine from its input regular expression. More complicated (non 'regular' regex) engines don't use this approach, but in theory the two are equivalent.

This equivalence is one of the fundamental findings of CS, and exposure to this concept is pretty much mandatory for acquiring a degree in the field. Sadly, this perspective is not often shared in the bootcamps and autodidacts, even though it's moderately documented in https://en.wikipedia.org/wiki/Regular_expression#Deciding_eq...

But the more mindblowing aspect is that you can use nondeterministic Finite Automata for the same purpose.

thank you very much.

My bad I forgot the verb, I meant “are equivalent”.

Edit: I don’t understand the downvotes though.

I think the article implies you're a programmer implementing search, not that you're taking an off-the-shelf system and just plugging it in.

Just like "falsehoods programmers believe about websites" wouldn't make sense if you were using Wix...

Algolia looks good, but are there any OSS alternatives for those of us trying to bootstrap a search system

For a blog/static website, Tipue Search [1], or maybe Datasette [2]. There are Pelican [3]/Jekyll [4] plugins for the former.

[1] http://www.tipue.com/search/

[2] https://24ways.org/2018/fast-autocomplete-search-for-your-we...

[3] https://github.com/getpelican/pelican-plugins/tree/master/ti...

[4] https://github.com/jekylltools/jekyll-tipue-search

Pretty much none of these are things programmers believe about search.

Putting limited effort into creating a mediocre search feature doesn't mean that you believe these falsehoods; it just means that you're too resource constrained to put serious investment into creating and improving a high quality search feature.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact