I didn't submit this. I didn't know anyone was submitting it. Or I would have written a post about why it is different than other search engines. Better is still up for grabs because it is new.
Samuru doesn't use link authority, it analyzes pages and matches what you queried to the types of pages and picks the best matches.
Let me give you an example.
You search for "How to Make cupcakes"
Google says give me the pages that have the most inbound linkes (over simplification) that contain all those words.
The winner is Brandon's Cupcakes (not really but play along for a minute) because it says, "We know how to make the best cupcakes, because we have been doing it for 25 years"
That is not a useful result. Samuru on the other hand says "how to make cupcakes is a search for instructions" and it looks for pages that match the words, and are written as instructions.
We weigh other factors, like is there an author associated with the article. Do they routinely write about the topic?
We do this for reviews, products and other things as well.
To be a full replacement for Google we need Driving directions, and image search and a lot of things. But in order to do all the other things we are doing we needed a search engine. (related content, analysis, speed testing, building a corpus of words)
Responses get better if you search something someone else has searched or do a second search 30 seconds later. This is because we haven't deep indexed the entire Internet yet, and so we don't have all the deep data.
Re: your portrayal of how Google works...An "over simplification"? It's just plain wrong. Google has, for quite awhile, not depended on sites containing all the words of a query...and natural language processing plays a huge part in analyzing intent of a query.
I applaud this ambitious project but I'm skeptical you'll achieve what you aim for if you're way off the mark in understanding how Google is so successful...I mean, to even talk of replacing Google at this stage -- and saying it's just a matter of providing rich snippets and other ancillary features as if that was your engine's main deficiency compared to Google -- is quite bold and a little cart before horse, IMO
-
Edit: an example...I did a search for my own name, something I do habitually because I'm locked in an eternal struggle with a younger, better looking, more talented namesake for the top Google result. However, your search engine returns neither me nor my singing rival as the top result...instead you return the domain that is my first and last name with a hyphen, which is exactly the superficial result that Google was designed to avoid.
Right, but the success of a technical product is often based on results that can be tested and verified, and less on arguments through authority, though if Cutts is willing to say something like. "Wow, the OP has created something that surpasses Google in [whatever metric]", I'd admit I'd take his word on it.
But the OP's claim stands on its own and makes assertions that can easily be verified. Are you arguing that Google's search engine is as simple and literal as the OP claims?
Google Brandon Wirtz Greatest Living American. I am pretty good at understanding how Google works.
While Google gets things right with out all of our language stuff a lot of that is because they have user data about what people are clicking on and which things they come back after reading.
That data means if they can get the stuff on to the front page they can "crowd source" the rest.
We don't have that user base for a feed back loop. We have to get our results entirely based on software.
So you think Google knows if a piece of content is instructional?
Or know that it is a Review? (not just has a rating)
And you have to have all the words or a synonym of the words. But more importantly Google doesn't know what kind of content something is. Or what questions the content answers. Our system knows "this document answers what do aardvarks eat"
I will happily concede that you and your colleagues know more about search than I will ever know, and so I find it strange that we're having this argument...your perception of Google's limitations seems so far off that it's as if you've mistakenly referred to them instead of AltaVista.
To answer your questions, yes and yes, Google can derive the meaning of my search without relying on literal interpretation of the search terms. In fact, Google can return what I want even if I deliberately spell every word in the query incorrectly:
"couk besr ribz" instead of "cook best ribs" brings up recipes of how to do good ribs:
I'm willing to take you at your word, that there is a better way to interpret context and meaning from a search query than however Google does it...but if that is the entire raison d'etre for your search engine, can you come up with at least a few case examples where this is the case? The "cupcakes" you've contrived is clearly hypothetical (and not at all close to what happens in reality), and the ones that I've tried don't seem to show any improvement on human-friendly results. Which is not to say that samuru is a bad product...what you claim to do is incredibly difficult and is exactly the feature that makes Google such a useful, ubiquitous engine...I would love to be surprised but I'm skeptical that a new engine with a fraction of Google's processing power, nevermind the resources for test engineering and algorithm design, can compete with Google here...this isn't a "well, why hasn't anyone ranked search queries in this way before?" in the same way that PageRank/BackRub was 15 years ago...Search engines have been analyzing queries for intent, and their shortcomings in this area are due to it being a very hard problem, and not for lack of desire.
Either way, it's a deliberately contrived example to show that Google was more sophisticated than the GP indicates...I have no comment on whether amazingribs.com really does have the best ribs preparation tips
While we get the same top result for this, we also show that WikiHow.com linked to the Python Docs. So you can see closely related articles grouped together.
But Google requires the words or the synonyms to be on the results page. So do we, but we know that a search for http://www.samuru.com/?q=cook+bbq+ribs is not just about how to make BBQ Ribs, but to make the best BBQ ribs, because this is a subjective topic, I like honey BBQ, someone else likes Mesquite. That factors in to our results.
Regional variations might account for the differences. Do you live in a place where people don't eat ribs? (If so, please accept my sincere condolences)
Google can't tell that something is instructions. That's not something they do. It may know that ehow has a lot of pages with How To on them, but it doesn't know if those are pages with step by steps for how to do something, nor does it know that Rotten Tomatoes has pages with Opinions and Points, and conclusions that make up a review.
For the sake of realism only: any basic supervised machine learning algorithm does that with proper labeled dataset. I have built many. I can assure you this is so classic that Google has it, that many other companies have it. And there are much better and accuratd solutions in place for combining all signals of this kind.
Well their move to semantic technologies e.g. Schema.org is a step in the direction of understanding these things better. Like if you mark up the page with http://schema.org/Review wouldn't you agree that they know it's a review?
But more importantly Google doesn't know what kind of content something is. Or what questions the content answers.
I'm not sure I agree with this. Google may not know it explicitly but it can effectively know it, perhaps in some cases even better than the average human would if given the same search keys. It is the difference between knowing what a word means by the way people actually use it versus by looking it up in the dictionary.
Google essentially crowdsources its results, taking advantage of the fact that people entering similar queries probably have similar intentions. If your users are only using keywords, with little regard for word order, I don't see how you can do better than Google on average. You may do better for obscure queries where the best search result can not be easily inferred from the keywords present, but how pages are like this? Furthermore, if there are a set of keywords where semantic analysis suggests page A is the best result but the page that most people actually click on is page B, which do you return first? What if your index doesn't even have B? This will be challenge you will face when trying to do better than Google on average. Nevertheless, I applaud your work and will definitely keep Samuru at the ready for the queries that Google struggles on.
Who said we don't use similar search to infer intent?
We can't really use the click throughs until we have users. We need feed back to improve results. But this also powers the related search in our TLDR Products ( http://www.tldrstuff.com ) and those have to work with much more abstract queries, because they aren't user queries they are generated queries.
I'm not saying that you don't use word order. I think you must to some extent if you are doing a comprehensive semantic analysis of the keywords and websites. I'm saying that your users may not use correct word order because they are lazy and because they are used to the behavior of search engines that don't require it but still give okay results.
Perhaps you have solved this problem, but I just don't see how you can offer better results than Google on average (without having a similar sized index) when users are just throwing together a bunch of words related to what they are looking for. It seems to me that if we want to take advantage of systems like yours for search and if we want to get better results than Google, we need to change users' behavior; they need to learn to give more precise queries.
I think technology should strive for Zero Learning Curve. If I type "Chicken Chord On Blue" it should figure out that I probably need to know that I spelled it wrong, what it is, and how to make it. A user shouldn't have to know the answer they are looking for in advance.
We have that issue because it turns out many brand pages don't have any content. We are still balancing the domain bonus, but think how many home pages for brands have no text. No text means we have nothing to analyze.
When you are building a technology, you have to isolate and test. We use indicators that are harder to game than inbound links like traffic. But it is a balancing act. We have just shy of 100 score factors we can tweak, and getting them right takes a bit of time.
I am enjoying it. We went with the approach of how can we make this impossible for Brandon Wirtz to game. How can we make this about the content more than a popularity contest.
Now that we are both in the business of stopping spam we should grab lunch sometime.
Quick comment on the interface. Looks like the initial page is optimized for 1024x768. I'm on a netbook at 1024x600 (even less so since I'm not at fullscreen). While I realize that I'm in the minority, it looks like the issue is everything is position absolutely. The only reason that the bottom of the page is cut off is because there is a bunch of empty space between the top of the page and the logo: http://imgur.com/vDFwY8b
I realize that this is a bit of a nitpick, but I felt the need to mention it.
This is really strange. I just searched for "how to ride a bike" and the first links from Samuru are completely useless whereas the first link from Google is exactly what I wanted, instructions from wikihow. How do you explain that?
Pre-Google, this is pretty much how search engines worked... by analysing page content and weighting that rather than the network of links around the page.
Having just played with it, it feels both backwards and refreshing to go back to that. The results are different enough to feel good for the terms I used.
Other features I should have mentioned:
Threaded results. If a result is cited by other results, they will be grouped so that you can see the conversation across sites.
Better Social Media integration. We do Facebook, Twitter, Google Plus not just Google Plus for showing authors.
Voice Input if you are on Chrome 25 or higher.
Results are returned with Summaries not Snippets.
With that I am falling asleep. I have enjoyed answering questions on this an the https://news.ycombinator.com/item?id=5579336 thread but 5 hours of it has worn me out. If you leave comments I'll promise to get back to them.
Hi, congratulations, I like really Samuru also if it's not perfect. I wanted ask you two questions :
1) Are you sure that giving a "bonus" to domains containing a part of a query is a good idea ? I understand the reason behind that, and know that you need time to turn off this "bonus" but waiting that moment are you really sure that is a good idea ?
Instead from the third positions the web pages seems to be great.
2) how works the search suggest ?
I m a french user and in our language we have a lot of accents like "é è ù à". While typing a search query many people do not use them. When i correctly type a query with the accents, Samuru suggests the same query but without accents, this is wrong and that's why I m asking me about the provenience of data used by the search engine to provide these queries suggests.
it is now very hard to get many Google results that contains all your search terms which is why I start to dislike it... for example it gives you "synonyms" or the terms are completly missing.... I sure will give yours a try
We will as more people use us. We think that because we provide a summary of your page rather than a snippet that we drive more traffic if you are deserving of it. Snippets don't really "sell" your content or your writing style. Summaries do. We think that by giving people more insight in to what to expect rather than part of a few sentences with the keywords you searched for in them, that we help users make better decisions about what to read.
OK. My first try was to search for "plato dialogue concerning friendship". Google gave me the result I expected (a reference to the Lysis dialogue) through wikipedia and a bunch of articles about it (the most helpful being a link to the Stanford Encyclopedia of Philosophy, ranked third). It didn't link the text though in the first page (it only appears in the third page of results, with a copy at the MIT Classics archive). Samuru gives me a bunch of general articles on Plato first (oddly enough, the first results are articles from the SEP, but not the article on "Plato on Friendship and Eros".), some noise and then information on the Lysis. The text itself appeared at 25th place.
Something I find interesting is that one of the snippets samuru gave me (on the 5th result) has a pretty good description of the lysis as the item most likely to be the "plato dialogue concerning friendship": "the dramatically later Lysis presents Plato's more developed understanding of love and friendship than the dramatically earlier Symposium and Phaedrus". From this description of the Lysis one could gather that the text of the Lysis itself should be a very relevant result to the query; at the very least, that information about it should be weighted as more relevant to the query than info on the Symposium or the Phaedrus, and then info on those over all else. From this, I think, one could build a better representation of a good answer to the query than in google or samuru.
I think natural language analysis is very promising here. I hope work on this area yields good results, but it seems like a hard problem.
Counterpoint. I was surprised by it. A couple weeks ago I decided to start recording any search phrases which I felt were tricky or required good language modelling.
"baby features kept in adulthood" is the only one I've thought worth recording so far. You can compare the results in Google, Bing, DDG. Only Samuru and Google have it on page 1. Samuru has it as the first result. But this is just one example so I can't draw any conclusions. Curious to see how well it performs in general.
Just for the record: I didn't want to sound dismissive of samuru. I actually think the results I had were not bad at all (the fact that it gave me a link to the lysis in the front page was great UX-wise, even though it is worth pointing out that samuru yields more results per-page than google). The case I pointed at was simply a case where I thought information could be gathered from the dataset that would lead to better results than those presented by samuru, or google (I also tested bing, but the results imho were poorer than google's).
"baby features kept in adulthood" is a weird one too. You are right samuru yields the best result in first place if one meant to get info on neoteny, but then, on first sight, it is the only relevant result in the first page. And the same thing happens with google.
About 1 month ago I switched from google to bing. There different queries that use the to measure 'better'.
For simple queries 'strncmp', 'giraffe', 'sound transit schedule' ...
Google, Bing and Samuru perform pretty well. But Samuru is extremely slow.
For more complex queries like, 'seattle dumpling restaurant that is famous in singapore' or 'how to zip a list in ruby'. I find that Google always comes out on top, bing lacks the previous search history to personalize my searches and often thinks I mean (zip as in zipfile)... But samuru gave me relevant results for all three which is rather surprising.
Another type is one for people/social related searches... Bing's facebook/twitter/linkedin/yelp integration actually makes it better than google because the 'snapshot' bar it has is super helpful. However Samuru results are on par with Google and Bing results here (minus the snapshot bar).
Overall I was skeptical but other than it being unbearable slow (Google spoilt us with speed), Samuru does have very good search results for what I assume is not a mutlibillion dollar product.
I like slash tags and Booleans, but the truth is search should work with out the need for those things. We support - to make things go away. Later we will expose some of the cooler things we do behind the scenes like "reviews" or "instructional" or "oped" searches, or "Simple English" but we want to do that in a way that doesn't require syntax.
"If it requires syntax it isn't user friendly" is our internal battle cry.
Yeah, it's definitely the case that users don't want to learn or type syntax in a search engine -- Daniel Russell of Google says that a majority of searchers think they are advanced users of search engines, but a majority of them don't know about or use "" or -.
That's why blekko and izik both invoke that syntax "under the hood", automatically -- starting in November 2011.
Do your search a second time. Our index isn't exhaustive yet, and we are slow if we haven't seen enough of the results pages before. We generate the summaries and a bunch of other things after something has appeared in a search result. This is because we aren't a billion dollar company and have to be efficient in our indexing.
I don't understand. Is this spam? There is no context or accompanying article for the claim. I searched my name and the results weren't nearly as good. One data point, sure, but first impression is everything.
Edit for context: original title read: "This search engine is better than google."
Nope it doesn't. We decided that it was hard enough getting advertising without having "adult" search. We focus on text analysis so we aren't very good at porn searches.
I need to look why you didn't get a message saying we don't do those kinds of searches.
Disabling ads is actually pretty hard. We had Adsense running until we got kicked for having results on "Jail Bait" those two words alone are not dirty. But I didn't focus on building long lists of dirty topics so we were returning results on that.
Google is excellent. Bing is also excellent (with minor differences). DDG and Blekko are adding interesting and useful features.
But they all feel a bit like they're a mono-culture, and thus vulnerable to gaming. Black-hat seo seems to be something that Google is pretty good[1] at dealing with. White hat SEO and ads have changed the web drastically from what I remember.
So it's really nice to have an alternative method of search that searches in a different way. Your post (https://news.ycombinator.com/item?id=5580321) highlights a few things I find frustrating in search at the moment.
[1] It's odd that all the work they do isn't noticed.
I'm willing to have an open mind about this, but I think some sort of explanation on what samuru is hoping to achieve in distinction from other search engines would be helpful.
Exactly. Doing MVP of a search engine is hard, so it is okay to lack on quality of results initially when you launch. On HN probably. Even DDG is trying to only catch-up.
But to keep the engine running, and keep the hacker interested you should tell what distinction samuru is trying to achieve with its search engine.
And perhaps this query http://www.samuru.com/?q=porn should not be blocked by default, rather provide tools for safe search. Heard of the porn cookie guy? Just copy his footsteps, I'd say.
Summaries instead of Snippets.
Document Type to Query Type Matching (looks like you are looking for a review we favor reviews, looks like you are looking for instructions we favor how to's)
I wonder why you decided to follow the same old search formula. There is so much to innovate in this area. For example, Nuuton uses #hashtags for trending results. Say you go and make a search. All related terms would appear as #hashtags somewhere in the page. These are created by the users and by the system. It also uses the / and the ! to filter results in different ways. Say: /Honda !modified, gets you pages of modified Hondas. Click on a #hashtag, say #turbocharged, and you would get turbocharged Hondas. Why so many tools that appear to do the same thing? They are close in functionality, but affect different factors in the back end.
How do I teach 2nd graders, their 62 year old teacher, and my mom to do those things?
We are focused on making interfaces that are Zero Learning Curve. Our goal is to allow you to ask for what you want and get it with out having to know how to ask.
Selling your own ads is hard. Especially at low volumes getting started. So your choices are basically Google and Microsoft. (Chitika doesn't pay anything)
In what way? Writing something the meets our qualifications for "what is a review" is much harder to game than Link spamming. You can game the system only by writing content that is useful to the user.
The only easy to game part is that we give brands a pretty big bonus for themselves. Sony.com/playstation will always be the top hit for Sony PlayStation. Even if we should favor a .gov result that says they are recalled for bursting in to flames. But as that rarely becomes an issue we are ok with that being number 2.
There seems to be a strong emphasis in search results on domain name match, similar to Google several years back. e.g., search "dog training" and examine the results - there's a much higher mix of spammier content mixed in with helpful content than you'll see in the other big name search engines' results.
Anyway, keep cracking at it; I'm sure you'll get it sharper as you go.
"Su samuru"(literal translation: water sable) is the Turkish for otter. Turkish is an agglutinative language, that last u is actually a possessive affix and doesn't make sense when the word is by itself.
We run on Google AppEngine. So we use Google for a lot of things. Building all of the pieces that make google is more than 10 people can do in a year. We have 5 developers. And most of those only came in the last 6 months. We may build analytics, but it will be a while.
Doesn't seem to tailor results to your location so might not be as useful for people outside of the US? Or did I just try a stupid search? I performed a vanity search and it was listing different names before there was anything about me. Same search in Australia on Google has me in four of the top six spots.
Google has 100 people searching every thing that can be searched. We have to do the work when you do the search. We get faster the more people use us. Exponentially.
How can they trademark the words "Liquid Helium"? The first search I did on Samuru was for Liquid Helium and it brought back about a half million results, all of which I assume are violating its purported trademark.
You don't have to submit we will find you. He have a bot, I apologize I don't recall the user Agent at the moment... It comes from a Google IP address since we are running on Google AppEngine. So we have less control over the bot's user agent than I would like.
Samuru doesn't use link authority, it analyzes pages and matches what you queried to the types of pages and picks the best matches.
Let me give you an example.
You search for "How to Make cupcakes" Google says give me the pages that have the most inbound linkes (over simplification) that contain all those words. The winner is Brandon's Cupcakes (not really but play along for a minute) because it says, "We know how to make the best cupcakes, because we have been doing it for 25 years"
That is not a useful result. Samuru on the other hand says "how to make cupcakes is a search for instructions" and it looks for pages that match the words, and are written as instructions.
We weigh other factors, like is there an author associated with the article. Do they routinely write about the topic?
We do this for reviews, products and other things as well.
To be a full replacement for Google we need Driving directions, and image search and a lot of things. But in order to do all the other things we are doing we needed a search engine. (related content, analysis, speed testing, building a corpus of words)
Responses get better if you search something someone else has searched or do a second search 30 seconds later. This is because we haven't deep indexed the entire Internet yet, and so we don't have all the deep data.