But for some crazy reason, I kinda like this. It feels like the 90s internet. The links included so far have that same random mix of lots of nerdy links, homepages & personal blogs, a few religious sites, and the occasional big news website. Because there's no crawler yet, it's limited to the specific pages people thought were noteworthy. And because the index is so limited, I'm stumbling on interesting things.
It's so weird looking at this and thinking "Y'know, maybe this could also work if the links were curated into yet another hierarchical officious oracle", or "if this site let me pay to show a small text ad on the side when someone searched for a relevant keyword, I might spend a few dollars here".
Someone submitted the "Strawberry Pop-Tart Blow-Torches" page, which is one of my earliest internet memories. Whoever submitted that, thank you for the nostalgia!
This thing isn't slaying anything.
If this were really like AltaVista, I'd get 3 trillion results and have to use advanced Boolean logic to cut that down to the most useful 7,000 - so I guess having no results is sort of easier...
Since the index had only five or six entries a couple of hours ago I set the matching to be wide instead of narrow. I'm also experimenting with loading the model with phrases, phrases and words or words only. I might have f-ed up the query parsing because of that. Remember, this is 0.1, fresh out of the press.
Searching the tree is here: 
Tokenization is here: 
Thanks for the link.
Stop and ponder that for a moment.
 First Blood
 Rambo: First Blood Part II
 Rambo III
(the best one in the bunch)
Did you see that South Park episode where they try to pick a name for their startup but all names are taken? It's very, very funny.
So yeah, I want the perfect name for this. What _is_ the perfect name?
And maybe also the 'you'?
1. Does the site do any crawling on its own, or is the public index only fed from submissions?
2. It appears Umlaut/Unicode handling needs some work: When I search for "Käse" (German for 'cheese'), I get the response "0 results for 'Käse' in 'www' (0 ms)".
At this point I'm not sure if there's actually 0 results or if it was actually searching for the escaped string.
1. You may submit a page. When I have a little more capacity that just 1 CPU/1 GB RAM I will also crawl.
2. I'll look into it. Thank you.
That’s why google wants every drop of data
You are right though, the index had only a handful of entries an hour ago.
Is there any further technical documentation than this (besides the source code)?
I tried searching some of the terms in this description on Google, but found little specific information. One search turned up k-d trees. Is this related?
In broad terms: its a 16-bit vector space in which you can encode anything you like. I have chosen to encode phrases and words as bags-of-characters. This separates terms from each other enough that they can be searched for reliably (in almost all cases).
Terms that share a character have vectors that intersect one another and we can measure the cos angle between them. That's the score.
That is represented as a binary tree.
A scan in the tree gives you the closest match and an address into a file on disk; a list of document IDs.
At query time boolean logic is used on the result (document ID list) from each query clause (AND/OR/NOT key:value).
I'll write something up.
It would be surprising (to me) to get the same results for e.g. "strange" and "garnets".
A secondary index might become needed with the most popular terms, to resolve which anagram is the right one.
Love the ambition, but a long way to go go.
I think it’s problematic to have random people submit to the index with no incentive. I’m just becoming interested in tai chi, but I run no such webpage (who usually submits). There might be a way to gamify or otherwise incentivize people, but that’s a very non-scalable approach. Really only automated crawling can be done to significantly widen your index. It’s just very resource intensive... but good luck! I hope you can go far!
"There might be a way to gamify"
I hear you. First of all, you guys aren't random people to me. You're my favorite internet people.
There are already some hundred entries in the index, all from you guys. If I analyzed the contents right now it would probably tell me something about us, as a group.
One of the entries is pornhub.com. We have at least one male in the group.
Maybe organic growth of the index has already started. And once I teach you how to use the public HTTP API and not just the web GUI, perhaps you will all start to see how useful this service already is. And it will grow even more.
Someone just donated 5 huge servers, big ones. Didyougogo will be around a while at least.
I searched for apple. Top result was the archive.org macos that showed up here on HN recently, 2nd and 3rd were apple.com indexed 10s apart.
Then some odd results - though they do include the word apple on page just once. The imdb page for 12 Monkeys appears 3 times.
I guess you're not trimming duplicates? Seems like you need some way to weight rankings too.
I wish you every success - search definitely needs some competition.
Did you submit both a query and a URL?
Did you go go?
Thanks for submitting.
still, all that being said, i agree with the idea of erring on the side of safety. but either way, what you do in the privacy of your own device isn't really constrained by licenses, so of course there's no reason you couldn't just start working on it now if such were your desire and then worry about distribution and such when the license itself changes. sort of a "fair use" type thing imo
I just added a MIT license. Not sure that was the right one, but to be clear, I want anyone to be able to fork it, run a business/do whatever with it, without me being able to sue them. At no time can I sue them.
The more forks the better. As long as they adhere to certain principles, like not detroying the current HTTP API's, they will all be able to talk to each other, which is how I would like this to scale.
By having many people running search services, load and storage will be distributed.
Why would they run a search service? Well, they might need one for their site and once it's up and loaded with your content, you can now start to query it for data that you don't host. Others host it. I host a "www" index. You might host a "my_data" index. So you can create queries that span those two indexes.
Is the (business) idea.
That's a very interesting idea that I hadn't considered. So basically site owners could host their own nodes that only index their own website. But since the nodes can communicate the end result is an index of many different websites.
Anyway, noted that it's a very early version, so good luck with it!
Yep, I have probably messed up the relevancy a bit because of constantly experimenting with how to load the model/index. Right now I'm using phrases (sentences) as well as words, both extracted during the tokenization process. Initially I used only phrases because using the current 65K vector-space model that would match any word to any phrase containing that word. There are perhaps sideeffects of reinforcing each word like that.
"long way to go"
I don't think so. The real bitch was to figure out how to maintain a good representation of the language model on disk. How to update it. Remove data from it. Now I anticipate a couple of months fine-tuning the balancing of the tree and testing relevance. From what I have heard so far, relevance is a little sub-par.
Scaling is the next thing. I have a great plan for that of course, mentioned somewhere in this thread.
Is that still yahoo.com?
I like the "Surprise Me" button, where it takes you to a random page from the index. (I got a 90s era Babylon 5 fan page.) It could be interesting if didyougogo added that, but it would need to add a NSFW filter.
I’m not quite sure the exact privacy trade-off but for things that I consider non-sensitive, I certainly prefer non-https web.
There’s a whole class of traffic I don’t care about, like this guy’s prototype or your mom’s blog or whatever.
And I like segregating stuff I care about vs stuff I don’t.
Also note that with SSL, google can still do all this, but they have the same pressure my ISP does if they ever try it.
On the other hand, there are downsides to not using it (which have been previously mentioned).
But I think the most obvious downside is that OP is the only one working on this and any time spent working out ssl is time away from feature development. SSL is not a key feature of OP’s product so there may be other features more important.
Simplicity is an important design principle. There are many things that have “no downside [other than cost to set up and maintain].” but have no clear value driver.
It’s quite possible that all the important stuff gets built out before users make the value of ssl really clear.
Well, I'm sold!
Or archive.org: http://web.archive.org/web/20180813020050/http://didyougogo....
(I still miss proper boolean queries.)
I've done it with a static set of data, the UTZoo Usenet data...
Shame it died on the vine, distributed, and curated search was a powerful tool in the days of Veronica and Archie
No it wasn't. I'm old enough to have (tried to) use it, and it was terrible.
It was usually quicker and got better results to manually connect to FTP sites, and run directory listing on likely directories untily ou found what you were looking for.
Non-centralized personal search engines have a few challenges to solve before they're feasible.
1) The web cannot support thousands or even millions of spiders/crawlers.
2) Search indexes are (probably?) too huge to distribute. See the commoncrawl project. It's TB for a few Billion pages.
3) Assuming a single crawler collects the necessary data, indexes can be easily distributed, and the search engine software is simple to set-up, who is going to subsidise this effort?
Why would I use this over duckduckgo? (Assuming that we're some time on and the index is comparable?)
I made it public yesterday on https://fts.fail/
Good luck slaying that dragon though.
Also, plans to add HTTPS?
This looks cool, though, good luck!
I know (almost) nothing about search engines but I hope something like this succeeds.
Definetely need a better one.
91 results for 'hello world' in 'www' (32615 ms)
As always, the question is how it scales.
Scaling out technically and socially seems a little bit related. I want to scale out like this: a public search server (node) knows about other public nodes and the semantic topics their data carries. When a node cannot sufficiently answer a query it can reach out to other nodes by looking up a map of topic/list of nodes. Sharding by table/collection can also be solved the same way. That way, people owning public nodes can create queries that span tables they don't even host. They can build analytics using _their_ data _and_ the world's data. That's super-powerful.
Searched Red Dead Redemption 2 - no game info
Searched "bobs" - no bobs
There was a (failed) attempt by the EU I know about. And I don’t see that happening in the near future.
The US isn’t even capable of providing a search interface to its own web sites that competes with commercial offering (eg, using google is better than the sites built in search).
The EU attempt was called Qaero  and wasn’t an exact google didyougogo competitor as I think it’s focus was on video and audio. But they spent at least $99M from 2005-2013 and had absolutely trash results.
It’s kind of weird how hard it is for some organizations to do some things. You would think with a hundred million bucks you would get something. DDG  was self funded initially and then with $3M and they are pretty useful despite 30x less funding.
it's unclear to me how i am supposed to help improve this.
There is something wrong currently with relevance, probably because of query parsing errors but perhaps also in how text is tokenized. This whole idea revolves around relevance so this is of course embarrassing. But it's 0.1 alpha. And it _did_ work on my machine.
Thanks for trying.
This project looks neat, I think first experiences with it would be much more improved if you could seed it with some content.
Maybe this could run my search with other search engines to compare and gain insights.
Yes, server capacity. Once I have a better hosting situation I'll start crawling.
Thank you, I've tried to be neat this time around.
With regards to full-text search, the didyougogo search engine should be able to replace elasticsearch (which is laughable relevance-wise in my eyes) or solr, once the alpha-bugs are gone.
Perhaps HN members might offer some spare cpu cycles.
Maybe a few sites could be crawled and indicate some sample searches that could be run in the meantime.