Hacker News new | comments | show | ask | jobs | submit login
DuckDuckHack is now in Maintenance Mode (duckduckhack.com)
209 points by frabcus 98 days ago | hide | past | web | favorite | 83 comments



As someone who has actively participated in DDH for a while now, here are my views:

- A non-trivial part of the current contributions included "cheat sheets" which IMO, really required a lot of effort to ensure correctness/usability but don't really provide much improvement to search results(I don't think I myself used the feature in the past 1.5 years more than 3-4 times), so, this should really free up time for DDG staff to focus on the more important instant answers and features.

- The community has been, for a while now, getting smaller and less contributing in the recent past. Backed by data from official repos(the number of commits over time, that is)[1]. After all, there are only a finite number of instant answers before they just become redundant.

- The current model for the triggers(when an instant answer gets displayed) is quite restrictive. It's just regex-based. IMO, a lot more growth can be achieved using ML models for triggering, A/B testing etc.

I'm still kind of disappointed with this. Perhaps unrelated, but does anyone have any suggestions for people willing to work on similar open source projects.

[1]: https://github.com/duckduckgo/zeroclickinfo-spice/graphs/con... , https://github.com/duckduckgo/zeroclickinfo-goodies/graphs/c...


Kiwix - most people are too conditioned to think that search has to happen online and don't even realize what is possible offline.

Entire web archives such as the entire dump of wikipedia and stackexchange (including media and indexes for search) can be stored locally. The missing piece is Google level search quality on the local machine. Given that brute force substring search can process Gigabytes in seconds nowadays. If you have enterprise grade server hardware things are reaching 1000GB/s. At this rate, there is no reason to think in a couple years local search of all known human knowledge can't happen on a local device at Google level result quality.

For anyone interested in the search space look into whats possible today in local offline search.


This is a great observation & seems to dovetail with technologies like IPFS.[1]

[1]: https://ipfs.io


You might be right, but human knowledge is also expanding, of course. The question is: will it expand faster than hardware capabilities?

Anyway, I wish we'd see more search and NLP related posts here on HN. It deserves far more attention than it gets.


For the average person this rate does not matter. They don't need access to the cutting edge of quantum physics, astronomy, dance, art or javascript.

All you have to do is look at the speed at which new info is being added to Wikipedia and Stackoverflow which is stabilizing, i.e. it is not growing as it once was. Basic/foundational knowledge is more or less all covered. https://en.wikipedia.org/wiki/Wikipedia:Modelling_Wikipedia%...

And that sum total comes to 50-60 GB compressed. Think about that number. It's not big.


The sum total of our collective intelligence is equal to an install of gtaV... Crazy.


Wikipedia is not the sum of our collective knowledge. It's little more than the preface.


We're talking about the "long tail" of information, which is huge also outside of science. Think popular culture.


It would be awesome if you could download dumps of wikepedia filtered by category so You can get the size down. Probably a lot of information that is useless to me in there


Kiwix does this, at least to a certain degree: http://wiki.kiwix.org/wiki/Content


Listen to Wikipedia http://listen.hatnote.com


NLP is rightly ignored.

https://en.m.wikipedia.org/wiki/Neuro-linguistic_programming...

Edit: Fortunately I'm left feeling foolish, rather than horrified.



The average user's needs are so small.

You do not even need "Google level" for most of today's web users.

You can deliver what users need with respect to web search with much less than "Google level".

For example a simple "<title>" search. This is how Google started.

The entry point into the web should be search for domains. A "<title>" search can do that.

Most users today do not do much searching within websites via Google. They search for websites using Google.

Anyway, you are right about storage space and offline search but obviously that truth misaligns with the "cloud" business narrative and coaxing users to store all their personal data in datacenters instead of on their desk or in their pocket.

Expect much opposition to this simple truth.


http://web.archive.org/ now provides full-text search, mostly of website titles.

Try it out. You'll find that it's... it feels like a trip back to 1998.


I'd say especially the average user profits from a search system that's somewhat clever and finds things even if they do not ask the exactly right query.

And searching for domains is only a tiny part of it, especially now where a lot of information is stuck in general sites with a lot of content (wikis, Q&A sites, social media sites) and not on special-interest sites. And for many generic searches the special-interest domains are various levels of spam/affiliate marketing.


PCIe 3 x16 devices have a 16GB/s theoretical max, so 1000GB/s is still out of reach for single machine I/O (though it's not as though search needs anywhere near these bandwidths anyway).


The Intel i9-7900x has 44 PCIe 3.0 lanes and wikipedia tells me each lane has throughput 984.6 MB/s so there's ~40 GB/s, maybe fast compression could make a small integer multiple.

https://www.intel.com/content/www/us/en/products/processors/...


AMD Threadripper has 64 in all available models: https://en.wikipedia.org/wiki/Zen_(microarchitecture)



That blog seems to imply you're using a distributed architecture, ie. not a single machine.


I've been using Google and Wolfram Alpha for these things over the years, but it has always irked me that I'm sending this info to a third-party, to run through their services that I have no way to read or improve the code, and knowing that these things are only available to me if I'm online. I was really happy when I found out the DuckDuckGo Instant Answers modules' source code is open.

It's been on my list of things that I will almost definitely never take the time to actually work on, but I wished what I had was (A) a browser extension or GNOME extension that incorporates an offline version of all the DuckDuckHack modules, and (B) the same thing in an open source mobile app. (This kind of thing could just as easily live in a command line app, though, and I'd be super happy if a project maintainer incorporated them into something like GNU Units.) I looked into it, especially for (B), but I realized that the DuckDuckHack code depends on Perl.


Well, about offline availability, a large number of instant answers(spices and fatheads that are) use external APIs or indexed databases from websites, so they can't work offline.

DDG does have official(and unofficial) browser extensions and apps for iOS/Android.


> Well, about offline availability, a large number of instant answers [...] can't work offline

Sure, but there are a large number of instant answers that can and do work offline because they're simple, static tables, or are self-contained—existing only to apply transformations on the input (e.g., cheatsheets, natural language unit conversions, and calculations).

> DDG does have official(and unofficial) browser extensions and apps for iOS/Android

A browser extension that just sends the query the same as it would if you hit their homepage is in the "what's the point?" category, just like mobile sites that nag you to install their app when all it does is show you the same content that is (or could be) on the mobile site itself. The "is a browser extension" is not the interesting part. "Doesn't send data to a third party" and "can operate without being connected to the network" are.


Why can't we have an intermediary search service that grabs search results from Google and posts them on a search website anonymously?


Startpage [1] is what you're looking for.

[1] https://startpage.com


Right. StartPage.com delivers Google search results in privacy. Plus, it offers a free proxy with every search result so you can visit websites through StartPage anonymously, too.


In DuckDuckGo, !g more or less does this, in that it disables search bubbling, but I think google can see your client IP when the results are served to your browser.


Banging into Google using !G is like searching Google directly. Banging from DDG doesn't confer any privacy protections. A lot of people don't know this.


Start page does just that. Ddg something and use !sp to search there.


Let me save you a lot of time for the future:

!s is enough to redirect to Startpage. :-)


searx proxies user requests to different search engines.

https://github.com/asciimoo/searx

there are different instances : https://github.com/asciimoo/searx/wiki/Searx-instances


The subtitle is "Past, Present, and Future", but I'm really missing what the future will hold. All they mention is that "We’re not sure what the next community initiative will look like"


Only vaguely related: Is there any fully FOSS general purpose web search engine which gets close to DDG? It seems by now it should be possible to run a community supported completely transparent search engine with relatively limited means ($X00k/year).


> It seems by now it should be possible to run a community supported completely transparent search engine with relatively limited means ($X00k/year).

I'd argue the opposite, the time when such a thing would have been possible is long past. If you want to get anywhere near to the quality in results that the big engines offer then you're going to be spending some pretty big cash.


In my experience, the bar for quality has been rapidly dropping as of late. At this point, most of the things I type into google come back null or with random results -- even with results that used to return data that was relevant.


I have noticed this too. Entire vast websites seem very poorly indexed (I can never find relevant reddit comments, Tweets, etc), and the web feels smaller and smaller, if I judge by how many useful results I get in a search.

I have no idea why that is, but it's really hard to find relevant information for anything outside mainstream stuff. Just the other day I was looking for how to post JSON in a test with Flask, and I hadn't found anything after tens of searches. Surely, something must be referenced in some code on GitHub, or a reddit comment, or some blogpost somewhere, but it proved impossible for me to find.


I agree that there are problems searching large volumes of user-generated content when it isn't referenced much long term (Reddit, Twitter, etc.). If I don't keep record of Tweets myself I rarely find them again.


Well I'm glad it isn't just me. Though I've been seeing it on other sites like Netflix and YouTube. Hell, YouTube keeps suggesting me videos that I've already watched (search history is on). Like watch a video, go home, it is in the suggested. Sometimes Google amazes me (like when I typed in "mm unit" and it returned "molar mass" which is what I intended) but most of the time it feels like it is going down hill and I have to start using the macros.


Can you please provide an example of this type of search?


I look for scientific papers a lot and have found that I'm getting a lot more sites like IFLS or Ars that are reporting on said paper. Or I'll get related studies, but not the one I'm looking for, when the related studies definitely don't contain a word I'm using.

This is even expanding into my code search. Like I'll type "do something os related python linux" and get commands for windows as the first few hits. Clearly I don't want windows.


Like a link please, I want to see "null or with random results". Not sure how customized results affect things though.


I can only account for my experiences. Only time I've ever had null results was when looking for really obscure things. I'm not the person that you originally replied to.


Thanks for taking the time to provide a specific example where the results weren't up to your standards.


I could spend a good hour or two of my time scouring through my web history for the things that you ask for proof of, or you could take what I said (Like anything on the internet, or based on recollection) with a pinch of salt. It happens just infrequently enough that it is both difficult to find proof of, and annoying when it happens. The null case does not happen so often as the case where nonsensical answers are returned.

Needless to say I am not going to do the former, because it was intended to be an anecdote indicative of the general case -- so example-specific solutions are not likely to be transferable, because I don't hand portions of my web history to random people to prove something, and also because I have better things to do.


Thanks for 138 words to explain no proof of your original 43 (making what I personally consider to be a rather extraordinary claim) since you have better things to do? I'll go with several times more than a pinch and keep an eye out to document this phenomenon a bit more thoroughly should it ever happen to me.

Success to you!


It's interesting to me that it took me one minute to write 138 words, thank you for that neat fact. And congratulations on a comment that is otherwise pretty useless.

For the record, you do not have the right to demand that someone use their time to prove something that is only of interest to you, nor do you have the right to demand that someone debate you. It is extremely arrogant and self-righteous of you to state otherwise.


Since we are going "for the record" (on a site where comments can't be deleted after an hour and a particular user record's karma column value is doubtless a part of consideration for ~$70,000 or whatever it is now) ...

> Can you please provide an example of this type of search?

My "arrogant and self-righteous" "demand that someone use their time to prove something that is only of interest to" me

>> demand that someone debate you

Not found. Apologies for any failures of communication or reading comprehension on my part?

Peace!


Yes, you are correct, it wasn't the question. It was the response to not getting what you wanted that made the original question tone-shift into a demand.

  >> demand that someone debate you
  > Not found.
Sure. I left that in for prosperity.


If you have time, would you mind sharing what you would consider to be a non-demanding example of how I could thank you for following up, explain that I was doing exactly what you asked (re: salt), and promise that going forward I would be on the lookout for examples of my own -- all without any risk of confusion on my part being interpreted as a "tone-shift into a demand"?

I am surprised to discover the hard way how the intricacies of tone could cause anyone to document "for the record" an implication that I wrote something that I didn't write (re: demand), and then claim that I have written "arrogant and self-righteous" statements demonstrating my incorrectly supposed possession of two oddly specific non-existent rights (re: 1. others' use of time, and 2. debate) when I wrote no such thing.

I truly appreciate the potential value of this opportunity to learn from your feedback; please don't interpret this new request as a demand! (You are of course as always welcome to just say no [thanks?] or to not respond, among many other options.)

Perhaps my takeaway should be to add a disclaimer similar to the one immediately preceding this sentence to future questions; my mistake may be rooted in taking for granted what I believed to be understood by all as core to the function of this particular means of communication.

PS. FWIW, I have upvoted all of your replies as is my habit in very small part to thank you for your time.


It's worth pointing out that blekko raised $63mm. That gives you an idea of what a multi-billion-page crawl and index requires.


That's only if you're trying to be competitive with Google et al as a business venture. If all you're going for is good enough, that's not indicative of your necessary costs. What's more, the FOSS movement has a long history of accepting these kinds of compromises.

I call this the "Retina watermark fallacy"—when you equate something not being the latest and greatest with being unacceptable. When Apple introduced the "Retina" Hi-DPI display for the iPhone 4, it was good. But what's more, it was supposed to show that everything else was junk. And yet, if you looked at Apple marketing materials from ~5 years prior, you could find breathless ad copy about their then-latest displays that were (necessarily) not Retina quality. That means either one of two things are true:

1. either Apple was selling unusable junk prior to the introduction of the Retina displays, and they managed to mistakenly convince themselves and everyone else that this stuff was acceptable when it actually wasn't, or

2. pre-Retina displays were good enough, and Hi-DPI displays are simply better

The truth is lies in 2.

So, the takeaway as I understood it from the original question would be whether or not the FOSS world could produce today a search engine on par with, let's say, 2002-era Google. (I remember 2002-era Google, and not only did it work, but it was good!)


You can't build 2002-era google today, it's 2017 and the web is a much more hostile place.

If there was any way the FOSS community could fund a reasonable search engine, I'd be happy to work at non-profit wages to make it happen. I don't see any way.


There's YaCy, but as for achieving the results, there's a very large amount of network traffic and storage you'd need to have available to you.


It's not exactly what you're looking for, but I'd recommend taking a look at Apache Solr:

http://lucene.apache.org/solr/

It allows you to run a document search engine that can be distributed over multiple machines. It could be adapted to create a web search engine. Interestingly, it looks like DDG uses Solr, though I'm not sure if it's used in their core search features or not.


If you're up in the billion+ document size you'll probably want something more efficient like Xapian. Solr and Elastic are great at what they do, but web scale is bigger.


You might be interested in searx.me


"Searx is a metasearch engine", so while it has many good uses any biases present in the results it retrieves for you are still going to be in the final result.


How it can about "for you" if the instance is shared between different users ?


Check out http:/yacy.net. Do not expect the experience anywhere close to ddg though...


Am I understanding correctly that the instant answers (actual content) is not on GitHub, and is only available on a web page semi-locked down from scraping attempts by JavaScript paging?

If this is not correct, anyone have a link to the exact repo I should be looking at? The link in TFA only goes to the main account page, not any specific repo, and the repo names are not clear enough to tell if they have what I'm looking for.


The Instant Answers are all on GitHub but in four separate repos which, I agree, can be confusing.

* https://github.com/duckduckgo/zeroclickinfo-goodies - "Goodies" which are generally static answers such as cheat sheets, colour picker or unit conversions.

* https://github.com/duckduckgo/zeroclickinfo-spice - "Spice" for using public APIs, e.g. weather, transport status or currency conversions.

* https://github.com/duckduckgo/zeroclickinfo-fathead and https://github.com/duckduckgo/zeroclickinfo-longtail - "Fathead" and "Longtail" are less common and are for text lookups, e.g. of programming docs.

Disclaimer: DuckDuckGo staff


Thanks for clarifying!



Does this mean the end for Instant Answers? I hope not - it's one of the information sources my bot uses to research the world.

It's amazing to see so much human effort went into this project and the full 1200-word list. I thought I had read somewhere that this was automation backed by Wikipedia, but apparently it was entirely manual?


DDG staff here - Instant Answers aren't going anywhere :)


This is why I'll never use another service by DuckDuckGo ever again. First they shut down DuckDuckReader and now this.


Just to clarify for those missing the joke, we don't (or didn't) have a service called DuckDuckReader. Got a nice ring to it though :-)

Disclaimer: DuckDuckGo staff


I thought I was familiar with most of DDG's operations. What was DuckDuckReader? RSS reader or something?


That was a joke essentially ripping on people throwing fits about Google shutting down products, most famously Google Reader.


[Unrelated]

I was not aware of this being open source.

A quick look-through led to this sample search -- "Movies with Keira Knightley". However, "Keira Knightley movies" fails to give the same instant answer. Any permutations of words "Keira", "Knightley" and "movies" on Google seems to give the list of movies -- which is how the behaviour should be I guess, will probably open an issue/PR :)


What was DuckDuckHack?


Seems to be an editing community for instant answers on duckduckgo search


I started making something with DuckDuckHack, soon realized I bit off more than I can chew. I wanted to delete what I did so at least the name would be available to someone who wants to do a good job, but have no idea how to delete it.


> That's over 5,000 pull requests, 250,000 lines of code and hundreds of squashed bugs!

I was expecting "hundreds of new bugs".


How can one join the Duckduckhack community. And what's the selection criteria?


You pretty much submit an instant answer or two and they add you to the duckduckhack-community group on GitHub.


It seems that discarding your community once you have made enough money is trending.


How does this sort of conspiracy theory make any sense in this context? Or even in general?

If they are making so much money, why would they end the program?


To be fair, the comment from DuckDuckGo in the Reddit thread[1] says that they will continue to put resources into it ("staff are still improving the Instant Answers we have, and will create any new Instant Answers we see are needed"). Which means the only change really happening here is to shut off the contributions that DDG receives from others for free, which doesn't really make sense as a business decision, either.

On the other hand, if you view it as an announcement that they're going to be taking Instant Answers closed source to keep future changes in-house, then it makes sense.

1. https://www.reddit.com/r/duckduckgo/comments/6ymjj8/duckduck...


Just to clarify, the Instant Answers will remain open source on GitHub and maintained in public.

Disclaimer: DuckDuckGo staff


I don't know, the whole post reads like a "So long, and thanks for all the fish" to me. They don't need a community anymore because they got bigger and at this point it's just easier to close it rather than having employees taking care of the user contributions.

Reddit has just pulled the plug to all their open source repositories. This will make it harder to develop and keep third party clients updated, like the ones without ads for example.


I'm not sure how you reached this conclusion from this news, given some insider infos from other comments.

Maybe you have a relevant story to tell about DDG?




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: