Hacker News new | past | comments | ask | show | jobs | submit login
DARPA Has Open-Sourced 'Dark Web' Search Tech (forbes.com)
188 points by aburan28 on Apr 19, 2015 | hide | past | web | favorite | 62 comments



BTW, I'm one of the co-authors of streamparse, one of the DARPA-supported projects that is being developed by my company, Parse.ly. It lets you integrate Apache Storm cleanly with Python.

I just gave a talk about streamparse at PyCon US (https://www.youtube.com/watch?v=ja4Qj9-l6WQ) a few days ago, it was entitled "streamparse: defeat the Python GIL with Apache Storm". I'm glad to answer any questions about it.


With only a brief skim of your talk, I wonder what you think of the moral implications of this project being DARPA supported.

> ...DARPA said Memex wasn’t about destroying the privacy protections offered by Tor, even though it wanted to help uncover criminals’ identities. “None of them [Tor, the Navy, Memex partners] want child exploitation and child pornography to be accessible, especially on Tor. We’re funding those groups for testing...”

Doesn't this sound like the same "protect the kids" line embedded in every press release for not-so-subtle government spy programs? $1 million is a lot of money, and I'm sure being able to name drop DARPA in any conversation about your company carries its own cache -- surely you feel pressured to design your optimizations to fit DARPA's needs. Does it feel weird to write code that's being used to track people? Or is that off base?


To be clear, our projects (at Parse.ly) don't have anything to do with Tor. In fact, I didn't know much about Tor until researching DARPA and the other participants involved in the program.

But, I'll address your general question, which is, do I have a moral/ethical problem with DARPA funding some of our open source work, such as streamparse and pykafka?

The answer is a resounding "no". There are very few funding sources for open source work. Part of DARPA's funding supports fundamental tech advancements (famously, the Internet itself and GPS) and recently, important open source projects (such as, Apache Spark and the Julia language).

Now, there is no doubt in my mind that open source software is used for intelligence purposes, regardless of its funding source. To restrict ones contribution to F/OSS based on the fear that some government or entity may use it toward an end you disagree with seems a battle you can only win through willful ignorance.

The nature of open source software is that people can use it however they please (within legal limits, of course). This is a trade-off I accept with eyes wide open, and in my mind, the benefit to the community for F/OSS always wins out.


> In fact, I didn't know much about Tor until researching DARPA and the other participants involved in the program.

This reminds me of the movie Cube :(.


Brilliant little unknown film. First time I've seen it mentioned, ever.


It's a means of searching public-facing (albeit cloaked) content, not a means of tracking people specifically.

If anything, it's ethically much superior to what the NSA is doing: law enforcement searches for content that is clearly criminal (child pornography, actual terroristic threats, murder-for-hire services), then requests a warrant after showing the content to a judge. That's how the process should work; identify something illegal at the front, identify probable cause, then go in through the back with court approval. These search engines can only find content that is already accessible to other users.

The NSA is already in the back, looking for justification for already being there, then after finding something, lying and saying they went in through the front.

Of course, this software could theoretically be used to search a database of data unethically exfiltrated without a warrant, but that's not what the stated goal is and there doesn't seem to be any evidence of that.


They're using it to search for "human trafficking", by which they seem to mean adult women having sex in exchange for money. Oh, sorry, adult women who describe themselves as "latina" having for money - mustn't forget that part. (Seriously. Look at the pictures in the article.) Minor details like whether the women in question are actually trafficked, or whether they should be deporting them right back into the hands of the people who trafficked them if they are, have never been terribly important to the police in the US. This will be used to hurt vulnerable women.


> Does it feel weird to write code that's being used to track people?

Does it feel weird to design mechanical implements designed for the sole purpose of destroying human life?

I'm not speaking of drones and missiles, mind you; I'm speaking of small arms, the very same tools so staunchly defended by libertarian lovers of the Second Amendment everywhere.

There are plenty of valid reasons to want to track someone over a network like Tor, just as there are insidious reasons. E.g. all the reasons that make legal, warrant-protected wiretaps a legitimate function of governments worldwide.

But even if there weren't valid reasons, other countries will develop (or already have) similar capabilities, so making DARPA your line in the sand for this is missing the point anyways.


>With only a brief skim of your talk, I wonder what you think of the moral implications of this project being DARPA supported.

You're not adding to the conversation by pointing this out. We can all clearly see this for what it is.


How does one get their commercial project supported by DARPA?


Hmm, not sure I could answer that question, as in this case, DARPA is supporting our open source projects, not our commercial projects. Or is that what you are asking?

That said, FastCompany covered the story of how we got involved with DARPA here:

http://www.fastcompany.com/3040363/the-future-of-search-brou...


I looked at the MEMEX page and say a bunch of companies represented and was genuinely curious... thanks for sharing the link.


Send your proposal in response to the Broad Agency Announcements (BAA) that the agency puts out.


What's the GitHub URL?



Solid. Apache licensed. You're inside tmux too.

This is legit.

Docs: http://streamparse.readthedocs.org/en/latest/

How did you make that screenshot / animated preview?


I used a Linux program called byzanz. The bash alias I use to record gif screencasts is here:

https://github.com/amontalenti/home/blob/master/.bash_aliase...


This is the link of the project:

http://www.darpa.mil/opencatalog/MEMEX.html


Of additional interest would be the other DARPA programs participating in the open catalog, of which MEMEX is but one.

http://www.darpa.mil/opencatalog/

There are many interesting 'pieces of the puzzle' here that you can glue together to make something awesome.

If anyone cares about what success means in our case, it's the transition of technology to other agencies and companies (not necessarily military!). We want the research funded here to live beyond the program instead of dying when the program ends (truly satisfying from a Software Engineering perspective, where many R&D deliverables are either too specialized or never see the light of day)


Roger Dingledine was asked about the Tor Project's involvement with the things mentioned in this article. His comment is here:

https://lists.torproject.org/pipermail/tor-talk/2015-April/0...


Thanks for posting - this is amazing... I was actually working on a far from finished article "How a Search Engine Startup company could compete with Google" http://bitexperts.com/Question/Detail/42/how-a-search-engine...

and then this announcement came across. Will be looking at what kind of crawlers they would release. Hopefully some modern ones based on a WebKit / Chromium core, that exposes DOM model and suitable for navigating all these AJAX script fueled modern web interfaces.

Also very interested to see what kind of Machine Learning / classifiers they are using. When working on our search engine, we were using purely statistical classifiers like all variations of Bayes, SVM (Support Vector Machines) and decision trees C4.5 glued together with some custom algos. We did not use neural nets at all. Nowadays, neural nets have a new name - "deep learning" and seem to be everywhere.

Really really interesting in terms of what people would build with an Open Source Search Engine. Watch out Google :)


One of the supported projects is splash, which is basically WebKit-as-a-service. It takes an interesting approach to crawling where it renders the page using WebKit, and then exposes the "rendered DOM" -- so that your crawling code doesn't need to actually use JavaScript for information extraction. See:

https://github.com/scrapinghub/splash

People often use Scrapy + Splash together in the Python community for crawling more dynamic websites.

A team I collaborated with is also working on a project to make Scrapy usable in a "cluster context", it's called scrapy-cluster. The idea is scrapy workers running across machines and a single crawling queue (in the current prototype, powered by Redis) in between them all.

https://github.com/istresearch/scrapy-cluster


    It takes an interesting approach to crawling where
    it renders the page using WebKit, and then exposes
    the "rendered DOM" -- so that your crawling code
    doesn't need to actually use JavaScript for
    information extraction.
It is an interesting approach. There's evidence that Google crawls the web that way, though I don't know if it's been confirmed by the company.

GoogleBot indexes content rendered by Javascript - even content delivered by an AJAX request. They've announced they are going to start penalizing sites that don't work well on mobile. I don't know the specifics of that (and they probably haven't shared them) but I do know that I've received automated email from Google Webmaster Tools and/or AdSense about one of my sites not working great on mobile: small UI elements grouped too closely together, content that's too wide, etc.


This is the tool recommended to my by a person on the adwords team:

https://www.google.com/webmasters/tools/mobile-friendly/

According to them, starting april 21st it will be a ranking factor.


    april 21st
Great. April 2011 was when Google launched Panda 1.0, from which I don't think my slang dictionary site has ever recovered.

Thanks for the link. I guess I better hop to it.


Thanks! I've briefly looked at Splash and related projects like ScrapingHub, etc - looks like this niche is live and kicking...

The distributed scrapy-cluster is the way to go, if you need to crawl anything of decent size ( maybe even Amazon - 300+ MM webpages, j/k :)

I see a lot of Python based projects recently, even in Bitcoin niche, we even have a local Toronto based Python meetup. Looks like Python dev community is active.

I have a domain name PYFORUM.com - would it be good idea to launch a forum site? With Bitcoin tipping built-in? So instead of saying "Thanks" people would be able to send $0.25 in Bitcoin to those who helped them in the forums or made them laugh? What are the most established Python forums out there?

Thanks!


Launch a forum actually using python...

Even the largest 'python forum' is on phpbb..


This is a nice surprise! I wrote to ask about public access to this in February and got no reply. I had forgotten about it until I got a lengthy reply from someone at DARPA in mid-march saying that they were considering it - and sure enough, here it is: http://www.darpa.mil/opencatalog/MEMEX.html


It looks like we can add Forbes to the list of publications that don't understand what the "dark web" is. It's too bad that this article starts right off with gross misinformation.


To be honest, "dark web" seems to lack a clear, well defined meaning and should be avoided.


Once again someone confuses "deep web" for "dark web".


This is wonderful news. Information (including Tor onion services) wants to be freely findable ;)

Obscurity should never be the major aspect of OPSEC.

Even with the sorry state of Tor onion services, sites can authenticate users at both network level, using the "stealth" authorization protocol, and application level. Of course, prudent users will use end-to-end encryption for sensitive information. And they'll consider carefully before sharing.

All of the "hidden markets", by the way, are little more than honeypots in waiting, enticing customers to pwn themselves.

Edit: See Roger Dingledine's reply in the "Clarification of Tor's involvement with DARPA's Memex" thread on tor-talk <https://lists.torproject.org/pipermail/tor-talk/2015-April/0....


The idea that a single company can perfect search - even one employing a significant percentage of the world's PhDs - is absurd. However, Google will likely continue to dominate search even in the face of superior alternatives because of their position in the marketplace and in the minds of consumers, just as Microsoft continues to dominate the OS space long after the introduction of viable free alternatives. This is why, if you have ever tried to pitch a search startup to VC's, you have almost certainly received a less-than-polite "no" along with your coconut water.

These tools will be interesting to use for a handful of techies like myself, but Google will remain safely ensconced in its fortress of cash for at least the next several decades regardless of this and whatever else comes along that may enable the creation of superior search engines.


> long after the introduction of viable free alternatives

Maybe Ubuntu (or mint?) has learned UX design and found a way to sand down all its' rough edges in the 3 years since I switched to OSX, but I doubt it. Running linux requires the time and patience of a knowledge person. Linux is far too expensive for most people.


If you ever have to service or install Windows (and everyone has to eventually) then that is even harder because it's not primarily designed to be worked with at that level. Windows rests on a cushion of someone else providing IT support. Why should Linux perform usability miracles no other OS does?


How many times you had issues running software on Windows compared to Linux?

Even with "consumer friendly" distro's like Ubuntu compatibility is an issue and you can easily screw up co-hosted software and even the entire OS due to the current dependency mess which plagues the F/OSS world especially on Linux.

Back in the 90's and maybe maybe early 2000's you would get silly error messages like .dll XYZ or cannot find ZYX .ocx on Windows but today? I don't think so, other things like configuring network sharing, backup, more advanced network setups are still considerably much easier on Windows than Linux.

And it's not just the UI both Gnome and KDE have offered pretty much 95% coverage of anything you can configure on Linux (core service only ofc) in the UI but their UI is still sucky.

I haven't used Windows server since 2003 jumped into 2012 and could find everything and any new features were self explanatory, trying to make OS changes in Gnome, meh i rather just find the config file and edit it manually.

Linux needs to get It's shit together, and hopefully one day it will, because ATM even with all the nice and user friendly package managers at best half of your software will come in archives which a basic consumer user doesn't know what to do with (nor should they) and at worst comes completely uncompiled forcing you to /MAKE it on your own machine hoping you got all the dependencies and the header files you need.

And for the software you can get from your distro's "app store" you still holding your fingers hoping one application wont cause all of your other ones to brain fart because the dependencies it uses are of a slightly newer or older version.

I've worked for a Company that tried to switch their desktops to Linux, and I've heard plenty of stories of people who encountered similar cases, it just never bloody works just like everything else and then they panic hire an entire new IT department who still doesn't manages to get everything working properly for everyone and scarps the entire project 6-8 months down the Line losing probably several orders of magnitude more than they would ever have gained from reducing licensing costs across their desktops.


Elementary OS [1] attempts to do that with a design first philosophy. However, it was still too much of a hassle to get it working with 3 monitors last time I tried it.

I've used various linux distros for years, and there's always something that doesn't work right. Whether it's the speakers not turning off when I plug my headphones in through my thinkpad dock in Arch, or monitor position settings changing every time I reboot my desktop in Elementary OS, the lack of polish really shows. I understand why this happens, and I don't fault the people making on the various distros. I've even dove in and fixed a few problems myself.

But nevertheless, it remains a problem, so for now I've just decided on windows for day to day use with a VM running linux when I need it.

[1] http://elementary.io/


While OSX is probably easier to use still, I've been using Ubuntu on the desktop since 2005, and every time I have to do something on Win8 I cry in desparation. Win7 was frustrating, but workable. Win8.1 is a horrible mess, as was Win8 before it; The Metro/Old dichotomy is intellectually painful, distracting, and damaging to any sane workflow.


I strongly disagree. Market dominance is almost meaningless in search, first off everyone is trying to game your system so being ranked #2 is actually a large advantage from a quality standpoint. Second, switching costs are next to nothing, just look at all the people using duck duck go despite a vastly smaller team.


Duck Duck Go actually proves my point. It is arguably the most successful of the alternative search engines. Yet, if you stopped 100 average people on any street in the US and asked them, I'm guessing that less than 3 of them will know what Duck Duck Go is. Ask them what Google is, and you'll get at least 99% accuracy.


You completely miss the point people using Duck Duck go can use g! and do a Google search. https://duckduckgo.com/bang.html

More importantly Duck Duck Go is actually considered worse than Google even by many people that use Duck Duck Go so this is considered a significant feature. Granted, it's debatable how much worse but Bing / Duck Duck go are viewed as competitive search engines not better search engines. Plenty of people are doing regular head to head comparisons with people using different search engines for the same query if they can't find it on one they use the other. After a while the 'better' engine generally becomes their dealt.

Windows on the other hand has a lot of inertia due to third part integration. For example there is a much smaller selection of games on OS X than windows which creates a feedback loop. There is no way OS X could hit 50% market share in 3 years let alone a brand new OS starting from scratch, but search has seen massive swings in market share fairly recently.

PS: The g! feature is also great for fine tuning their search engine. Providing a listing of every pain point when people gave up.


DDG definitely has sore spots. But the ! shortcuts more than make up for it.


So, browsing through these projects I see a lot of crawlers.

Where is the search tech? Other than that one project which seems to be a JS-frontend for Lucene/Solr, I can't find anything having to do with Information Retrieval.

Unless I'm missing something, the article and the previous one it links to seem to be a bit overblown.

"Could The U.S. Military's New Search Engine Replace Google?" -> Not if the U.S. Military's New Search Engine is the vanilla Lucene/Solr.

Link: http://www.darpa.mil/opencatalog/MEMEX.html


This is where the "Memex" name comes from: http://en.wikipedia.org/wiki/Memex

A memex is a "hypothetical proto-hypertext system that Vannevar Bush described in his 1945 The Atlantic Monthly article 'As We May Think'."

I'm rather disappointed that DARPA chose it as the name for a project that, according to Wired according to Wikipedia, "aims to shine a light on the dark web and uncover patterns and relationships in online data to help law enforcement and others track illegal activity".


It's explicitly designed to work outside that single sphere:

Memex seeks to develop software that advances online search capabilities far beyond the current state of the art. The goal is to invent better methods for interacting with and sharing information, so users can quickly and thoroughly organize and search subsets of information relevant to their individual interests....

Memex would ultimately apply to any public domain content; initially, DARPA plans to develop Memex to address a key Defense Department mission: fighting human trafficking.

http://www.darpa.mil/Our_Work/I2O/Programs/Memex.aspx


No, DARPA plans to expand Memex so it works outside that single sphere. I'm not reading anything that suggests it's designed to work outside that sphere.

But we could go back and forth on the distinction between "seeks to" and "designed to" all night, so I'll leave it at that. :)


You're both right. The trend with these I2O programs (and other departments) is to design a system of loosely coupled pieces that are tailored to a specific problem domain, in this case human trafficking, and then generalize in the next couple years later. By the third or fourth year you could expect to see multiple engagements with corporate industry and Gov't/DOD/LE/etc to appl these tools to their data, and thus generalize and polish the final product. The end goal is adoption of the technology and successful transition to some partner. The good programs tend to run for 3-4 years if they can justify their existence to the DARPA leadership. This is also roughly the tenure of a program manager, although a program manager may have more than one program going at a time.


I agree, but it is worth noting that DARPA isn't known for their lack of ambition, nor for their lack of long term commitment. If they "seek to do" something that generally means that is what they are aiming for.


Yeeeeeah... There's much more to google than just search. Also, there is much more to search engine than just seaech (seo spam countermeasures come to mind). Good luck replacing Google on its home turf.


However, the idea is interesting.

I once used Chrome. I now use Firefox.

Nothing lasts forever.

Then again, what's to stop Google adopting the same technology?


In order to replace Google, just implement their patents and then wait until they expire.

At least, that is how it should work.


Except you'll be 20 years behind their current version.


You are assuming that search will keep on growing beyond an asymptotic limit. It could very well be that we will soon reach a point where search is good enough for most people.


You're assuming a sort of static model of search where it makes sense to talk about it growing to an asymptotic limit.

But it's a dynamic and adversarial environment: the corpus evolves to destroy the quality of search results (spam/SEO). A 20-yo version of Google will be way worse than Google was 20 years ago, because when it goes live it'll be attacked by 19 years of spam innovation (if I can use that word in this context).



IMHO, spammers have little chance against Google in a long run, here is why:

- it is of vital highest priority for Google to stay ahead of SEO spammers, or they may stand to lose their $$$ billion search business. So resources to fight spam will always be allocated

- it would take only 1 smart Google person to write smart algo, and it will get amplified by 100,000 server cores at Google. And Google employs >> 1 smart EE

- the Chrome browser has 50% market share and may collect enormous SERP quality / engagement metrics on what "end users" actually click in search results, how much time they spend there vs amount of content on that page, and what page finishes each search quest (i.e. user has found what they were looking for) - these are in fact one of the best "quality signals" available to Google.

- and Google could completely ignore signals from any Chrome instance that has even slightest suspect on being manipulated (Chrome is a native app so it could monitor mouse movement patterns, etc).

- I would say Google is much better equipped than spammers to keep staying ahead. Of course, there are going to be some short term advances in a black hat SEO or social networks manipulations, but eventually every serious loophole would get closed


Anyone who've used a search engine disagrees with you. And as far as Google goes, their only "gripe" with SEO/Spammers is that they get the money instead of Google.

So yes, Google does work to stay ahead of SEO spammers by providing a better service than them for those who are willing to pay, because the consumers don't care and as long as you don't get Viagra commercial and 17 types of malware by clicking on a promoted search result in Google neither will you.


The problem is that search is a constant battle between Google and spammers/SEO. It's a never ending arms race that ensures there won't be a perfect solution.

And in this arms race Google's massive resources gives them a huge advantage. I think Google will probably maintain their dominance until they inevitably get caught doing something egregious enough that the government decides to step in and bust them up.


Do they patent the important parts from their search engine? I thought they just kept it secret.


For example PageRank - the core algorithm - is patented [0].

[0] http://www.google.com/patents/US6285999


Is that still an ongoing trend?


just implement their patents - that presupposes that the "disclosures" of the patents in question actually disclose useful information. One of the big criticisms of software patents is that the "disclosure" really isn't disclosing anything of value, like an algorithm or something.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: