
DARPA Has Open-Sourced 'Dark Web' Search Tech - aburan28
http://www.forbes.com/sites/thomasbrewster/2015/04/17/darpa-nasa-and-partners-show-off-memex/
======
pixelmonkey
BTW, I'm one of the co-authors of streamparse, one of the DARPA-supported
projects that is being developed by my company, Parse.ly. It lets you
integrate Apache Storm cleanly with Python.

I just gave a talk about streamparse at PyCon US
([https://www.youtube.com/watch?v=ja4Qj9-l6WQ](https://www.youtube.com/watch?v=ja4Qj9-l6WQ))
a few days ago, it was entitled "streamparse: defeat the Python GIL with
Apache Storm". I'm glad to answer any questions about it.

~~~
strgrd
With only a brief skim of your talk, I wonder what you think of the moral
implications of this project being DARPA supported.

> ...DARPA said Memex wasn’t about destroying the privacy protections offered
> by Tor, even though it wanted to help uncover criminals’ identities. “None
> of them [Tor, the Navy, Memex partners] want child exploitation and child
> pornography to be accessible, especially on Tor. We’re funding those groups
> for testing...”

Doesn't this sound like the same "protect the kids" line embedded in every
press release for not-so-subtle government spy programs? $1 million is a lot
of money, and I'm sure being able to name drop DARPA in any conversation about
your company carries its own cache -- surely you feel pressured to design your
optimizations to fit DARPA's needs. Does it feel weird to write code that's
being used to track people? Or is that off base?

~~~
pixelmonkey
To be clear, our projects (at Parse.ly) don't have anything to do with Tor. In
fact, I didn't know much about Tor until researching DARPA and the other
participants involved in the program.

But, I'll address your general question, which is, do I have a moral/ethical
problem with DARPA funding some of our open source work, such as streamparse
and pykafka?

The answer is a resounding "no". There are very few funding sources for open
source work. Part of DARPA's funding supports fundamental tech advancements
(famously, the Internet itself and GPS) and recently, important open source
projects (such as, Apache Spark and the Julia language).

Now, there is no doubt in my mind that open source software is used for
intelligence purposes, regardless of its funding source. To restrict ones
contribution to F/OSS based on the fear that some government or entity may use
it toward an end you disagree with seems a battle you can only win through
willful ignorance.

The nature of open source software is that people can use it however they
please (within legal limits, of course). This is a trade-off I accept with
eyes wide open, and in my mind, the benefit to the community for F/OSS always
wins out.

~~~
saurik
> In fact, I didn't know much about Tor until researching DARPA and the other
> participants involved in the program.

This reminds me of the movie Cube :(.

~~~
api
Brilliant little unknown film. First time I've seen it mentioned, ever.

------
farresito
This is the link of the project:

[http://www.darpa.mil/opencatalog/MEMEX.html](http://www.darpa.mil/opencatalog/MEMEX.html)

~~~
phy6
Of additional interest would be the other DARPA programs participating in the
open catalog, of which MEMEX is but one.

[http://www.darpa.mil/opencatalog/](http://www.darpa.mil/opencatalog/)

There are many interesting 'pieces of the puzzle' here that you can glue
together to make something awesome.

If anyone cares about what success means in our case, it's the transition of
technology to other agencies and companies (not necessarily military!). We
want the research funded here to live beyond the program instead of dying when
the program ends (truly satisfying from a Software Engineering perspective,
where many R&D deliverables are either too specialized or never see the light
of day)

------
schoen
Roger Dingledine was asked about the Tor Project's involvement with the things
mentioned in this article. His comment is here:

[https://lists.torproject.org/pipermail/tor-
talk/2015-April/0...](https://lists.torproject.org/pipermail/tor-
talk/2015-April/037538.html)

------
alexmobile
Thanks for posting - this is amazing... I was actually working on a far from
finished article "How a Search Engine Startup company could compete with
Google" [http://bitexperts.com/Question/Detail/42/how-a-search-
engine...](http://bitexperts.com/Question/Detail/42/how-a-search-engine-
startup-company-could-compete-with-google)

and then this announcement came across. Will be looking at what kind of
crawlers they would release. Hopefully some modern ones based on a WebKit /
Chromium core, that exposes DOM model and suitable for navigating all these
AJAX script fueled modern web interfaces.

Also very interested to see what kind of Machine Learning / classifiers they
are using. When working on our search engine, we were using purely statistical
classifiers like all variations of Bayes, SVM (Support Vector Machines) and
decision trees C4.5 glued together with some custom algos. We did not use
neural nets at all. Nowadays, neural nets have a new name - "deep learning"
and seem to be everywhere.

Really really interesting in terms of what people would build with an Open
Source Search Engine. Watch out Google :)

~~~
pixelmonkey
One of the supported projects is splash, which is basically WebKit-as-a-
service. It takes an interesting approach to crawling where it renders the
page using WebKit, and then exposes the "rendered DOM" \-- so that your
crawling code doesn't need to actually use JavaScript for information
extraction. See:

[https://github.com/scrapinghub/splash](https://github.com/scrapinghub/splash)

People often use Scrapy + Splash together in the Python community for crawling
more dynamic websites.

A team I collaborated with is also working on a project to make Scrapy usable
in a "cluster context", it's called scrapy-cluster. The idea is scrapy workers
running across machines and a single crawling queue (in the current prototype,
powered by Redis) in between them all.

[https://github.com/istresearch/scrapy-
cluster](https://github.com/istresearch/scrapy-cluster)

~~~
WalterGR

        It takes an interesting approach to crawling where
        it renders the page using WebKit, and then exposes
        the "rendered DOM" -- so that your crawling code
        doesn't need to actually use JavaScript for
        information extraction.
    

It is an interesting approach. There's evidence that Google crawls the web
that way, though I don't know if it's been confirmed by the company.

GoogleBot indexes content rendered by Javascript - even content delivered by
an AJAX request. They've announced they are going to start penalizing sites
that don't work well on mobile. I don't know the specifics of that (and they
probably haven't shared them) but I do know that I've received automated email
from Google Webmaster Tools and/or AdSense about one of my sites not working
great on mobile: small UI elements grouped too closely together, content
that's too wide, etc.

~~~
hayksaakian
This is the tool recommended to my by a person on the adwords team:

[https://www.google.com/webmasters/tools/mobile-
friendly/](https://www.google.com/webmasters/tools/mobile-friendly/)

According to them, starting april 21st it will be a ranking factor.

~~~
WalterGR

        april 21st
    

Great. April 2011 was when Google launched Panda 1.0, from which I don't think
my slang dictionary site has ever recovered.

Thanks for the link. I guess I better hop to it.

------
anigbrowl
This is a nice surprise! I wrote to ask about public access to this in
February and got no reply. I had forgotten about it until I got a lengthy
reply from someone at DARPA in mid-march saying that they were considering it
- and sure enough, here it is:
[http://www.darpa.mil/opencatalog/MEMEX.html](http://www.darpa.mil/opencatalog/MEMEX.html)

------
dublinben
It looks like we can add Forbes to the list of publications that don't
understand what the "dark web" is. It's too bad that this article starts right
off with gross misinformation.

~~~
ryanlol
To be honest, "dark web" seems to lack a clear, well defined meaning and
should be avoided.

------
elorant
Once again someone confuses "deep web" for "dark web".

------
mirimir
This is wonderful news. Information (including Tor onion services) wants to be
freely findable ;)

Obscurity should never be the major aspect of OPSEC.

Even with the sorry state of Tor onion services, sites can authenticate users
at both network level, using the "stealth" authorization protocol, and
application level. Of course, prudent users will use end-to-end encryption for
sensitive information. And they'll consider carefully before sharing.

All of the "hidden markets", by the way, are little more than honeypots in
waiting, enticing customers to pwn themselves.

Edit: See Roger Dingledine's reply in the "Clarification of Tor's involvement
with DARPA's Memex" thread on tor-talk
<[https://lists.torproject.org/pipermail/tor-
talk/2015-April/0...](https://lists.torproject.org/pipermail/tor-
talk/2015-April/037538.html>).

------
downandout
The idea that a single company can perfect search - even one employing a
significant percentage of the world's PhDs - is absurd. However, Google will
likely continue to dominate search even in the face of superior alternatives
because of their position in the marketplace and in the minds of consumers,
just as Microsoft continues to dominate the OS space long after the
introduction of viable free alternatives. This is why, if you have ever tried
to pitch a search startup to VC's, you have almost certainly received a less-
than-polite "no" along with your coconut water.

These tools will be interesting to use for a handful of techies like myself,
but Google will remain safely ensconced in its fortress of cash for at least
the next several decades regardless of this and whatever else comes along that
may enable the creation of superior search engines.

~~~
afarrell
> long after the introduction of viable free alternatives

Maybe Ubuntu (or mint?) has learned UX design and found a way to sand down all
its' rough edges in the 3 years since I switched to OSX, but I doubt it.
Running linux requires the time and patience of a knowledge person. Linux is
far too expensive for most people.

~~~
pekk
If you ever have to service or install Windows (and everyone has to
eventually) then that is even harder because it's not primarily designed to be
worked with at that level. Windows rests on a cushion of someone else
providing IT support. Why should Linux perform usability miracles no other OS
does?

~~~
dogma1138
How many times you had issues running software on Windows compared to Linux?

Even with "consumer friendly" distro's like Ubuntu compatibility is an issue
and you can easily screw up co-hosted software and even the entire OS due to
the current dependency mess which plagues the F/OSS world especially on Linux.

Back in the 90's and maybe maybe early 2000's you would get silly error
messages like .dll XYZ or cannot find ZYX .ocx on Windows but today? I don't
think so, other things like configuring network sharing, backup, more advanced
network setups are still considerably much easier on Windows than Linux.

And it's not just the UI both Gnome and KDE have offered pretty much 95%
coverage of anything you can configure on Linux (core service only ofc) in the
UI but their UI is still sucky.

I haven't used Windows server since 2003 jumped into 2012 and could find
everything and any new features were self explanatory, trying to make OS
changes in Gnome, meh i rather just find the config file and edit it manually.

Linux needs to get It's shit together, and hopefully one day it will, because
ATM even with all the nice and user friendly package managers at best half of
your software will come in archives which a basic consumer user doesn't know
what to do with (nor should they) and at worst comes completely uncompiled
forcing you to /MAKE it on your own machine hoping you got all the
dependencies and the header files you need.

And for the software you can get from your distro's "app store" you still
holding your fingers hoping one application wont cause all of your other ones
to brain fart because the dependencies it uses are of a slightly newer or
older version.

I've worked for a Company that tried to switch their desktops to Linux, and
I've heard plenty of stories of people who encountered similar cases, it just
never bloody works just like everything else and then they panic hire an
entire new IT department who still doesn't manages to get everything working
properly for everyone and scarps the entire project 6-8 months down the Line
losing probably several orders of magnitude more than they would ever have
gained from reducing licensing costs across their desktops.

------
kbwt
So, browsing through these projects I see a lot of crawlers.

Where is the search tech? Other than that one project which seems to be a JS-
frontend for Lucene/Solr, I can't find anything having to do with Information
Retrieval.

Unless I'm missing something, the article and the previous one it links to
seem to be a bit overblown.

"Could The U.S. Military's New Search Engine Replace Google?" -> Not if the
U.S. Military's New Search Engine is the vanilla Lucene/Solr.

Link:
[http://www.darpa.mil/opencatalog/MEMEX.html](http://www.darpa.mil/opencatalog/MEMEX.html)

------
WalterGR
This is where the "Memex" name comes from:
[http://en.wikipedia.org/wiki/Memex](http://en.wikipedia.org/wiki/Memex)

A memex is a "hypothetical proto-hypertext system that Vannevar Bush described
in his 1945 The Atlantic Monthly article 'As We May Think'."

I'm rather disappointed that DARPA chose it as the name for a project that,
according to Wired according to Wikipedia, "aims to shine a light on the dark
web and uncover patterns and relationships in online data to help law
enforcement and others track illegal activity".

~~~
nl
It's explicitly designed to work outside that single sphere:

 _Memex seeks to develop software that advances online search capabilities far
beyond the current state of the art. The goal is to invent better methods for
interacting with and sharing information, so users can quickly and thoroughly
organize and search subsets of information relevant to their individual
interests....

Memex would ultimately apply to any public domain content; initially, DARPA
plans to develop Memex to address a key Defense Department mission: fighting
human trafficking._

[http://www.darpa.mil/Our_Work/I2O/Programs/Memex.aspx](http://www.darpa.mil/Our_Work/I2O/Programs/Memex.aspx)

~~~
WalterGR
No, DARPA _plans to_ expand Memex so it works outside that single sphere. I'm
not reading anything that suggests it's _designed to_ work outside that
sphere.

But we could go back and forth on the distinction between "seeks to" and
"designed to" all night, so I'll leave it at that. :)

~~~
phy6
You're both right. The trend with these I2O programs (and other departments)
is to design a system of loosely coupled pieces that are tailored to a
specific problem domain, in this case human trafficking, and then generalize
in the next couple years later. By the third or fourth year you could expect
to see multiple engagements with corporate industry and Gov't/DOD/LE/etc to
appl these tools to their data, and thus generalize and polish the final
product. The end goal is adoption of the technology and successful transition
to some partner. The good programs tend to run for 3-4 years if they can
justify their existence to the DARPA leadership. This is also roughly the
tenure of a program manager, although a program manager may have more than one
program going at a time.

------
Drdrdrq
Yeeeeeah... There's much more to google than just search. Also, there is much
more to search engine than just seaech (seo spam countermeasures come to
mind). Good luck replacing Google on its home turf.

~~~
XJOKOLAT
However, the idea is interesting.

I once used Chrome. I now use Firefox.

Nothing lasts forever.

Then again, what's to stop Google adopting the same technology?

------
amelius
In order to replace Google, just implement their patents and then wait until
they expire.

At least, that is how it should work.

~~~
CaveTech
Except you'll be 20 years behind their current version.

~~~
amelius
You are assuming that search will keep on growing beyond an asymptotic limit.
It could very well be that we will soon reach a point where search is good
enough for most people.

~~~
gwern
You're assuming a sort of static model of search where it makes sense to talk
about it growing to an asymptotic limit.

But it's a dynamic and adversarial environment: the corpus evolves to destroy
the quality of search results (spam/SEO). A 20-yo version of Google will be
way worse than Google was 20 years ago, because when it goes live it'll be
attacked by 19 years of spam innovation (if I can use that word in this
context).

~~~
alexmobile
IMHO, spammers have little chance against Google in a long run, here is why:

\- it is of vital highest priority for Google to stay ahead of SEO spammers,
or they may stand to lose their $$$ billion search business. So resources to
fight spam will always be allocated

\- it would take only 1 smart Google person to write smart algo, and it will
get amplified by 100,000 server cores at Google. And Google employs >> 1 smart
EE

\- the Chrome browser has 50% market share and may collect enormous SERP
quality / engagement metrics on what "end users" actually click in search
results, how much time they spend there vs amount of content on that page, and
what page finishes each search quest (i.e. user has found what they were
looking for) - these are in fact one of the best "quality signals" available
to Google.

\- and Google could completely ignore signals from any Chrome instance that
has even slightest suspect on being manipulated (Chrome is a native app so it
could monitor mouse movement patterns, etc).

\- I would say Google is much better equipped than spammers to keep staying
ahead. Of course, there are going to be some short term advances in a black
hat SEO or social networks manipulations, but eventually every serious
loophole would get closed

~~~
dogma1138
Anyone who've used a search engine disagrees with you. And as far as Google
goes, their only "gripe" with SEO/Spammers is that they get the money instead
of Google.

So yes, Google does work to stay ahead of SEO spammers by providing a better
service than them for those who are willing to pay, because the consumers
don't care and as long as you don't get Viagra commercial and 17 types of
malware by clicking on a promoted search result in Google neither will you.

