I just gave a talk about streamparse at PyCon US (https://www.youtube.com/watch?v=ja4Qj9-l6WQ) a few days ago, it was entitled "streamparse: defeat the Python GIL with Apache Storm". I'm glad to answer any questions about it.
> ...DARPA said Memex wasn’t about destroying the privacy protections offered by Tor, even though it wanted to help uncover criminals’ identities. “None of them [Tor, the Navy, Memex partners] want child exploitation and child pornography to be accessible, especially on Tor. We’re funding those groups for testing...”
Doesn't this sound like the same "protect the kids" line embedded in every press release for not-so-subtle government spy programs? $1 million is a lot of money, and I'm sure being able to name drop DARPA in any conversation about your company carries its own cache -- surely you feel pressured to design your optimizations to fit DARPA's needs. Does it feel weird to write code that's being used to track people? Or is that off base?
But, I'll address your general question, which is, do I have a moral/ethical problem with DARPA funding some of our open source work, such as streamparse and pykafka?
The answer is a resounding "no". There are very few funding sources for open source work. Part of DARPA's funding supports fundamental tech advancements (famously, the Internet itself and GPS) and recently, important open source projects (such as, Apache Spark and the Julia language).
Now, there is no doubt in my mind that open source software is used for intelligence purposes, regardless of its funding source. To restrict ones contribution to F/OSS based on the fear that some government or entity may use it toward an end you disagree with seems a battle you can only win through willful ignorance.
The nature of open source software is that people can use it however they please (within legal limits, of course). This is a trade-off I accept with eyes wide open, and in my mind, the benefit to the community for F/OSS always wins out.
This reminds me of the movie Cube :(.
If anything, it's ethically much superior to what the NSA is doing: law enforcement searches for content that is clearly criminal (child pornography, actual terroristic threats, murder-for-hire services), then requests a warrant after showing the content to a judge. That's how the process should work; identify something illegal at the front, identify probable cause, then go in through the back with court approval. These search engines can only find content that is already accessible to other users.
The NSA is already in the back, looking for justification for already being there, then after finding something, lying and saying they went in through the front.
Of course, this software could theoretically be used to search a database of data unethically exfiltrated without a warrant, but that's not what the stated goal is and there doesn't seem to be any evidence of that.
Does it feel weird to design mechanical implements designed for the sole purpose of destroying human life?
I'm not speaking of drones and missiles, mind you; I'm speaking of small arms, the very same tools so staunchly defended by libertarian lovers of the Second Amendment everywhere.
There are plenty of valid reasons to want to track someone over a network like Tor, just as there are insidious reasons. E.g. all the reasons that make legal, warrant-protected wiretaps a legitimate function of governments worldwide.
But even if there weren't valid reasons, other countries will develop (or already have) similar capabilities, so making DARPA your line in the sand for this is missing the point anyways.
You're not adding to the conversation by pointing this out. We can all clearly see this for what it is.
That said, FastCompany covered the story of how we got involved with DARPA here:
This is legit.
How did you make that screenshot / animated preview?
There are many interesting 'pieces of the puzzle' here that you can glue together to make something awesome.
If anyone cares about what success means in our case, it's the transition of technology to other agencies and companies (not necessarily military!). We want the research funded here to live beyond the program instead of dying when the program ends (truly satisfying from a Software Engineering perspective, where many R&D deliverables are either too specialized or never see the light of day)
and then this announcement came across. Will be looking at what kind of crawlers they would release. Hopefully some modern ones based on a WebKit / Chromium core, that exposes DOM model and suitable for navigating all these AJAX script fueled modern web interfaces.
Also very interested to see what kind of Machine Learning / classifiers they are using. When working on our search engine, we were using purely statistical classifiers like all variations of Bayes, SVM (Support Vector Machines) and decision trees C4.5 glued together with some custom algos. We did not use neural nets at all. Nowadays, neural nets have a new name - "deep learning" and seem to be everywhere.
Really really interesting in terms of what people would build with an Open Source Search Engine. Watch out Google :)
People often use Scrapy + Splash together in the Python community for crawling more dynamic websites.
A team I collaborated with is also working on a project to make Scrapy usable in a "cluster context", it's called scrapy-cluster. The idea is scrapy workers running across machines and a single crawling queue (in the current prototype, powered by Redis) in between them all.
It takes an interesting approach to crawling where
it renders the page using WebKit, and then exposes
the "rendered DOM" -- so that your crawling code
According to them, starting april 21st it will be a ranking factor.
Thanks for the link. I guess I better hop to it.
The distributed scrapy-cluster is the way to go, if you need to crawl anything of decent size ( maybe even Amazon - 300+ MM webpages, j/k :)
I see a lot of Python based projects recently, even in Bitcoin niche, we even have a local Toronto based Python meetup. Looks like Python dev community is active.
I have a domain name PYFORUM.com - would it be good idea to launch a forum site? With Bitcoin tipping built-in? So instead of saying "Thanks" people would be able to send $0.25 in Bitcoin to those who helped them in the forums or made them laugh? What are the most established Python forums out there?
Even the largest 'python forum' is on phpbb..
Obscurity should never be the major aspect of OPSEC.
Even with the sorry state of Tor onion services, sites can authenticate users at both network level, using the "stealth" authorization protocol, and application level. Of course, prudent users will use end-to-end encryption for sensitive information. And they'll consider carefully before sharing.
All of the "hidden markets", by the way, are little more than honeypots in waiting, enticing customers to pwn themselves.
Edit: See Roger Dingledine's reply in the "Clarification of Tor's involvement with DARPA's Memex" thread on tor-talk <https://lists.torproject.org/pipermail/tor-talk/2015-April/0....
These tools will be interesting to use for a handful of techies like myself, but Google will remain safely ensconced in its fortress of cash for at least the next several decades regardless of this and whatever else comes along that may enable the creation of superior search engines.
Maybe Ubuntu (or mint?) has learned UX design and found a way to sand down all its' rough edges in the 3 years since I switched to OSX, but I doubt it. Running linux requires the time and patience of a knowledge person. Linux is far too expensive for most people.
Even with "consumer friendly" distro's like Ubuntu compatibility is an issue and you can easily screw up co-hosted software and even the entire OS due to the current dependency mess which plagues the F/OSS world especially on Linux.
Back in the 90's and maybe maybe early 2000's you would get silly error messages like .dll XYZ or cannot find ZYX .ocx on Windows but today? I don't think so, other things like configuring network sharing, backup, more advanced network setups are still considerably much easier on Windows than Linux.
And it's not just the UI both Gnome and KDE have offered pretty much 95% coverage of anything you can configure on Linux (core service only ofc) in the UI but their UI is still sucky.
I haven't used Windows server since 2003 jumped into 2012 and could find everything and any new features were self explanatory, trying to make OS changes in Gnome, meh i rather just find the config file and edit it manually.
Linux needs to get It's shit together, and hopefully one day it will, because ATM even with all the nice and user friendly package managers at best half of your software will come in archives which a basic consumer user doesn't know what to do with (nor should they) and at worst comes completely uncompiled forcing you to /MAKE it on your own machine hoping you got all the dependencies and the header files you need.
And for the software you can get from your distro's "app store" you still holding your fingers hoping one application wont cause all of your other ones to brain fart because the dependencies it uses are of a slightly newer or older version.
I've worked for a Company that tried to switch their desktops to Linux, and I've heard plenty of stories of people who encountered similar cases, it just never bloody works just like everything else and then they panic hire an entire new IT department who still doesn't manages to get everything working properly for everyone and scarps the entire project 6-8 months down the Line losing probably several orders of magnitude more than they would ever have gained from reducing licensing costs across their desktops.
I've used various linux distros for years, and there's always something that doesn't work right. Whether it's the speakers not turning off when I plug my headphones in through my thinkpad dock in Arch, or monitor position settings changing every time I reboot my desktop in Elementary OS, the lack of polish really shows. I understand why this happens, and I don't fault the people making on the various distros. I've even dove in and fixed a few problems myself.
But nevertheless, it remains a problem, so for now I've just decided on windows for day to day use with a VM running linux when I need it.
More importantly Duck Duck Go is actually considered worse than Google even by many people that use Duck Duck Go so this is considered a significant feature. Granted, it's debatable how much worse but Bing / Duck Duck go are viewed as competitive search engines not better search engines. Plenty of people are doing regular head to head comparisons with people using different search engines for the same query if they can't find it on one they use the other. After a while the 'better' engine generally becomes their dealt.
Windows on the other hand has a lot of inertia due to third part integration. For example there is a much smaller selection of games on OS X than windows which creates a feedback loop. There is no way OS X could hit 50% market share in 3 years let alone a brand new OS starting from scratch, but search has seen massive swings in market share fairly recently.
PS: The g! feature is also great for fine tuning their search engine. Providing a listing of every pain point when people gave up.
Where is the search tech? Other than that one project which seems to be a JS-frontend for Lucene/Solr, I can't find anything having to do with Information Retrieval.
Unless I'm missing something, the article and the previous one it links to seem to be a bit overblown.
"Could The U.S. Military's New Search Engine Replace Google?" -> Not if the U.S. Military's New Search Engine is the vanilla Lucene/Solr.
A memex is a "hypothetical proto-hypertext system that Vannevar Bush described in his 1945 The Atlantic Monthly article 'As We May Think'."
I'm rather disappointed that DARPA chose it as the name for a project that, according to Wired according to Wikipedia, "aims to shine a light on the dark web and uncover patterns and relationships in online data to help law enforcement and others track illegal activity".
Memex seeks to develop software that advances online search capabilities far beyond the current state of the art. The goal is to invent better methods for interacting with and sharing information, so users can quickly and thoroughly organize and search subsets of information relevant to their individual interests....
Memex would ultimately apply to any public domain content; initially, DARPA plans to develop Memex to address a key Defense Department mission: fighting human trafficking.
But we could go back and forth on the distinction between "seeks to" and "designed to" all night, so I'll leave it at that. :)
I once used Chrome. I now use Firefox.
Nothing lasts forever.
Then again, what's to stop Google adopting the same technology?
At least, that is how it should work.
But it's a dynamic and adversarial environment: the corpus evolves to destroy the quality of search results (spam/SEO). A 20-yo version of Google will be way worse than Google was 20 years ago, because when it goes live it'll be attacked by 19 years of spam innovation (if I can use that word in this context).
- it is of vital highest priority for Google to stay ahead of SEO spammers, or they may stand to lose their $$$ billion search business. So resources to fight spam will always be allocated
- it would take only 1 smart Google person to write smart algo, and it will get amplified by 100,000 server cores at Google. And Google employs >> 1 smart EE
- the Chrome browser has 50% market share and may collect enormous SERP quality / engagement metrics on what "end users" actually click in search results, how much time they spend there vs amount of content on that page, and what page finishes each search quest (i.e. user has found what they were looking for) - these are in fact one of the best "quality signals" available to Google.
- and Google could completely ignore signals from any Chrome instance that has even slightest suspect on being manipulated (Chrome is a native app so it could monitor mouse movement patterns, etc).
- I would say Google is much better equipped than spammers to keep staying ahead. Of course, there are going to be some short term advances in a black hat SEO or social networks manipulations, but eventually every serious loophole would get closed
So yes, Google does work to stay ahead of SEO spammers by providing a better service than them for those who are willing to pay, because the consumers don't care and as long as you don't get Viagra commercial and 17 types of malware by clicking on a promoted search result in Google neither will you.
And in this arms race Google's massive resources gives them a huge advantage. I think Google will probably maintain their dominance until they inevitably get caught doing something egregious enough that the government decides to step in and bust them up.