People often use Scrapy + Splash together in the Python community for crawling more dynamic websites.
A team I collaborated with is also working on a project to make Scrapy usable in a "cluster context", it's called scrapy-cluster. The idea is scrapy workers running across machines and a single crawling queue (in the current prototype, powered by Redis) in between them all.
It takes an interesting approach to crawling where
it renders the page using WebKit, and then exposes
the "rendered DOM" -- so that your crawling code
It is an interesting approach. There's evidence that Google crawls the web that way, though I don't know if it's been confirmed by the company.
Thanks! I've briefly looked at Splash and related projects like ScrapingHub, etc - looks like this niche is live and kicking...
The distributed scrapy-cluster is the way to go, if you need to crawl anything of decent size ( maybe even Amazon - 300+ MM webpages, j/k :)
I see a lot of Python based projects recently, even in Bitcoin niche, we even have a local Toronto based Python meetup. Looks like Python dev community is active.
I have a domain name PYFORUM.com - would it be good idea to launch a forum site? With Bitcoin tipping built-in? So instead of saying "Thanks" people would be able to send $0.25 in Bitcoin to those who helped them in the forums or made them laugh? What are the most established Python forums out there?
BTW, I'm one of the co-authors of streamparse, one of the DARPA-supported projects that is being developed by my company, Parse.ly. It lets you integrate Apache Storm cleanly with Python.
I just gave a talk about streamparse at PyCon US (https://www.youtube.com/watch?v=ja4Qj9-l6WQ) a few days ago, it was entitled "streamparse: defeat the Python GIL with Apache Storm". I'm glad to answer any questions about it.
With only a brief skim of your talk, I wonder what you think of the moral implications of this project being DARPA supported.
> ...DARPA said Memex wasn’t about destroying the privacy protections offered by Tor, even though it wanted to help uncover criminals’ identities. “None of them [Tor, the Navy, Memex partners] want child exploitation and child pornography to be accessible, especially on Tor. We’re funding those groups for testing...”
Doesn't this sound like the same "protect the kids" line embedded in every press release for not-so-subtle government spy programs? $1 million is a lot of money, and I'm sure being able to name drop DARPA in any conversation about your company carries its own cache -- surely you feel pressured to design your optimizations to fit DARPA's needs. Does it feel weird to write code that's being used to track people? Or is that off base?
To be clear, our projects (at Parse.ly) don't have anything to do with Tor. In fact, I didn't know much about Tor until researching DARPA and the other participants involved in the program.
But, I'll address your general question, which is, do I have a moral/ethical problem with DARPA funding some of our open source work, such as streamparse and pykafka?
The answer is a resounding "no". There are very few funding sources for open source work. Part of DARPA's funding supports fundamental tech advancements (famously, the Internet itself and GPS) and recently, important open source projects (such as, Apache Spark and the Julia language).
Now, there is no doubt in my mind that open source software is used for intelligence purposes, regardless of its funding source. To restrict ones contribution to F/OSS based on the fear that some government or entity may use it toward an end you disagree with seems a battle you can only win through willful ignorance.
The nature of open source software is that people can use it however they please (within legal limits, of course). This is a trade-off I accept with eyes wide open, and in my mind, the benefit to the community for F/OSS always wins out.
It's a means of searching public-facing (albeit cloaked) content, not a means of tracking people specifically.
If anything, it's ethically much superior to what the NSA is doing: law enforcement searches for content that is clearly criminal (child pornography, actual terroristic threats, murder-for-hire services), then requests a warrant after showing the content to a judge. That's how the process should work; identify something illegal at the front, identify probable cause, then go in through the back with court approval. These search engines can only find content that is already accessible to other users.
The NSA is already in the back, looking for justification for already being there, then after finding something, lying and saying they went in through the front.
Of course, this software could theoretically be used to search a database of data unethically exfiltrated without a warrant, but that's not what the stated goal is and there doesn't seem to be any evidence of that.
They're using it to search for "human trafficking", by which they seem to mean adult women having sex in exchange for money. Oh, sorry, adult women who describe themselves as "latina" having for money - mustn't forget that part. (Seriously. Look at the pictures in the article.) Minor details like whether the women in question are actually trafficked, or whether they should be deporting them right back into the hands of the people who trafficked them if they are, have never been terribly important to the police in the US. This will be used to hurt vulnerable women.
> Does it feel weird to write code that's being used to track people?
Does it feel weird to design mechanical implements designed for the sole purpose of destroying human life?
I'm not speaking of drones and missiles, mind you; I'm speaking of small arms, the very same tools so staunchly defended by libertarian lovers of the Second Amendment everywhere.
There are plenty of valid reasons to want to track someone over a network like Tor, just as there are insidious reasons. E.g. all the reasons that make legal, warrant-protected wiretaps a legitimate function of governments worldwide.
But even if there weren't valid reasons, other countries will develop (or already have) similar capabilities, so making DARPA your line in the sand for this is missing the point anyways.
Parse.ly has built a real-time content measurement layer for the entire web.
Parse.ly's analytics platform helps digital storytellers at some of the web's best sites, such as Arstechnica, New Yorker, The Atlantic, The Next Web, and many more. In total, our analytics backend system needs to handle over 10 billion monthly page views from 400 million monthly unique visitors.
Our entire stack is in Python, and our team has innovated in areas related to real-time analytics, building some of the best open source tools for working with modern stream processing technologies like Apache Kafka and Storm.
Our UX/design team has also built one of the best-looking dashboards on the planet, using AngularJS and d3.js.
Seems like a similar design to Apache Kafka, http://kafka.apache.org. AP, partial ordering (Kafka does ordering within "partitions", but not topics).
One difference is that Disque "garbage collects" data once delivery semantics are achieved (client acks) whereas Kafka holds onto all messages within an SLA/TTL, allowing reprocessing. Disque tries to handle at-most-once in the server whereas Kafka leaves it to the client.
Will be good to have some fresh ideas in this space, I think. A Redis approach to message queues will be interesting because the speed and client library support is bound to be pretty good.
I don't really get it. Storm Topologies are built in Java or Clojure using a builder interface, but the data structures for topologies themselves are actually DAGs that serialize using Thrift. It's true that this is a bit heavy-weight compared to something like JSON or EDN, but offering an alternative is a discussion in the community right now. What would your ideal representation of topologies be, actually?
I wasn't aware that they're Thrift serializable - that's cool, and offers roughly what Onyx does in terms of its workflow representation.
Onyx goes a little further though in terms of its catalog. I wanted more of the computation to be pulled out into a data structure. That includes runtime parameters, flow, performance tuning knobs, and grouping functions. All of these things are represented as data in Onyx. It's a little harder, at least in my experience, to do these things in Storm.
Our experience was as a Python shop who was backed into a corner to use Apache Pig for our Hadoop batch jobs.
We decided to rewrite some of those jobs from Pig to PySpark, and though there was a little bit of a learning curve and some sharp edges, the development experience is so much better than Pig that my team is generally happy with the switch.