Hacker Newsnew | comments | show | ask | jobs | submit | pixelmonkey's comments login

One of the supported projects is splash, which is basically WebKit-as-a-service. It takes an interesting approach to crawling where it renders the page using WebKit, and then exposes the "rendered DOM" -- so that your crawling code doesn't need to actually use JavaScript for information extraction. See:

https://github.com/scrapinghub/splash

People often use Scrapy + Splash together in the Python community for crawling more dynamic websites.

A team I collaborated with is also working on a project to make Scrapy usable in a "cluster context", it's called scrapy-cluster. The idea is scrapy workers running across machines and a single crawling queue (in the current prototype, powered by Redis) in between them all.

https://github.com/istresearch/scrapy-cluster

-----


    It takes an interesting approach to crawling where
    it renders the page using WebKit, and then exposes
    the "rendered DOM" -- so that your crawling code
    doesn't need to actually use JavaScript for
    information extraction.
It is an interesting approach. There's evidence that Google crawls the web that way, though I don't know if it's been confirmed by the company.

GoogleBot indexes content rendered by Javascript - even content delivered by an AJAX request. They've announced they are going to start penalizing sites that don't work well on mobile. I don't know the specifics of that (and they probably haven't shared them) but I do know that I've received automated email from Google Webmaster Tools and/or AdSense about one of my sites not working great on mobile: small UI elements grouped too closely together, content that's too wide, etc.

-----


This is the tool recommended to my by a person on the adwords team:

https://www.google.com/webmasters/tools/mobile-friendly/

According to them, starting april 21st it will be a ranking factor.

-----


    april 21st
Great. April 2011 was when Google launched Panda 1.0, from which I don't think my slang dictionary site has ever recovered.

Thanks for the link. I guess I better hop to it.

-----


Thanks! I've briefly looked at Splash and related projects like ScrapingHub, etc - looks like this niche is live and kicking...

The distributed scrapy-cluster is the way to go, if you need to crawl anything of decent size ( maybe even Amazon - 300+ MM webpages, j/k :)

I see a lot of Python based projects recently, even in Bitcoin niche, we even have a local Toronto based Python meetup. Looks like Python dev community is active.

I have a domain name PYFORUM.com - would it be good idea to launch a forum site? With Bitcoin tipping built-in? So instead of saying "Thanks" people would be able to send $0.25 in Bitcoin to those who helped them in the forums or made them laugh? What are the most established Python forums out there?

Thanks!

-----


Launch a forum actually using python...

Even the largest 'python forum' is on phpbb..

-----


BTW, I'm one of the co-authors of streamparse, one of the DARPA-supported projects that is being developed by my company, Parse.ly. It lets you integrate Apache Storm cleanly with Python.

I just gave a talk about streamparse at PyCon US (https://www.youtube.com/watch?v=ja4Qj9-l6WQ) a few days ago, it was entitled "streamparse: defeat the Python GIL with Apache Storm". I'm glad to answer any questions about it.

-----


With only a brief skim of your talk, I wonder what you think of the moral implications of this project being DARPA supported.

> ...DARPA said Memex wasn’t about destroying the privacy protections offered by Tor, even though it wanted to help uncover criminals’ identities. “None of them [Tor, the Navy, Memex partners] want child exploitation and child pornography to be accessible, especially on Tor. We’re funding those groups for testing...”

Doesn't this sound like the same "protect the kids" line embedded in every press release for not-so-subtle government spy programs? $1 million is a lot of money, and I'm sure being able to name drop DARPA in any conversation about your company carries its own cache -- surely you feel pressured to design your optimizations to fit DARPA's needs. Does it feel weird to write code that's being used to track people? Or is that off base?

-----


To be clear, our projects (at Parse.ly) don't have anything to do with Tor. In fact, I didn't know much about Tor until researching DARPA and the other participants involved in the program.

But, I'll address your general question, which is, do I have a moral/ethical problem with DARPA funding some of our open source work, such as streamparse and pykafka?

The answer is a resounding "no". There are very few funding sources for open source work. Part of DARPA's funding supports fundamental tech advancements (famously, the Internet itself and GPS) and recently, important open source projects (such as, Apache Spark and the Julia language).

Now, there is no doubt in my mind that open source software is used for intelligence purposes, regardless of its funding source. To restrict ones contribution to F/OSS based on the fear that some government or entity may use it toward an end you disagree with seems a battle you can only win through willful ignorance.

The nature of open source software is that people can use it however they please (within legal limits, of course). This is a trade-off I accept with eyes wide open, and in my mind, the benefit to the community for F/OSS always wins out.

-----


> In fact, I didn't know much about Tor until researching DARPA and the other participants involved in the program.

This reminds me of the movie Cube :(.

-----


Brilliant little unknown film. First time I've seen it mentioned, ever.

-----


It's a means of searching public-facing (albeit cloaked) content, not a means of tracking people specifically.

If anything, it's ethically much superior to what the NSA is doing: law enforcement searches for content that is clearly criminal (child pornography, actual terroristic threats, murder-for-hire services), then requests a warrant after showing the content to a judge. That's how the process should work; identify something illegal at the front, identify probable cause, then go in through the back with court approval. These search engines can only find content that is already accessible to other users.

The NSA is already in the back, looking for justification for already being there, then after finding something, lying and saying they went in through the front.

Of course, this software could theoretically be used to search a database of data unethically exfiltrated without a warrant, but that's not what the stated goal is and there doesn't seem to be any evidence of that.

-----


They're using it to search for "human trafficking", by which they seem to mean adult women having sex in exchange for money. Oh, sorry, adult women who describe themselves as "latina" having for money - mustn't forget that part. (Seriously. Look at the pictures in the article.) Minor details like whether the women in question are actually trafficked, or whether they should be deporting them right back into the hands of the people who trafficked them if they are, have never been terribly important to the police in the US. This will be used to hurt vulnerable women.

-----


> Does it feel weird to write code that's being used to track people?

Does it feel weird to design mechanical implements designed for the sole purpose of destroying human life?

I'm not speaking of drones and missiles, mind you; I'm speaking of small arms, the very same tools so staunchly defended by libertarian lovers of the Second Amendment everywhere.

There are plenty of valid reasons to want to track someone over a network like Tor, just as there are insidious reasons. E.g. all the reasons that make legal, warrant-protected wiretaps a legitimate function of governments worldwide.

But even if there weren't valid reasons, other countries will develop (or already have) similar capabilities, so making DARPA your line in the sand for this is missing the point anyways.

-----


>With only a brief skim of your talk, I wonder what you think of the moral implications of this project being DARPA supported.

You're not adding to the conversation by pointing this out. We can all clearly see this for what it is.

-----


How does one get their commercial project supported by DARPA?

-----


Hmm, not sure I could answer that question, as in this case, DARPA is supporting our open source projects, not our commercial projects. Or is that what you are asking?

That said, FastCompany covered the story of how we got involved with DARPA here:

http://www.fastcompany.com/3040363/the-future-of-search-brou...

-----


I looked at the MEMEX page and say a bunch of companies represented and was genuinely curious... thanks for sharing the link.

-----


Send your proposal in response to the Broad Agency Announcements (BAA) that the agency puts out.

-----


What's the GitHub URL?

-----


https://github.com/Parsely/streamparse

-----


Solid. Apache licensed. You're inside tmux too.

This is legit.

Docs: http://streamparse.readthedocs.org/en/latest/

How did you make that screenshot / animated preview?

-----


I used a Linux program called byzanz. The bash alias I use to record gif screencasts is here:

https://github.com/amontalenti/home/blob/master/.bash_aliase...

-----


Seems unlikely to me a "natural" group of hacker news users would upvote a link to a landing page with email capture.

-----


Hi there, you can view the PDF without entering your email address - the email capture is just if you want to stay updated.

Here is the direct link to the PDF: https://chartmogul.com/wp-content/uploads/2015/04/ChartMogul...

-----


One of the best startup teams in NYC -- and maybe on the planet!

-----


Parse.ly (http://parse.ly) - Fully Remote - Full-Time

Parse.ly has built a real-time content measurement layer for the entire web.

Parse.ly's analytics platform helps digital storytellers at some of the web's best sites, such as Arstechnica, New Yorker, The Atlantic, The Next Web, and many more. In total, our analytics backend system needs to handle over 10 billion monthly page views from 400 million monthly unique visitors.

Our entire stack is in Python, and our team has innovated in areas related to real-time analytics, building some of the best open source tools for working with modern stream processing technologies like Apache Kafka and Storm.

Our UX/design team has also built one of the best-looking dashboards on the planet, using AngularJS and d3.js.

Some blog posts about our technology:

- The Magical Time Series Backend Behind Parse.ly Analytics => http://blog.parsely.com/post/1633/mage/

- Lucene: The Good Parts => http://blog.parsely.com/post/1691/lucene/

- Whatever It Takes: Building Elegant, Beautiful, and Timely Data Digests => http://blog.parsely.com/post/46/whatever-it-takes/

We are hiring a backend engineer and a UX engineer, with the only requirement being some experience in Python/Javascript. Apply via work@parsely.com (CV, github link, 1 paragraph intro), and make sure to mention this HN post!

-----


Remote anywhere or US only?

-----


Anywhere but a preference for US EST timezone or plus/minus 2 hours.

-----


Source?

-----


Should have added a disclaimer. I work for Amazon.

-----


Seems like a similar design to Apache Kafka, http://kafka.apache.org. AP, partial ordering (Kafka does ordering within "partitions", but not topics).

One difference is that Disque "garbage collects" data once delivery semantics are achieved (client acks) whereas Kafka holds onto all messages within an SLA/TTL, allowing reprocessing. Disque tries to handle at-most-once in the server whereas Kafka leaves it to the client.

Will be good to have some fresh ideas in this space, I think. A Redis approach to message queues will be interesting because the speed and client library support is bound to be pretty good.

-----


Michael, can you explain this more? "[Storm] Topologies are written with functions, macros, and objects. These things are specific to a programming language, and make it hard to work at a distance -- specifically in the browser. JavaScript is the ultimate place to be when creating specifications."

I don't really get it. Storm Topologies are built in Java or Clojure using a builder interface, but the data structures for topologies themselves are actually DAGs that serialize using Thrift. It's true that this is a bit heavy-weight compared to something like JSON or EDN, but offering an alternative is a discussion in the community right now. What would your ideal representation of topologies be, actually?

-----


I wasn't aware that they're Thrift serializable - that's cool, and offers roughly what Onyx does in terms of its workflow representation.

Onyx goes a little further though in terms of its catalog. I wanted more of the computation to be pulled out into a data structure. That includes runtime parameters, flow, performance tuning knobs, and grouping functions. All of these things are represented as data in Onyx. It's a little harder, at least in my experience, to do these things in Storm.

-----


Our experience was as a Python shop who was backed into a corner to use Apache Pig for our Hadoop batch jobs.

We decided to rewrite some of those jobs from Pig to PySpark, and though there was a little bit of a learning curve and some sharp edges, the development experience is so much better than Pig that my team is generally happy with the switch.

-----


PySpark is really compelling for Pig/Python shops. If it weren't for Pig on Spark, I'd fear for Pig's future.

-----


My favorite part from this speech, "Everybody Worships."

http://www.pixelmonkey.org/2014/11/26/everybody-worships

-----

More

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: