

Show HN: A Python Spider System with Web UI - binux
https://github.com/binux/pyspider

======
meowface
This looks really nice. The API seems more user-friendly than scrapy's.

------
adam-_-
How does this compare to scrapy? Why would I use one over the other, or is
either a fine choice?

~~~
binux
I'm working on a benchmarking suite
[https://gist.github.com/binux/67b276c51e988f8e2c31](https://gist.github.com/binux/67b276c51e988f8e2c31)
and meet some problem...

pyspider comes from a vertical search engine project. we have two issues:

\- 100+ websites, they may change the template or down sometime. We need a
dashboard to monitor the changes and the fails.

\- update in 5 minutes, when the website updated, we need follow that in 5
minutes. We are using a update time from index(list) page to tell the changed
pages. And pages should been updated after about 30 days in case of we missed
something. A powerful scheduler is needed.

obviously, I hadn't got the right way to do so with scrapy. I'm not very
familiar with scrapy. So I can't say something pyspider can do but scrapy not.

------
OedipusRex
Can someone explain what this is?

------
mrmondo
Nice project! I do wish it supported a PostgreSQL backend rather than (or as
well as I guess) MySQL.

------
_bitliner
I really like the flow/UX. Congratulations! Nice job!

What is the roadmap?

I am really inside scraping, it is one of my daily job. I could consider to
integrate it in one of my architectures

~~~
_bitliner
Furthermore, what you mean with `Javascript pages supported`? Could I just
specify where it has to click or do I need to make a reverse engineering of
the ajax calls?

~~~
binux
[http://demo.pyspider.org/debug/js_test_sciencedirect](http://demo.pyspider.org/debug/js_test_sciencedirect)
is a sample for this.

There is a phantomjs fetcher that can render the page as WebKit did.
Furthermore, you can have some JavaScript running before/after page loaded to
simulate a mouse click.

~~~
pknerd
But will it not be slow? Assuming downloading css/images etc?

~~~
binux
Images not downloaded default. Both the fetcher and the phantomjs proxy is
totally async.

------
kidsil
Thanks for making me feel bad about my python-based aggregation solution :)

[https://github.com/AZdv/agricatch](https://github.com/AZdv/agricatch)

------
erikb
What is a "spider system"? Never heard that term before.

~~~
binux
sorry :(

------
bowlofstew
That is a nice tool....nice work!

------
Immortalin
Any plans for a gui based web scraper interface similar to portia?

~~~
binux
Currently, yes and no.

pyspider is running original python code, something like portia is a code
generator (Apologize if I'm wrong, I have not use it). So it can been made as
another WebUI module.

But for flexible, I have no idea how to make it right currently. So, We have a
css selector helper, but no plan for a complete tool.

~~~
prht
I am not trying to offend you, but I really don't understand when someone says
"yes and no". I hear it more and more these days. Is this becoming a cliche?
It can be "yes" or "no", not both together. "yes and no" is "no" for me.

~~~
smoe
Don't know about other languages, but in german this phrase is pretty common
when there is no clear yes or no answer. Like "yes to some extend but not
completely"

------
bjblazkowicz
How's the performance compared to scrapy?

~~~
binux
[https://gist.github.com/binux/67b276c51e988f8e2c31](https://gist.github.com/binux/67b276c51e988f8e2c31)

------
zbb
Take a look at source code. The package hirarchy is not pythonic (use "libs"
as top package is not a good idea).

~~~
paulhauggis
Why isn't it a good idea? I have plenty of projects setup this way and it
works well.

It looks pretty well organized to me.

~~~
vertex-four
The issue is that if you have two packages installed, and both use "libs" as
their top-level package, they'll collide. Use "projectname.common" instead.

~~~
fmueller
This is not true, you can specify package directories in setup.py.

See
[https://docs.python.org/2/distutils/setupscript.html#listing...](https://docs.python.org/2/distutils/setupscript.html#listing-
whole-packages)

~~~
vertex-four
A package name != the actual name of the directory in the source tree. My
point stands.

