
Scrapely: The brains behind Portia, our visual web scraping tool - unsettledtck
https://blog.scrapinghub.com/2016/07/07/scrapely-the-brains-behind-portia-spiders/
======
merraksh
I wonder if this was submitted after reading today's submission
([https://news.ycombinator.com/item?id=12047234](https://news.ycombinator.com/item?id=12047234))
about (real) Portia spiders and googling the subject.

~~~
unsettledtck
We did indeed model our Portia on the real spiders. We'd like to think our
version is a wee bit cuter...

~~~
david-given
After finding this picture of a real Portia, I have to disagree.

[https://c1.staticflickr.com/7/6096/6306406141_3b237e21ee_b.j...](https://c1.staticflickr.com/7/6096/6306406141_3b237e21ee_b.jpg)

Look at those big, soulful black eyes...

------
rkrzr
This looks like a good solution for scrapers where "close enough" is good
enough. If you need 100% accuracy you can always fall back to using scrapy
directly and make your scraping logic as accurate as you need. But in many
cases you can live with some false positives and then this tool looks like it
will fit the bill.

~~~
unsettledtck
We've actually developed a way that you can convert Portia projects into
Scrapy spiders: [https://blog.scrapinghub.com/2016/06/29/introducing-
portia2c...](https://blog.scrapinghub.com/2016/06/29/introducing-portia2code-
portia-projects-into-scrapy-spiders/)

and since this is all open source, here's a link to GitHub:
[https://github.com/scrapinghub/portia2code](https://github.com/scrapinghub/portia2code)

------
abc03
Does anyone know what methods are state of the art in machine learning for
data extraction in general or where I could get an overview (invoices,images,
documents etc.)?

~~~
ahljoh
We would need more context/information about your specific objectives.

\- document conversion (pdftotext, pdfbox, apache tabula, etc.)

\- OCR (tesseract, pypdfocr, etc.)

\- Named-Entity-Recognition (NER) i.e. finding and recognizing entities in
text (DBPedia Spotlight, stanford NER via NLTK, spacy)

\- coreference resolution, dependency parsing (spacy, syntaxnet)

~~~
abc03
Thanks. Some great keywords to investigate. I'm namely interested in two areas
at the moment: \- invoices (I guess NER would be partially an Option) \- web
scrapping (wrapper induction)

