
HTML to Excel data extraction - changmin
http://www.listly.io/
======
changmin
Thank you, all.

Listly.io is my private work built days ago. I hope to hear opinions if it is
useful for you... or not.

Listly.io turn HTML to Excel in seconds without coding. It finds the pattern
of repeated structure and extracts all of image links and texts. It does find
not tags (table, ul ...), but the structure.

Ideally for developers, I think API would be the best way to adapt this
extractor to other scraper or your own scraper.

~~~
popey456963
This site is actually really awesome, and has worked for every website I've
tried! My only slight issue with it however is it took me a few minutes to
actually work out what "HTML codes" were and even then it was only from
watching the video. Have you considered renaming it to something like "HTML
Source Code"? It also seems to struggle on web pages it can't find tables,
such as the following website I made which contains no information:

[https://hastebin.com/eguluvoquq.html](https://hastebin.com/eguluvoquq.html)

~~~
changmin
I appreciate your hard testing and feedback. Your suggestion is very good to
me.

Actually, any (partial or full) HTML source code is available; <div></div>,
<p></p>, <span></span>,<html></html>, and etc. Following to your advice, I
changed the placeholder description to "any HTML Source Code".

Secondly, my server returns 500 error only if there is nothing to extract such
as your code. I will fix it soon. Thank you.

------
est
I think Google Spreadsheet had something similar

[https://support.google.com/docs/answer/3093339?hl=en](https://support.google.com/docs/answer/3093339?hl=en)

I use it all the time for better sorting, filtering, etc.

~~~
changmin
I have used Google Spreadsheet to extract <TABLE> or <UL> content. It works
very well with them.

Compared to it, listly.io works well with all types of tag if there are
repeated structures.

In my experiment, it works well with hunderds kinds of web sites.

e.g. Google/Bing search result, Amazon/Walmart/Ebay product list,
Twitter/Facebook/Tumblr posts, Twitch list, Bloomberg finance info, Threads of
a forum, Instagram comments, and etc.

------
LeoPanthera
On macOS, you can literally copy and paste tables from Safari into Numbers.
Numbers can export to Excel, if you need to.

------
webrobots
Tried on Indeed jobs listing, Amazon product search, Craigslist and cannot get
it to work. I suggest you test the tool with top 10-20 most popular websites
that contain listing type data. Our company also did a little side project
similar to yours and packaged it as Chrome extension. We learned that it is
quite hard to make a unversal tool to guess where data is. Especially that so
many websites use <div> and <ul> with CSS to form table like structures
instead of plain <table>. If you want, take a look at our tool:
[https://chrome.google.com/webstore/detail/instant-data-
scrap...](https://chrome.google.com/webstore/detail/instant-data-
scraper/ofaokhiedipichpaobibbnahnkdoiiah)

~~~
changmin
I tested now on Amazon product search,
[https://www.amazon.com/s/ref=nb_sb_noss_2/130-9531298-529675...](https://www.amazon.com/s/ref=nb_sb_noss_2/130-9531298-5296754?url=search-
alias%3Daps&field-keywords=tv) . It works well though the result comes out
slow (about 45 seconds). In result page, you can find the product list with
the number of 28. For better speed, I agree to publish API or chrome
extension.

In addition, it also works well in seconds with Craiglist apts / housing page.
[http://seoul.craigslist.co.kr/search/apa](http://seoul.craigslist.co.kr/search/apa)

Sorry for being slow. This is my private work. I could not predict a lot of
new visitors, I need to scale up and out the server.

------
polm23
Looks like import.io.

[https://www.import.io/](https://www.import.io/)

I tried writing a script to do the same thing before - turns out finding the
element on the page with the most children and assuming each child is an entry
works surprisingly often.

~~~
changmin
The difference:

Import.io needs user's click to determine what to extract, thus, the user has
to repeat it whenever the web page changes.

Listly.io needs URL or HTML codes. It always works even if the web page
chages.

------
haberdasher
I use this extension for tables. Gets the job done:
[https://chrome.google.com/webstore/detail/table-
capture/iebp...](https://chrome.google.com/webstore/detail/table-
capture/iebpjdmgckacbodjpijphcplhebcmeop)

~~~
changmin
I think Chrome extension is the best way for end-users, too.

------
fenollp
I'm guessing this is doing some kind of tree-diff on the DOM?

Now if you would have this generate a graphQL spec file, you could run a
graphQL server acting as a proxy to lots of websites. That would be
interesting. Not sure how that fares with the websites' owners' ToS though.

~~~
changmin
Thanks for the idea. It makes me think how to build API. I need to take a look
at graphQL.

------
joss82
Parseur is doing the same for emails. It's a bit more manual at first but it
works better IMHO.

[https://app.parseur.com](https://app.parseur.com)

------
fabianmg
I use a chrome web scraping extension:

[http://webscraper.io/](http://webscraper.io/)

------
guipsp
Falls flat on this very website.

~~~
nothrabannosir
To be fair to the guy, hn is among the worse in markup out there. Great lack
of js and css, but boy oh boy is that html ugly.

------
cdolan92
outwit hub is my go to for a large, complex extraction:
[https://www.outwit.com/](https://www.outwit.com/)

their marketing is poor but the product is very powrful

------
snowpanda
This looks good, can't wait for the scraper feature.

