
Web Scraping Indeed for Key Data Science Job Skills - jonbaer
https://jessesw.com/Data-Science-Skills/
======
stewhir
As a little side project, I built a website at:
[http://skill.report](http://skill.report) which instantly does this for any
job title. Go try it! I'd love to hear your feedback :)

It works by sampling job ads from Indeed, then applying some information
extraction/retrieval/NLP algos to extract and weight the presence of
identified skills and qualities. There's some occasional glitches in the algo
(I need to fix some of the disambiguation data), but it usually gives
reasonable results.

I was thinking of focusing the algorithm for giving really in-depth feedback
on improving your resume specifically for a job. Now if only I could finish my
PhD thesis I might actually have the time to do more with it...

~~~
pc86
Love the design and my first query "developer" turned up some good results.
Some feedback:

\- If I enter "c#" as a query it simply refreshes the page.

\- A lot of the "skills" I am getting back are simply rephrased job titles
(e.g. "web developer" returned "web applications, web development, web
services, mobile application development, support, responsibility, web design,
javascript, project and software developer." for the skills list)

Definitely has a lot of promise though if you can reliable filter out skills
from job descriptions.

------
logn
_look through Indeed 's pages of job results and click on all of the job
links, but only in the center of the page where all of the jobs are posted
(not on the edges)._

I wrote a toolkit to help solve this problem [1]. An issue with taking the
approach of hard-coding the result pattern to scrape is that it can break when
the page changes. E.g., the author's code has:

    
    
      page_obj.find(id = 'resultsCol')
    

If Indeed ever changes that ID, the program won't work. In that respect, it's
better to dynamically figure out where the results are.

And as far as cleaning up unicode and HTML entities, I like the "he" project
[2]. Within text fields these HTML parsing libraries don't do a very good job
so unfortunately this extra parsing is necessary. And sometimes properly
stripping all duplicate whitespace involves getting very familiar with control
characters (and corrupted control characters from bad encoding) as well as
left-behind html tags/entities [3].

1\.
[https://github.com/MachinePublishers/ScreenSlicer](https://github.com/MachinePublishers/ScreenSlicer)

2\. [https://github.com/mathiasbynens/he](https://github.com/mathiasbynens/he)

3\.
[https://github.com/MachinePublishers/ScreenSlicer/blob/maste...](https://github.com/MachinePublishers/ScreenSlicer/blob/master/common/src/com/screenslicer/common/CommonUtil.java#L791)

~~~
zo1
Could you please elaborate on what you mean by "dynamically figure out where
the results are"? Or how to go about doing it?

#Edit. I see your first link sorta answers that. And correct me if I'm wrong,
but when I went there, it seemed that the library caters more towards
automatic searching + paging, rather than extracting results?

~~~
logn
It handles extraction too, trying to find where the results are and then
extracting individually the title/summary/url/date.

To elaborate on the general approach I used, it was to take each node in the
web page and get stats about all of them (e.g., position on page, amount of
freetext, etc) and plug those stats into a neural net.

I worked on a different project some years ago that took the approach of
looking for repeating tag patterns in the page, focusing especially on
structural tags (as opposed to ones that are purely for formatting).

Another possible approach might be to just plug the whole result page into
something such as Boilerpipe
([https://code.google.com/p/boilerpipe/](https://code.google.com/p/boilerpipe/))
and look at the set of urls in the text block it identifies.

------
dheera
I wish there was a system of web scrapers where the scraping logic is user-
contributed and decentralized at the same time. Being decentralized, there
would be no way for the owners of websites to stop anyone from scraping, and
being user-maintained, the logic gets updated quickly whenever the original
website's HTML template changes.

~~~
gwu78
Assuming that someone has created such a system but has not released it, and
that this person asked you what you might do with her system, what would you
answer?

Do you envision that users would want to run such a system, e.g., if there was
a public benefit to such information sharing?

What if the implementation was a group of small programs written in C that
communicated with each other, and no browser extensions or scripting languages
were required?

What if the system required attachment of dedicated hardware to the user's
LAN, e.g., a $25 single board computer?

~~~
dheera
> Assuming that someone has created such a system but has not released it, and
> that this person asked you what you might do with her system, what would you
> answer?

Build stuff that takes information and uses it in new, interesting, creative,
and useful ways. Right now there is a lot of extremely useful data that is
trapped inside the interfaces of websites and apps that could be used in
amazing ways but unfortunately there's no easy way to get at the data.

I don't think hogging information and intellectual property will last very
long as a means of creating value. We as a society need to think of better
business models and better ways to define progress than this.

> Do you envision that users would want to run such a system, e.g., if there
> was a public benefit to such information sharing?

Sure, if they are getting something out of it too. For example, the new ways
of accessing information should only be usable if they participate in running
the system.

> What if the implementation was a group of small programs written in C that
> communicated with each other, and no browser extensions or scripting
> languages were required? What if the system required attachment of dedicated
> hardware to the user's LAN, e.g., a $25 single board computer?

All this sounds good to me. I'd want the full hardware and software stack to
be open-source though if it's going to be plugged into my home network, so
that there's no chance of it violating my privacy.

One hurdle will be how to enforce that users MUST contribute a piece of their
bandwidth in order to be able to use the fruits of the system (e.g. you need
to help others make scraping API calls before you can issue calls yourself).
Napster did this for music, but as with any centralized system, it will
eventually get sued and shut down.

In order to decentralize this I think a cryptographic currency similar to
Bitcoin will be needed: you get points for offering bandwidth, you need to
spend points in order to make calls on other peoples' bandwidth.

------
flashman
This is very cool. I've been kicking around a similar approach to feed data
into a recommendation engine: collect job listings, filter against a list of
stop words, then see if ones with similar words turn out to be similar jobs.

One of the problems with this approach is (in my country at least) the heavy
industry presence of recruitment firms means every job is listed up to four
times: once by the employer, and once by each firm competing to find the hire.

------
Bammybums
If you don't like to code scrapers, you can always use something like
[http://import.io](http://import.io)

the 'magic' API works on alot of list websites:
[https://magic.import.io/?site=http:%2F%2Fwww.indeed.co.uk%2F...](https://magic.import.io/?site=http:%2F%2Fwww.indeed.co.uk%2Fjobs%3Fq%3Dcar%26l%3DLondon)

------
jefb

        for script in soup_obj(["script", "style"]):
                script.extract() # Only need these two elements from the BS4 object
    
    

That comment is a bit misleading. soup.extract() will remove those tags from
the tree, 'script' and 'style' are the two tags you __don't__ need.

[http://www.crummy.com/software/BeautifulSoup/bs4/doc/#extrac...](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#extract)

------
sport_billy
Nice work. Although not in python at
[http://trendyskills.com](http://trendyskills.com) it has a broader set of
skills and more countries with an Open API for everyone.

------
Toast_
Perhaps offtopic: Is web scraping a desired skill in the job market? Asking
because unemployed.

~~~
iheartmemcache
Sort of. There are tons of people offering crawling-as-a-service, where you
can do your own scraping. Then there are tools like import.io which let you
point-and-click data right into your database. There's scrapy for and similar
frameworks for Python. I'm sure node has more than a few npm packages out
there to deal with dynamically rendered (e.g. infinite scroll) stuff out
there. In short, it's a useful tool to be able to reach for in your toolbelt
but not the hammer you'll use every day.

------
edem
The page is not available for me.

------
blumkvist
re: python vs. R

I haven't done python at all, but from reading bits and piece online, it seems
to me that Python is a lot more about ML than Statistics.

Also, how does one decide what is a "data scientist"? Is it only people who do
Stats+ML+IT? What about a researcher in economics or biology? Or marketing.
Are those included? They do a lot of Stats and increasingly a lot of ML. Not
so much information technology though (less concerned with data storage and
retrieval, because they have other people to do that for them). I'd venture to
say that researchers like that are far bigger number than pure play data
scientists and would be interesting to see the technical skills for them. I
bet SPSS and SAS would look a lot more in demand.

~~~
sososoko
You can take a look at this book Analyzing the Analyzers The Authors surveyed
data scientists, asking about their experiences and how they viewed their own
skills and careers.They answer your question what is a "data scientist"

[http://www.oreilly.com/data/free/files/analyzing-the-
analyze...](http://www.oreilly.com/data/free/files/analyzing-the-
analyzers.pdf)

~~~
blumkvist
That's a great paper. I wonder how I've missed it. I'm happy to to say that it
partially confirms my own take on the situation. I thought of it only in 2
categories - Comp sci/math guys vs. Applied statistics guys (mostly from
Humanities). Reality seems more nuanced then that (as always). Very nice read!

