

Ask HN: Help for a newbie - mbm

Hi all,<p>I've been hanging out here for about a month as I've been learning Python.  It's been a great process, and what I've learned is finally being tested by someone other than myself.  Today at work (I work for a genomics startup), I faced a problem that likely would have been most elegantly solved through a script.  Unfortunately, I was forced to face the Sisyphean task of doing everything manually using (wait for it) Excel.  I haven't reviewed enough of the Python Standard Library yet to tackle it alone.  If someone on here is just hanging out tonight, and would like to help a newbie, just shoot me an email or IM at the address on my profile.  There are two potential reasons you might do this:<p>(a) It'll probably only take an experienced programmer roughly 15-20 mins, and you'll feel good about yourself for what you did.<p>(b) Another guy in my office (who works on software) is hacking together a solution this weekend in C.  He's an old-school Fortran hacker and lover of all things compiled.  I'd love to walk in Monday and compare code length in front of the whole group.<p>Thanks guys,<p>matt
======
daviding
Why not post what you need as a comment up here? People could then collaborate
on what they see as the solution plus it might help others who come back to
look at this in the future.

PS Not sure about (b) being a great humanitarian reason :)

~~~
mbm
Sure, no problem.

Like I mentioned in the listing, I work for a startup. One quick thing I
wanted to do today was to get some very basic data on our competitors from an
online source. I wanted to retrieve the following data for each of them:

(a) full company name (b) current stock price (c) market cap (d) price-to-
earnings ratio

I assume it wouldn't be too hard to extend this to other data as well, as long
as it's visible in a basic corporate financial profile on Google/Yahoo finance
or something similar. There's functionality to do this through Google Docs (a
function called GoogleFinance()), but I'd like to try to do it manually (so I
can learn).

I've retrieved data from webpages before, but I'm not sure how to ensure I'm
getting the actual values. I know it's probably trivial to get this directly
through some tool that creates a CSV file [e.g., <http://www.gummy-
stuff.org/Yahoo-data.htm>], but then I couldn't beat Mr. C (I don't hate C, I
just want to show him it can be done with less code).

I've thrown up a small subset of our competitors' info into a public
spreadsheet you can view:

([https://spreadsheets.google.com/ccc?key=0AgbmjbWQrdmgdGZLb3V...](https://spreadsheets.google.com/ccc?key=0AgbmjbWQrdmgdGZLb3VUWDJ4V0t6a3I5XzdEX21DMnc&hl=en))

Each column is a different competitor. The abbreviation before the colon is an
abbreviation for the market the stock trades in (some are international
markets, others are nonstandard) and the part after the colon is the stock
symbol.

thanks for your help guys!

~~~
arebop
There are a few variations on CSV, but a great many variations on HTML. You
could try to hack together some sort of quick and dirty regular expression
based extractor, but if I were you I'd use
<http://www.crummy.com/software/BeautifulSoup/>. It does a reasonably good job
of parsing HTML, even invalid HTML such as is often found in the wild.

Then it should be pretty straightforward to scrape out the data you want. The
scraper will be reasonably resilient and because the parsing code is separate
from the code that deals with the structural details of the data source you're
dealing with, it will be reasonably easy to maintain that when the provider
redesigns their web pages.

