
I'd like Caltrain to publish raw train data - britta
http://www.stackallocated.com/caltrain-scraper/
======
tatsiana
We've been working on the solution to this issue since our office is
overlooking the tracks. You can read more here:
[http://svds.com/post/listening-caltrain](http://svds.com/post/listening-
caltrain) and here: [http://svds.com/post/railroad-modeling-hadoop-scale-
hadoop-s...](http://svds.com/post/railroad-modeling-hadoop-scale-hadoop-
summit-2014-san-jose)

~~~
bduerst
That's pretty cool. I love the idea of scraping real-world information.

I did something similar using a GoPro and computer vision when I lived next to
one of the 101 off-ramps in SF. I got it to work for most daylight hours
(headlights screwed it up hardcore) before our landlord raised our rent by
$1000/mo and we moved.

I figured it could have been a way to calculate ad impressions for billboards,
but I also figured Clear Channel probably already knows those numbers.

~~~
winslow
I wouldn't be so sure that Clear Channel knows those numbers. You'd be
surprised how little data big companies know. Do you have a github or blog
post on your experiment?

~~~
bduerst
Sadly I don't, and it was a self-project I was doing to learn OCR (to scrape
price tags in grocery stores) and thus I never backed it up on github.

Do you have industry experience with billboard advertising? I might rebuild it
if there is an actual demand for it.

~~~
winslow
Unfortunately I do not have experience in billboard advertising. However, I do
have experience in massive companies (they suck) as a Software Engineer and
I've realized how little they actually know about their own products and the
data associated with it. You could probably contact Clear Channel trying to
advertise with them and just ask for simple data like expected eye balls/views
and population etc. If they have absolutely no idea then you have your answer
that there might be a market for this. Your market might not even been the
billboard company but rather someone trying to advertise. If I were to
advertise I would want some data behind the advertising platform along with
some way to track its effectiveness. I assume you are the same @bduerst?

------
ZanyProgrammer
Heh, I saw this on Twitter and responded to the author earlier-I'm working on
a data mining project now with public transit times, comparing arrivals vs
scheduled times. Since I live in the Bay Area, it made sense to use local
data. However, 511.org, the repository (it seems) for all Bay Area transit
APIs, doesn't publish any specific vehicle/route number, or what the actual
scheduled time is for an arrival at a stop (though MUNI used to have a nextbus
API that was really nicely detailed-I can't find any public hosting of it
anymore though).

My solution, since I didn't want to do any screen scraping or make trying to
identify individual busses/trains a project in and of itself, was to use
Portland's TriMet API. _That_ API acutally return specific route numbers, and
estimated and scheduled times for each stop (interpolated in the case of non
time points). I'm originally from the Portland Area, so I'm pretty familiar
with the geography and roads.

From what I remember in the 511.org Google developer group, people have raised
this exact issue, i.e. Caltrain train numbers. The guy responding from the MTA
said they'd try and integrate it in the future, but these posts were like back
in 2012 (IIRC).

~~~
simoncion
If you're still interested in doing Muni data mining, you'll probably be
interested in this:
[http://www.nextbus.com/xmlFeedDocs/NextBusXMLFeed.pdf](http://www.nextbus.com/xmlFeedDocs/NextBusXMLFeed.pdf)

NextBus is _the_ source of bus position and predicted arrival times for MUNI,
and appears to be the same for many other transit agencies. I can verify that
(as of three minutes ago) it's still returning reasonable data.

However, if you're looking for the SFMTA schedule [0], I don't think you can
get it through the NextBus API. I do know that you can get it through a GTFS
"feed" found here: [http://sfmta.com/about-sfmta/reports/gtfs-transit-
data](http://sfmta.com/about-sfmta/reports/gtfs-transit-data)

Also, you _might_ be interested in this, if you haven't seen it already:
[http://bdon.org/transit/](http://bdon.org/transit/) (SF MUNI transit delays.
[This isn't my work.])

[0] Why would you want MUNI's schedule? It's not like any of the drivers care
about it! ;)

------
rakoo
Author, you should integrate your scraper into
[http://raildar.fr](http://raildar.fr), they've already started to scratch
that kind of itch for a similar problem.

------
deepsun
Side note: instead of buying Burp Suite, check out just pure free Chrome or
Firefox browsers to watch your HTTP traffic -- they both have pretty good
Developer Tools, even IE does. They will show you the returned HTML formatted,
and let you change it.

~~~
lstamour
mitmproxy, Charles Web Proxy and Fiddler have also worked for me. I've never
understood why someone would pay so much more for Burp if they're not going to
use much of it. And for half of the rest, there's plenty of other tools or
scripting languages you could use and save yourself a pile of money. I'd love
to be convinced otherwise, after all I openly admit I haven't used Burp yet...

~~~
hansnielsen
I use Burp (and much of its featureset) every day at work; that's why I used
it here. The free edition does basically everything you could want except for
saving / restoring states (request history, requests you modified, etc). I've
also used Fiddler to great effect in the past, but the fact that Burp is
written in Java makes it really convenient to use when you deal with multiple
OSes.

------
guard-of-terra
"But that’s just the planned schedule"

Why won't it match the real schedule?

~~~
bowenli
Caltrain is often behind schedule. Trains have break down or hit cars. It's a
huge pain for daily Caltrain commuters. See:
[https://twitter.com/Caltrainstatus](https://twitter.com/Caltrainstatus)

~~~
ak217
Among the things Caltrain has to contend with (aside from old equipment prone
to breaking) are several dozen at grade crossings and freight train traffic on
the same line (!)

~~~
bdamm
Fortunately the freight traffic is mostly after commute hours. Imagine what
will happen with those so-called "high-speed" trains coming through!

~~~
guard-of-terra
Aren't you supposed to have separate high-speed track for high-speed trains?

------
bfung
I also had this idea, but I never executed it as I haven't thought of a way to
solve the real vs. estimated times perfectly. Probably can get close w/some
data mining, but not sure if it's worth the effort.

RE: scraping - instead of putting logic in your scraper, just download the
entire section you need, store it in file format. Then parse and shove into
database whenever you feel like it. You could rerun the parsing since you'll
have all the historically scraped website data on disk.

------
ZanyProgrammer
It'd be neat if they published positional data. I know the old Nextbus public
API for MUNI did that, and it was cool making maps of real time positions of
vehicles. I'm sure the excuse now is security BS.

------
tzm
I'd like Caltrain to accept mobile payments.

~~~
enos_feedler
Use a clipper card? What is the pain?

~~~
tzm
Yes, I have a few clipper cards tied to travel bank accounts. Unfortunately,
Clipper cannot be integrated into third-party vendors / apps and is prone to
24 hour account locks if transactions are declined. Adding money ad-hoc is
troublesome as well.. use a POTS terminal, go to an approved retailer
(Walgreens, etc), online ('available within 3-5 days').

Their commerce system is not mobile friendly and is a pain for mobile users.
It could be much more efficient.

