
Crunching subway data – a New Yorker’s busiest stations - danso
http://blogging.alastair.is/crunching-subway-data-a-new-yorkers-busiest-stations/
======
ibejoeb
I noticed some funny stuff about the data, but I haven't tracked down the
source. I'd expect the entry/exit counts to increase over the course of the
day, and they generally do. Sometimes, though, the latest record doesn't have
the greatest number. I suspect it either odd cutoff times, i.e., not aligned
to the day.

If anyone wants to play with the data, here's some stuff to start with
PostgreSQL:

    
    
      create table stats (
          ca varchar(10),
          unit varchar(10),
          scp varchar(10),
          dt timestamp,
          "desc" varchar(20),
          entries integer,
          exits integer
      );
    
      copy stats from '<path>/output.csv' delimiter ',' csv header;
    

Here's query that will show the entire set of exit counts align with both
their greatest and latest values:

    
    
      select
          unit, scp, dt, exits,
          max(exits) over (
              partition by unit, scp, date_trunc('day', dt)
              rows between unbounded preceding and unbounded following
          ) as largest_exits,
          last_value(exits) over (
              partition by unit, scp, date_trunc('day', dt)
              order by dt 
              rows between unbounded preceding and unbounded following
          ) as latest_exits
      from stats
      order by 1, 2, 3;
    

If you want to see the discrepancies I described above, just wrap it up and
find where latest <> greatest:

    
    
      with x as (
      select
          unit, scp, dt, exits,
          max(exits) over (
              partition by unit, scp, date_trunc('day', dt)
              rows between unbounded preceding and unbounded following
          ) as largest_exits,
          last_value(exits) over (
              partition by unit, scp, date_trunc('day', dt)
              order by dt 
              rows between unbounded preceding and unbounded following
          ) as latest_exits
      from stats
      )
      select *
      from x
      where largest_exits <> latest_exits
      order by unit, scp, dt;

------
chimeracoder
I'm a bit surprised to see Penn Station below Grand Central.

Penn Station is the most trafficked train station in North America[0], which I
would imagine would lead to more subway entrances/exits, especially during
rush hours.

Also, Penn Station has the A,C,E, 1, 2, and 3. Grand Central only has the
4/5/6[1]. The 4/5/6 are the only lines on the east side and are therefore
fairly busy, but I find this surprising nonetheless.

[0] This includes non-subway trains, [1] Don't even get me started about the T
(ie, the Second Ave. Line)! :)

~~~
bradleyjg
Not every commuter takes the subway to and from Penn Station. The surrounding
area has a lot of offices, so many people take light rail there and then just
walk to and from work.

~~~
natesm
NJ Transit and the LIRR are heavy rail (as are the subway and the PATH train).

------
dcalacci
I didn't know this data was available! I wish we had the same breadth
available through the MBTA. I'd be interested in using this data to:

\- plan travel directions based on past congestion patterns

\- pair this with any data found for NY's bus system or taxi system to map out
hubs vs destination stations

\- examine stations' frequency of repair and see if congestion at a station
correlates with the frequency of repair; try to predict dates of repair

very cool!

