
Defining Transit Service Areas with Unsupervised Machine Learning - mikez302
https://towardsdatascience.com/how-does-the-bay-area-commute-22f45e00419e
======
steve_gh
Interesting application of clustering.

My group in the UK (Amey Strategic Consulting: [https://www.amey.co.uk/amey-
consulting/services/strategic-co...](https://www.amey.co.uk/amey-
consulting/services/strategic-consulting/) ) have been doing very similar
things for a while now to optimise utilities maintenance, for example
optimizing the number and position of maintenance depots to enable a utilities
company to undertake repairs most efficiently.

I'm UK based, and not that familiar with the Bay Area - but I guess there are
a whole bunch of follow-up questions. Most pertinently - what would this
change. Suppose (for example) you created a North Bay Transit Authority - how
would a common ticketing policy affect commuting patterns. Is the issue the
cost and inconvenience of multiple tickets, or the disjoint nature of current
services?

~~~
thanatropism
Are you aware of the classic four-stage traffic assignment model? It's the
linear regression of transportation modeling. It's scary that the authors
didn't even mention that it exists.

[https://en.wikipedia.org/wiki/Transportation_forecasting#Fou...](https://en.wikipedia.org/wiki/Transportation_forecasting#Four-
step_models)

My first job was to estimate trip generation models (the first tier) using
simultaneous-equations time-series models in Eviews; a transportation engineer
with some experience doing surveys (i.e. coordinating teams of people with
clipboards) worked on the mode choice. We had additionally some greyheads who
had the Matlab codes for trip distribution and route assignment.

Realistically some machine learning type classification could really help in
mode choice, particularly if raw data (like turnstile pushes) is available.
Trip generation, like most microeconometrics with time-series, is something of
a dark art.

Ideally we would all be doing agent-based simulation by now with super-
disaggregate data like Waze has, but I haven't seen complex systems simulation
really "arrive" for real problems. It'd be fun to train reinforcement learners
on them too :)

------
LeanderK
In germany, there are a lot of the "Verkehrsverbünde" and they work very well.
For example, in munich there's the MVV (Münchner Verkehrsverbund), but the
s-Bahn (slimiliar to BART) is operated by the Deutsche Bahn and the
underground, tram and busses are operated by the MVG (the communal transit
company). There are, of course, problems in most of the public-transportation
networks in germany, but I don't think they are to blame. I think it's a
pattern worth copying.

It even allows for some competition because you separate the provider from the
system. For example, in nuremberg the provider for the s-bahn for next few
years was recently auctioned off to another company after frustration with the
deutsche bahn.

------
n4r9
This is cool. I'm working on something similar to divide up areas for
municipal waste collection based on travel times between properties.

> for this problem, we are defining the distance matrix directly from the
> source data. This results in the unusual property that the feature space
> distance from A to B is likely to be different from the feature space
> distance from B to A, as more commuters will commute in one direction than
> the other. For this project, we made the decision to examine origin to
> destination commute flows only, as this resulted in the clusters that were
> clearly defined in both feature space and real space, while the inverse
> resulted in clusters that were significantly overlapping in real space.

I had a similar problem, since the travel time from A to B can differ
dramatically to that from B to A. I experimented with a few different ways of
symmetrising the matrix and found that taking the maximum of both values was a
pretty good compromise.

I also found that hierarchical clustering didn't work as well as
K-means/medoids when the "clusters" were not necessarily very well-defined and
the number of data-points was in the thousands.

~~~
steve_gh
@n4r9. Where in the world are you working? I would be interested in getting in
touch. You can reach me at stephen dot gooberman hyphen hill at amey dot co
dot uk

------
jonathankoren
This article was very underwhelming. It’s viz porn dressed up as something
insightful. Looking at his clusters, he managed to “find” the transit systems.
That’s not particularly interesting. It’s even less interesting because at the
start he talks about how you can’t combine transit districts, and then he
clusters to find the transit districts and then tries to say “look at all the
transit districts I combined”, when in reality the ones he’s combining are
tiny and inconsequential.

If he wanted to do something, he should demystify clipper card purchases.
Every agency has its own bespoke fare system, and its complete bullshit if you
transfer between systems. THAT would have been interesting, especially given
how these things interact with agency funding.

------
tixocloud
Apologies for playing devil's advocate but in areas where there are few
riders, would it not be a reflection of poor transit service in the first
place? I would assume if the service is so bad in those areas that ridership
would be poor but you'd still need to check out whether the actual residents
in those areas are using alternative methods of transportation as opposed to
just commute data?

------
AlexTWithBeard
Absolutely amazing research!

But let's now imagine the transit systems of SF area have been reshaped
according to the article. It will surely trigger a change in commute patterns.
Will this change be sufficient to significantly affect the clustering? Will
the system stabilize eventually?

------
pjc50
Isn't this in fact k-means clustering rather than "machine learning"?

~~~
n4r9
Looks like he used a hierarchical clustering (probably single-linkage) rather
than K-means.

But yes, I'm also not too keen on using a clustering algorithm and calling it
machine learning.

------
dcbadacd
This does not take into account potential riders, am I correct? This seems to
be a common pitfall when planning new routes or systems.

