My group in the UK (Amey Strategic Consulting: https://www.amey.co.uk/amey-consulting/services/strategic-co... ) have been doing very similar things for a while now to optimise utilities maintenance, for example optimizing the number and position of maintenance depots to enable a utilities company to undertake repairs most efficiently.
I'm UK based, and not that familiar with the Bay Area - but I guess there are a whole bunch of follow-up questions. Most pertinently - what would this change. Suppose (for example) you created a North Bay Transit Authority - how would a common ticketing policy affect commuting patterns. Is the issue the cost and inconvenience of multiple tickets, or the disjoint nature of current services?
My first job was to estimate trip generation models (the first tier) using simultaneous-equations time-series models in Eviews; a transportation engineer with some experience doing surveys (i.e. coordinating teams of people with clipboards) worked on the mode choice. We had additionally some greyheads who had the Matlab codes for trip distribution and route assignment.
Realistically some machine learning type classification could really help in mode choice, particularly if raw data (like turnstile pushes) is available. Trip generation, like most microeconometrics with time-series, is something of a dark art.
Ideally we would all be doing agent-based simulation by now with super-disaggregate data like Waze has, but I haven't seen complex systems simulation really "arrive" for real problems. It'd be fun to train reinforcement learners on them too :)
It even allows for some competition because you separate the provider from the system. For example, in nuremberg the provider for the s-bahn for next few years was recently auctioned off to another company after frustration with the deutsche bahn.
> for this problem, we are defining the distance matrix directly from the source data. This results in the unusual property that the feature space distance from A to B is likely to be different from the feature space distance from B to A, as more commuters will commute in one direction than the other. For this project, we made the decision to examine origin to destination commute flows only, as this resulted in the clusters that were clearly defined in both feature space and real space, while the inverse resulted in clusters that were significantly overlapping in real space.
I had a similar problem, since the travel time from A to B can differ dramatically to that from B to A. I experimented with a few different ways of symmetrising the matrix and found that taking the maximum of both values was a pretty good compromise.
I also found that hierarchical clustering didn't work as well as K-means/medoids when the "clusters" were not necessarily very well-defined and the number of data-points was in the thousands.
If he wanted to do something, he should demystify clipper card purchases. Every agency has its own bespoke fare system, and its complete bullshit if you transfer between systems. THAT would have been interesting, especially given how these things interact with agency funding.
But let's now imagine the transit systems of SF area have been reshaped according to the article. It will surely trigger a change in commute patterns. Will this change be sufficient to significantly affect the clustering? Will the system stabilize eventually?
But yes, I'm also not too keen on using a clustering algorithm and calling it machine learning.
There's a general tendency to reclassify any kind of statistical analysis that produces a useful result as "machine learning", and I don't think it helps. For example there are lots of machine vision techniques; the neural net ones can sensibly be called "learning" but things like edge detection shouldn't be.
(also, just because something is taught in course X doesn't mean that it is X, it might be an embedded prerequisite)