
Universal Method to Sort Complex Information Found - digital55
https://www.quantamagazine.org/universal-method-to-sort-complex-information-found-20180813/
======
macleginn
The subtitle claims that a “universal way” was found to solve the nearest-
neighbour-search problem for any kind of data, but actually the result is
restricted to the (rather huge, of course) set of normed spaces, i.e. spaces
whose distance measures obey the triangle inequality.

~~~
sshine
Incidentally, the zone system for the Danish public transit system is not a
metric space. (The ticket you need from A to B is not necessarily the same as
from B to A.)

That's a kind of practical graph where you want efficient shortest path
algorithms for.

~~~
gumby
Really?? Is that true only when there are multiple paths between A & B or is
there something else involved?

~~~
sshine
For busses, the B stop on the way out may be in another zone as the B stop on
the way back because they're geographically displaced. For busses and trains,
depending on your type of ticket (2-8 zones, 8+ zones, commuter card) and
whether the zone intersection is on or between stops/stations, you may have to
pay for that zone one way but not the other. Those are edge cases.

But the most confusing part is the ring topology (here shown with the zone-
ring center being the actual city center):

[https://passagerpulsen.taenk.dk/sites/default/files/styles/f...](https://passagerpulsen.taenk.dk/sites/default/files/styles/full_width/public/zonekort_movia_hjside830x400.jpg)

When buying 2-8 zones you don't pay for individual zones, but rather rings
from your origin zone. Going one way, the zone rings look one way. Going back
they look different, because the origin zone is different.

A practical example of this peculiarity:

You go from Hellerup (zone 2) to Friheden (zone 33) via the central station
(zone 1). Your ring 0 is zone 2, and zone 1 and 33 are both in ring 1, so you
only pay for two zone(-ring)s:
[https://dinoffentligetransport.dk/media/1413/svanemollen-
fri...](https://dinoffentligetransport.dk/media/1413/svanemollen-friheden.jpg)

You go back from Friheden to Hellerup via the central station: Your ring 0 is
zone 33, zone 2 is in ring 1, but zone 1 is in ring 2 this time, so you pay
for three zone(-ring)s:
[https://dinoffentligetransport.dk/media/1411/friheden-
svanem...](https://dinoffentligetransport.dk/media/1411/friheden-
svanemollen.jpg)

Your origin zone on the way out was adjacent to both other zones needed
(needing only 1 ring), but your origin zone on the way back was not (needing 2
rings).

So they're not just different prices: The graph that spans train stations via
zones in the ring topology is not metric. So getting a return ticket is not
the inverse of getting the ticket out. And while you have a ticket that's
valid for the timespan of your return, it may not be valid for your return
path.

~~~
ZeikJT
What the... that was painful to read and comprehend. Was this kind of
complexity really necessary? Did someone get a promotion for coming up with
this?

~~~
sshine
I copied and translated a quite limited subset of the official traveller
guidelines.

There are several reasons why it ended up like this, and I don't know all of
them.

But two reasons I can think of:

1) With the introduction of RFID-based commuter cards, the zone system was
revised for traditional ticket types as well. It appears convenient that a
price can always be determined by knowing the check-in point, the check-out
point and the shortest path in the graph. Unfortunately, with the old ticket
types, you have to perform the calculation instead.

2) It is the odd shape of zone 2 that causes some tickets to not work on the
way back, even though the trip goes in a straight line. But for a large part
of all trips, you only need zone 1 and 2, or zone 2 and zone (30, 31, 32, 33).
Only the ones that start in zone 2, go through zone 1, go back into zone 2 and
farther into another zone (or vice versa) have the anomaly. (Trips that
actually bend can have this too, but it is slightly more intuitive.)

------
hinkley
> For example, “Manhattan” distance forces you to make 90-degree turns, as if
> you were walking on a street grid. Using Manhattan distance, a point 5 miles
> away as the crow flies might require you to go across town for 3 miles and
> then uptown another 4 miles.

It's interesting that this is called Manhattan distance because it's only
relevant in a town where everyone jaywalks... Like Manhattan. It's far from
true anywhere where jaywalking is frowned upon.

Because of (the lack of) crosswalk synchronization, it's a lot faster to walk
to a place 2 blocks over and 2 blocks up than it is to walk to a place 4
blocks in one direction. Because at the first two lights you have the option
of crossing in either direction, at which point you may only have to wait a
few moments before crossing in the other direction.

~~~
SomewhatLikely
The measure is concerned only with the distance traveled, not the time.

~~~
Y_Y
That's quite a narrow view. You can certainly model time in a metric space,
i.e. it you go one block west and one new-york minute forward.

------
danharaj
Ah that's so cool. I've been reading about expander graphs lately. Their
interesting properties make them pertinent to lots of questions. In
particular, you can use expander graphs to cook up error correcting codes and
prove the PCP theorem. There's a class of expander graphs called Ramanujan
graphs which are characterized by an analogue to the Riemann hypothesis
involving a zeta function that counts the prime cycles in a graph.

------
EmilStenstrom
I really hope this leads to databases with good support for fast nearest
neighbour search. This would be especially useful in word2vec cases, where you
have millions of word vectors, and want to find "words similar to this one"
without either having all of them in memory, or going through all of them to
find out.

------
fenollp
I wonder what this can mean for fuzzing, optimization, learning or any kind of
task that has to do with tip-toeing into potentially high dimensional spaces?

~~~
fizx
There are a lot of places in practical neural nets with attention where you
want softmax(queryvector · memorymatrix), where memory can be quite large. If
you have a decent ANN implementation, you can approximate by only calculating
the dot product for the vectors of memory that neighbor the query.

There are currently a ton of mediocre ways to do this because nothing really
works very well in high dimensions, and calculating this can easily be the
bottleneck in training and evaluation.

------
QML
I'm curious to if this result will extend to the k-nearest neighbors (k-NN)
algorithms.

Two problems that have to do with k-NN are

1\. It's a non-parametric method: the number of parameters grow linearly with
the size of the training set since the distance function must be calculated
for all training points and the test point.

2\. The curse of dimensionality: distance metrics like the Euclidean distance
do not perform well in higher dimensions; points which seem "close" in 2D may
be far in 3D, 4D, etc. As a result, we would need an exponential amount of
more training data for every additional dimension. Locality sensitive hashing
tries to combat this by reducing the dimensionality of the data.

~~~
mlthoughts2018
I’m curious if it will extend to k q-flats, or other notions of points being
near each other purely by being near to the same subspace, rather than
pointwise nearness.

------
fwilliams
Link (from the article) to the paper with details:
[https://www.ilyaraz.org/static/papers/spectral_gap.pdf](https://www.ilyaraz.org/static/papers/spectral_gap.pdf)

~~~
DoctorOetker
The article announces the algorithm, which isn't published yet. This first
paper contains the proof that the result is possible, not yet the efficient
algoithm.

~~~
fizx
One assumes its an extension of their previous work
([https://arxiv.org/pdf/1501.01062.pdf](https://arxiv.org/pdf/1501.01062.pdf)),
which was only valid for Euclidian and Hamming spaces?

Edit: The author says its
[https://ilyaraz.org/static/papers/daher.pdf](https://ilyaraz.org/static/papers/daher.pdf),
but he got marked dead by HN.

~~~
novia
I vouched for the author's comment. Any clue why people are
downvoting/flagging it?

~~~
yorwba
New accounts posting comments with links are killed by the spam filter before
anyone even gets the chance to downvote/flag.

------
crb002
Curious to see Timothy Chan develop some screaming fast C code using their
method.

------
usgroup
Click bait. Not universal. Nearest neighbour algo that works well for a family
of distant measures.

