Hacker News new | past | comments | ask | show | jobs | submit login
Universal Method to Sort Complex Information Found (quantamagazine.org)
275 points by digital55 on Aug 13, 2018 | hide | past | web | favorite | 62 comments



The subtitle claims that a “universal way” was found to solve the nearest-neighbour-search problem for any kind of data, but actually the result is restricted to the (rather huge, of course) set of normed spaces, i.e. spaces whose distance measures obey the triangle inequality.


Incidentally, the zone system for the Danish public transit system is not a metric space. (The ticket you need from A to B is not necessarily the same as from B to A.)

That's a kind of practical graph where you want efficient shortest path algorithms for.


The rail network in the UK is also strange. On some routes it is cheaper to split your journey across multiple tickets. On others it is cheaper to buy a ticket for a longer journey. However, this is illegal and the train companies do attempt prosecutions.



To be clearer, splitting tickets is legal, and while the second ticket may become invalid if the first train runs late, train companies now honour the second ticket if the reason you miss your train is because the incoming train has run late.

However, what can be in breach of the terms and conditions is to buy a ticket for a longer journey than desired and then get off at an intermediate station. If you read the comment to which you're replying, it's the "overlength ticket" that he is saying can be illegal.


Fair enough for split ticketing, but they prosecuted someone last year for overlong ticketing. It only failed because they got the facts (as opposed to the law) wrong.


Why is it illegal and how do they enforce it?

I know that something similar happens with airplane tickets but they get you with the luggage and with terms & conditions AFAIK.

How is it similar/different with train tickets?


It is illegal because the railway byelaws make it illegal. The railway doesn't just have its own law, it also has its own police force, The British Transport Police, though as far as I know they don't do the ticket enforcement.

You are not allowed to break a journey either, so a ticket inspector can spot a ticket for a longer journey at one end of the other. The multiple tickets ruse is more of a grey area.


BREAK OF JOURNEY

Rules vary by train company and route. In general you cannot break your outward journey (except for necessary changes of train) but you can break your return journey.

Source: https://www.splityourticket.co.uk/info/fare-types.aspx

I've already dealt with the legality of split ticketing in another post.


The National Conditions of Carriage actually permit breaking an outbound journey, provided the ticket remains valid for the continuation. Outbound tickets are usually only valid on the day, so in that case you can't continue the journey the next day. But a regular standard (or first-class) ticket on a train that's not the last train of the day can be broken on the outbound journey.

For most tickets.

Some tickets, such as Advance and some Super-Off-Peak, have additional conditions. However, I do have a letter from a train company saying that they were wrong to have charged for an extra ticket, and wrong to have tried to start proceedings.

So breaking an outbound journey can be legal, and in such cases buying an over-length ticket is possible. It's circumstances like these that led in part to some of the specifics of the recent shake-up of the fare structures.


Huh. I wonder if it would be cheaper to do a ticket like

start -> actual destination (stay on train) -> overshoot destination (get off) (get back on) -> (turn around; begin return trip) -> actual destination (get off) (get back on) -> start

?

Seems utterly unintuitive, and it probably is, but I am curious.


That can be possible, and I have done it.

Part of the reason for all this is that the train system has been broken up into a collection of franchises. Some companies will then incentivise some journeys, and so the fare doesn't always reflect the time or distance of the journey being made. As a result you can find and exploit anomalies, but only if you're willing to spend the (sometimes considerable) time and effort.


And how do they enforce it? Chain you to the train?


No, they fine at exit.


Really?? Is that true only when there are multiple paths between A & B or is there something else involved?


For busses, the B stop on the way out may be in another zone as the B stop on the way back because they're geographically displaced. For busses and trains, depending on your type of ticket (2-8 zones, 8+ zones, commuter card) and whether the zone intersection is on or between stops/stations, you may have to pay for that zone one way but not the other. Those are edge cases.

But the most confusing part is the ring topology (here shown with the zone-ring center being the actual city center):

https://passagerpulsen.taenk.dk/sites/default/files/styles/f...

When buying 2-8 zones you don't pay for individual zones, but rather rings from your origin zone. Going one way, the zone rings look one way. Going back they look different, because the origin zone is different.

A practical example of this peculiarity:

You go from Hellerup (zone 2) to Friheden (zone 33) via the central station (zone 1). Your ring 0 is zone 2, and zone 1 and 33 are both in ring 1, so you only pay for two zone(-ring)s: https://dinoffentligetransport.dk/media/1413/svanemollen-fri...

You go back from Friheden to Hellerup via the central station: Your ring 0 is zone 33, zone 2 is in ring 1, but zone 1 is in ring 2 this time, so you pay for three zone(-ring)s: https://dinoffentligetransport.dk/media/1411/friheden-svanem...

Your origin zone on the way out was adjacent to both other zones needed (needing only 1 ring), but your origin zone on the way back was not (needing 2 rings).

So they're not just different prices: The graph that spans train stations via zones in the ring topology is not metric. So getting a return ticket is not the inverse of getting the ticket out. And while you have a ticket that's valid for the timespan of your return, it may not be valid for your return path.


What the... that was painful to read and comprehend. Was this kind of complexity really necessary? Did someone get a promotion for coming up with this?


I copied and translated a quite limited subset of the official traveller guidelines.

There are several reasons why it ended up like this, and I don't know all of them.

But two reasons I can think of:

1) With the introduction of RFID-based commuter cards, the zone system was revised for traditional ticket types as well. It appears convenient that a price can always be determined by knowing the check-in point, the check-out point and the shortest path in the graph. Unfortunately, with the old ticket types, you have to perform the calculation instead.

2) It is the odd shape of zone 2 that causes some tickets to not work on the way back, even though the trip goes in a straight line. But for a large part of all trips, you only need zone 1 and 2, or zone 2 and zone (30, 31, 32, 33). Only the ones that start in zone 2, go through zone 1, go back into zone 2 and farther into another zone (or vice versa) have the anomaly. (Trips that actually bend can have this too, but it is slightly more intuitive.)


Other possibilities: Trains going in different directions don't make the same stops (express vs local), and time-based fares applying to a specific direction based on commuting patterns.

In the US, on Amtrak, a ticket is specific to an exact train, like buying a plane ticket. There's no generalized "I want a ticket from DC to NYC."


Some of the shorter ones have unreserved coach class tickets that aren't restricted to a specific train. So, while there is no generalized, "I want a ticket from DC to NYC", there is a generalized, "I want a ticket from Chicago to Milwaukee."


Another possibility is that there is more demand in one direction.

You might have a major city, A, a center of finance and law and full of corporate headquarters, and a smaller city, B, where many of the companies whose headquarters are in A have their engineering and manufacturing facilities.

Someone making a deal with one of those companies may need to visit both engineering to work out technical details first and then visit headquarters to work out business, financial, and legal details. They will travel from home to B, and then travel to A, and then when done travel directly home from A. This may happen much more than the other direction--people needing to visit headquarters first and then going to visit engineering and manufacturing, without needing to go back to headquarters.

Net result: more passengers needing B to A than A to B. This can result in A to B trips having empty seats. It might even result in some trips with no passengers at all! They can't simply cancel those trips, because they need to get the vehicles to B to handle the B to A traffic, so they sell seats cheap.


Chicago has a minor version of this, even without zone fares: The entry fee to get on a train is $2.50 everywhere except O'Hare Airport, where it's $5.


Interesting, thanks!


The tourist fee.


Nah. Most the people I see on that train are people commuting to/from work, and locals who live along the blue line corridor. Some tourists, but I think most of them are opting to pay a driver 8x as much to take even longer to get them downtown.


They use a zone system, and you don’t have to pay for a zone if you get off at the first stop after entering it.

This creates the asymmetry


thanks


The same goes for airfares.

A round-trip can be cheaper depending on direction of travel.

One-way trip can be significantly cheaper if you are flying through a major hub like FRA and starting your journey from a less wealthy country.


> spaces whose distance measures obey the triangle inequality

That would be a metric space, and a distance obey the triangle inequality by definition. The element of the space are otherwise quite arbitrary.

Unfortunately, Normed spaces are a subset of metric spaces which are much less general (in particular they have to be vector spaces).


You are right, I wantonly omitted some of the important properties. Cf. the response of one of the authors in another comment on how their method tackles general metric spaces.


The method gives a data structure for any metric space, and the quality of the data structure is controlled by how well expanders embed in the metric space. This implies interesting results for metrics which are not norms, like geodesics on certain manifolds. The surprise to us was how well this approach works for any norm, however. (I am one of the authors.)

Also, a math correction: spaces which obey the triangle inequality are called metric spaces. The class of normed spaces is more restricted, but still very large.


I don't think the general property of expander graphs is needed, perhaps a relaxed property, such as the following definition:

A graph G is a distance(k)-expander if #edges(U, V-U) > c min(#U,#(V-U)) where U is a set of diameter <=k. Now study the embeding.

I am also curious about what happens if one use SVM to classify vertices of an expander graph: Let G be an expander graph and let u1,u2 be vertices of G such that d(u1,u2) = diameter of G = an even number 2k. We classify the vertices of G like this: f(u) = 0 if d(u,u1)<=k, f(u)=1 if d(u,u1)>=1, then using SVM applied to the vertices encoded by the rows of the adjancecy matrix. I would expect the SVM algorith to have some special property but I don't know what to expect, the best or worst classification, or the support vectors can gives us some information about symmetries in the graph.


But these more general results are kind of small print, right? The abstracts to both linked articles talk specifically about normed spaces.

(Thanks for the correction. I just gave one of the properties, but should’ve been more precise.)


It's a rather large set, but it's got some noteworthy holes. Cosine distance, for example.


my gut feeling tells me - ln((1+Similarity_cosine(A,B))/2) would satisfy the triangle inequality.


It produces negative distances, which disqualifies it as a metric.


For a Riemannian metric yes, but a lot of physics gets done in Lorentzian (psuedo-)metric spaces.


how does it produce negative distances?

cosine similarity is [1,-1], so add one [2,0], now halve [1,0] now take logarithm [0,-inf] finally negate [0,+inf]

?


Ah, I mistook the - at the beginning for a separator.

It does fail the triangle inequality, though, as the following python snippit demonstrates:

  import numpy as np

  def randomvec():
      return np.random.rand(4)*2-1

  def cossim(a, b):
      return np.dot(a,b) / np.sqrt(np.dot(a, a) * np.dot(b, b))

  def metric(a, b):
      return -np.log((1+cossim(a, b))/2)

  def main():
      for i in range(20):
          a = randomvec()
          b = randomvec()
          c = randomvec()
          a2b = metric(a, b)
          b2c = metric(b, c)
          a2c = metric(a, c)
          print a2b, b2c, a2c
          if a2b + b2c < a2c:
              print '  *** fails triangle inequality'
   
  main()


good catch! upvoted


I guess the best conversion of cosine similiarity to distance would of course be d(A,B) = arcCos(Cosine_similiarity(A,B)), i.e. the angle between the 2 directions, which would have the property that subdividing a geodesic arc and calculating the sum of distances along the subdivision results in the same distance as the start and end point.


> For example, “Manhattan” distance forces you to make 90-degree turns, as if you were walking on a street grid. Using Manhattan distance, a point 5 miles away as the crow flies might require you to go across town for 3 miles and then uptown another 4 miles.

It's interesting that this is called Manhattan distance because it's only relevant in a town where everyone jaywalks... Like Manhattan. It's far from true anywhere where jaywalking is frowned upon.

Because of (the lack of) crosswalk synchronization, it's a lot faster to walk to a place 2 blocks over and 2 blocks up than it is to walk to a place 4 blocks in one direction. Because at the first two lights you have the option of crossing in either direction, at which point you may only have to wait a few moments before crossing in the other direction.


The measure is concerned only with the distance traveled, not the time.


That's quite a narrow view. You can certainly model time in a metric space, i.e. it you go one block west and one new-york minute forward.


Yes, this is the value of optionality, as N.N.Taleb would say.

I had exactly the same thought about crossing streets when reading one of his books. Always choose a route where you have the choice to keep walking in some direction towards your destination, with a delayed jaywalk to cross to the side you want during a gap in traffic, rather than wait on a corner when a crossing is blocked. The reason for blockage can be a crosswalk light (if you obey them) or traffic itself - the rule applies even without any signal-controlled junctions.


Ah that's so cool. I've been reading about expander graphs lately. Their interesting properties make them pertinent to lots of questions. In particular, you can use expander graphs to cook up error correcting codes and prove the PCP theorem. There's a class of expander graphs called Ramanujan graphs which are characterized by an analogue to the Riemann hypothesis involving a zeta function that counts the prime cycles in a graph.


I really hope this leads to databases with good support for fast nearest neighbour search. This would be especially useful in word2vec cases, where you have millions of word vectors, and want to find "words similar to this one" without either having all of them in memory, or going through all of them to find out.


I wonder what this can mean for fuzzing, optimization, learning or any kind of task that has to do with tip-toeing into potentially high dimensional spaces?


There are a lot of places in practical neural nets with attention where you want softmax(queryvector · memorymatrix), where memory can be quite large. If you have a decent ANN implementation, you can approximate by only calculating the dot product for the vectors of memory that neighbor the query.

There are currently a ton of mediocre ways to do this because nothing really works very well in high dimensions, and calculating this can easily be the bottleneck in training and evaluation.


I'm curious to if this result will extend to the k-nearest neighbors (k-NN) algorithms.

Two problems that have to do with k-NN are

1. It's a non-parametric method: the number of parameters grow linearly with the size of the training set since the distance function must be calculated for all training points and the test point.

2. The curse of dimensionality: distance metrics like the Euclidean distance do not perform well in higher dimensions; points which seem "close" in 2D may be far in 3D, 4D, etc. As a result, we would need an exponential amount of more training data for every additional dimension. Locality sensitive hashing tries to combat this by reducing the dimensionality of the data.


I’m curious if it will extend to k q-flats, or other notions of points being near each other purely by being near to the same subspace, rather than pointwise nearness.


Link (from the article) to the paper with details: https://www.ilyaraz.org/static/papers/spectral_gap.pdf


The article announces the algorithm, which isn't published yet. This first paper contains the proof that the result is possible, not yet the efficient algoithm.


One assumes its an extension of their previous work (https://arxiv.org/pdf/1501.01062.pdf), which was only valid for Euclidian and Hamming spaces?

Edit: The author says its https://ilyaraz.org/static/papers/daher.pdf, but he got marked dead by HN.


I vouched for the author's comment. Any clue why people are downvoting/flagging it?


New accounts posting comments with links are killed by the spam filter before anyone even gets the chance to downvote/flag.


I await this paper with eager anticipation.


This is the second paper, where we present an actual fast algorithm for general normed spaces: https://ilyaraz.org/static/papers/daher.pdf . Enjoy!


Does any of this have implications for libraries like your FALCONN https://falconn-lib.org? Have not read the paper yet.


thank you very much!


Curious to see Timothy Chan develop some screaming fast C code using their method.


Click bait. Not universal. Nearest neighbour algo that works well for a family of distant measures.




Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: