Hacker News new | comments | ask | show | jobs | submit login
The $25B eigenvector (2006) [pdf] (rose-hulman.edu)
342 points by onecooldev24 on Jan 7, 2018 | hide | past | web | favorite | 68 comments

An oldie but a goodie. IMHO, this paper, along with Larry and Sergei's original work[0], should be part of every Linear Algebra course. Few things get students to pay attention like the mention of ungodly amounts of money!

[0] http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf

Can't speak for all linear algebra courses but my linear algebra course in college actually taught that paper for at least two lectures and a discussion section back when I took it.

I'm also pretty sure a simplified version of this was used for french entrance exams a couple years back. I remember taking a practice exam with this as the topic, at least

Unasked for, arrogant rant:

Do you mean in school or in university?

Because if somebody is taking linear algebra in university and still didn't understand that this is an extremely important topic, then he should maybe go back to school or take a gap year or something to improve his general education.

From a university lecture in linear algebra, I expect that it carries me through as many important topics as possible while giving me the tools to form a good formal and intuitive understanding.

The motivation part should be solved by that time.

Ya know, I get what you're saying and I am somewhat sympathetic to this position. But I think you're being a bit harsh. It's easy to talk about "understand(ing) that this is an extremely important topic" when you already know. But how many high-school kids really know anything about Linear Algebra or understand much at all about how/where it's used? Heck, rewind time a bit and ask if you really understood that when you were first starting college. And if you did, realize that you are probably the exception.

From a university lecture in linear algebra, I expect that it carries me through as many important topics as possible while giving me the tools to form a good formal and intuitive understanding.

Sure and unless your goal is to do pure mathematics research, part of that is providing some motivation in terms of understanding applications. There's nothing wrong with a class on LA (or any other topic as far as that goes) spending some time motivating study of the topic by showing how it can be useful... even explaining how it might used to make millions or billions of dollars.

When I took math in college there was no motivating examples for nearly everything. I had to complete my CS courses to understand nearly a quarter of the actual applications of what I learned in my math class. The only reason that I even know more applications right now is the fact that I am trying to teach myself SMT solving and ML. Why can't we have a math book that has motivating examples applicable to everyone? Or if that is too hard then just one for Computer Science?

Right now in nearly every Math book that is bought by a College student or High School student has motivations in Phyaics. So a person could go through HS, get their Undergraduate degree in CS and then never figure out how math could be useful in their job at all. When students ask "when am I going to use this" you have already failed in teaching your students how to apply this.

Once they have reached this point either they will have a really bad impression of math and just give up OR they will believe whatever you say after words and get rewarded ten years down the line.

I don't think that the GP is advocating that we motivate students to study linear algebra by telling them it will make them rich. They were merely saying that this is a nice way to grab student's attention.

I really liked it when we went over PageRank in one my university lectures. It made me think "Wow, I can actually have an impact using the things I've learned here".

You are wrongly assuming most courses aren't taught by overworked professors or TAs who just speed run through problems with no background on the subject and more or less teach you to pass the exams. Not everyone gets ivy league lectures.

I don't see how adding another example why systems of linear equations are important would solve the speeding problem.

> Because if somebody is taking linear algebra in university and still didn't understand that this is an extremely important topic, then he should maybe go back to school or take a gap year or something to improve his general education.

Because not everyone professor romanticizes the subject or discusses its importance, not teaching anything other than surface details about problems.

> if somebody is taking linear algebra in university and still didn't understand that this is an extremely important topic, then he should maybe go back to school or take a gap year or something to improve his general education.

raises hand

I was that kid.

I had to take differential equations, linear algebra, and discrete systems for my CS undergrad. I loved math, and I did my best but I eventually got bogged down and lost interest.

Since then I’ve come to appreciate more why you would want that advanced math as a programmer, but my experience at the time was that I could brute force almost anything with enough for-loops.

It just didn’t occur to me that there were interesting problems that were too complex to brute force, or that brute force would lead to such tangled code in some cases that it wouldn’t be debuggable.

Frankly, I doubt more liberal arts would’ve convinced me. I needed to get knee deep in big bad code and domain specific problems before I’d realize what I missed.

I am curious to know what major innovations in search engines happened since the page rank algorithm, or were there only incremental improvements?

Also is search considered a solved problem?

Definitely not solved. I wouldn't even say search is a well defined problem.

In any case, PageRank is a method for estimating quality of a page based on the amount of inbound links, not a solution to all of search.

But it's a property of the web at the time, not something universal to the search problem, e.g. it's not a statistic that exists if you want to search books.

I think the work being done on question answering (given a question and a document that answers the question, provide a concise answer) is a place where a lot of interesting work is being done, both in academia and at Google with the snippets of web pages it provides.

In particular, PageRank can't be used for corpora whose documents do not link to each other explicitly somehow - which, outside of hypermedia, is nowhere.

Academic papers, patents, textbooks, anything with citations also works

The problem with these kinds of citations is that they are time determined, that is to say paper X will never come to reference paper Y because paper X came first, thus resources that exist first will accrue more rank using a non-hypermedia citation system.

In order to rank papers I think you would instead have to rank people, so that the people who have written on Paper X can in Paper Z reference the authors of Paper Y.

But of course it would need a CiteRank of equivalent quality to pagerank to be at all useful.

This is sort of true, but the graph of citations is basically a DAG - very unlike the graph of hyperlinks on the web. From what I've seen it's not obvious that PageRank on a DAG tells you anything super interesting.

That makes me wonder if there are actually any cycles in paper citations. I can imagine this scenario:

A paper X has been updated to reference a response or later work Y, which itself referenced X from the start in such a way as to make the 'version' of X in the reference unknown. Citation-trawling software might bite hard on a loop like that :P

Anyway, I also wonder why having cycles makes PageRank useful and lacking them makes it less so -- you can still count inbound links and such with a DAG, and huge huge amounts of the content of the web would exist in DAG-equivalent subtrees, wouldn't they? I could have this pretty wrong, haven't looked at the paper in years and should go do so!

PageRank is overkill if you don't have cycles, since you can just trivially count the DAG.

Interesting, how explicit is explicit? So much of literature is implied references. Assuming that later works are always “linking to” earlier works, could you not use page rank for, say, rap lyrics?

My impression is that "search" isn't so much of a solved problem, but the question is changing.

The next–very unsolved–problem is being able to "understand" natural language queries and "understand" source materials such that a user can ask for something and get it.

"Understand" is in quotes because because it means something rather specific.

Is this what is missing to get something like Gmail search working at the same level as web search?

> Also is search considered a solved problem?

Ha, not only is search not a solved problem, I would posit that search is getting WORSE.

Computer knowledge is a particularly good example for how search is degrading with time.

Try to figure out how to do X on the Beaglebone Black (I presume the Raspberry Pi has a similar problem, but it's not something I'm that familiar with).

The problem is that the Linux implementation for the Beaglebone went from weird distribution (Angstrom) to mainline Debian Linux kernel 3.8 -> 4.4 -> 4.14 in a VERY short time so the number of links to new stuff stayed flat.

Consequently, the old Angstrom stuff almost always fills the initial search positions for quite a ways even though it's completely useless.

This is occurring in other things, as well. Stack Overflow, for example, has no way to mark an answer as "This was correct 5 years ago but is now wrong."

Effectively, the web is becoming sclerotic and search engines are following it.

I REALLY miss old AltaVista's feature where it would give you a graphical representation of the clusters in your search so you could drill down into a less popular grouping. The fact that nobody has recreated this makes me wonder ...

> Stack Overflow, for example, has no way to mark an answer as "This was correct 5 years ago but is now wrong

Not counting comments? What more could you ask for?

Candide would be proud of you...

What more could I ask for?

Which comment is the correct one? There are always multiple "No, that isn't correct, this is the one true way" comments. One posted 5 years after the flurry is unlikely to get many votes.

How about bad information not showing up in my search at all?

How about ageing out votes so it makes sense to come back to a topic and revote?

And this doesn't even account for the information that is simply wrong but nobody cares enough (or has enough karma) to fix.

Curation isn't always bad.

> One posted 5 years after the flurry is unlikely to get many votes.

Interesting point - perhaps a hybrid scheme to decide the ordering of the answers, that balances upvotes and submission date.

It would need to be carefully tuned though - for some questions, answers will age badly ("What's the best way to do parallel stream processing in Java?"), but for others, they essentially won't age at all ("Why is there a small numerical error in this floating-point calculation?").

Perhaps it could be tuned by tag, as a means to estimate how the answers will age.

> How about bad information not showing up in my search at all?

You don't want the system to be overly sensitive to undeserved downvotes.

> Curation isn't always bad.

Of course, but traditional curation isn't on the cards simply because of scale - StackOverflow isn't like an academic journal - and we're whining about a system that works incredibly well.

Look at YouTube comments, or Yahoo Answers, and you see what a shitshow it can be when the Internet tries to have a conversation. It's a small miracle that intellectually worthwhile forums like this one can ever work. StackOverflow does a lot right.

> I am curious to know what major innovations in search engines happened since the page rank algorithm, or were there only incremental improvements?

a ton has happened! since pagerank, theres been a ton of advances around nlp that has changed the way queries are processed prior to information retrieval. for example, google's rankbrain seems to do a lot of the heavylifting around word similarity.

> Also is search considered a solved problem?

I certain wouldn't, since I still encounter things that I know are on the internet but Google can't find. It's possible that the next advance won't be actually indexing the web but rather figuring out what the user wants rather than what they requested.

> what the user wants rather than what they requested.

Amusing anecdote regarding this issue.

  - I teach an introductory online chemistry class. 
  - If the students are determined enough, they can/do cheat on their quizzes.
  - In one of my quizzes, I give the students a formula for a pretend material and ask them to compute its molar mass.
  - If you perform the calculation, the molar mass works out to something like 108 grams / mole.
  - If you try to Google the answer, Google is smart enough to know that my compound is unstable. 
  - Instead, Google provides the molar mass for a _related_ material (86 grams / mole)
  - Each semester, I find a handful of students who dutifully tell me the answer is 86 g / mole.


Reminds me of my metal work craft classes in school. We were making our own wrenches from scratch, and that requires a bit of geometry and drafting to make the blueprint (and the template). A couple of guys decided to cheat by pressing an existing wrench against the paper and using a pencil to copy the shape. Sounds like fine idea in theory, but the result is obviously fake, and also distorted enough to be unusable (pencil-surface angle will and pencil thickness will not let you have exact measurements, and inability to maintain the stable angle distorts the shape). Didn’t end well for them, got nearly kicked out.

Google does this to some extent. Recently I've found that Google uses my past queries to make cogent suggestions for my next search (essentially to predict what I want next based on prior info). Eg, if I've just searched for "sully", then a short time after type "t" into Google, the first suggestion is Tom Hanks. I've only noticed this in the last few months.

My favorite example of this is the Bullet physics library.

Google eventually learned to give me documentation in response to stuff like 'bullet collision', but for awhile it was big on youtube links and gun ranges.

I’ve experienced this when learning new programming languages and other domains. When you start, Google results are mostly garbage with a few useful bits. A few months later, it feels like the web is awash in materials to answer your every question and you can always find an answer for your problem in the first 5 results.

Part of it is learning which phrases to use when searching, but I’m sure a big part of it is also Google figuring out what you want.

Googles monetization strategy blinds the results. My guess would be either a way to search beyond google or force it to give results that aren’t manipulated somehow.

There's Google Panda, which was meant to combat content farms, but unfortunately generates a lot of false positives: https://en.wikipedia.org/wiki/Google_Panda

Well...pretty much everything important turns out to be an eigenvector.

That has less to do with some mysterious fact about eigenvectors, and more to do with the fact that linear algebra is incredibly incredibly well studied and understood; with octopus arms sticking into every special case of applied mathematics imaginable. Give it a few decades and "everything" will be some other (today obscure) mathematical idea.

At least one part of the universe really does seem to prefer linear algebra: quantum mechanics. Every quantum state is a linear combination of eigenvectors, and that's unlikely to change within a few decades (it would be a revolution analogous to the discovery of quantum mechanics itself).

I would argue that the reason QM is expressed in linear algebra is also not a mysterious property of LA, but because that is what was part of the physicists curriculum when the field was developed.

Physicists would have preferred to do it with calculus -indeed, intro QM is usually taught with calculus, not with linear algebra. But you can't go very far without talking about operators and eigenstates. Linear algebra fits the underlying physical phenomenon. People have tried to develop nonlinear versions of quantum mechanics, with no success so far. In particular, if quantum mechanics was not exactly linear, then a quantum computer would be able to solve NP-complete problems in polynomial time!

To get an intuition of why quantum mechanics looks the way it does, Scott Aaronson is extremely helpful: https://www.scottaaronson.com/democritus/lec9.html

I think the main reason for the initial focus on calculus in intro QM is more to do with making sure you have a solid grasp of the ugly, complicated, hard way of doing things before diving into the more elegant, easy way (once armed with an appreciation of all the whacky machinery under the simplified proverbial hood). I recall my intro classical mechanics had a similar approach on a few topics, and my professor explicitly said this was the reason for it. That and the elegant way is generally more abstract, and it's important to have the foundation the abstraction was built on first.

The link you posted looks interesting and fun to read, I'm looking forward to going through it after work. Thank you!

I'm not a fan of the teaching method where you do something the hard way first so you'll appreciate it, but I was a pretty lazy student so that's understandable. Maybe it works for most students.

You can also express QM/QFT in terms of matrix algebra (S-Matrix formalism).

That's a subset of the linear algebra used in quantum mechanics, no? Any transformation between QM states can be expressed as a unitary matrix - this is not a particular characteristic of scattering phenomena with S-matrices. The quantum states are still vectors, as in the rest of QM.

On second thoughts, I expressed myself very poorly... it goes without saying that matrices and linear algebra are identical; what I meant was that one can compute quantum interactions with the S-Matrix alone and not need to ‘explicitly’ use linear algebra concepts to deal with superposition of states & cetra. The focus on the s-matrix as a fundamental formalism was spearheaded by Wheeler, Heisenberg, and Landau (and amongst others), if my memory serves.

P.S. Even this statement is formally quite weak, but hopefully I have clarified myself enough to transfer my intended meaning. Apologies.

Your right it's not some mysterious property of LA, but you're wrong about it being the curriculum. Linear algebra just works and is the best we've found.

You’re right: it’s a fairly circular argument.

We actually know that QM (or rather QFT) cannot be the whole story because it cannot be renormalised and reconciled with general relativity, so quite possibly the definitive theory will not be expressed in terms of linear algebra and thus maybe the Universe is not based on linear algebra.

I agree that this is possible, but as I said, I don't expect such a revolution within the next few decades. I would be happy to be proved wrong!

A category maybe?

Wait a sec... everything already is a category.

And a set.

I thought it was known that not everything is a set?

There’s more than one way to define what a set is, and many of these definitions always allow you to postulate there exist further collections that cannot meet that definition of a set ad infinitum. Most importantly ZFC (which is at least in theory the formal basis for the vast majority of mathematics) has this property, and so do many extensions like NBG and TG.

of course - it's the empty set :)

Sadly, this seems to be the last thing we've ever heard about the search algorithm of Google. It tells nothing about topics like modern natural language-processing (which defines a different, non-global, ordering of search results, based on the query). How much of the basic algorithm is still in effect?

From everything that is publically known, assume PageRank to still exist and be one tensor feature on the RankBrain algorithm.


Although there haven't been any major announcements, from daily anecdotal evidence I can confirm it's still a major factor to get you into the front pages.

I'd say the major changes since deprecating PageRank in practice are 1. Much higher dependence on CTR and bounce rate once you start showing up in the SERPs 2. Much higher influence of a notion of "trust" on links (not just the quantity, but mainly the quality counts now, too much quantity without quality can actually hurt) 3. PageRank much more disconnected from domain level. A few years ago, you could rank with pretty much anything if your DomainPop was high enough. By now, Google got much smarter about different folders that don't have anything to do with what it's ranking your main site for and it's harder to get them ranking. On the bright side, they also got smarter about the negative SEO influence of subdomains or domain changeovers, which won't cost you as dearly.

So TL;DR: PageRank still exists, but it's been reduced to an input vector.

For example: "2 or 3 years ago links was 80-90% of what it took to get something ranked, Panda has changed that in an insane way. Here's the test example. Go to Google and search for 'best time to visit Tahiti'. You'll find my little site, VisualItineraries up there, #1 for that, ahead of TripAdvisor, Lonely Planet, USA Today, all these other sites. These other sites have between 10,000 and 250,000 domains linking to them. My site has under 100. I rank #1 for that. Now in case you think it's internal link, anchor text, or page title match. Here's the other proof. Do a google search for 'when should I go to French Polynesia'. The only word in that that matches the page title or any anchor text is the word 'to' - (and it's a) stop word - that's not going to count. VisualItineraries.com is #1 for that too.

source: https://www.youtube.com/watch?v=7RvRUt9ejkw

I've found this too. My tiny little site sits at number one (for most people in English) for most combinations of 'why was Margaret Thatcher hated?'

It out ranks national news papers and all sorts of huge sites, as it should really the info in the post is pretty comprehensive and sourced.

Somehow Google worked that out for itself based on way more than the old page rank back link weighted algorithm.

Damn that's some good SEO. Interesting backlinks I'm surprised Google hasn't caught up on this

The resources already linked from this thread are surely superior but, as a shameless plug, I wrote a blog article recently on this very topic as it was presented in my linear algebra course at CU Boulder: https://thought.place/articles/2017/7/22/pagerank/

Click bait title. You can say the same about sums and multiplication, they are at the basis of most trillion dollar technologies :P

This isn't about eigenvectors in general, it's about a single eigenvector that Google built. The title is accurate, their search product is giving you access to query the eigenvector

> You can say the same about sums and multiplication

Yes indeed. In particular how multiplication distributes over addition. It is powerful technology.

This post has happened before, it'll happen again: https://news.ycombinator.com/from?site=rose-hulman.edu

This article is from 2006. It's interesting but I'd hazard a guess that it's hardly relevant to what Google is doing now.

Google has always spoken about adjustments to pagerank and they have been visible but I assume the core concept still applies. Maybe someone with more inside knowledge can comment?

Nice try, Bing.

But in all seriousness, Google is notoriously tight-lipped about how PageRank works, so I doubt we're going to get any more information than this. I'd love to proven wrong, though!

Ok, we'll add the year above.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact