Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: I made TV Sort, a web-based game for ranking TV show episodes (tvsort.com)
69 points by pocketarc 9 months ago | hide | past | favorite | 67 comments
Over this Christmas break, while discussing the best episodes of Frasier with my mother (as we tend to do when I get to see her), I thought about coming up with something that's less arbitrary than 1-10 ratings.

The result is TV Sort. It just uses a sorting algorithm, but... it's human powered. When the algorithm needs to compare two items, it asks you to compare them, and with that you end up with a full, thoroughly sorted episode list.

It uses TMDB, IMDB, and Wikipedia to extract episode information for any show, to help jog your memory when making episode comparisons.

It was a fun little experiment. And finally, I know -exactly- what I think the best and worst episodes are.[0]

Would love to hear your feedback, this is my first Show HN. ;)

Edit: I wrote a whole blog post about what went into making it, if anyone wants to read more of the technical detail behind it.[1]

[0]: https://tvsort.com/show/3452/matrix_01hjtxz2e1ewkrh44ja3mz0s...

[1]: https://pocketarc.com/posts/tv-sort-engineering-the-ultimate...




You can do much better than just sorting. The simplest is to use Bradley-Terry. It's a very simple algorithm and will let you combine results from multiple users and gives an actual rating rather than just a ranking.

It also handles the probabilistic nature of sorting better. Traditional sorting algorithms rely on comparisons being sensible (a>b and b>c implies a>b) but you probably won't get that if you use people.

I explained it here:

https://stats.stackexchange.com/a/131270/60526

Quite closely related to matchmaking in computer games.

I remember there was a website a while ago that used pairwise comparison to rank programming languages and I think whiskey. Does anyone remember this? I could never find it again.


I did something similar, in fact, the math may be the same thing and just expressed differently. But when I've had to rank non-transitive things, I use Elo (https://en.wikipedia.org/wiki/Elo_rating_system)

Many years ago when I was a mid-level developer at a dysfunctional company, I was senior enough to be invited to some "strategy" meetings, but junior enough that no one ever listened to me. We (engineering, sales, marketing, etc.) spent nearly an entire summer bickering over what "important" features we were going to schedule next. I finally got fed up, took everything out of the ticketing system and made random parings and had people vote on it. Then just like a chess match, updated their Elo score based on the outcome. Then I had anyone who cared play match-ups for as long as they wanted. We ended up getting a decent ordering of features and finally ended the summer of hell meetings.

I don't know if the order was the correct order, I didn't stay around long enough to see. I was just happy that sales and marketing folks thought that I had some magic math that solved their problem, and I was happy to be back developing and not sitting in useless meetings.

What I like about this is that you don't have to be self consistent, as long as on average you pick the best, it will bubble to the top. And you can mix the results of other voters and see what the "true" winner is. (Of course, to be fair, you have to give each person the same number of match ups, in my case, I just served match-ups to anyone who wanted to sit at the terminal and vote, so someone could have wasted an entire day and overwhelm the system - I didn't care at the time).


Wow, this is extremely helpful, I had no idea this existed and will have to read up on it properly.

I think my main concern would be: What would it be like for the first user to try to rank a show (as was the case for everyone today)? All probabilities would be 50-50, no? But if it's a show that's already been ranked at least once, then this could help immensely, if I understand correctly.


Due to the regularisation yeah they all start at the same rating. But you don't need many votes to start getting good ratings.

I introduced this method to Dyson for objectively calculating very subjective measurements (e.g. "how frizzy does this hair look?"). We basically crowd sourced it to other engineers.

I did a load of studies on different methods by ranking something that's sort of hard to rank but you know the answer to - I used 10 grey squares that only differed by 2/255 and you had to pick the brighter one.

Some other things:

1. I don't remember the exact details but there's a slight extension of the method where you give each user a "how good are you" coefficient that you simultaneously solve for. This helps eliminate people that vote randomly, and also inverts the votes of people that deliberately pick the wrong answer (as long as they're consistently wrong).

2. You can put confidence limits on the values very easily too since it's a MAP estimate. Actually I showed curves for each item - basically how does the model probability vary as you sweep one rating up and down a bit. People didn't understand it at all though.

3. You can calculate the rankings incrementally very quickly (details in the answer) which means you can show users comparisons that give the most information. This usually means you end up showing users endless difficult choices which can frustrate them, especially if it's a forced choice.

4. I never found a principled way to incorporate a "they look the same" option. I tried some ad-hoc methods and IIRC a "much better, slightly better, can't tell, slightly worse, much worse" scale gave the fastest convergence but it was pretty unsatisfying that I just used some as hoc method to add the results.

It was all closed source and I haven't worked there for years so the code is lost to the wind unfortunately.


This is honestly very interesting, thank you so much for elaborating! To be fair, after today, there are now nearly 400 TV shows with votes, so I can start seriously looking into this very soon!


Interesting! We could use this algorithm to rank websites for every term in search engines. Just need a good UI to collect the data.


This is so cool. I always wanted a way to do this


I've tried with two different shows, both times the first selection starts at S01E01, the second selection was I'd guess a high rated episode from a later series.

If I say selection one is better, selection one becomes the next episode, S01E02, while selection two stays the same episode. If I say either or selection two is better, selection one is picked at random and selection two stays the same episode. And then if I go back to selection one be being better, selection one moves back to the next episode, ie S01E03 now.

Is that the intended behavior? As I quickly got bored of saying series one was better than this one episode in series three.


This is my one complaint as well. I really like the idea of this website, and it seems to be made well! I tried out Star Trek TNG, and regardless of which I picked, I was always comparing against the same episode on the right side (Season 4 Episode 15, which I think is the exact middle episode of its run?). I know all the comparisons must be made eventually, but it would be nice to swap different episodes in so that I'm not comparing every single episode against one at a time.


You're right, it is the exact middle episode. I picked shows with variable series lengths, so it was hard to tell.


Had this exact same experience. Picked the Simpsons just to try it out and in the first pair up was S1 stuff vs some episode from S18 I've never seen.

So I picked The S1 episode. Then it was another S1 episode. Repeat. Then an S2 episode.

My second option never changed from that first episode from S18 so I never picked it. Perhaps I could have gone through 12-17 seasons of episodes always picking option one until two episodes I've never seen went head to head.

IMO, both options need to randomize each time until first pairings are exhausted.


> Is that the intended behavior?

Yeah, I'm afraid so. At the start, selection two would always be the same as the algorithm doesn't know anything about the standing of any episode; it's trying to decide where in the array to place it (above or below selection two).

It might be worth seeing if I can randomise the episodes displayed, if only so it doesn't feel so repetitive.


I've done something similar.

You're essentially doing A/B comparisons across the entire set.

It looks like you have it so you're basically setting where "B" is before moving on to the next item.

This isn't strictly necessary. You could just generate a novel pair every time and ask the user to choose between them. The thing is that you'd need a way to track a user. So you can make sure that user hasn't seen a certain pair already.

Once you've exhausted all the pairs, you'll know exactly how to sort the array. You'll have an idea before.

You might have an issue with circular lists though. People are fickle. You could have someone who says that A > B > C > A.

In this case, I'd allow for repeat pairings after a certain amount of time. To allow the person to reevaluate essentially.

You could also take the comparisons across all users and compile a general sort of "best of" ranking.


> The thing is that you'd need a way to track a user.

The page you're on already knows what comparisons you've made (otherwise how it would move forward), so this is entirely possible!

I've come up with a way to randomise it (by just picking a random element from an array of comparisons that haven't yet been made), that's the next step. I deployed it earlier, but there was a bug with it so I've had to rollback until I can look into it.


Why does selection one get randomized if you say selection two is better, but then jump back to the original sequential if you say selection one is better again? That behavior felt bizarre.


Selection one jumps between the start and the end of the show repeatedly until you make it to the middle (selection two), after which point it'll move on to getting you to rank the best episodes, and then the worst episodes.

Definitely looking into seeing if I can come up with something better though! The problem is making sure that whatever algorithm is picked remains as close to O(n log n) as possible. Randomising options in a way that makes require a lot more comparisons would be far worse.


> Randomising options in a way that makes require a lot more comparisons would be far worse.

For the algorithm, but not for the people taking time to do the rankings. Which do you want to prioritize?


I was prioritising for taking less time total, but you’re right, that doesn’t matter if the person gets bored and leaves. I’m tinkering with it now and I think I have a good solution to the problem. I’ll be deploying it soon!


Maybe you can pull in episode rankings from another site to seed a first ranking and then make an algorithm to find where you disagree with the norm.


Make it slightly fancier and assign a “similarity” between users.

Start with uniform similarity, and as preferences are made, adjust them.

Then you have personalization.


I'm not sure what you mean by this - are you saying that the episodes that get displayed would be based on what is likely that other users would've picked for the same show?


I’m saying that instead of one objective ordering, you have a subjective per user ordering.

The decisions made by other users get a weight assigned to them, which is individual to each logged in user. So every users viewpoint is personalized.

(Apologies for the short explanation, I’m on mobile. If you want to ask more you can email me.)


I'm both a casual TV viewer and someone with a short attention span. My experience was: I tried comparing a few Seinfeld episodes, found that I am unable to recall the episodes from the descriptions, then gave up.


Hah, I have the opposite problem. As soon as I know the episode title, I lose interest because I speedrun the episode in my head and don’t want to watch it anymore.


Yeah, this thing requires a decent time commitment as it stands.

One of the ideas floating around is to make it so you do this by season instead, to make it a bit more "quick casual fun", rather than having to rank the entire show all at once. Scoring 180 Seinfeld episodes as a casual TV viewer isn't going to be a great experience.

Honestly, HN feedback has been immensely helpful, I couldn't be more thankful.


It keeps giving me the same episode vs a different episode every time. Feels like I'm manually doing bubble sort :)

Maybe randomising the selection would keep me going longer.


I agree. I think using the transitive property to place episodes relative to others would help a lot. Also something like "pick your favorite of these 3-5" might go faster and make it feel more fun


I have been thinking about showing more than 2 options to help it go faster. On mobile I guess that would be quite difficult, but for people on bigger screens, yes, let them run through episodes as fast as they can.


It’s always comparing to a same episode, not sure if that’s by design, but it made it feel stale, or not fun. Also you need to also show progress bar if you are gonna compare shows across seasons, but I’d stick to comparing shows in a single season instead.


You're right about doing it by season, that would definitely make it a lot easier to just jump in and start, without making a big commitment to ranking a whole show.


This has been a great thread for reading suggestions for algorithm improvements, etc.

If you're interested in gathering a lot of data from a lot of people as they watch TV you might want to look at integrating this in as a JellyFin plugin in some manner or otherwise hooking into the home media centre crowd.

Presenting results back to the users would also be of interest, particularly (IMHO) if cluster groups of preferences are teased out (a lot of peole really like these episodes Vs this other distinct group that seem to prefer these ones).


The website isn't really my thing (maybe I'm weird, but I already know my favorite 3-4 episodes of every show I've watched) -- but the writeup is stellar and has just the right level of detail. Using LLMs to generate episode summaries and having a fallback plan is really going above and beyond to get a great UX. Great stuff.


> I already know my favorite 3-4 episodes of every show I've watched

I did try the site, but the 2nd "static episode" of the algo was a fantastic finale (Venom of the Red Lotus), and it made me realize I'm kinda like this too. I already knew the best few episodes.

I don't remember every show, and not always by name, but I think I remember the best episodes of shows I would bother discussing with friends.


Thank you for the kind words, and for helping me get an idea of how my writing is coming across, I appreciate it a lot!


This is super interesting, as over the break, I was thinking of something exactly like this, but for video games.

Feel free to steal the following, if anyone likes:

Take a scrape of video game data from multiple sources, such as steam, amazon, game sales, forum sizes on reddit, as an example, then rank the games based on these metrics - but then have people vote for "hall of fame"

Include as much historic data one game sales for all time, if possible - as so many games were introduced during our formative years, and thus have a deeper, more memorable impact.

--

I see the problem others state. Perhaps have a random button to just give you a new selection.

Great work though.


The problem with multi user title voting is that it becomes a popularity contest, not an ostensible quality ranking. Steam itself has spent years trying to address this.

TV episodes don't have that problem because each user has viewed all (or at least a sequence of) the episodes.


Good point, but I guess the real issue with video games vs shows - shows are passive, games are active - so your experience with a game is going to be way different than a show - you dont have to have had eye coordination for Seinfeld.

:-)


I'd assume this doesn't work too well with shows that aren't episodic in nature. Especially if you binge watch, the line between when one episode ends and the next starts is usually blurred.


Highly serial shows definitely have standout episodes, strong/weak seasons and such.

But I can see how this could be hard to remember.


This is pretty neat! For quite some time I’ve yearned for a tool like this to be able to rank my favorite songs of an artist (embedded Spotify 30 second clips?) and other custom media like comic strips.


Time for me to snatch musicsort.com or something. ;-)

That would honestly be fun, great call!


There should be a way to exclude seasons (e.g. Simpsons)


Congrats on the launch!

Curious: how are you getting the data from e.g. TMDB? Was it a one time download or are you refreshing it?

Feedback: Sometimes I didn’t watch the latest season of a tv show; at the moment I’m being asked to rate episodes from all seasons. I’d like to rate episodes of a single season or up to a certain season. Alternatively: an option to skip an episode, or flag that I haven’t seen it.


Thank you! The data from TMDB is cached when someone starts ranking a show for the first time. At the moment there is no system for refreshing it, but that's on the to-do list, so that if there are improvements to the data, they'll be fetched.

Thanks for the feedback - I love the idea of a "I didn't watch it", that's super important. Maybe that could drop it out of the list entirely (since it can't count for anything).

The "rate a single season" idea is one of the main things to come out of today, and it's where I'm going to take this next. When you land on a show, instead of a single "start ranking" button you'll have a list of all the seasons in the show, and be able to rank them individually. And since all these comparisons are stored in your browser, I can make it count toward your "full show" ranking automatically, so that if you ever get to that, you'll already have it in progress.


I’m curious what your pics for best Frasier episodes are?

Two of mine are when they become illegal caviar dealers and when Niles wants to try weed. The episodes with Lilith tend to also be very good. People love the Valentine’s Day one, but I'm never a fan of that style of episode. I can appreciate the genius in it though


“Roe to perdition” is the caviar one, and it’s fantastic as well (#20 for me). Niles trying weed was #1 for me, “high holidays”. For me, #2 and #3 were “the doctor is out” (Frasier getting involved with Patrick Stewart) and “out with dad” (where Martin and Frasier go to the opera). The misunderstandings are what does it for me!

Edit: But I'd honestly say that my whole Top 50 or so is great episodes, there's not a big gap between the top and any of them. It was hard ranking them all.

Also, my list is at: https://tvsort.com/show/3452/matrix_01hjtxz2e1ewkrh44ja3mz0s...


All very great choices! “Dog army? What do you think that means?” Is an inside joke with me and my best friend from “High Holidays”.

I also agree that the ranking is hard. A remarkable thing about the show is how it’s consistent throughout its run and doesn’t really fall off in quality despite going for over a decade.


Frasier's "dear god" when he first sees goth Freddy as well is just terrific.

"Well, thank you Lilith, for mentioning this little development!"


I remember cackling so loudly when I first watched that episode and he opened the door to the image of goth Freddy


I like pubmeeple for ranking board games. You can import your BGG collection and it simply does pairwise comparison and presto, you get a top X list sorted. Simple but works very well. I think they also support TV shows and the like but maybe you have to input your own list.


I've just deployed a change that randomises selections - hopefully it helps address the biggest concern raised here so far. I'm curious to find out how people feel about it compared to the previous way it was doing things.


Trying to rank the IT Crowd right now and the episodes keep refreshing before I click anything - https://tvsort.com/show/2490/matrix_01hk2y2csmeg1tdczs12wbz6...

Am using Firefox with uBlock Origin if that affects anything


Thanks for the heads up! I shouldn't have tried to rush this while this is under heavy use - I've reverted the change, and it should be OK now!


This seems to be somehow broken for me. It auto-skips everything in about one second. If there's some point about a time limit being required it should be 5 seconds at the very minimum.


That's what I get for trying to rush a deployment during this HN period. ;) I've reverted the change, and it should be OK now!


Fyi, on my first try, one of the two episodes was always the same, on my second try, it kept reloading two new episodes before I could make a selection in an infinite loop.


Sorry about that! Had to do with the latest deployment. It’s okay now!


This is neat. See also https://brickelo.com for the same thing but with LEGO minifigures


This reminded me of adaptive comparative judgement. I'd be interested in your algorithm on how you decide how to pair up items.


Thank you for that! Adaptive comparative judgment gives a name to something I've always believed, but never really quite put my finger on; that comparing things one to another is more reliable than random 1-10 ratings.

As for the algorithm, it's a basic Quicksort, building on the work of Leonid Shevtsov[0].

[0]: https://leonid.shevtsov.me/post/a-human-driven-sort-algorith...


I think merge sort would provide a better experience.

Quicksort can be great for human-comparison sorting if you let the user pick the pivot, and if you have a direct-manipulation interface for dividing a big pile into two smaller ones. Humans are great a scanning large numbers of objects, and can split piles much faster than operating one by one.


You are quite right. I had already been thinking about merge sort because it’s guaranteed to lead to fewer comparisons, but what you said about piles would work great when combined with showing more episodes at once, asking the user “which of these 5 episodes is better” and getting those comparisons out of the way all at once.


Very cool!

How are you getting/using ibdb data? I thought the API was locked down (?)


I wrote a blog post about it[0], but basically, I use the TMDB API to get all shows and episodes, which is free and has a generous rate limit.

For the episode descriptions, I grab the plot summaries from IMDB and Wikipedia, with just HTML scraping, no APIs, and feed them to an LLM to get the 3 main plot points, so you don't have to read a bunch of rambling text when trying to quickly assess the episode you're looking at.

[0]: https://pocketarc.com/posts/tv-sort-engineering-the-ultimate...


I think it's much better to do this for show vs show imo


Wanted to say the same. I watch a lot of TV, but can barely recall particular episodes by name or even some screengrabs. But I'd be curious to see how my taste in shows themselves compares to other people's


Definitely an idea for the future, to expand this beyond just TV episodes!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: