Hacker News new | past | comments | ask | show | jobs | submit login
Technical Feasibility of Building Hitchhiker's Guide to the Galaxy (memkite.com)
54 points by amund on Apr 1, 2014 | hide | past | web | favorite | 35 comments

All of Netflix, in HD, on 24TB.

The thought of putting most movies in my pocket for a few hundred bucks is ... stunning.

Netflix could solve the bandwidth problem by just mailing the whole library to customers. "Never underestimate the bandwidth of a station wagon full of magnetic tapes."

From my quick look around, somewhere close to $1000 US for the drives. But it's easy to imagine there would be a day doing what you say would be totally feasible. They should already be offering an option to purchase movies at the store on USB drives as it is.

I'm hopeful in that I'm looking forward to the day that shipping 24TB of data by package delivery would be considered too slow.

It's pretty amazing how much better things could be, if we could pry the content industry away from outdated business models.

Well, consider what went into assembling that 24TB of content, and follow it from there.

Each movie cost something like, well, let's just ballpark $10M average (the >$200M blockbusters are relatively rare, balanced by old & cheap filler content made down around $1M each). Lead article notes Netflix has 8900 movies. Round both up sensibly, and we're looking at $100B to create the content.

Netflix has some 44M subscribers. Assuming all that content was made for & paid by Netflix subscribers, that's $2272 per subscriber. 8-O

Talmand SWAGged the storage hardware cost at $1000. Storage prices being what they are, that will be $300 in a year or so.

Would you pay $2500 to own, in a box the size of a pack of cards, the entire Netflix library? With updates (package deal, you get 'em all as they're made to cover production costs) of every movie added thereafter for $0.25 each? Could we persuade the entire 44M Netflix subscribers to sign up?

Not sure what to do with those numbers, but they're fascinating.

That's only temping because of the high resolution. I'm not sure if it's Comcast being intentionally awful or just incompetent, but I can't seem to stream HD anything on netflix. I turn on manual buffering and HD options just aren't there. If that problem didn't exist, there's basically no benefit in making paying the large up front cost.

Now, if production costs were completely covered by subscribers, it would be interesting to see what sort of movies get made.

Pointless factoid: that works out to about $65 per movie frame.

People would be concerned about privacy. One solution would be to encrypt it and then use the internet connection to send decryption codes. And really you don't even need to do that - it's not like pirated copies of the movies don't already exist.

They did send the whole library to their CDN nodes all over the world.

Some information has value that remains relatively constant over time: historical records, literature. Some information has value that decays slowly: basic materials in the sciences, large bodies of well-established measurements.

Some information has value that decays fairly quickly: current scientific progress that invalidates older measurements or has better predictions than older theories, current events, prices of common items.

And some information has value that decays extremely quickly: weather, financial markets, prices for things that you want to buy or sell now, casual interactions with other people.

If you build a system that can connect fast enough to the Internet in terms of latency and throughput, you don't need much local caching. That's what we have in current smartphones.

Without a subEtha network, having a large local cache becomes increasingly important the farther away you are. But without a mission profile to plan for, estimating things in terms of current storage technology is as useless as coming up with a security plan without a threat model.

How about the mission to Mars? What repository of data should they take with them and what should they pull across space at analogue modem speed?

If you plan on using it just on Earth, this becomes relevant: https://xkcd.com/548/ ;).

The Netflix numbers are definitely wrong. Of the 10434 titles on there 3687 are TV Shows, or more specifically seasons of TV shows.

Assuming 2 hours/movie and 15 hours/tv show season, the number rolls around to 72,486 Hours of content. Let's round that to 75,000 hours for convenience sake.

Assuming an average bitrate of 4Mbps, it comes to about 135 TB.

Which is why the Netflix Open Connect box works so well; Netflix can dump their entire catalog onto 2 of those boxes (well, it will take a few more since Netflix will be caching multiple versions of the files are different bitrates, but I wouldn't be surprised if the entire catalog in all bitrates can fit into 1 rack)

Source for numbers: http://instantwatcher.com/

The creators of 'Elite' had this problem way back in the 1980's when they had to squeeze 8 galaxies complete with planet data into 32K of RAM. They solved the problem by 'procedurally generating' all of the required data from a seed number. Therefore, it isn't a question of cutting and pasting from Wikipedia etc., it is more a matter of getting the right 'seed number' for earth, updating the procedure for generating content and that should be it.

Anyway, here is something cut 'n' pasted from Wikipedia:

The Elite universe contains eight galaxies, each with 256 planets to explore. Due to the limited capabilities of 8-bit computers, these worlds are procedurally generated. A single seed number is run through a fixed algorithm the appropriate number of times and creates a sequence of numbers determining each planet's complete composition (position in the galaxy, prices of commodities, and even name and local details— text strings are chosen numerically from a lookup table and assembled to produce unique descriptions for each planet). This means that no extra memory is needed to store the characteristics of each planet, yet each is unique and has fixed properties. Each galaxy is also procedurally generated from the first.

However, the use of procedural generation created a few problems. There are a number of poorly located systems that can be reached only by galactic hyperspace— these are more than 7 light years from their nearest neighbour, thus trapping the traveller. Braben and Bell also checked that none of the system names were profane - removing an entire galaxy after finding a planet named "Arse".[9]

[9] Procedurally generated by unicorns.

It's not possible to compress more than a certain amount. E.g. you can't store every possible combination of 12 bits with only 8 bits. It's trivial to expand 8 bits into 20 bits following some procedure, but it's literally impossible to do the reverse.

So much of what makes the internet useful is the collaborative aspect. Having a copy of StackOverflow at a static point in time is better than nothing, but losing the ability to ask or answer new questions or stay up to date is a significant loss.

If you are a lone (or part of a small group) traveler, the thumb drive Hitchhiker's Guide is probably as good as it gets. If we had a colony on Mars (or more distant) a better solution would be to have a planetary WAN (the MWW?) Central servers could update news, movies, apps and other timely but read-only content over a wide-bandwidth, high latency channel to the WWW (Renamed in 2250 to the Earth Wide Web.) Community sites (like HN and SO) would have to be redesigned for multiple non-interactive WANs. Perhaps it would default to your home WAN for voting and commenting, but you could click a tab to view a read-only, slightly older version of different planets' contributions.

That's all the information it would ideally have on earth, but then again, there are aproximately 8.8 BILLION habitable planets on our galaxy alone, and the more advanced ones oughta have a lot more information to store.

However, about our planet, the HHGTTG had only this to say: "Mostly Harmless". It does have an entry for human beings though, so we actually don't know how much information about us it contains.

Humans: A species that still thinks digital timepieces are a pretty neat idea.

That's probably it.

It is already apparent that the guide is curated from two enormous office buildings, has a host of low-paid field researchers, and culls any information deemed not relevant to the hoopiest, most towel-aware set of galactic travelers.

Which is to say that when people are talking about Netflix, they probably actually mean an extensive catalog of high-definition movies that cater to the prurient interest. And when they say Wikipedia, they mean expert tips on how to find temporary companionship and cocktails from the native populations of unfamiliar planets.

That's a lot harder to put together than a simple catalog of facts and knowledge.

A while back I had some thoughts about a user-generated travel guide, in the spirit of Hitchhikers, built like a wiki or the old IMDB. Has anyone heard of a project like that?

Maybe the problem is that it'd fill up with so much spam from local businesses that it'd be useless. Maybe a good reputation and rating system could make it work, if there were enough participation from neutral parties.

Well, there's Wikivoyage:


A large portion of the information is relatively objective (knowing where the hotels are and their quoted room rates is a big step up from 0, even if it doesn't reveal the bad places).

Here's a guide I saw on /r/openstreetmap:


> "Maybe the problem is that it'd fill up with so much spam from local businesses that it'd be useless."

I think that the problem of spam, in this case, is a problem of search technology, presentation, and identity.

Ideally small business owners should be able to write entire tomes about their shops without degrading user experience. That spam would be available to those who want to read it, and not presented to those who do not. Deletion or culling of spam would therefore be unnecessary.

1. The Wikipedia estimate probably only includes text. All of the media is distributed separately, and was ~200GB as of a few years ago. I can only imagine it's grown since then. Some articles make more sense with images (e.g. a photo of an animal to go with the description).

2. Is the entirety of Twitter really useful as part of a Hitchhiker's Guide to the Galaxy?

As a thought experiment, packing up the whole of a website is a good upper bound. If we were doing this for real, we could further aggressively trim, taking advantage of the fact that content utility tends to follow a power law rather than a uniform distribution. For instance, take stack overflow; plink everything with a negative point score. Plink everything with no replies and no votes. Plink everything that has a reply with more than, say, 5x the points (i.e., if there's a 100 point reply and a 4 point reply, plink the four-point reply). Plink entire sites/tags that are unlikely to be useful (no Python software on the mission? No Python interpreter shipped? Remove all Python questions). For Wikipedia, remove all information about localities that have either less than 5KB of text or 10,000 people. Spend some time hitting the "random" button on Wikipedia and you can get a sense of how much of Wikipedia does not need to be sent into space. Remove vast swathes of the catalog of each species on Earth... I doubt a Martian settler is going to have a compelling reason to suddenly look up a cricket particular to northern Utah. And so on.

You could probably cut all those estimates down by 90%, with very little loss observable to the users. It wouldn't be zero loss, but use the savings to stuff other useful stuff in there instead, such as redundant copies to ensure we don't lose anything.

Im making my own offline wiki (fun side project). EN wiki dump is ~10GB. 8-9GB if you preprocess XML dropping useless tags and wikimarkup

  falink = {b'Link FA|', b'Link GA|', b'link FA|', b'link GA|', b'Link FL|', b'link FL|'}
different language links/references, every article has a list of all the language version links (ex: [[af:April]] [[als:April]] ..)

  langcolon = {b'aa:', b'ab:', b'ace:', b'af:', b'ak:', b'aln:', b'als:', b'am:', ................-my:', b'zh-sg:', b'zh-tw:', b'zh-yue:', b'zu:'}
I have a feeling changing wikimarkup to something more size optimized would help even more. Swapping compression algorithm can squeeze additional 1-3GB of savings without sacrificing decompression speed.

All in all with a lot of tricks you can go from 10GB to 5GB. Doesnt feel worth it. Bigger problem is searching thru this data on a 2-4 core ~1GHz ARM processor with 1-2GB of available ram. Indexing titles is simple and fast, but quite useless. Indexing all useful words would eat at least couple of gigabytes. Searching for a simple combination of three words would take couple of seconds every time. Its not like you can haul Hadopi cluster in your pocket.

I'm probably not clear enough in the blog post: The intention is to use Twitter as a news source, i.e. crawl and index top URIs (which can be any type of news, blog and other content). The underlying idea is that URIs on tweets give a good sample of _all_ overall knowledge production per day.

There are also of course some benefits of having all the interaction happen between you and your device that I haven't talked about in the blog post, e.g. increased privacy (no data collection), lower latency (disk seek on a mobile or tablet SSD - 100 microseconds) - is roughly 1000 times lower than the latency of accessing 3G or 4G can be (up to hundreds of milliseconds)

One post on Twitter: 140 bytes max. One HD frame on Netflix: 6220800 bytes max.

If throwing the near-totality of human cinematography on H2G2 is a no-brainer due to its manageably small size (a mere 24TB), then may as well throw Twitter on there as well.

It's probably easier to index and search Netflix than it is to index and search all of Twitter on a small device.

Don't forget Unicode. 140 Characters =/= 140 bytes.

Don't forget compression too. And don't forget most tweets don't use close to the full available tweet data space. A billion tweets takes up a lot less than 140GB.

I'm just looking at rough comparisons. A single 1/24th of a second of Netflix video approximates 44,000 tweets' worth of data. Anticipating growth, let's assume a billion tweets per day (was half-billion last October). 171 days of tweets takes up the same space as the complete Netflix library.

They throw netflix in there, I guess the difference between 10 terabytes and 20 terabytes isn't going to stop the show.

To your second point, I totally agree that an offline internet isn't what I think of as far as a guide.

They separated it into text (20GB) and media (200GB).

But wiki is VERY wastefull when it comes to media. 2MB SVG file to represent 3 atoms worth of molecule? yep thats wikipedia for you.

A Kindle with a copy of Wikipedia on it seems close enough to me.

but what about the legal side of things? last time I checked, Netflix wasn't particularly excited about me trying to scrape all of their video data and making money off of it.

DRM (in concept, but not in practice) means shipping you an impenetrable safe and selling the key separately.

The proposed idea means Netflix would be partnered with the hypothetical Guide and in order to access the encrypted movies and shows you would need to buy a key from Netflix.

Perhaps we should start with interstellar propulsion system?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact