Hacker News new | past | comments | ask | show | jobs | submit login
Finding the real first post on Instagram (birdeatsbug.com)
210 points by xwenf 10 days ago | hide | past | web | favorite | 63 comments





Instagram uses basic base62 with a very standard alphabet

  ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_
They started using hashing functions from Facebook after they got acquired [0]. This function is used across the whole FB and the main benefit is that it's time sensitive. That means they will sort based on time.

First few millions of IDs are just autoincrements and then they swapped to the custom sort. If you are bored, you can dig through who used instagram in the early days.

BTW base(62/63/64) is very popular way to encode IDs. Some sites will use much more custom alphabet. My favorite was Vine which used

  BuzaW7ZmKAqbhMOei5J1nvr6gXHwdpDjITtFUPxQ20E9VY3Ll

[0] https://instagram-engineering.com/sharding-ids-at-instagram-...

> Instagram uses basic base62 with a very standard alphabet

> ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_

This is base64. Base62 would be without the "-_" (26 upper case letters + 26 lower case letters + 10 digits = 62, if you include two more characters then it's base64, regardless if the characters are url-safe like -_ or /+ like in the standard base64 encoding).

EDIT:

Also worth noting that with things like base64/base62/etc. you can either encode a byte array into a string of known characters, or if you deal with integer values only then you can compress it much more by converting from a base10 number to a base62 (or similar) number.

For example, "1000" base64 encoded yields "MTAwMA==", but if you'd convert the base10 number 1000 into a base62 number then you'd get "G8". It's like converting between base2 and base10 (binary to dec) or base16 (hex) to base10 (dec).


My shameful confession is that I never counted the characters. I was mentally looking for = and + and subtracted them from the number.

Thank you for correcting me


It's "/", not "=", right? I think "=" is just for padding at the end

Correct. I meant that as I haven't seen it in the alphabet, so I decided that it's not base64. Don't follow my path of foolishness.

protip: browser console + "string".length :)

What do you like about the Vine ID alphabet? I don't see the appeal, it cuts out some characters yet still includes potentially confusing ones (I/l/1 and O/0, which can look identical or nearly identical in some fonts). If not human readability, why did they cut out standard characters thus making IDs longer than necessary? I suspect there is a perfectly reasonable explanation, but I'm missing it.

My guess is that they used this as a security through obscurity [0]. They hoped that nobody will be able to reverse back the alphabet if they don't get their hands on real IDs. Because we [1] crawl every video/audio site, we had all the links for vine. I just took 1,000 consequential IDs (based on creation time) and then worked backwards on the alphabet. Took maybe 10 minutes to write the code.

[0] https://en.wikipedia.org/wiki/Security_through_obscurity

[1] https://pex.com


Since it's not listed on the site, what are the costs for the services pex offers?

Depends on the service. We are currently closed to only certain customers, although our attribution engine [0] is completely free to _all_ rightsholders.

[0] https://blog.pex.com/introducing-the-attribution-engine-19ec...


To me the randomized characters are a clever way of hiding "autoincrement" to users, so it would be non-trivial to calculate the number being produced daily.

Picking base 49 (a seemingly random number) seems to be perhaps a similar security-through-obscurity thing?

I've written code to produce random (non-sequential) ID's for storage in a database, but it's surprisingly non-trivial to produce a performant INSERT that will work without ever producing collisions.

Relying on the database's AUTOINCREMENT but obfuscating the value through a baseXX encoding seems actually much easier, when the only "security" you're trying to provide is against reporters and business analysts trying to estimate the site's usage/popularity, and they're quite unlikely to bother to reverse-engineer your encoding scheme.


Can vouch for the non-triviality of collisions. We have around 500k inserts a second and wanted to have time sensitivity between them (so essentially basic in on a timestamp) and fit it into 64bit unsigned integer.

We managed to get a safe space of around 50M per second, so we could safely use shard_id to prevent collisions.


>Relying on the database's AUTOINCREMENT but obfuscating the value through a baseXX encoding seems actually much easier

I don't buy this explanation. If you really cared about security, you'd encrypt the number (preferably through a block cipher). Using a weird encoding system only provides marginal security, as evidenced by this post.


"This encoding may be referred to as "base64url"."

https://tools.ietf.org/rfc/rfc4648.txt


The ID generation code doesn’t use hashing and has nothing to do with Facebook. The blog post was published afterwards but the sharding code predated the acquisition.

Maybe I wasn’t clear. The integers that enter the base62 used to be autoincrements. After time they became what was published in the blog post.

If “timestamp” function is not from FB or not inspired by, then it must be a massive coincidence because FB’s IDs are almost exactly the same. They just don’t hash them with base62.


FBIDs are complicated because the number space includes legacy IDs, but the modern space have a 24-bit shard ID prefix and a 38-bit autoincrement identifier. Two of the bits are “thrown away.” Timestamps aren’t involved.

EDIT: ugh. Still early for us slackers on the West coast to do basic math.


Fair enough. They do appear to be very similar in structure and they do follow some logic in correlation to timestamp. However you seem to be more knowledgeable about it and so I will add this to my mental knowledge book.

Thank you


Those timestamps are likely from the image metadata, there's no reason to believe that the images are out of order from their numeric/shortcode ids.

What the author refers to as the "real post #1" has a taken_at_timestamp less than the first picture. It's still id #6 in the database versus #2 (the dog picture). Numeric ids are easily visible in the source of the page or by using this not-so-secret query string: https://www.instagram.com/p/C/?__a=1


Seems far more likely the timestamp is incorrect.

When writing a feed-based app, timestamps are one of the first things you inevitably screw up (read it from local vs. the server, pull it from exif and then decide to pull it from upload date, forget a timezone, fix a local time on the server that was out of sync, update the timezone on edit vs on creation, etc etc etc).


I'm not so sure about that; it's also fairly common for a database migration in the early days of a product to shuffle row IDs.

This is the reason my reddit user number is lower than spez and kn0thing's.


True, but considering the Instagram team has claimed that the dog photo was the first photo uploaded [1], I'm just not convinced the timestamp proves otherwise on its own because of how easy it is to make timestamp errors in the early days.

We may never know for sure...

[1] https://web.archive.org/web/20180202095552/http://blog.insta....


I have my doubts about the timestamps on these first posts. How did Kevin get from Pier 38, where Mike supposedly photographed him at exactly noon, to Todos Santos at ~2:30pm, when it's 28 hours away?

A couple of things:

- the 'noon' photo is taken at night

- the '10:26am' photo shows shadows on the windowsill from the south and west - Pier 38 runs almost west->east from shore. (something on the shore reflecting light? I'm assuming the stronger shadow is from the sun). Popping the date and time into a shadow calculator shows the sun would have been positioned above the south east corner of Pier 40 from that window - which couldn't have cast either shadow. I used http://shadowcalculator.eu/#/lat/37.78196555892351/lng/-122.... to check. The 10:26 photo does look more like it was taken around noon.

So, as others have said, these are upload times not when the photos were taken.


If it uses the timestamp of the upload, rather than timestamp of the picture then it is possible. However I don't remember if Instagram could use pictures from the camera roll in first version.

I don't think this was possible at the time. I thought instagram was restricted to only posting right then and there, and camera roll access came at a wayyy later date.

The timestamps are from when the photos were posted, not taken.

I heard they were also working on a teleportation prototype at the time Instagram took off.

Sadly, the teleportation prototype actually duplicated the user, forcing the originals to have to kill themselves every time the machine was used to avoid any confusion.

Tesla tried to warn them that exact science is not always an exact science.

Wait a second, is the "copy" the same person, or a whole new person being created?

I'm not sure there's an objective answer to that, so I suggest consulting with your local philosopher.

Any programmer will tell you the copy is just a second instance of the first person's type. It exists separately in spacetime (analogous to memory space). :)

Dark! And isn't there a Netflix show about that concept?

Not sure about a Netflix show, but it's the basic plot of The Prestige.

MAJOR SPOILER ALERT

.

.

.

.

.

.

.

.

It's from a fantastic movie that would be completely ruined if I told you the name; so refer to the comment below if you don't mind the huge spoiler...


Living with Yourself.

Facebook kept it a lot simpler: you can just iterate through integers. 1 - 3 have been deleted but if you navigate to https://www.facebook.com/4 it redirects to https://www.facebook.com/zuck and so on. All the original founders are sub 10.

'reverse-engineered' is a bit grandiose isn't it?

the definiton on wikipedia fits here. reverse engineering isn't always some clandestine thing

> Reverse engineering, also called back engineering, is the process by which a man-made object is deconstructed to reveal its designs, architecture, or to extract knowledge from the object; similar to scientific research, the only difference being that scientific research is about a natural phenomenon


Sure, it fits, but I just think it's over-selling it some.

As an aside, I just reverse-engineered HN and found the first ever submission:

https://news.ycombinator.com/item?id=1


Yeah, you basically reverse-engineered something there. Good job!

For the average programmer, it might not seem too much. But for a normal user, seeing something and figure out how it's working, without knowing internals beforehand, can be said to be reverse-engineering.

I guess we should start rating reverse-engineering so people can know what to expect.


I think he just showed us what is insecure direct object reference and not reverse engineered anything. Both op of your comment, and one from the post.

I'm working on a bug right now where we have some timestamps that are wrong. Just saying.

Can you tell us more about the bug?

Turned out our backend team just had a graphql type def that was wrong. That resulted in a nullified value, and the server logic following that treats a null value for the timestamp to mean the task should be completed now. None of this was reflected in the UI however.

So all in all, my bug wasn't that exciting, but if you think about the early chaotic days of a startup, there are often nuances in dealing with date and time that don't get thought through, particularly with newer developers being pushed or pushing themselves to get things launched. For example, if you are doing all of your testing locally, you might not find out until you get a user or customer in another time zone that, hey, we should have thought about that.

Bottom line is, and I think this is really what I was thinking this morning -- I would never sink so much time into drawing conclusions with the early data of a startup like this. Life is too short. If I am going to spend that much time on something it's going to need to be a less risky investment.


Not OP, but I work on an app that collects network data from various customer systems, and have run into lots of 'fun' time-related bugs. An instance of the app runs on a machine on the customer network (that we don't control), and this is very often "the one server" that particular customer has.

One thing I've been shocked by is how wrong the clocks often are -- to the point that our software tracks the offset from real time (our server) and adjusts all collected timestamps. It's often a minute or two off (which means the customer is not using NTP sync), but many times it's been several days or even years off. One of the things that led us to adding the time adjustment was a bug report that was initially something like "The UI says 'Data last updated in 5 years' but it was really a few minutes ago" -- that was a result of accepting data as-is from a server with a clock set 5 years in the future.

Another fun bug that sticks out in my head was caused by a system that was sending time strings using custom formatting, where the original developer either accidentally specified hours as 12-hour (instead of 24-hour) format or forgot the "AM/PM" (I am not sure which). On the receiving end, a fairly forgiving parsing method was being used and because there was no "AM/PM" it was being read as if it was 24-hour format, so what was really "7PM" was being parsed as "7AM". Worse, this wasn't even that obvious as a problem, because data naturally followed business hours (eg, <12 hours window of active time per day, usually without overlapping) and was being collected from many time zones. It was only visible if you really dug into the data, knew what was expected from the source data, and checked using data in collected in the afternoon of the client timezone.


> The users of the first app didn't really get what it was for.

Whenever I'm venturing on an idea, I have a pretty clear goal and a pretty clear intent of use from a users perspective. However despite that, I'm always asked by people (who, by the way, I often figure would be the target audience for this new thing I'm building), "why?" or "what's the point?" or just blatantly "no one's going to use this over [facebook/twitter/etc/etc/etc]"

I remember when first diving into twitter 11 years ago. I remember diving into Facebook when it cracked open to the general public. I remember telling people about these platforms, and the response always is "why?" or "I don't understand what this is for or why I would use it"

I keep that in mind everytime I'm asked the same question about my own projects. Not everything is going to be a massive hit, of course. But you never know when one of these wild ideas becomes successful, and it will usually be something that the people didn't know they wanted until you show it to them.


> you will be shocked how poorly they look and how far we have come in terms of the quality of Instagram content.

pics of photoshopped faces and body is quality content of instagram


and not just instagram

More interesting: Why did Instagram instantly get a million users upon public release in the AppStore? How did people discover it?

We didn't? The initial Android release did well. It wasn't released until April 2012, iOS launched October 2010.

The photos were square, and there were filters. Those features made things look cute.

Also getting a photo from your phone camera lens to the internet was often a multi-step process at that time. Instagram was two clicks and on your feed.

And the sharing model (all photos pushed to every follower) made people try a little harder on the images. That made it popular among amateur photographers. The profile model (grid of photos) also encouraged some level of vanity.

People discovered it by being invited by friends who like cataloging pretty things.


There simply wasn't a way to quickly share photos from your phone. Previously the best option was to mobile upload to facebook which was pretty common. Instagram made that process easier and also added the ability to quickly add "retro" looking filters/frames. It quickly became a way to see what your network was up to at that instant.

> Post #2. Mike captures how Kevin's working on Instagram. The place is the same - Dogpatch labs, which later relocated from San Francisco to Dublin.

Dogpatch Labs was not later relocated to Dublin. Both Dublin and San Francisco were open during the same time (as well as NYC). San Francisco and NYC were closed.


> But what happens if after "/p/" you put A, B, C and so on? It appears that you can find the first 28 posts of the social network.

Maybe I misunderstood the phrase, but how can you get 28 posts from 26 letters? the article doesn't mention lowercase letters


Maybe after Z it goes to 0 or 1?

The problem is that he says A to Z, not A to Z, 0, 1

Looking at the first post Instagram has tremendously benefited/taken advantage of improved mobile photo resolution during their run

My takeaway is that IG was always built to be performance based social media app. As in, the user puts up a performance for people to see.

Compare that to HN/Reddit/Blogs before and you can basically pinpoint when and how social media changed the way we connect.


I mean, aren't all social interactions performative at some level?

are you performing when you socialize with people?

Kudos ;)



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: