Hacker News new | comments | show | ask | jobs | submit login
Nano ID: A tiny, secure URL-friendly unique string ID generator for JavaScript (github.com)
78 points by iskin 5 months ago | hide | past | web | favorite | 78 comments



For years already I've been using UUID v4 (generated from a good random source) then base64url encoded, resulting in a 22 characters/bytes long identifier.

The nanoid implementation does not really bring anything new to the table except having fewer SLOC (which I personally don't find important). It is also incompatible with https://tools.ietf.org/html/rfc4648#section-5 for no good reason. It is noticable again and again that devs publishing to npm are not up to snuff, I wish I knew why, as this does not happen with other dynamic languages/their code repos.


> It is noticable again and again that devs publishing to npm are not up to snuff, I wish I knew why, as this does not happen with other dynamic languages/their code repos.

It absolutely does happen with other languages. It just happens in Node with greater volume because of the greater volume of programmers, low barriers to entry for JavaScript, lack of a rich standard library, the convergence of server focused coders and frontend coders, and lack of consensus around best practices and frameworks for applications.

Kind of the way it's been possible to write decent code in PHP since version 5. Yet the overall ecosystem is still mostly bad.

Lots of need to fashion very basic tools + high volume of inexperience + democracy = A package repo full of nonsense where it's difficult to find the diamonds in the rough.


Agreed. It's usually better to use a standard if there is one. To be specific its at:

https://tools.ietf.org/html/rfc4648#page-7

Note that it excludes the use of ~ because it has special meaning in some file systems. Nano ID includes ~. It's a good demonstration of why sometimes a standards comittee is better than a lone developer: A group of people and reviewers are more likely to think of everything.


Base64 isn't URL safe; the equal signs need to be escaped.


We're talking about base64url, which is URL safe. Links are in the post you were replying to and its parent.


Wait, why would you base64-encode UUID? You can encode 16 random bytes and not have a few bits spoiled by UUID version number.

But then, there are people who'd like to use 62-character alphabet, avoiding any special characters. This package allows them do generate such strings.

> devs publishing to npm are not up to snuff

Indeed, there are tons of packages that use Math.random or take modulo without accounting for bias. This is why I'm glad there are new packages like nanoid appearing, created by people who know what they are doing, and this is why your negativity is unfounded.


> why would you base64-encode UUID? You can encode 16 random bytes

Because it's a standard data type. It's supported by the database and the data abstraction layers. I can use other implementations for creating, storing and retrieving UUIDs because they backed by the standard and are interoperable. I don't get these advantages with your ad-hoc approach. The loss of 4 bits do not hurt me.

> this is why your negativity is unfounded

I was criticising the needless deviation from an encoding standard. You talk about how it's good that nanoid picked the correct random algorithm. That just does not fit together.


OK, so you use a non-standard representation of UUIDs because you can convert it to the standard representation for compatibility reasons. Fair enough, it's a good reason to do this if that's what you want. But then you complain that there's a different implementation of generating random identifier strings "not up to snuff", as if we all should settle on old crappy standards and stop creating something new. (One thing I don't understand with many developers is that they insist on doing something one way, thinking that their way is the best one and there should not exist other ways.)


Not because he can, but because me and they can decode and store these values in our far-from-npm environment that never even heard of nanoid. All we need to know is "base64url uuidv4". Irl, when someone tells you their data is nanoid-encoded, it is just pain in the ass, not an improvement in any way.

Btw, if you'd remove emotional "old crappy" from the sentence, it would sound pretty sacrastic.


What do you need to decode it for? It's an identifier that you store, no need to decode it. Or do you mean, you want to optimize storage by compressing those identifiers into as little bits as possible? Then just store random bytes and use any encoding you like for outside representation, just like you would do with optimizing UUID storage (except with UUID you're losing a few bits to account for the version indicator, which should not have existed at all if not for the mess with combining a method of creating a pseudorandom byte generator with encoding format, which what various UUID versions are.)


I do create, use (compare/lookup) and store uuids, by design. Base64url is just one of possible encoding methods for, as you may have noticed, passing it through url. Other channels may have other [incompatible] limitations or not have these, so there is no need to stick to specific escaping format.

Your long story short is: get rid of uuidv4 tag and use smelly format unreadable by virtually anyone outside of your code. Or generate random bytes environment-agnostically by yourself. No, thanks.


use smelly format unreadable by virtually anyone outside of your code

Aaarrgh. It is readable by everyone! It's an identifier, you don't need to convert it to something else: its conversion is identity function.


Good points but this comment reeks of arrogance. It absolutely does occur in other languages, why would you assume it doesn't?


To be fair, most other high level languages have functions in their stdlib to do common stuff like hashing, generating unique id, getting data from the OS random generator, etc.

JS stdlib has always be lacking in almost every possible aspects, and hence people have to reinvent the wheel again and again.


Can you cite some examples of this in other languages? It seems to occur time and time again in JS-land, but I rarely see other languages bad libraries getting 2000+ GH stars and onto HN with this kind of garbage.

FWIW. I have also always done the exact Base64 UUID op suggested when I need a shortened unique identifier...


I would still use uuid v4. Databases support this data type and reduce its storage needs. You can also compact it if it is too long for your use case or you like to save traffic.


But that goes against the modern principle of reimplement everything.


How are we supposed to learn anything if we never try to do it ourselves?


If you want to learn how UUIDs work, implement the standard.


This isn't about UUIDs. I think the project creator in the OP was just learning lower level things like working with randomness and bytes... I don't see the problem. Experimenting is fun. Leave them alone.


The problem is pollution. That's extremely unfun.

Experiments should stay on the developer's personal git repo, but they should not be uploaded to the common code repos as a package. (Maybe into the Acme namespace¹ so that's clear that this code is not serious/can be safely ignored, but unfortunately npm does not support that.)

¹ https://softwareengineering.stackexchange.com/a/221954


A bad a excuse for a bad library. See this time and time again in NodeJS though.


This is a good library.


However pretty, I wouldn't use something non-standard for generating IDs unless I was doing something embedded which required careful memory management under hardware constraints, where I most definitely wouldn't be using JS.


What? It's just a random string of letters. Don't be afraid of using random strings of letters! You don't need a standard for that.


@dchest So you frequently use varchar for your table index?


Why do you need VARCHAR? If you use SQL, you can use a fix char type to store fixed-length random strings as IDs. If you use SQL which supports native UUID type, sure, go ahead and use it. But the world is not limited to SQL databases or even any databases.

In three project I'm working on right now, two identify objects by content HMAC (one of them uses an SQL database), the other uses a mix of incrementing integers and a random string for different use cases.


> two identify objects by content HMAC (one of them uses an SQL database), the other uses a mix of incrementing integers and a random string for different use cases.

If those two are using SHA-1, like git, you are still using a standard. The third one however sounds like a potential mess.


If those two are using SHA-1, like git, you are still using a standard

Lovely, so I can officially claim I use a standard that is required by many commenters here if I just hash random bytes! Should I use a NIST-approved standard or maybe GOST hash function will work? Can I use the output of ChaCha20? If so, I'll be happy to read 16 bytes from /dev/urandom.

The third one however sounds like a potential mess.

Thanks for your analysis!


> Lovely, so I can officially claim I use a standard that is required by many commenters here if I just hash random bytes!

Stop pretending that you don't understand what a cryptological hash function is, or that using a strong crypto function is the same as rolling your own, or that using the hash function on your input is the same as running it on random bytes.

> Thanks for your analysis!

Thanks for the anecdotal evidence.


I experiment all the time in my spare time. But I don't write blog posts and publish to npm.


> `random % alphabet` is a popular mistake to make when coding an ID generator. The spread will not be even; there will be a lower chance for some symbols to appear compared to others—so it will reduce the number of tries when brute-forcing.

I don't understand: is the default implementation of random in node.js flawed? Surely it should produce a uniform distribution…

edit nevermind, found the link to https://gist.github.com/joepie91/7105003c3b26e65efcea63f3db8... just above that paragraph


0,1,2,3,4 % 3

0 => 0

1 => 1

2 => 2

3 => 0

4 => 1

So, a uniform input will produce more 0s and 1s than it will 2s.


Only if the alphabet size is not a multiple of two.


More generally, only if your input is an integer multiple of your alphabet.

However, there are methods of converting uniform distributions that will preserve uniformity. I'll have to find a good reference.


Well, it's obvious for me.

Issue is that the root reply seemed to have taken it as a JS implementation issue (due to math.random) as it's quite obvious why you should not do the modulus (or when you can), if you know the underlying maths.

So did I, as a matter of fact.

You - on the other hand had given the answer to the wrong confusion.


> Issue is that the root reply seemed to have taken it as a JS implementation issue

I'm the root reply. jimktrains2's explanation exactly addressed my confusion.

To get a uniform distribution over an alphabet from random, I would do:

    var bucket_size = (random.MAX_VALUE - random.MIN_VALUE) / count(alphabet)
    return alphabet[random/bucket_size]


Actually, that doesn't work either

max = 4, min = 0, count = 3

bucket_size = 4 / 3

0,1,2,3,4

0 / (4/3) = 0

1 / (4/3) = 3/4 = 0

2 / (4/3) = 6/3 = 2

3 / (4/3) = 9/4 = 2

4 / (4/3) = 12/4 = 3 = invalid index

but even if you adjust this to count - 1

bucket_size = 4 / 2 = 2

0 = 0/2 = 0

1 = 1/2 = 0

2 = 2/2 = 1

3 = 3/2 = 1

4 = 4/2 = 2

So, 0s and 1s are still more likely than 2 to appear.

One of the easiest, and slowest, ways is just to discard any trial in which your uniform RNG generates a number above the largest index and try again.

For a larger alphabet, you essentially just collect multiple trials, treat them as a single trial, and do similar tactics.


> bucket_size = 4 / 3

I have an off-by-one error… your random function has a range of 5, so it should be max - min + 1:

    bucket_size = 5 / 3
then

    0 / (5/3) = 0
    1 / (5/3) = 3/5 = 0
    2 / (5/3) = 6/5 = 1
    3 / (5/3) = 9/5 = 1
    4 / (5/3) = 12/5 = 2
> So, 0s and 1s are still more likely than 2 to appear.

Yeah, I planned for 65535 randoms and 52 alphabet. It would still mildly bias to the first range mod alphabet_size elements.


they could of responded, "no, random is uniform" (the answer to the written question) and that wouldn't of answered anything!


I like that this generates the ID directly by choosing random symbols.

But I don't like that it's equivalent to UUID v4 but different. The difference doesn't buy anything significant over the standard.

With a pretty small change this could directly generate base64 encoded UUID v4 (without going through a two-step process of generating a UUID and then encoding it: Most characters could be generated randomly from the base64 alphabet. The characters at certain indexes would need to use a restricted set of base64 characters to set certain bits to the values specified by UUID v4.


I really like the ulid library. Does what I want, compact representation, and lexicographically sorting ulids in time order is great. The character subset is URL friendly, so this is my go-to when I'm making nonces if I am not worried about super-double-secure prngs.

https://github.com/alizain/ulid


For a toy project, I'm currently using a 64-bit integer of which the first bit is a sign bit (I'm using Postgres' bigint), the next 32-bits make a UNIX epoch timestamp in seconds, and the rest of the 31 bits make a random number. In this way, the ID is very easily sortable by time, although it's a bit of a gimmick.

The scheme allows for 2 billion possible records per second, although if you consider birthday collisions, it comes down to a lot less (but I could re-insert on duplicates). I could later also use some of the random bits to inscribe extra data into the ID.

The bits from the two halves are then interspersed to make sure the first half looks random, and transformed to a 11-character id, for example /^[0-9A-Za-z]{11}$/. I can reject some IDs beforehand if they match a list of bad strings. The result is receiving an id such as "Vxe61yU5Cci" or "WgOUBEO88gb".


What happens in 2038? I hope you're never representing those 32 bits of UNIX timestamp as a signed integer anywhere. If you're making sure to always keep it unsigned, what happens in year 2106?

EDIT: I'm not criticizing you, and a valid counter argument would be that you make absolutely sure to always store the timestamp unsigned, that it doesn't need to work until 2106 because it's a toy project, or that an overflow wouldn't really matter because you make sure to keep it unique even after an overflow, but I get scared any time storing a UNIX timestamp in 32 bits is mentioned, and am wondering if you've considered it.


The literal Elixir code is:

  def epoched_random_uid, do: epoched_random_uid(&:erlang.system_time/1, &:crypto.strong_rand_bytes/1)
  def epoched_random_uid(system_time, strong_rand_bytes) do
    time = system_time.(:second)
    randint = strong_rand_bytes.(4) |> :crypto.bytes_to_integer

    bor(time <<< 31, randint >>> 1)
  end
So it uses – or at least I hope it does; you may file a bug report – a full 32 bits for the time, but it leaves the sign bit to be zero. I didn't want to deal with negative uids in PostgreSQL (bigint), but they actually would be perfectly valid. The sign bit could be used to expand the timestamp another bit.

You could argue that the "2038 problem" (31 effective bits) in the context of a uid scheme of an application that is bootstrapped today, is actually the "2084 problem", because the wraparound to timestamps in 1970 could be construed as being from 2018.


> but I get scared any time storing a UNIX timestamp in 32 bits is mentioned

You're scared because an average app hasn't considered a problem that's 21 years off? Seems like a premature worry


> You're scared because an average app hasn't considered a problem that's 21 years off? Seems like a premature worry

'Those who cannot remember the past are condemned to repeat it.'. --George Santayana


Yeah? Did people need to prepare for the y2k bug in 1979?


Well yeah, weren't a lot of the broken systems big cobol systems from that era?

You never know what software ends up being still used for critical systems 20 years down the road, so it's stupid to knowingly make a new system which is guaranteed to break in 20 years.

That said, it sounds like this guy has thought about it, and is making sure to use all 32 bits to store the timestamp.


Wait so `%` is bad but `& 63` is fine? Ain't you doing a mod as well??? It's just that you have power of two (64 in this case) that you are trying to exploit here... anyone who knows basics of computer science can tell you it's same! If it's about reducing code size I can demonstrate smaller code:

module.exports = function (size) {

  return random(size || 22)
           .map(r => url[r & 63])
           .join('');
};

I am always disappointed to see these kind of packages, as mentioned somewhere in comments above

> It is noticable again and again that devs publishing to npm are not up to snuff, I wish I knew why, as this does not happen with other dynamic languages/their code repos.

And I totally agree!


Check the code below:

      var masks = [15, 31, 63, 127, 255]
      ...
      var mask = masks.find(function (i) {
          return i >= alphabet.length - 1
      })
      ...

      var byte = bytes[i] & mask
      if (alphabet[byte]) {
        id += alphabet[byte]
Masking results in a random uniform number that is a power of two and greater or equal than the alphabet length and then indexes that are greater than alphabet length are rejected, which is a common way of avoiding modulo bias.

I am always disappointed to see these kind of packages, as mentioned somewhere in comments above

Because you didn't understand how it worked and assumed that it's bad?


I believe the criticism here is about index.js, which is what all the other claims are about.

  var bytes = random(size)
  for (var i = 0; i < size; i++) {
    id += url[bytes[i] & 63]
  }
This part of the code just uses an alphabet size of 64, which makes the approach actually used in the exported function identical to modding the byte by the alphabet size.


Yes, with a power-of-two alphabet length there's no bias. The poster seems to be confused about modulo bias claims, which apply to non-power-of-two alphabets, not a particular way of calculating modulus.



(replying to myself as I can no longer edit:) My favorite feature in it is .entropy(n) method: when you want some fixed amount of unpredictable bits, but don't want to calculate how long the random string should be in your encoding to contain that many bits.


I designed something similar (in terms of URL-friendliness), but for distributed systems:

https://github.com/AltusAero/bronze

It's designed for use cases where collision-resistance (both time-based [UUID1] and random [UUID4]) is valued more than the size of the id. Can be used as a module (Node.js & browser) and via CLI.


Silly question, why js not provide method generate uuid without library? I mean this is 2017. One method without param input, wont hurt anybody


UUID is a misguided standard that should have been killed long ago if not for compatibility reasons, so I'm glad it's not in JS.


I wrote a Python version of this six years ago, it's been working beautifully and I use it for IDs in all my APIs:

https://github.com/skorokithakis/shortuuid

That way, elements are neither enumerable nor guessable.


tilde (~) doesn't looks a wise choice to me as may have special meaning in certain context, fortunately the alphabet are redefinible. i would also prefer if at least some part fo the nanoid would be machine dependent. anyway, it works :)


If you need a function that returns integers in some range [0..n) equiprobably, just write that. Why conflate a bunch of concepts and have weird limitations?

Oh, and by the way, fuck HN for deleting comments.



This is a perfect example of what NOT to do. They reimplemented a seeded linear congruential generator for their random number generator:

https://github.com/dylang/shortid/blob/master/lib/random/ran...

But since LCGs are easy to solve, after seeing a few values you can solve for the seed and generate all past and future values.

But you don't even need to do anything fancy for that. Since their implementation of LCG only has a state size of 233280 different values, you can just brute force it. (also means that their rng could only ever generate 233280 different numbers to begin with)

Why the fuck do they have 2.5k stars!?

EDIT: They already have an open issue for it but project seems unmaintained: https://github.com/dylang/shortid/issues/70. Stay far far away.


You're absolutely right, and this is huge...


How does it compares to hashids (http://hashids.org/)?


Hashids are reversible mappings to/from integers. Nano IDs are random strings.


String generated using default parameters has ~0.1% chance of containing four letter english profanity.


I'd consider that a feature not a bug :).


what about uuid, a well researched and established standard?


> Compact. It uses more symbols than UUID (A-Za-z0-9_~) and has the same number of unique options in just 22 symbols instead of 36.


This is basically UUIDv4, encoded with a url-safe variant of base64 (except UUIDv4 reserves a few bits for saying "this is a UUID, specifically version 4").


UUIDs are not secure unless they are version 4 (random) and generated from a secure randomness source.


What do you mean by secure? Unpredictable?


The same applies for any unique ids in general.


Yes, but the holy UUID standard specifies multiple ways of generating them from sticks and stones, invented before people knew how to properly generate random bytes.


Why wouldn't this be a one-liner?


It would be a pretty long line, containing a function wrapping crypto.randomBytes, window.crypto.getRandomValues, window.msCrypto.getRandomValues selected at runtime, and then a few lines to select characters avoiding modulo bias. Try?


That sounds like a pretty interesting code golf-ish challenge actually, would attempt if I had more than 3% battery on my laptop atm :(




Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: