
Show HN: Shortern URIs using Huffman Coding, not a database - 19eightyfour
https://urizip-dot-populace-soho.appspot.com/
======
tghw
I tried:

    
    
        https://github.com/dosaygo-coder-0/urizip/commit/cb2bfd2e04e8b814943e42a5bf87eeff31a77126
    

Got

    
    
        kl9oo67XTkQCETxduLJSW5iUfSNh5pW6iDuhqKmzprOoJCaUYmMo5M7kiMaYxrBRjTQsWzqDks9BlBQWEQKsrKImrQU
    

Which is longer.

Also "OK" isn't a great commit message for almost every commit:

[https://github.com/dosaygo-
coder-0/urizip/commits/master](https://github.com/dosaygo-
coder-0/urizip/commits/master)

~~~
adm_hn
[https://github.com/dosaygo-
coder-0/urizip/blob/master/build#...](https://github.com/dosaygo-
coder-0/urizip/blob/master/build#L48)

------
ageitgey
The url for this post is 45 characters:

[https://news.ycombinator.com/item?id=14245119](https://news.ycombinator.com/item?id=14245119)

If you encode it with this shortener, you get this 35 character string:

mNb:w9iIp7u8di:AKB2xrPUVYUFhfRUWHwA

Assuming you were using this string as a key in a shortening service like
this:

[https://short.url/mNb:w9iIp7u8di:AKB2xrPUVYUFhfRUWHwA](https://short.url/mNb:w9iIp7u8di:AKB2xrPUVYUFhfRUWHwA)

... you'd end up with a url longer than the original url! So it's not
_technically_ a url shortener :)

~~~
zzzcpan
The idea is still interesting. I imagine using some dictionary based
compression and crawled data to build a dictionary could get us somewhere.

~~~
grey-area
Unfortunately the longest urls, the ones you want to shorten the most, are the
worst candidates for compression (with long query strings full of things like
uuids).

~~~
19eightyfour
I have an idea to identify the uuid and other identifier type parts and encode
them up as numbers. The file radix_coder.js is working toward this. So we
cover a few formats like guid base64 digits base36, to get a little bit more
gain.

But my initial experiments suggest it's just a little gain. Digits go to 68%,
base36 to 90%, and then we also have to add in the prefix to indicate we are
switching encodings.

------
no_gravity
Often urls can be most effectively shortened manually. I wish everybody who
sends me something like this:

    
    
        https://www.booking.com/hotel/fr/hotelwestminster.html?label=gen173nr-1FCAEoggJCAlhYSDNiBW5vcmVmaDuIAQGYATHCAQN4MTHIAQzYAQHoAQH4AQKSAgF5qAID;sid=9b4fe19e9de68a3cb71714046bf9d64a;checkin=2017-05-12;checkout=2017-05-13;ucfs=1;highlighted_blocks=5190001_91458119_0_2_0;all_sr_blocks=5190001_91458119_0_2_0;room1=A;hpos=1;dest_type=city;dest_id=-1456928;srfid=49082c78468185e093631018c71495e7e11775c0X1;from=searchresults;highlight_room=#hotelTmpl
    

Would just take a look at the url and see that only this part is needed:

    
    
        https://www.booking.com/hotel/fr/hotelwestminster.html
    

Or that this:

    
    
        https://www.reddit.com/r/AskReddit/comments/68sgew/you_awake_one_morning_to_find_you_have_10_skill/"
    

Is just a sugarcoated version of this:

    
    
        https://www.reddit.com/r/AskReddit/comments/68sgew

~~~
Benjamin_Dobell
Oh, but distributing links with tracking tokens attached is a really great way
to mess with sites that track users:

 _Well, we 've had one user who really liked the new content. They viewed it
15,323 times... waaaaaait, damn it!_

~~~
a3n
I habitually remove tracking and other non-essential cruft from shared URLs,
for readability and for politeness to my correspondents. I don't think it's
polite to associate their IP with my tracked activity.

~~~
paulryanrogers
Do you have any tools to automate that process?

~~~
a3n
Not really. I use the Uppity extension in Firefox.

So if Uppity shows me:

    
    
      http://fortune.com/2017/05/02/donald-trump-wall-street-elite/?xid=gn_editorspicks&google_editors_picks=true
      http://fortune.com/2017/05/02/donald-trump-wall-street-elite/
      http://fortune.com/2017/05/02/
      http://fortune.com/2017/05/
      http://fortune.com/2017/
      http://fortune.com/
      http://www.fortune.com/
    

Then I pick select the second one from the top, let it load to be sure I got a
usable URL, then use that in my email. This is probably ideal, since anything
more automated would probably get it wrong often enough that I'd go back to
doing just this.

Firefox: Uppity [https://addons.mozilla.org/en-
US/firefox/addon/uppity/](https://addons.mozilla.org/en-
US/firefox/addon/uppity/)

Firefox: Navigate Up WE [https://addons.mozilla.org/en-
US/firefox/addon/navigate-up-w...](https://addons.mozilla.org/en-
US/firefox/addon/navigate-up-we/)

Chrome: Up
[https://chrome.google.com/webstore/detail/up/iohgglcbddjknne...](https://chrome.google.com/webstore/detail/up/iohgglcbddjknnemakghbjadinmopafl)

(And what is _up_ with those chrome store URLs?)

------
grey-area
I've got a suggestion for a better format called hurl, I'm working on the RFC
as I write this.

It's _constant length_ , doesn't require database storage, avoids collisions
and compresses longer urls far better than the huffman coding method.

Here is the complete implementation (in go):

    
    
        func hurl(s string) string {
        	sum := sha256.Sum256([]byte(s))
        	return base64.RawURLEncoding.EncodeToString(sum[:])
        }
    

[https://play.golang.org/p/JNYzKhLHs8](https://play.golang.org/p/JNYzKhLHs8)

Here is a comparison of output:

    
    
        // For a short url *that doesn't really need shortened* huffman coding is better:
        // https://urizip-dot-populace-soho.appspot.com/
        // hurl:Y3h44nnzQH74AMEf-S2PAhgiP_CUH-4tZmkh78qXwdQ
        // huff:gtUrgo-kiv:hqL-ZG2:N1d-1j:bFFH:VAA
    
        // For longer urls, hurl wins hands down:
        // https://blogs.windows.com/devices/2017/05/02/introducing-surface-laptop-powered-by-windows-10-s/#xkgEy2SH0VVEG2dt.97
        // hurl:r0WGHoHvVmv4I51qW9FxCAIxX8NSfYlds1Pi-Of92ZI
        // huff:kNbXwsPYj:-GvsMIK65xSimsSIsqLFFqLEmqI4EUYj2GPtwpiHSjdyfABAm4yRudKbmciNxW3IomKbkRibopEQAx5BwcCRE1GdWJxcYGFTUhDTkxZgE
    
        // https://www.booking.com/hotel/fr/hotelwestminster.html?label=gen173nr-1FCAEoggJCAlhYSDNiBW5vcmVmaDuIAQGYATHCAQN4MTHIAQzYAQHoAQH4AQKSAgF5qAID;sid=9b4fe19e9de68a3cb71714046bf9d64a;checkin=2017-05-12;checkout=2017-05-13;ucfs=1;highlighted_blocks=5190001_91458119_0_2_0;all_sr_blocks=5190001_91458119_0_2_0;room1=A;hpos=1;dest_type=city;dest_id=-1456928;srfid=49082c78468185e093631018c71495e7e11775c0X1;from=searchresults;highlight_room=#hotelTmpl
        // hurl:-jf411UaJbZ1Pwl53dTLUxui31YB60GrwyX-n3RpkJU
        // huff 109% size
    

More seriously, since urls that need shortened are typically over 140 chars
and contain random data, huffman coding isn't efficient enough. I do think
some sort of predictable shortening algorithm would be much better than the
current system of link databases that die though so this idea does have some
merits but not sure trying to compress like this will work on the urls you
need to shorten the most.

~~~
tonmoy
How do you decode the shortened URL to get the original back??!!??

~~~
19eightyfour
Proof of work. Hahaha. URLCoin. Cryptocurrency based on reversing hashed URLs
to their Originals. Involves a lot of web crawling to verify as an added
bonus. Proof of crawl.

------
captn3m0
A few ideas:

1\. unicode URLs. Throw some emojis in there. A side project plan of mine has
been to run an emoji link shortener service.

2\. Another way would be to store it on the blockchain for a publicly
verifiable lookup.

~~~
edent
Re (1) - that's linkmoji - [http://www.xn--vi8hiv.ws/](http://www.xn--
vi8hiv.ws/)

So the URl for this discussion becomes
[http://linkmoji.co/⭕](http://linkmoji.co/⭕) or [http://xn--
k7i.ws](http://⭕..ws)

~~~
19eightyfour
I agree Unicode emojis in URLs it's a really cool idea. I'm pretty sure that
would still require database. So using a database is not the direction I would
like to go in with this.

And if it was using compression to produce an encoding using unicode the first
trouble is I'm not really sure how to do that encoding since Unicode is not
straight up translating any sequence of bits into characters there are some
restrictions and there's all kinds of rules I think and I don't really know
how that works and I would like to keep it simple. And the second point is I
want to keep it so it's very easy to transport which pretty much means we need
to use base64 in my opinion.

------
19eightyfour
Hello, OP here. I went to bed with three points in 4 hours, thought this idea
will never take here, and a few hours after I woke up I checked on my account.
54 points and so many comments. I'll try to make my way through them.

------
a3n
[http://fnord.com](http://fnord.com) => gR30sacA

but [http://gR30sacA](http://gR30sacA) !=>
[http://fnord.com](http://fnord.com)

gR30sacA => [http://fnord.com](http://fnord.com)

There's nothing about gR30sacA all by itself that tells me that this is a
shortened URI.

~~~
19eightyfour
I'm not sure I understand this one.

But I had the idea you might mean, it doesn't have a human readable prefix
identifying what kind of encoding it is.

That's still something I'm considering. I haven't thought that much about it
but some ideas that I've glanced at are : data:, urizip:,
[https://un.co](https://un.co) as prefixes.

------
vanderZwan
If you use React with react-router[0] (or a comparable JS-based solution), you
can just apply lz-string[1] to a stringified JS object and make the resulting
string match a path. Then you need is some code to reverse the operation (lz-
string -> JSON -> JS object). That's what I'm doing to create shareable state
in an app I'm building for a research group.

Done naively, it will have the same problems others mentioned here: it barely
shortens the URL. However, in the case of my specific app - a browser for data
sets for single cell RNA sequencing - the state is largely dependent on the
data set being viewed, and the data set is unchanging. What that means is that
we can "externalise" the data being referred to, to pre-transform my JS object
(which is an object tree) to nested arrays with integers (so a kind of "array
tree") that use the data set as a look-up to reverse the operation. The latter
is already a lot shorter when stringified, but also more compressible: the
character set is limited to ten digits, commas and square brackets, and
different values may be transformed into the same numbers, being distinguished
by their _position_ in the array tree.

To make this operation easier, I created a few helper functions to declare a
"schema" that creates a recursive function for transforming the original JS
object to said array of arrays, and vice-versa:

[https://gist.github.com/JobLeonard/a47692a1f77bebc06c2518f32...](https://gist.github.com/JobLeonard/a47692a1f77bebc06c2518f321fa7efc)

As you can see in the gist, that transformation shrunk a URL of 2466
characters down to a (admittedly still crazy) 638 characters.

I was thinking on putting it on NPM but it looked so specific in its
applicability that I haven't bothered turning it into a proper package yet.

[0] [https://github.com/ReactTraining/react-
router](https://github.com/ReactTraining/react-router)

[1] [https://github.com/pieroxy/lz-string](https://github.com/pieroxy/lz-
string)

~~~
19eightyfour
Interesting. Thanks for sharing. It sounds like a highly technical project
you're optimizing there and having some success. Keep up the good work! I
might take a look.

edit: I looked at your gist and this is a totally great idea. I see what
you're doing, converting it to make it more compressible. Very clever. If you
have some ideas about how to improve my compression in this binary URI
encoder, please do submit an issue or PR! Thanks for commenting!

------
fsiefken
Neat but SMAZ has better performance :-)
[https://github.com/antirez/smaz/tree/master](https://github.com/antirez/smaz/tree/master)

~~~
eriknstr
Performance in terms of compression ratio or in terms of speed or in terms of
both or in terms of something else like for example memory usage?

Is SMAZ encoded data URL safe?

~~~
fsiefken
I meant compression ratio, but I overlooked the rather important fact that
it's not URL safe

------
larrik
I don't really understand the motivation here, care to expand on that?

~~~
jmkb
A url shortener like tinyurl maintains a database of original urls linked to
the shortened versions. Only they can translate between them, so 1) there's a
risk the links will die if they die and 2) there are tracking implications.

This shortening scheme simply uses lossless data compression, like a zip. As
long as the decompression algorithm is available, the link can be translated
by anyone.

~~~
peeters
You can click on a tinyurl link. That's all that's required from the user.

I'm finding it really difficult to understand where this would be useful. Do
we expect everyone to keep the handy decoder available and know to use it when
they see random base64 encoded strings?

~~~
mytherin
The end-game of the idea is that the decoder becomes embedded into browsers,
so URLs can be shortened without requiring users to go through a third party
service that tracks you. This method would be both faster than current URL
shorteners (because it all happens client side - no extra round trip required)
and much more suitable for archiving purposes (no single-point-of-failure that
takes down millions of shortened URLs).

~~~
peeters
Hmm, IMO it should be a valid URL with a protocol then. Something like
zurl:Ma0t7asf...0a==. Something needs to identify how to handle it.

I'm really not sure it's handling what URL shorteners are though. URL
shorteners can take uber long URLs, like 150 characters, down to 10-20
characters. Huffman coding might be able to get that down to 120 characters,
given how much of many long URLs is already random data that is largely
incompressible (past the reduced character set).

~~~
dafrankenstein2
an analysis can be done on that

~~~
peeters
You're right. I don't have the resources for a full analysis but I did
"shorten" three Google Maps links (IMO a good candidate for link shortening).

The compression for the three ranged between 114% and 117% of the original.
E.g.:

[https://www.google.ca/maps/place/Rockefeller+Center/@40.7586...](https://www.google.ca/maps/place/Rockefeller+Center/@40.7586887,-73.9788843,92m/data=!3m1!1e3!4m5!3m4!1s0x0:0x33d224a0d5dacca2!8m2!3d40.758741!4d-73.978672)

Encoded:

    
    
        k:Yr5fL7YTHQz1AlBxKJrphUT0qAgf6ouSZ3Ze1yEQCNiTlmLclpycllFbllYTkxZcnJyY1hFTE0dUfkwV5SlgdREpEKLCUxjsWlLA6xpSJFMQx7EksQx6wsJCamsYViSIvoymS01Kch1NSlhIY2JOWYtyWWNESmNIbllYTkxZclpZRt

~~~
19eightyfour
This URI has a lot of hex sections and numbers in it, and I'd like to point
out this is still 85% of the Base64 encoded version of that URI, even tho its
114% of original.

I think we could get this size down below parity by encoding the numbers and
hex sections.

Thanks for the comment! If you have some ideas how to improve the compression,
I invite you to please submit an issue or PR, thank you very much.

------
Hansi
Doesn't support anchor links it seems:

Encoder
[https://calendar.google.com/calendar/render#main_7](https://calendar.google.com/calendar/render#main_7)
kF9DKUm9hMdDPUOCFu-QohzySQDx6pLN

Decoder kF9DKUm9hMdDPUOCFu-QohzySQDx6pLN
[https://calendar.google.com/calendar/renddabD](https://calendar.google.com/calendar/renddabD)

~~~
19eightyfour
Oh, that's a bug! Thank you. Fragment links ought to work.

I'll open an issue.

Issue: [https://github.com/dosaygo-
coder-0/urizip/issues/1](https://github.com/dosaygo-coder-0/urizip/issues/1)

~~~
19eightyfour
Okay fixed now! [https://github.com/dosaygo-
coder-0/urizip/compare/4b5be28......](https://github.com/dosaygo-
coder-0/urizip/compare/4b5be28...master)

Thanks for report :)

------
anon1253
Here's a silly trick we used for a similar problem: use zlib with a custom
compression dictionary. Our application had tons of interlinks (think linked
data) that we wanted to expose to the end user, but putting urls in urls is
kinda ugly. So we ended up with our own custom compression dictionary and
pushing them through zlib. Works like a charm.

~~~
19eightyfour
Okay I will try this, thank you! I'll open an issue.

Issue: [https://github.com/dosaygo-
coder-0/urizip/issues/3](https://github.com/dosaygo-coder-0/urizip/issues/3)

------
stevekemp
I'm seeing errors:

    
    
        ReferenceError: urizip is not defined
           decoder.onsubmit()
    

Shame. I did consider using a DHT for storing shortened URLs once upon a time,
to avoid the single point of failure. These days I guess there is already
somebody trying to sell a solution using a blockchain!

~~~
19eightyfour
Thanks for the report. I'm sorry about that. I forgot to test on edge and ff
before posting. Pretty stupid oversight actually. I'll open an issue.

Issue: [https://github.com/dosaygo-
coder-0/urizip/issues/4](https://github.com/dosaygo-coder-0/urizip/issues/4)

------
pmiller2
I came up with a variant of this idea in an interview once. It did not go over
too well. :)

~~~
asimpletune
What was the feedback the gave you?

~~~
proksoup
I was asking for a function that converted a number (auto incremented id) to a
string, such that all possible shortest urls would be iterated through.

E.g. given character set a-z, 0: a 1: b 26: aa 27: ab

etc.

I probably had no idea what the interviewee was suggesting and tried and
failed to explain my question any better.

------
gumby
Nice! By making it algorithmic and serverless the client could even expand it
in situ when the mouse hovers over the link.

~~~
19eightyfour
True, that's a niceidea, I hadn't thought of that, thanks! If you have some
more ideas for how to improve it I invite you to make an issue or PR!

------
ravenstine
Only got it to work once and now the tab just crashes every time. Neat idea,
though.

~~~
19eightyfour
Yeah there's a pretty big JavaScript load. And actually I did not test it on
Firefox and Edge. I am sorry about that. Thanks for the report! I'll open an
issue.

Issue: [https://github.com/dosaygo-
coder-0/urizip/issues/2](https://github.com/dosaygo-coder-0/urizip/issues/2)

