Nowadays, on macOS, "dict://Internet" will open the Dictionary app with the query "Internet". (Probably behind a security prompt.) Not sure, if there's similar functionality on other operating systems.
> firefox came up with a prompt confirming I wanted to open 'internet' in 'Dictionary'
My Ffx 131.0b9 wasn't so adept. It gave me:
The address wasn’t understood
Firefox doesn’t know how to open this address, because one of the following protocols (dict) isn’t associated with any program or is not allowed in this context.
You might need to install other software to open this address.
The behavior may be also vary depending on whether it's an actual link a document or direct input to the location bar. (I get different prompts on Firefox, while both will forward to the built-in system dictionary.)
And, of course, any further configurations may alter this, e.g., another service may have been registered for this protocol.
I admire these old protocols that are intentionally built to be usable both by machines and humans. Like the combination of a response code, and a human readable explanation. A help command right in the protocol.
Makes me think it's a shame that making a json based protocol is so much easier to whip up than a textual ebnf-specified protocol.
Like imagine if in python there was some library that you gave an ebnf spec maybe with some extra features similar to regex (named groups?) and you could compile it to a state machine and use it to parse documents, getting a Dict out.
> Makes me think it's a shame that making a json based protocol is so much…
Maybe I'm not the human you are thinking of, being a techie, but I find a well structured JSON response, as long as it isn't overly verbose and is presented in open form rather than minified, to be a good compromise of human readable and easy to digest programmatically.
The legibility is probably one of the main reasons JSON got adopted. XML can be made to not look too bad, but in SOAP it must be unreadable, so everybody was looking into fixing this.
XML has a sweet advantage though. It can be styled in the browser. For example a sitemap that works for Google, OpenAI, &c. and is human readable looking like a web page.
Yeah, but JSON has the vulnerability that it will tend to become like an overgrown garden over time, because it can. The type of TCP <code> <response> <text> protocols the GP talks about have the benefit of their restrictions and are definitely a better balance (human / machine), and more stable.
Unfortunately, in practice they are a nightmare. Look at the WHOIS protocol for an example.
Humans don’t look at responses very much, so you should optimise for machines. If you want a human-readable view, then turn the JSON response into something readable.
Whois protocol has no grammar to speak of (the response body is not defined at all, just a text blob) which makes it a nightmare to parse. Having a proper response format would solve this.
The next logical step is to use a machine-friendly format instead; that is a binary protocol.
Even HTML and XML which were designed for readability and manual writing eventually became 'not usable enough" ("became" because I think part of it is that their success made them exposed to less technical populations), and now we have markdown everywhere which most of the times is converted to HTML.
So if you are going to use a tool more sophisticated than Ed/Edlin to read and write (rich) text in a certain format, it could be more efficient to focus on making the job of the machine - and of the programmer, easier.
If you look at a binary protocol such as NTP, the binary format leaves very little room for Postel's principle [1], so it is straightforward to make a program that queries a server and display the result.
maybe we could have a format that was more human-readable than json (or especially xml) but still reliably emittable and parseable? yaml, maybe, or toml, although i'm not that enthusiastic about them. another proposal for such a thing was ogdl (https://ogdl.org/), a notation for what i think are called rose trees
> OGDL: Ordered Graph Data Language
> A simple and readable data format, for humans and machines alike.
> OGDL is a structured textual format that represents information in the form of graphs, where the nodes are strings and the arcs or edges are spaces or indentation.
their example:
network
eth0
ip 192.168.0.10
mask 255.255.255.0
gw 192.168.0.1
hostname crispin
Formats like TOML are horrible for heavily nested data (even XML does a better job here) and the last time I checked, TOML didn’t support arrays at the top level.
YAML is nicer than JSON to write, but I wouldn’t say it’s any nicer to read.
If you want something that’s less punctuation heavy, then I’d prefer we go full Wirth and have something more akin to Pascal.
arrays at the top level are probably a bad idea for protocols that need to evolve in a backward-compatible way
what do you mean about heavily nested data? do the other formats i linked do a better job there?
i'm not sure it's possible to come up with a data format that will work well for such a wide range of use cases, but it sure would be nice to have. json is pretty great in terms of being able to load it into the browser, or visidata, or python, or js, or whatever
> arrays at the top level are probably a bad idea for protocols that need to evolve in a backward-compatible way
Depends on the protocol. It might be preferable to version the end point. Or if it’s a specific function, eg list-synonyms” then having a dictionary just to reference an array could be argued as unnecessary protocol bloat. Particularly given the aim of this exercise is readability.
> what do you mean about heavily nested data? do the other formats i linked do a better job there?
I mean a tree like structure.
JSON and YAML are probably the best in class here. XML, for all of its warts, is good at handling nested data in a readable way too.
TOML was more based around a flatter structure.
> i'm not sure it's possible to come up with a data format that will work well for such a wide range of use cases,
It’s not. The moment that happens, that format then becomes unwieldy and people then feel the urge to invent yet another new format to simplify things. It’s a vicious circle that happens over and over again in the tech sector.
That suggests things are getting worse but personally I’ve seen the opposite trend.
These days developers rallying around a subset of established standards rather than inventing new protocols and grammar for each new service.
Take a look at the old protocols out there: finger, DNS, Gopher, HTTP, FTP, SMTP Dict, etc. they all have their own grammar and in many cases, even that grammar is very loosely defined or subject to dozens of different standards. Whereas these days it’s mostly JSON or XML over HTTPS. Or ProtoBuf if you need something more compact.
There’s definitely still room for improvement. For example the shift towards proprietary messaging protocols like Slack, Discord, etc. But that’s another topic entirely.
yeah, i appreciate the move to html, http, and json. although http/2 and http/3 arguably aren't really http, and scraping data out of html is ridiculously time-wasting. the shift toward cloudflare and secret criteria for blocking users whose sessions act "atypical" are also huge problems, but that's sort of what you'd expect from using software running on remote servers you can't control
In my department's (we were formerly our own company) internal framework throwing .html on the end of any JSON response outputs it in nested HTML tables. I personally find it very helpful.
At that point you might as well drop JSON altogether and use an XHTML subset so your rendered output is also valid XML (instead of having two different and incompatible markups merged together)
I’m not generally a fan of XML either but what you posted there is just factually incorrect in just about every conceivable way.
1. There’s plenty of XML parsers already available for most languages. Yeah there have been high profile exploits based from XML but given the scale of XMLs usage, it’s fair to say those exploits are atypical usage where XML can be user supplied. And as long as you’re not allowing users to upload their own XML, then you get to control the schema so there isn’t any risks in using XML.
2. XMLs entire purpose is a data store. I’m not someone who likes to blame the developers for using their tools wrong but honestly, if you can’t unmarshal an XML schema you have control over then you’re not going to succeed with JSON either.
3. It is. But it’s also highly compressible because of its repetitive tags. So for HTTP endpoints, it actually doesn’t work out any different to JSON.
> Leave the markup languages for intended purpose: text markup. Don't force them to carry data.
You do realise the entire point of XML is to carry data? It might have fallen out of favour in recent years but those of us old enough to remember a time before JSON will talk about how JSON is just a simplified reimplementation of XML. And with things like JSON schemas, JSON is continuing to copy XML features.
Yes, XML is a better structured format than the things before it (random non-standardized text formats). But saying "it's just the same as JSON" misses how many dangerous parts and footguns it has.
There was a post less than two weeks ago on defusedxml [0] - XML parser with protection from various XML "features" - small files causing exponential blow-up, remote access from just trying to parse the xml... It's not related to "scale of XMLs usage", and those are not security bugs. It's "working as designed", because XML is full of weird features that maybe sounded great in 1998 but now just add vulnerabilities. JSON has eliminated the entire class of those.
(and before you say: "it's just python!", check out the "other languages" section. It's also Perl, Ruby, PHP, .NET.
And I am not going to write anything about ambiguities - if you want to serialize an array of points with ("x", "y", "color") properties, and give it to few different XML programmers, each one of them will come up with a different schema. This does not make interoperability easier at all. Compare to JSON, where this can have only one canonical encoding, and the worst you might have to deal with would be some uppercased letters.
JSON is a simplified version of XML, and JSON has copied good XML features (it's not getting namespaces or external DTDs, and good riddance!). So there is no reason to stick to over-complicated technology whose security story is "don't parse user-supplied data".
> But saying "it's just the same as JSON" misses how many dangerous parts and footguns it has.
I didn't say it's just the same as JSON. I said:
"And as long as you’re not allowing users to upload their own XML, then you get to control the schema so there isn’t any risks in using XML."
Every example you've given requires untrusted 3rd parties to craft the XML. But that wasn't what I was advocating here. I was talking specifically about the API returning XML.
> and give it to few different XML programmers, each one of them will come up with a different schema.
But again, the API is controlling the schema so this isn't an issue for the use case I discussed.
> Compare to JSON, where this can have only one canonical encoding, and the worst you might have to deal with would be some uppercased letters.
You've clearly not worked with enough JSON if that's all you think the issue with JSON is. I've written JSON parsers and used a fair few open source ones too. And there's a lot of places things can go wrong:
1. You have number serialization bugs between different JSON parsers.
2. No standard for dates. Causing everyone to do things slightly differently
3. Inconsistencies with top level arrays, some parsers require top level arrays to be `{[ ... ]}` whiles others are happy just with `[ ... ]`
4. Parsers don't all agree on how to represent non-alpha / numeric ASCII characters. And we're not just talking about unicode, Even some ASCII characters like `>` can be handled differently by different JSON libraries
5. Lots of different JSON supersets (because JSON itself doesn't support half the stuff that people need from it), like jsonlines, concatenated json, newline delimited json, JSON with date fields (as seen in popular JSON libraries in .NET), JSON schema, etc.
6. Even your key name example has numerous other inconsistencies you haven't touched on. Like UPPER, lower, dot.notation, hyphenated-keys, underscored_keys, UpperCamelCased, lowerCamelCased...and so many variations in between.
Ignoring JSON supersets, then I agree that JSON has fewer places for exploits in user generated documents. But the specification is also only 5 (FIVE!!!) pages long and thus it allows for a lot of undefined behaviour. And that's a problem for somethings who's entire purpose is a database.
This is why XML is so complex -- precisely because it's intended to solve these problems. But it was also intended to be served from trusted identities. Which is where the vulnerabilities lie.
> JSON is a simplified version of XML, and JSON has copied good XML features (it's not getting namespaces or external DTDs, and good riddance!).
To go back to my earlier point: literally no-one is going to argue that XML doesn't have it's warts. But what you need to understand is that in the specific example that started this conversation, the API provider is the one defining the schema and crafting the XML. So literally none of your examples apply what-so-ever. In fact, this falls squarely under the correct usage of XML.
Context matters. User supplied XML is bad but that's not what is being proposed here. And that's why you're being called out of stating what you believed to be pretty obvious advice.
The protocols that have a response code with an explanation is helpful. A help command is also helpful. So, I had written NNTP server that does that, and the IRC and NNTP client software I use can display them.
> Makes me think it's a shame that making a json based protocol is so much easier to whip up ...
I personally don't; I find I can easily work with text-based protocols if the format is easily enough.
I think there are problems with JSON. Some of the problems are: it requires parsing escapes and keys/values, does not properly support character sets other than Unicode, cannot work with binary data unless it is encoded using base64 or hex or something else (which makes it inefficient), etc. There are other problems too.
> Like imagine if in python there was some library that you gave an ebnf spec ...
Maybe it is possible to add such a library in Python, if there is not already such things.
REST (REpresentational State Transfer) as a concept is very human orientated. The idea was a sort of academic abstraction of html. but it can be boiled down to: when you send a response, also send the entire application needed to handle that response. It is unfortunate that collectively we had a sort of brain fart and said "ok, REST == http, got it" and lost the rest of the interesting discussion about what it means to send the representational state of the process.
> in an age of low-size disk drives and expensive software, looking up data over a dedicated protocol seems like a nifty2 idea.
Then disk size exploded, databases became cheap, and search engines made it easy to look up words.
I love this particular part of history about How protocols and applications got build based on restrictions and got evolved after improvements. Similar examples exists everywhere in computer history. Projecting the same with LLMs, we will have AIs running locally on mobile devices or perhaps AIs replacing OS of mobile devices and router protocols and servers.
In future HN people looking at the code and feeling nostalgic about writing code
I've been working with other programmers for over 30 years, and the impact of this scarcity mindset has very real implications in behavior today.
In much the same way that someone who grew up with food insecurity views food now, even if the food is now plentiful.
For example, memory and disk space were expensive. So every database field, every variable, was scrutinized for size. Do you need 30 chars for a name? Would 25 do?
In C especially all strings are malloced at run time, predefined strings with max length are supported but not common.
Arguments (today) about the size of the primary key field and so on. Endless angst about "bloat".
I understand that there are cases where size matters. But we've gone from a world where it mattered to everything to a world where it matters in a few edge cases.
Given that all these optimizations come with their own problems it can be hard to break old habits.
It is, until it isn't. This stuff very much matters for games.
There's also a general argument about resource usage, but I think the AI and crypto people have largely won the argument that it's OK to use as much electricity as you want as long as you're making money somehow.
On the basis of GANs or diffusion models turning default skeletons into whatever style you want including photorealism, and that those skeletons can be animated by other AI, and that plots in most* games and porn are worse than the extremely basic stuff the original ChatGPT could provide, I think that's nearly at an end.
When they run in 240 fps on your phone, that's the end. I've seen 5 fps on unknown hardware, so give it a decade at most.
* Exceptions exist, but there aren't enough hours in a year to support even just 23 games like Bauldur's Gate 3 even if you're unemployed, and there's more coming out each year than that.
I think they just have the good sense to run away, because even if they fixed things, they would be hated and mocked relentlessly for it. See for example every contemporary blog rediscovering “wow, makefiles just work, and this tech is 50 years old” and the number of contrarian reactions to it in discussion threads. For every dev that values ubiquitous, simple, predictable, and stable solutions.. there’s 3 who want to use something that’s much more complex, and has a shelf life of just a few years before it’s replaced with the new hotness.
There's only so much one can do. Convincing other people that these things matter is hard. Even when you have good evidence (which isn't always easy to come by), people often won't put in the work to keep things small.
I think because of levels of abstraction of software/pltforms, top layers are bound to have « bloat ». Unless someone( something) changes radically across the stack, I think bloat will just increase for each layer
But on the other hand, for some applications, disk requirements exploded as well and require dedicated protocols and servers for it; for example Google's monorepo, or the latest Flight Simulator, the 2024 version will take up about 23 GB as the base install and stream everything else - including the world, planes, landmarks, live real-world ship and plane locations, etc - off the internet. Because the whole world just won't fit on any regular hard drive.
>Because the whole world just won't fit on any regular hard drive
Except it did. FS2020 had a base level of terrain quality that was installed such that you could play it even if you never connected to the internet. It wasn't Bing Earth quality sure, but it was way better than what you got in FSX thirteen years earlier. It was 200gb.
Which is conveniently about the same size as a single copy of whatever Call of Duty game is currently in vogue.
> Projecting the same with LLMs, we will have AIs running locally on mobile devices
That’s not much of a projection. That’s been announced for months as coming to iPhones. Sure, they’re not the biggest models, but no one doubts more will be possible.
> or perhaps AIs replacing OS of mobile devices and router protocols and servers.
Holy shit, please no. There’s no sane reason for that to happen. Why would you replace stable nimble systems which depend on being predictable and low power with a system that’s statistical and consumes tons of resources? That’s unfettered hype-chasing. Let’s please not turn our brains off just yet.
What is an AI? Just an LLM? What do you call Siri, Cortana, and Google Assistant? I've already received a Gemini app on Android, and they're promoting the premium version too. Runs locally, yes?
30 years ago, my supervisor wrote, from scratch, an "AI" running on the company web server (HP/UX PA-RISC; 32-64MB RAM), that would heuristically detect and block suspected credit-card fraud. Remember that "AI" is a perennial buzzword with fluid definitions, both an achievable goal right now, and a holy grail. ¡Viva Eliza!
> I've already received a Gemini app on Android, and they're promoting the premium version too. Runs locally, yes?
Are you asking me? I don’t know nor do I care, I don’t use Android or Gemini in any capacity.
> Remember that "AI" is a perennial buzzword with fluid definitions
You don’t need to tell me that. I didn’t use the term “AI”, I just quoted the other post and responded in what I understood to be their terms. I don’t think LLMs are intelligent, thus not AI. You’re nitpicking the wrong person.
You are going with assumption that llms will remain same forever. There are attempts everywhere to make them smaller, efficient and more focused.
Would you have called « transferring mails online « in the 90s is hype-chasing because our postal system was working great? Probably not a great analogy but you get the point
> You are going with assumption that llms will remain same forever.
I definitely am not and that’s stated directly in my post. I specifically said “but no one doubts more will be possible”.
> There are attempts everywhere to make them smaller, efficient and more focused.
That doesn’t matter when the issue is inherit. An LLM, by definition, needs large amounts of data and acts on them probabilistically. If you change that, it’s no longer an LLM. Any system that requires predictability needs to be programmed with rules we can understand, test, and reproduce reliably. LLMs ain’t it, and won’t ever be. Something else by a different name, maybe.
> Would you have called « transferring mails online « in the 90s is hype-chasing because our postal system was working great? Probably not a great analogy but you get the point
I understand analogies are never perfect, but that’s a particularly bad one. At least stay within the same realm of physicality. You made a bad comparison then ascribed a bad argument to be against it. That’s a straw man. I don’t even believe our current digital system is “working great”, so the analogy fails on multiple levels.
There are bad programmers and bad technology everywhere, the world is barely held by proverbial spit and bubblegum. Yet that doesn’t mean any crap that’s invented afterwards, be it “web3” or LLMs are immediately the solution.
You are missing the whole point of comment. Its a prediction/ speculation. I agree that it may seem bad with current state of things but no one can say with certainty it WILL be bad. There might be hybrid models which can do both probabilistic AND heuristic work within the scope, you never know. That’s my whole point.
Imagine if dict://internet was renamed to agent://source, then agentic calls to model sources could interconnect with ease. With HTTP/3, one stream could be the streaming browser and the other streams could be multi-agent sessions.
I was thinking on the same lines but with improvements towards running local models is right around the corner. Unless someone has specific usecase or proprietary models, this may not make sense. But in current situation, that would be a great thing to standardise communication between various llms and clients
Not right now but we are seeing the accelerated trends of generated code. How long would it take for regular use cases to be completely AI generated on the go. Edge cases exists everywhere.
It seems grossly inefficient for coding to go from NL prompt -> LLM -> formal language -> compile/execute. I'd envision AIs that would construct apps directly from the prompts, without the fragile middleware.
In fact, if it is optimal coding to interface existing libraries and frameworks, with minimal novel code, then just go full LEGO and pull in those dependencies, minimizing the error-prone originality.
I recently began testing my own public `dictd` server. The main goal was to make the OED (the full and proper one) available outside of a university proxy. I figured I would add the Webster's 1913 one too.
Unfortunately the vast majority of dictionary files are in "stardict" format and the conversion to "dict" has yielded mixed results. I was hoping to host _every_ dictionary, good and bad, but will walk that back now. A free VPS could at least run the OED.
No, it's not possible to do in a legal fashion I don't think. The existing methods require a browser-based portal that requires you to be logged in with your university proxy.
As an alumnus I could do this by showing up in person to my university and accessing that way. But I'm not going to.
what's the stardict format? which edition of the oed are you hosting? i scanned the first edition decades ago but i don't think there's a reasonable plain-text version of it yet
StarDict (a program/file format) is easily googlable. A bit of a rabbit hole is that it's been chased around hosting providers because its site (used to) offer downloads of copyrighted dictionaries, including the OED 2nd edition. I don't know how these files were originally obtained or produced. See: https://web.archive.org/web/20230718140437/http://download.h...
Edit to add: Also, "i scanned the first edition decades ago" sounds like quite a story. 13 volumes? What project were you doing?
At least one of the illegal copies was supposedly converted using a version of "oed2dict": https://njw.name/oed2dict/ which creates various formats.
That takes HTML files as input, and I don't know where those are found. ISOs of CD-ROM editions from 1996 and 2009 are online, but it looks like an adventure to install the software and/or extract the data.
The trouble with piracy is that provenance is so shaky, with plenty of chance for bugs or alterations...
Wow, either I've forgotten this existed, or had no clue, I was around for this era, and I remember Veronica, Archie, WAIS, Gopher, etc, but never recall reading about a Dict protocol, nice find!
I've been aware of dict for a while since I wrapped up an esperanto to english dictionary for KOReader in a format KOReader could understand. What I'd really have liked is a format like this:
Oh yes, I remember dictionary servers. Also many other protocols.
What happened to all of those other protocols? Everything got squished onto http(s) for various reasons. As mentioned in this thread, corporate firewalls blocking every other port except 80 and 443. Around the time of the invention of http, protocols were proliferating for all kinds of new ideas. Today "innovation" happens on top of http, which devolves into some new kind of format to push back and forth.
I wouldn't place all the blame on corporate IT for low level protocols dying out. A lot of corporate IT filtering was a reaction to malicious traffic originating from inside their networks.
I think filtering on university networks killed more protocols than corporate filtering. Corporate networks were rarely the place where someone stuck a server in the corner with a public IP hosting a bunch of random services. That however was very common in university networks.
When university networks (early 00s or so) started putting NAT on ResNets and filtering faculty networks is when a lot of random Internet servers started drying up. Universities had huge IPv4 blocks and would hand out their addresses to every machine on their networks. More than a few Web 1.0 companies started life on a random Sun machine in dorm rooms or the corner of a university computer lab.
When publicly routed IPs dried up so did random FTPs and small IRC servers. At the same time residential broadband was taking off but so were the sales of home routers with NAT. Hosting random raw socket protocols stopped being practical for a lot of people. By the time low cost VPSes became available a lot of old protocols had already died out.
dict is a must for me on any daily driver. Removing the friction of opening a web browser or specialized app and just looking up text from the terminal is just so nice. Just like bc, you don't miss it when you don't know it's there, but once you get used to using it you can't live without. Making custom dictionaries is not very well documented though.
I love dict/dictd but I had an issue using it in hostile networks that block the port/protocol.
I've been tempted to revamp dict/dictd to shovel the dict protocol over websokets so I can use it over the web. Just one of those ideas in the pipeline that I haven't revisited because I'm no longer dealing with that hostile network.
The dict protocol really show it's age, notably the stateful connection part. Having a new protocol based on HTTP and JSON similar to LSP would be nice but there is no real interest. (I made and used my own nonetheless in a research project. It may even be deployed but desactivated in another one)
This biggest issue isn't technical, it's the fact organizations having dictionary data don't want third-party to interact with it without paid licensing.
I hate the fact that corporate IT collectively decided to block every port except 80 and 443, making it necessary to base new protocols on HTTP instead of TCP/IP.
Doesn't HTTP require binary data to be converted to base64 encoding, thereby increasing its size on the wire?
That seems suboptimal for a lot of use cases
dict and the relevant dictionaries are things i pretty much always install on every new laptop. gcide in particular includes most of the famous 1913 webster dictionary with its sparkling prose:
: ~; dict glisten
2 definitions found
From The Collaborative International Dictionary of English v.0.48 [gcide]:
Glisten \Glis"ten\ (gl[i^]s"'n), v. i. [imp. & p. p.
{Glistened}; p. pr. & vb. n. {Glistening}.] [OE. glistnian,
akin to glisnen, glisien, AS. glisian, glisnian, akin to E.
glitter. See {Glitter}, v. i., and cf. {Glister}, v. i.]
To sparkle or shine; especially, to shine with a mild,
subdued, and fitful luster; to emit a soft, scintillating
light; to gleam; as, the glistening stars.
Syn: See {Flash}.
[1913 Webster]
it's interesting to think about how you would implement this service efficiently under the constraints of mid-01990s computers, where a gigabyte was still a lot of disk space and multiuser unix servers commonly had about 100 mips (https://netlib.org/performance/html/dhrystone.data.col0.html)
totally by coincidence i was looking at the dictzip man page this morning; it produces gzip-compatible files that support random seeks so you can keep the database for your dictd server compressed. (as far as i know, rik faith's dictd is still the only server implementation of the dict protocol, which is incidentally not a very good protocol.) you can see that the penalty for seekability is about 6% in this case:
nowadays computers are fast enough that it probably isn't a big win to gzip in such small chunks (dictzip has a chunk limit of 64k) and you might as well use a zipfile, all implementations of which support random access:
: ~; mkdir jargsplit
: ~; cd jargsplit
: jargsplit; gzip -dc /usr/share/dictd/jargon.dict.dz|split -b256K
: jargsplit; zip jargon.zip xaa xab xac xad xae xaf
adding: xaa (deflated 60%)
adding: xab (deflated 59%)
adding: xac (deflated 59%)
adding: xad (deflated 61%)
adding: xae (deflated 62%)
adding: xaf (deflated 58%)
: jargsplit; ls -l jargon.zip
-rw-r--r-- 1 user user 565968 Sep 22 09:47 jargon.zip
: jargsplit; time unzip -o jargon.zip xad
Archive: jargon.zip
inflating: xad
real 0m0.011s
user 0m0.000s
sys 0m0.011s
so you see 256-kibibyte chunks have submillisecond decompression time (more like 2 milliseconds on my cellphone) and only about a 1.8% size penalty for seekability:
: jargsplit; units -t 565968/556102 %
101.77413
and, unlike the dictzip format (which lists the chunks in an extra backward-combatible file header), zip also supports efficient appending
even in python (3.11.2) it's only about a millisecond:
In [13]: z = zipfile.ZipFile('jargon.zip')
In [14]: [f.filename for f in z.infolist()]
Out[14]: ['xaa', 'xab', 'xac', 'xad', 'xae', 'xaf']
In [15]: %timeit z.open('xab').read()
1.13 ms ± 16.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
this kind of performance means that any algorithm that would be efficient reading data stored on a conventional spinning-rust disk will be efficient reading compressed data if you put the data into a zipfile in "files" of around a meg each. (writing is another matter; zstd may help here, with its order-of-magnitude faster compression, but info-zip zip and unzip don't support zstd yet.)
dictd keeps an index file in tsv format which uses what looks like base64 to locate the desired chunk and offset in the chunk:
strfile from the fortune package works on a similar principle but uses a binary data file and no keys, just offsets:
: ~; wget -nv canonical.org/~kragen/quotes.txt
2024-09-22 10:44:50 URL:http://canonical.org/~kragen/quotes.txt [49884/49884] -> "quotes.txt" [1]
: ~; strfile quotes.txt
"quotes.txt.dat" created
There were 87 strings
Longest string: 1625 bytes
Shortest string: 92 bytes
: ~; fortune quotes.txt
Get enough beyond FUM [Fuck You Money], and it's merely Nice To Have
Money.
-- Dave Long, <dl@silcom.com>, on FoRK, around 2000-08-16, in
Message-ID <200008162000.NAA10898@maltesecat>
: ~; od -i --endian=big quotes.txt.dat
0000000 2 87 1625 92
0000020 0 620756992 0 933
0000040 1460 2307 2546 3793
0000060 3887 4149 5160 5471
0000100 5661 6185 6616 7000
of course if you were using a zipfile you could keep the index in the zipfile itself, and then there's no point in using base64 for the file offsets, or limiting them to 32 bits
There was also a translation server called Babylon that used a similar raw text protocol (like WHOIS, and DICT here) in 1998. I remember adding it to my IRC script, but it must have stopped working at some point that I had replaced it with "babelfish.altavista.com" :)
Manual build with an explicit `--disable-dict` perhaps? Because it's not Debian, Fedora, Homebrew, Nix, Alpine, Arch, or Gentoo, judging by their package definitions.
Ah, it appears that curl-minimal became the default curl for Fedora recently. curl-full has to be installed for full functionality. I initially ignored it because I assumed the default was curl-full.
2. Try to type 'dict' into a terminal, on the off chance there's a command-line tool with the same name (would you do this for https:// and expect the same outcome?)
3. Be running a distribution that modifies the user's shell environment to suggest packages related to unknown commands
4. Actually install and run that command
5. Be running tcpdump or wireshark at the same time to notice that the `dict` command is reaching out to the network, as opposed to doing some sort of local lookup in /usr/share/dict
6. Figure out from the network traffic that the tool is using a dictionary-specific protocol as opposed to just making an HTTP request to dictionary.com or whatever.
--
Nah, the only way someone would know (or even suspect!) that dict:// is somehow related to an ancient Unix command-line tool is prior knowledge, and it's unreasonable to expect the article author to have somehow intuited such an idea.
This is not old code. Old code did not have dialer (tel:) urls. Given the timing, the dict it refers to is also not the original one, but the Safari link scheme.