HTTP: , FTP:, and Dict:?

masswerk · 2024-09-20T18:15:49 1726856149

Nowadays, on macOS, "dict://Internet" will open the Dictionary app with the query "Internet". (Probably behind a security prompt.) Not sure, if there's similar functionality on other operating systems.

dmd · 2024-09-22T13:08:52 1727010532

What do you mean by "behind a security prompt"?

bobbylarrybobby · 2024-09-22T13:56:58 1727013418

Browsers generally throw up a prompt before opening an external app

promiseofbeans · 2024-09-22T13:12:02 1727010722

I tried it and firefox came up with a prompt confirming I wanted to open 'internet' in 'Dictionary'

WarOnPrivacy · 2024-09-22T15:26:56 1727018816

> firefox came up with a prompt confirming I wanted to open 'internet' in 'Dictionary'

My Ffx 131.0b9 wasn't so adept. It gave me:

   The address wasn’t understood

    Firefox doesn’t know how to open this address, because one of the following protocols (dict) isn’t associated with any program or is not allowed in this context.

    You might need to install other software to open this address.

wormius · 2024-09-22T18:29:55 1727029795

Mine just opened the default search engine with the results, maybe it's how I have the address bar configured?

masswerk · 2024-09-22T22:56:30 1727045790

The behavior may be also vary depending on whether it's an actual link a document or direct input to the location bar. (I get different prompts on Firefox, while both will forward to the built-in system dictionary.)

And, of course, any further configurations may alter this, e.g., another service may have been registered for this protocol.

thayne · 2024-09-23T04:19:07 1727065147

Perhaps it only does that on Macs?

im3w1l · 2024-09-22T12:27:48 1727008068

I admire these old protocols that are intentionally built to be usable both by machines and humans. Like the combination of a response code, and a human readable explanation. A help command right in the protocol.

Makes me think it's a shame that making a json based protocol is so much easier to whip up than a textual ebnf-specified protocol.

Like imagine if in python there was some library that you gave an ebnf spec maybe with some extra features similar to regex (named groups?) and you could compile it to a state machine and use it to parse documents, getting a Dict out.

dspillett · 2024-09-22T13:47:57 1727012877

> Makes me think it's a shame that making a json based protocol is so much…

Maybe I'm not the human you are thinking of, being a techie, but I find a well structured JSON response, as long as it isn't overly verbose and is presented in open form rather than minified, to be a good compromise of human readable and easy to digest programmatically.

marcosdumay · 2024-09-22T15:12:15 1727017935

The legibility is probably one of the main reasons JSON got adopted. XML can be made to not look too bad, but in SOAP it must be unreadable, so everybody was looking into fixing this.

xmlmann · 2024-09-22T16:59:09 1727024349

XML has a sweet advantage though. It can be styled in the browser. For example a sitemap that works for Google, OpenAI, &c. and is human readable looking like a web page.

Example:

https://www.wpbeginner.com/sitemap.xml

immibis · 2024-09-22T17:14:22 1727025262

This is fun, but in reality, not all that useful.

Except you can impress other nerds when they "view source" and there's not an HTML tag in sight.

immibis · 2024-09-22T17:02:57 1727024577

XML is a language designed for markup (i.e. text formatting), and fitting structured documents into it creates an impedance mismatch.

Dictionary definitions may be considered as marked-up documents, so it may work. The overall structure of the dictionary is not.

keepamovin · 2024-09-23T03:14:49 1727061289

Yeah, but JSON has the vulnerability that it will tend to become like an overgrown garden over time, because it can. The type of TCP <code> <response> <text> protocols the GP talks about have the benefit of their restrictions and are definitely a better balance (human / machine), and more stable.

orf · 2024-09-22T13:01:13 1727010073

Unfortunately, in practice they are a nightmare. Look at the WHOIS protocol for an example.

Humans don’t look at responses very much, so you should optimise for machines. If you want a human-readable view, then turn the JSON response into something readable.

poincaredisk · 2024-09-22T13:08:09 1727010489

Whois protocol has no grammar to speak of (the response body is not defined at all, just a text blob) which makes it a nightmare to parse. Having a proper response format would solve this.

Though I agree, I prefer my responses in JSON.

tephra · 2024-09-23T06:52:45 1727074365

If you want WHOIS with a better defined format we have a protocol for that! It's called RDAP https://en.wikipedia.org/wiki/Registration_Data_Access_Proto...

astrobe_ · 2024-09-22T17:04:39 1727024679

The next logical step is to use a machine-friendly format instead; that is a binary protocol.

Even HTML and XML which were designed for readability and manual writing eventually became 'not usable enough" ("became" because I think part of it is that their success made them exposed to less technical populations), and now we have markdown everywhere which most of the times is converted to HTML.

So if you are going to use a tool more sophisticated than Ed/Edlin to read and write (rich) text in a certain format, it could be more efficient to focus on making the job of the machine - and of the programmer, easier.

If you look at a binary protocol such as NTP, the binary format leaves very little room for Postel's principle [1], so it is straightforward to make a program that queries a server and display the result.

[1] https://en.wikipedia.org/wiki/Robustness_principle

kragen · 2024-09-22T13:34:52 1727012092

maybe we could have a format that was more human-readable than json (or especially xml) but still reliably emittable and parseable? yaml, maybe, or toml, although i'm not that enthusiastic about them. another proposal for such a thing was ogdl (https://ogdl.org/), a notation for what i think are called rose trees

> OGDL: Ordered Graph Data Language

> A simple and readable data format, for humans and machines alike.

> OGDL is a structured textual format that represents information in the form of graphs, where the nodes are strings and the arcs or edges are spaces or indentation.

their example:

    network
      eth0
        ip   192.168.0.10
        mask 255.255.255.0
        gw   192.168.0.1

    hostname crispin

another possibility is jevko; https://jevko.org/ describes it and http://canonical.org/~kragen/rose/ are some of my notes about the possibilities of similar rose-tree data formats

hnlmorg · 2024-09-22T13:45:05 1727012705

Formats like TOML are horrible for heavily nested data (even XML does a better job here) and the last time I checked, TOML didn’t support arrays at the top level.

YAML is nicer than JSON to write, but I wouldn’t say it’s any nicer to read.

If you want something that’s less punctuation heavy, then I’d prefer we go full Wirth and have something more akin to Pascal.

kragen · 2024-09-22T13:51:42 1727013102

arrays at the top level are probably a bad idea for protocols that need to evolve in a backward-compatible way

what do you mean about heavily nested data? do the other formats i linked do a better job there?

i'm not sure it's possible to come up with a data format that will work well for such a wide range of use cases, but it sure would be nice to have. json is pretty great in terms of being able to load it into the browser, or visidata, or python, or js, or whatever

hnlmorg · 2024-09-22T14:13:45 1727014425

> arrays at the top level are probably a bad idea for protocols that need to evolve in a backward-compatible way

Depends on the protocol. It might be preferable to version the end point. Or if it’s a specific function, eg list-synonyms” then having a dictionary just to reference an array could be argued as unnecessary protocol bloat. Particularly given the aim of this exercise is readability.

> what do you mean about heavily nested data? do the other formats i linked do a better job there?

I mean a tree like structure.

JSON and YAML are probably the best in class here. XML, for all of its warts, is good at handling nested data in a readable way too.

TOML was more based around a flatter structure.

> i'm not sure it's possible to come up with a data format that will work well for such a wide range of use cases,

It’s not. The moment that happens, that format then becomes unwieldy and people then feel the urge to invent yet another new format to simplify things. It’s a vicious circle that happens over and over again in the tech sector.

kragen · 2024-09-22T15:20:28 1727018428

it's a pretty big problem that connecting existing software together so often requires writing new parsers

hnlmorg · 2024-09-22T15:41:38 1727019698

That suggests things are getting worse but personally I’ve seen the opposite trend.

These days developers rallying around a subset of established standards rather than inventing new protocols and grammar for each new service.

Take a look at the old protocols out there: finger, DNS, Gopher, HTTP, FTP, SMTP Dict, etc. they all have their own grammar and in many cases, even that grammar is very loosely defined or subject to dozens of different standards. Whereas these days it’s mostly JSON or XML over HTTPS. Or ProtoBuf if you need something more compact.

There’s definitely still room for improvement. For example the shift towards proprietary messaging protocols like Slack, Discord, etc. But that’s another topic entirely.

kragen · 2024-09-22T16:03:20 1727021000

yeah, i appreciate the move to html, http, and json. although http/2 and http/3 arguably aren't really http, and scraping data out of html is ridiculously time-wasting. the shift toward cloudflare and secret criteria for blocking users whose sessions act "atypical" are also huge problems, but that's sort of what you'd expect from using software running on remote servers you can't control

donatj · 2024-09-22T13:45:42 1727012742

In my department's (we were formerly our own company) internal framework throwing .html on the end of any JSON response outputs it in nested HTML tables. I personally find it very helpful.

hnlmorg · 2024-09-22T14:25:40 1727015140

At that point you might as well drop JSON altogether and use an XHTML subset so your rendered output is also valid XML (instead of having two different and incompatible markups merged together)

zeven7 · 2024-09-22T14:38:39 1727015919

I’m assuming they use the .html trick for human reading of the data by developers rather than it being used in production

donatj · 2024-09-22T15:02:16 1727017336

That's exactly what it's for

hnlmorg · 2024-09-22T15:31:14 1727019074

Ahhh that makes more sense.

theamk · 2024-09-22T17:50:42 1727027442

That sounds like a bad idea. Unlike JSON, XML is (1) non-trivial to parse safely, (2) hard to reliably extract information and (3) verbose.

Leave the markup languages for intended purpose: text markup. Don't force them to carry data.

hnlmorg · 2024-09-22T18:58:22 1727031502

I’m not generally a fan of XML either but what you posted there is just factually incorrect in just about every conceivable way.

1. There’s plenty of XML parsers already available for most languages. Yeah there have been high profile exploits based from XML but given the scale of XMLs usage, it’s fair to say those exploits are atypical usage where XML can be user supplied. And as long as you’re not allowing users to upload their own XML, then you get to control the schema so there isn’t any risks in using XML.

2. XMLs entire purpose is a data store. I’m not someone who likes to blame the developers for using their tools wrong but honestly, if you can’t unmarshal an XML schema you have control over then you’re not going to succeed with JSON either.

3. It is. But it’s also highly compressible because of its repetitive tags. So for HTTP endpoints, it actually doesn’t work out any different to JSON.

> Leave the markup languages for intended purpose: text markup. Don't force them to carry data.

You do realise the entire point of XML is to carry data? It might have fallen out of favour in recent years but those of us old enough to remember a time before JSON will talk about how JSON is just a simplified reimplementation of XML. And with things like JSON schemas, JSON is continuing to copy XML features.

theamk · 2024-09-23T03:44:53 1727063093

Yes, XML is a better structured format than the things before it (random non-standardized text formats). But saying "it's just the same as JSON" misses how many dangerous parts and footguns it has.

There was a post less than two weeks ago on defusedxml [0] - XML parser with protection from various XML "features" - small files causing exponential blow-up, remote access from just trying to parse the xml... It's not related to "scale of XMLs usage", and those are not security bugs. It's "working as designed", because XML is full of weird features that maybe sounded great in 1998 but now just add vulnerabilities. JSON has eliminated the entire class of those. (and before you say: "it's just python!", check out the "other languages" section. It's also Perl, Ruby, PHP, .NET.

And I am not going to write anything about ambiguities - if you want to serialize an array of points with ("x", "y", "color") properties, and give it to few different XML programmers, each one of them will come up with a different schema. This does not make interoperability easier at all. Compare to JSON, where this can have only one canonical encoding, and the worst you might have to deal with would be some uppercased letters.

JSON is a simplified version of XML, and JSON has copied good XML features (it's not getting namespaces or external DTDs, and good riddance!). So there is no reason to stick to over-complicated technology whose security story is "don't parse user-supplied data".

[0] https://news.ycombinator.com/item?id=41523098

hnlmorg · 2024-09-23T10:56:57 1727089017

> But saying "it's just the same as JSON" misses how many dangerous parts and footguns it has.

I didn't say it's just the same as JSON. I said:

"And as long as you’re not allowing users to upload their own XML, then you get to control the schema so there isn’t any risks in using XML."

Every example you've given requires untrusted 3rd parties to craft the XML. But that wasn't what I was advocating here. I was talking specifically about the API returning XML.

> and give it to few different XML programmers, each one of them will come up with a different schema.

But again, the API is controlling the schema so this isn't an issue for the use case I discussed.

> Compare to JSON, where this can have only one canonical encoding, and the worst you might have to deal with would be some uppercased letters.

You've clearly not worked with enough JSON if that's all you think the issue with JSON is. I've written JSON parsers and used a fair few open source ones too. And there's a lot of places things can go wrong:

1. You have number serialization bugs between different JSON parsers.

2. No standard for dates. Causing everyone to do things slightly differently

3. Inconsistencies with top level arrays, some parsers require top level arrays to be `{[ ... ]}` whiles others are happy just with `[ ... ]`

4. Parsers don't all agree on how to represent non-alpha / numeric ASCII characters. And we're not just talking about unicode, Even some ASCII characters like `>` can be handled differently by different JSON libraries

5. Lots of different JSON supersets (because JSON itself doesn't support half the stuff that people need from it), like jsonlines, concatenated json, newline delimited json, JSON with date fields (as seen in popular JSON libraries in .NET), JSON schema, etc.

6. Even your key name example has numerous other inconsistencies you haven't touched on. Like UPPER, lower, dot.notation, hyphenated-keys, underscored_keys, UpperCamelCased, lowerCamelCased...and so many variations in between.

Ignoring JSON supersets, then I agree that JSON has fewer places for exploits in user generated documents. But the specification is also only 5 (FIVE!!!) pages long and thus it allows for a lot of undefined behaviour. And that's a problem for somethings who's entire purpose is a database.

This is why XML is so complex -- precisely because it's intended to solve these problems. But it was also intended to be served from trusted identities. Which is where the vulnerabilities lie.

> JSON is a simplified version of XML, and JSON has copied good XML features (it's not getting namespaces or external DTDs, and good riddance!).

I wouldn't be so sure about that: https://json-schema.org/specification

---

To go back to my earlier point: literally no-one is going to argue that XML doesn't have it's warts. But what you need to understand is that in the specific example that started this conversation, the API provider is the one defining the schema and crafting the XML. So literally none of your examples apply what-so-ever. In fact, this falls squarely under the correct usage of XML.

Context matters. User supplied XML is bad but that's not what is being proposed here. And that's why you're being called out of stating what you believed to be pretty obvious advice.

zzo38computer · 2024-09-22T22:56:05 1727045765

> I admire these old protocols ...

The protocols that have a response code with an explanation is helpful. A help command is also helpful. So, I had written NNTP server that does that, and the IRC and NNTP client software I use can display them.

> Makes me think it's a shame that making a json based protocol is so much easier to whip up ...

I personally don't; I find I can easily work with text-based protocols if the format is easily enough.

I think there are problems with JSON. Some of the problems are: it requires parsing escapes and keys/values, does not properly support character sets other than Unicode, cannot work with binary data unless it is encoded using base64 or hex or something else (which makes it inefficient), etc. There are other problems too.

> Like imagine if in python there was some library that you gave an ebnf spec ...

Maybe it is possible to add such a library in Python, if there is not already such things.

somat · 2024-09-22T22:45:11 1727045111

REST (REpresentational State Transfer) as a concept is very human orientated. The idea was a sort of academic abstraction of html. but it can be boiled down to: when you send a response, also send the entire application needed to handle that response. It is unfortunate that collectively we had a sort of brain fart and said "ok, REST == http, got it" and lost the rest of the interesting discussion about what it means to send the representational state of the process.

sixdimensional · 2024-09-23T04:32:37 1727065957

May I humbly submit “parsing expression grammars”[1] for your consideration?

Fairly simple and somewhat fun.. Python has PEG parsing built in, but also the pyparsing or parsimonious modules too.

I have built EDI X12 parsers and toy languages with this.

[1] https://en.wikipedia.org/wiki/Parsing_expression_grammar

sixdimensional · 2024-09-23T04:38:26 1727066306

Also lark in Python too

fouc · 2024-09-22T15:38:36 1727019516

textual ebnf-specified protocol > json

praveen9920 · 2024-09-22T16:13:12 1727021592

> in an age of low-size disk drives and expensive software, looking up data over a dedicated protocol seems like a nifty2 idea. Then disk size exploded, databases became cheap, and search engines made it easy to look up words.

I love this particular part of history about How protocols and applications got build based on restrictions and got evolved after improvements. Similar examples exists everywhere in computer history. Projecting the same with LLMs, we will have AIs running locally on mobile devices or perhaps AIs replacing OS of mobile devices and router protocols and servers.

In future HN people looking at the code and feeling nostalgic about writing code

bruce511 · 2024-09-23T04:39:43 1727066383

I've been working with other programmers for over 30 years, and the impact of this scarcity mindset has very real implications in behavior today.

In much the same way that someone who grew up with food insecurity views food now, even if the food is now plentiful.

For example, memory and disk space were expensive. So every database field, every variable, was scrutinized for size. Do you need 30 chars for a name? Would 25 do?

In C especially all strings are malloced at run time, predefined strings with max length are supported but not common.

Arguments (today) about the size of the primary key field and so on. Endless angst about "bloat".

I understand that there are cases where size matters. But we've gone from a world where it mattered to everything to a world where it matters in a few edge cases.

Given that all these optimizations come with their own problems it can be hard to break old habits.

pjc50 · 2024-09-23T10:04:32 1727085872

It is, until it isn't. This stuff very much matters for games.

There's also a general argument about resource usage, but I think the AI and crypto people have largely won the argument that it's OK to use as much electricity as you want as long as you're making money somehow.

praveen9920 · 2024-09-23T10:20:01 1727086801

Gaming and Porn are the main reason for lot of advancements in tech actually. They are always on cutting edge and always need more resources

ben_w · 2024-09-23T16:27:40 1727108860

On the basis of GANs or diffusion models turning default skeletons into whatever style you want including photorealism, and that those skeletons can be animated by other AI, and that plots in most* games and porn are worse than the extremely basic stuff the original ChatGPT could provide, I think that's nearly at an end.

When they run in 240 fps on your phone, that's the end. I've seen 5 fps on unknown hardware, so give it a decade at most.

* Exceptions exist, but there aren't enough hours in a year to support even just 23 games like Bauldur's Gate 3 even if you're unemployed, and there's more coming out each year than that.

OkGoDoIt · 2024-09-23T05:25:37 1727069137

I wish some of those old timers who care about bloat could work their magic on the web development ecosystem

photonthug · 2024-09-23T07:23:29 1727076209

I think they just have the good sense to run away, because even if they fixed things, they would be hated and mocked relentlessly for it. See for example every contemporary blog rediscovering “wow, makefiles just work, and this tech is 50 years old” and the number of contrarian reactions to it in discussion threads. For every dev that values ubiquitous, simple, predictable, and stable solutions.. there’s 3 who want to use something that’s much more complex, and has a shelf life of just a few years before it’s replaced with the new hotness.

toast0 · 2024-09-23T23:46:50 1727135210

There's only so much one can do. Convincing other people that these things matter is hard. Even when you have good evidence (which isn't always easy to come by), people often won't put in the work to keep things small.

praveen9920 · 2024-09-23T10:28:00 1727087280

I think because of levels of abstraction of software/pltforms, top layers are bound to have « bloat ». Unless someone( something) changes radically across the stack, I think bloat will just increase for each layer

Cthulhu_ · 2024-09-23T12:57:57 1727096277

But on the other hand, for some applications, disk requirements exploded as well and require dedicated protocols and servers for it; for example Google's monorepo, or the latest Flight Simulator, the 2024 version will take up about 23 GB as the base install and stream everything else - including the world, planes, landmarks, live real-world ship and plane locations, etc - off the internet. Because the whole world just won't fit on any regular hard drive.

mrguyorama · 2024-09-23T20:13:51 1727122431

>Because the whole world just won't fit on any regular hard drive

Except it did. FS2020 had a base level of terrain quality that was installed such that you could play it even if you never connected to the internet. It wasn't Bing Earth quality sure, but it was way better than what you got in FSX thirteen years earlier. It was 200gb.

Which is conveniently about the same size as a single copy of whatever Call of Duty game is currently in vogue.

latexr · 2024-09-23T19:20:25 1727119225

> Projecting the same with LLMs, we will have AIs running locally on mobile devices

That’s not much of a projection. That’s been announced for months as coming to iPhones. Sure, they’re not the biggest models, but no one doubts more will be possible.

> or perhaps AIs replacing OS of mobile devices and router protocols and servers.

Holy shit, please no. There’s no sane reason for that to happen. Why would you replace stable nimble systems which depend on being predictable and low power with a system that’s statistical and consumes tons of resources? That’s unfettered hype-chasing. Let’s please not turn our brains off just yet.

AStonesThrow · 2024-09-23T23:36:57 1727134617

What is an AI? Just an LLM? What do you call Siri, Cortana, and Google Assistant? I've already received a Gemini app on Android, and they're promoting the premium version too. Runs locally, yes?

30 years ago, my supervisor wrote, from scratch, an "AI" running on the company web server (HP/UX PA-RISC; 32-64MB RAM), that would heuristically detect and block suspected credit-card fraud. Remember that "AI" is a perennial buzzword with fluid definitions, both an achievable goal right now, and a holy grail. ¡Viva Eliza!

latexr · 2024-09-24T08:06:46 1727165206

> I've already received a Gemini app on Android, and they're promoting the premium version too. Runs locally, yes?

Are you asking me? I don’t know nor do I care, I don’t use Android or Gemini in any capacity.

> Remember that "AI" is a perennial buzzword with fluid definitions

You don’t need to tell me that. I didn’t use the term “AI”, I just quoted the other post and responded in what I understood to be their terms. I don’t think LLMs are intelligent, thus not AI. You’re nitpicking the wrong person.

praveen9920 · 2024-09-24T17:47:36 1727200056

You are going with assumption that llms will remain same forever. There are attempts everywhere to make them smaller, efficient and more focused.

Would you have called « transferring mails online « in the 90s is hype-chasing because our postal system was working great? Probably not a great analogy but you get the point

latexr · 2024-09-25T12:22:32 1727266952

> You are going with assumption that llms will remain same forever.

I definitely am not and that’s stated directly in my post. I specifically said “but no one doubts more will be possible”.

> There are attempts everywhere to make them smaller, efficient and more focused.

That doesn’t matter when the issue is inherit. An LLM, by definition, needs large amounts of data and acts on them probabilistically. If you change that, it’s no longer an LLM. Any system that requires predictability needs to be programmed with rules we can understand, test, and reproduce reliably. LLMs ain’t it, and won’t ever be. Something else by a different name, maybe.

> Would you have called « transferring mails online « in the 90s is hype-chasing because our postal system was working great? Probably not a great analogy but you get the point

I understand analogies are never perfect, but that’s a particularly bad one. At least stay within the same realm of physicality. You made a bad comparison then ascribed a bad argument to be against it. That’s a straw man. I don’t even believe our current digital system is “working great”, so the analogy fails on multiple levels.

There are bad programmers and bad technology everywhere, the world is barely held by proverbial spit and bubblegum. Yet that doesn’t mean any crap that’s invented afterwards, be it “web3” or LLMs are immediately the solution.

praveen9920 · 2024-09-27T19:58:10 1727467090

You are missing the whole point of comment. Its a prediction/ speculation. I agree that it may seem bad with current state of things but no one can say with certainty it WILL be bad. There might be hybrid models which can do both probabilistic AND heuristic work within the scope, you never know. That’s my whole point.

If you see my comment, I said AI, not LLMs.

mycall · 2024-09-23T02:10:53 1727057453

Imagine if dict://internet was renamed to agent://source, then agentic calls to model sources could interconnect with ease. With HTTP/3, one stream could be the streaming browser and the other streams could be multi-agent sessions.

praveen9920 · 2024-09-23T10:24:50 1727087090

I was thinking on the same lines but with improvements towards running local models is right around the corner. Unless someone has specific usecase or proprietary models, this may not make sense. But in current situation, that would be a great thing to standardise communication between various llms and clients

38 · 2024-09-22T21:56:27 1727042187

Given that most current AI generated code is dogshit, I would say we are well off from that.

praveen9920 · 2024-09-23T10:18:56 1727086736

Not right now but we are seeing the accelerated trends of generated code. How long would it take for regular use cases to be completely AI generated on the go. Edge cases exists everywhere.

AStonesThrow · 2024-09-23T23:41:21 1727134881

It seems grossly inefficient for coding to go from NL prompt -> LLM -> formal language -> compile/execute. I'd envision AIs that would construct apps directly from the prompts, without the fragile middleware.

In fact, if it is optimal coding to interface existing libraries and frameworks, with minimal novel code, then just go full LEGO and pull in those dependencies, minimizing the error-prone originality.

hebocon · 2024-09-22T16:54:32 1727024072

I recently began testing my own public `dictd` server. The main goal was to make the OED (the full and proper one) available outside of a university proxy. I figured I would add the Webster's 1913 one too.

Unfortunately the vast majority of dictionary files are in "stardict" format and the conversion to "dict" has yielded mixed results. I was hoping to host _every_ dictionary, good and bad, but will walk that back now. A free VPS could at least run the OED.

tomsmeding · 2024-09-22T19:57:33 1727035053

> to make the OED (the full and proper one) available outside of a university proxy.

Was the plan to do this in a legal fashion? If so, how?

hebocon · 2024-09-23T15:10:59 1727104259

No, it's not possible to do in a legal fashion I don't think. The existing methods require a browser-based portal that requires you to be logged in with your university proxy.

As an alumnus I could do this by showing up in person to my university and accessing that way. But I'm not going to.

kragen · 2024-09-22T18:03:32 1727028212

what's the stardict format? which edition of the oed are you hosting? i scanned the first edition decades ago but i don't think there's a reasonable plain-text version of it yet

cormorant · 2024-09-22T19:26:35 1727033195

StarDict (a program/file format) is easily googlable. A bit of a rabbit hole is that it's been chased around hosting providers because its site (used to) offer downloads of copyrighted dictionaries, including the OED 2nd edition. I don't know how these files were originally obtained or produced. See: https://web.archive.org/web/20230718140437/http://download.h...

Edit to add: Also, "i scanned the first edition decades ago" sounds like quite a story. 13 volumes? What project were you doing?

kragen · 2024-09-22T19:59:32 1727035172

oh, i just thought it would be good for the public-domain dictionary to be available to the public: https://www.mail-archive.com/kragen-tol@canonical.org/msg001...

hebocon · 2024-09-23T15:04:42 1727103882

It will be the 2nd edition which is freely available on the internet with all of the usual copyright concerns.

And it's already in 'dict' format so I didn't need to convert.

cormorant · 2024-09-23T15:30:40 1727105440

At least one of the illegal copies was supposedly converted using a version of "oed2dict": https://njw.name/oed2dict/ which creates various formats.

That takes HTML files as input, and I don't know where those are found. ISOs of CD-ROM editions from 1996 and 2009 are online, but it looks like an adventure to install the software and/or extract the data.

The trouble with piracy is that provenance is so shaky, with plenty of chance for bugs or alterations...

wormius · 2024-09-22T18:28:38 1727029718

Wow, either I've forgotten this existed, or had no clue, I was around for this era, and I remember Veronica, Archie, WAIS, Gopher, etc, but never recall reading about a Dict protocol, nice find!

hkt · 2024-09-22T11:47:36 1727005656

I've been aware of dict for a while since I wrapped up an esperanto to english dictionary for KOReader in a format KOReader could understand. What I'd really have liked is a format like this:

dict://<server/<origin language>/<definition language>/<word>

Still, it is pretty cool that dict servers exist at all, so no complaints here.

cratermoon · 2024-09-22T18:16:16 1727028976

Oh yes, I remember dictionary servers. Also many other protocols.

What happened to all of those other protocols? Everything got squished onto http(s) for various reasons. As mentioned in this thread, corporate firewalls blocking every other port except 80 and 443. Around the time of the invention of http, protocols were proliferating for all kinds of new ideas. Today "innovation" happens on top of http, which devolves into some new kind of format to push back and forth.

giantrobot · 2024-09-22T21:26:08 1727040368

I wouldn't place all the blame on corporate IT for low level protocols dying out. A lot of corporate IT filtering was a reaction to malicious traffic originating from inside their networks.

I think filtering on university networks killed more protocols than corporate filtering. Corporate networks were rarely the place where someone stuck a server in the corner with a public IP hosting a bunch of random services. That however was very common in university networks.

When university networks (early 00s or so) started putting NAT on ResNets and filtering faculty networks is when a lot of random Internet servers started drying up. Universities had huge IPv4 blocks and would hand out their addresses to every machine on their networks. More than a few Web 1.0 companies started life on a random Sun machine in dorm rooms or the corner of a university computer lab.

When publicly routed IPs dried up so did random FTPs and small IRC servers. At the same time residential broadband was taking off but so were the sales of home routers with NAT. Hosting random raw socket protocols stopped being practical for a lot of people. By the time low cost VPSes became available a lot of old protocols had already died out.

nunobrito · 2024-09-22T20:25:43 1727036743

Nice find, didn't knew the protocol either. The site lists all available dictionaries here: https://dict.org/bin/Dict?Form=Dict4

I'll then be writing a java server for DICT. Likely add more recent types of dictionaries and acronyms to help keeping it alive.

t-3 · 2024-09-23T00:33:19 1727051599

dict is a must for me on any daily driver. Removing the friction of opening a web browser or specialized app and just looking up text from the terminal is just so nice. Just like bc, you don't miss it when you don't know it's there, but once you get used to using it you can't live without. Making custom dictionaries is not very well documented though.

commandersaki · 2024-09-22T13:23:01 1727011381

I love dict/dictd but I had an issue using it in hostile networks that block the port/protocol.

I've been tempted to revamp dict/dictd to shovel the dict protocol over websokets so I can use it over the web. Just one of those ideas in the pipeline that I haven't revisited because I'm no longer dealing with that hostile network.

gwervc · 2024-09-22T13:28:47 1727011727

The dict protocol really show it's age, notably the stateful connection part. Having a new protocol based on HTTP and JSON similar to LSP would be nice but there is no real interest. (I made and used my own nonetheless in a research project. It may even be deployed but desactivated in another one)

This biggest issue isn't technical, it's the fact organizations having dictionary data don't want third-party to interact with it without paid licensing.

steve_taylor · 2024-09-22T14:23:22 1727015002

I hate the fact that corporate IT collectively decided to block every port except 80 and 443, making it necessary to base new protocols on HTTP instead of TCP/IP.

jolux · 2024-09-22T15:42:47 1727019767

In my experience HTTP is a better foundation for novel protocols in most cases.

LambdaComplex · 2024-09-22T17:53:58 1727027638

Doesn't HTTP require binary data to be converted to base64 encoding, thereby increasing its size on the wire? That seems suboptimal for a lot of use cases

devmor · 2024-09-22T18:26:10 1727029570

It does not - you are perhaps thinking of GET queries: URL data often must be base64 encoded as URLs are parsed as characters.

HTTP bodies can be made up of any data in any encoding you wish.

kragen · 2024-09-22T18:04:02 1727028242

no, it does not, neither in requests nor in replies. possibly you are thinking of smtp

LambdaComplex · 2024-09-23T17:08:26 1727111306

> possibly you are thinking of smtp

You're right, I definitely was

Towaway69 · 2024-09-22T15:18:30 1727018310

There is always wiktionary, I would assume they have an api of some sort. That would cover the http & json bit!

https://wiktionary.org

mavhc · 2024-09-23T07:05:42 1727075142

There's not a specific Wiktionary API, just the general Mediawiki stuff

https://en.wiktionary.org/wiki/User:Amgine/Wiktionary_data_%...

pushupentry1219 · 2024-09-22T19:26:26 1727033186

> Having a new protocol based on HTTP and JSON

This is just a HTTP/REST api? These exist already.

cyanydeez · 2024-09-22T14:27:43 1727015263

Might be something LLM based for organization knowledge base

divbzero · 2024-09-23T18:33:06 1727116386

Regarding OP’s unanswered question:

> 00. Are there any other Dictionary Servers still available on the Internet?

There are a number of other dict: servers including ones for different languages:

https://servers.freedict.org/

kragen · 2024-09-22T13:19:03 1727011143

dict and the relevant dictionaries are things i pretty much always install on every new laptop. gcide in particular includes most of the famous 1913 webster dictionary with its sparkling prose:

    : ~; dict glisten
    2 definitions found

    From The Collaborative International Dictionary of English v.0.48 [gcide]:

      Glisten \Glis"ten\ (gl[i^]s"'n), v. i. [imp. & p. p.
         {Glistened}; p. pr. & vb. n. {Glistening}.] [OE. glistnian,
         akin to glisnen, glisien, AS. glisian, glisnian, akin to E.
         glitter. See {Glitter}, v. i., and cf. {Glister}, v. i.]
         To sparkle or shine; especially, to shine with a mild,
         subdued, and fitful luster; to emit a soft, scintillating
         light; to gleam; as, the glistening stars.

         Syn: See {Flash}.
              [1913 Webster]

it's interesting to think about how you would implement this service efficiently under the constraints of mid-01990s computers, where a gigabyte was still a lot of disk space and multiuser unix servers commonly had about 100 mips (https://netlib.org/performance/html/dhrystone.data.col0.html)

totally by coincidence i was looking at the dictzip man page this morning; it produces gzip-compatible files that support random seeks so you can keep the database for your dictd server compressed. (as far as i know, rik faith's dictd is still the only server implementation of the dict protocol, which is incidentally not a very good protocol.) you can see that the penalty for seekability is about 6% in this case:

    : ~; ls -l /usr/share/dictd/jargon.dict.dz
    -rw-r--r-- 1 root root 587377 Jan  1  2021 /usr/share/dictd/jargon.dict.dz
    : ~; \time gzip -dc /usr/share/dictd/jargon.dict.dz|wc -c
    0.01user 0.00system 0:00.01elapsed 100%CPU (0avgtext+0avgdata 1624maxresident)k
    0inputs+0outputs (0major+160minor)pagefaults 0swaps
    1418350
    : ~; gzip -dc /usr/share/dictd/jargon.dict.dz|gzip -9c|wc -c
    556102
    : ~; units -t 587377/556102 %
    105.62397

nowadays computers are fast enough that it probably isn't a big win to gzip in such small chunks (dictzip has a chunk limit of 64k) and you might as well use a zipfile, all implementations of which support random access:

    : ~; mkdir jargsplit
    : ~; cd jargsplit
    : jargsplit; gzip -dc /usr/share/dictd/jargon.dict.dz|split -b256K
    : jargsplit; zip jargon.zip xaa xab xac xad xae xaf 
      adding: xaa (deflated 60%)
      adding: xab (deflated 59%)
      adding: xac (deflated 59%)
      adding: xad (deflated 61%)
      adding: xae (deflated 62%)
      adding: xaf (deflated 58%)
    : jargsplit; ls -l jargon.zip 
    -rw-r--r-- 1 user user 565968 Sep 22 09:47 jargon.zip
    : jargsplit; time unzip -o jargon.zip xad
    Archive:  jargon.zip
      inflating: xad                     

    real    0m0.011s
    user    0m0.000s
    sys     0m0.011s

so you see 256-kibibyte chunks have submillisecond decompression time (more like 2 milliseconds on my cellphone) and only about a 1.8% size penalty for seekability:

    : jargsplit; units -t 565968/556102 %
    101.77413

and, unlike the dictzip format (which lists the chunks in an extra backward-combatible file header), zip also supports efficient appending

even in python (3.11.2) it's only about a millisecond:

    In [13]: z = zipfile.ZipFile('jargon.zip')

    In [14]: [f.filename for f in z.infolist()]
    Out[14]: ['xaa', 'xab', 'xac', 'xad', 'xae', 'xaf']

    In [15]: %timeit z.open('xab').read()
    1.13 ms ± 16.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

this kind of performance means that any algorithm that would be efficient reading data stored on a conventional spinning-rust disk will be efficient reading compressed data if you put the data into a zipfile in "files" of around a meg each. (writing is another matter; zstd may help here, with its order-of-magnitude faster compression, but info-zip zip and unzip don't support zstd yet.)

dictd keeps an index file in tsv format which uses what looks like base64 to locate the desired chunk and offset in the chunk:

    : jargsplit; < /usr/share/dictd/jargon.index shuf -n 4 | LANG=C sort | cat -vte
    fossil^IB9xE^IL8$
    frednet^IB+q5^IDD$
    upload^IE/t5^IJ1$
    warez d00dz^IFLif^In0$

this is very similar to the index format used by eric raymond's volks-hypertext https://www.ibiblio.org/pub/Linux/apps/doctools/vh-1.8.tar.g... or vi ctags or emacs etags, but it supports random access into the file

strfile from the fortune package works on a similar principle but uses a binary data file and no keys, just offsets:

    : ~; wget -nv canonical.org/~kragen/quotes.txt
    2024-09-22 10:44:50 URL:http://canonical.org/~kragen/quotes.txt [49884/49884] -> "quotes.txt" [1]
    : ~; strfile quotes.txt
    "quotes.txt.dat" created
    There were 87 strings
    Longest string: 1625 bytes
    Shortest string: 92 bytes
    : ~; fortune quotes.txt
      Get enough beyond FUM [Fuck You Money], and it's merely Nice To Have
        Money.

            -- Dave Long, <dl@silcom.com>, on FoRK, around 2000-08-16, in
               Message-ID <200008162000.NAA10898@maltesecat>
    : ~; od -i --endian=big quotes.txt.dat 
    0000000           2          87        1625          92
    0000020           0   620756992           0         933
    0000040        1460        2307        2546        3793
    0000060        3887        4149        5160        5471
    0000100        5661        6185        6616        7000

of course if you were using a zipfile you could keep the index in the zipfile itself, and then there's no point in using base64 for the file offsets, or limiting them to 32 bits

heystefan · 2024-09-22T20:23:26 1727036606

So, can I somehow use the 1913 Webster dictionary on MacOS? It's not in the list of configurable ones.

(If not possible, Terminal would work too.)

flomo · 2024-09-22T22:59:41 1727045981

Yep, someone posted this to HN:

https://github.com/cmod/websters-1913

kragen · 2024-09-22T22:04:26 1727042666

is gcide available? debian only offers web1913 as part of gcide

dokyun · 2024-09-22T22:32:19 1727044339

Emacs includes a browsable client for this protocol; you can use it with `M-x dictionary`.

divbzero · 2024-09-23T18:37:12 1727116632

The DICT Development Group also provides a dedicated dict: client:

  sudo apt install dict

  brew install dict

Which allows you to query dict://dict.org/ directly:

  dict foo

sedatk · 2024-09-23T03:10:41 1727061041

There was also a translation server called Babylon that used a similar raw text protocol (like WHOIS, and DICT here) in 1998. I remember adding it to my IRC script, but it must have stopped working at some point that I had replaced it with "babelfish.altavista.com" :)

anthk · 2024-09-22T17:49:44 1727027384

           echo "define * hacker " | nc dict.org 2628 | less

fitsumbelay · 2024-09-22T22:25:41 1727043941

super fascinating and potentially useful for future projects with or w/o AI. obviously makes me want to maintain my own dict service love this

mogoh · 2024-09-22T12:29:20 1727008160

hmmm

  $>curl dict://dict.org/d:Internet
  curl: (1) Protocol "dict" not supported

fallingsquirrel · 2024-09-22T12:59:18 1727009958

Works for me. I bet your OS ships a crippled version of curl.

  $ curl --version
  curl 8.7.1 (x86_64-pc-linux-gnu) [...]

  $ curl dict://dict.org/d:Internet
  220 dict.dict.org dictd 1.12.1/rf on Linux 4.19.0-10-amd64 <auth.mime> <370202891.28105.1727009645@dict.dict.org>
  250 ok
  150 1 definitions retrieved
  [...]

bloopernova · 2024-09-22T15:09:02 1727017742

Possibly Fedora. I'm using Fedora 40 and its curl reports thus:

  curl 8.6.0 (x86_64-redhat-linux-gnu) libcurl/8.6.0 OpenSSL/3.2.2 zlib/1.3.1.zlib-ng libidn2/2.3.7 nghttp2/1.59.0
  Release-Date: 2024-01-31
  Protocols: file ftp ftps http https ipfs ipns
  Features: alt-svc AsynchDNS GSS-API HSTS HTTP2 HTTPS-proxy IDN IPv6 Kerberos Largefile libz SPNEGO SSL threadsafe UnixSockets

And the dict protocol is indeed unsupported by system curl.

EDIT: https://fedoraproject.org/wiki/Changes/CurlMinimal_as_Defaul...

EDIT2: To change from libcurl-minimal to libcurl, run:

  dnf swap libcurl-minimal libcurl
  dnf swap curl-minimal curl

The second step there may not be needed, at least my system had curl paired with libcurl-minimal so your situation may not match mine.

EDIT3: This is the output of my curl now:

  curl 8.6.0 (x86_64-redhat-linux-gnu) libcurl/8.6.0 OpenSSL/3.2.2 zlib/1.3.1.zlib-ng brotli/1.1.0 libidn2/2.3.7 libpsl/0.21.5 libssh/0.10.6/openssl/zlib nghttp2/1.59.0 OpenLDAP/2.6.7
  Release-Date: 2024-01-31
  Protocols: dict file ftp ftps gopher gophers http https imap imaps ipfs ipns ldap ldaps mqtt pop3 pop3s rtsp scp sftp smb smbs smtp smtps telnet tftp ws wss
  Features: alt-svc AsynchDNS brotli GSS-API HSTS HTTP2 HTTPS-proxy IDN IPv6 Kerberos Largefile libz NTLM PSL SPNEGO SSL threadsafe TLS-SRP UnixSockets

soraminazuki · 2024-09-22T13:19:48 1727011188

Manual build with an explicit `--disable-dict` perhaps? Because it's not Debian, Fedora, Homebrew, Nix, Alpine, Arch, or Gentoo, judging by their package definitions.

mogoh · 2024-09-22T13:29:23 1727011763

I am on Fedora Silverblue

  $>curl --version
  curl 8.6.0 (x86_64-redhat-linux-gnu) libcurl/8.6.0 OpenSSL/3.2.2 zlib/1.3.1.zlib-ng libidn2/2.3.7 nghttp2/1.59.0
  Release-Date: 2024-01-31
  Protocols: file ftp ftps http https ipfs ipns
  Features: alt-svc AsynchDNS GSS-API HSTS HTTP2 HTTPS-proxy IDN IPv6 Kerberos Largefile libz SPNEGO SSL threadsafe UnixSockets

I am not sure I I understand you correctly. Should it work on Fedora?

SushiHippie · 2024-09-22T14:55:42 1727016942

The description on this page at least lists the dict protocol

https://src.fedoraproject.org/rpms/curl/

Only the minimal build disables the dict protocol, maybe you have installed the curl-minimal package?

https://src.fedoraproject.org/rpms/curl/blob/rawhide/f/curl....

soraminazuki · 2024-09-22T18:15:33 1727028933

Ah, it appears that curl-minimal became the default curl for Fedora recently. curl-full has to be installed for full functionality. I initially ignored it because I assumed the default was curl-full.

https://fedoraproject.org/wiki/Changes/CurlMinimal_as_Defaul...

Curl devs are predictably not too happy about this change.

https://daniel.haxx.se/blog/2022/03/16/fedora-and-curl-minim...

marxisttemp · 2024-09-22T14:46:24 1727016384

You mentioned Homebrew but missed the standard macOS package manager, MacPorts.

kragen · 2024-09-22T13:25:34 1727011534

works for me too. but it takes about 6 seconds so curl dict://localhost/d:Internet is vastly preferable

kgbcia · 2024-09-23T02:47:12 1727059632

Isn't .mobi an ebook format?

therein · 2024-09-23T02:54:45 1727060085

.COM is also a file format.

nashashmi · 2024-09-22T22:58:51 1727045931

I dont think dict is secure enough. We need a new version called dick. K is for encryption key. /s /rant (at sftp).

bubblesnort · 2024-09-22T11:14:59 1727003699

Never heard of the dict command?

The author went through the trouble of figuring out the protocol but never bothered to just run dict. Okay.

jmillikin · 2024-09-22T11:29:57 1727004597

What do you think the post would have contained if he had run dict?

Here's a hint:

  macbook% dict
  zsh: command not found: dict
  desktop$ dict
  bash: dict: command not found

You'd have to be pretty into retro-computing before you'll find an OS that ships /usr/bin/dict .

politelemon · 2024-09-22T12:21:43 1727007703

Not really... FWIW, addressing the 'hint' you're giving specifically, this is on Ubuntu.

    $ dict
    Command 'dict' not found, but can be installed with:
    sudo apt install dict

After installing,

    $ dict example
    6 definitions found
    ...

jmillikin · 2024-09-22T13:03:59 1727010239

So the causal chain would be:

1. Notice a URL scheme dict://

2. Try to type 'dict' into a terminal, on the off chance there's a command-line tool with the same name (would you do this for https:// and expect the same outcome?)

3. Be running a distribution that modifies the user's shell environment to suggest packages related to unknown commands

4. Actually install and run that command

5. Be running tcpdump or wireshark at the same time to notice that the `dict` command is reaching out to the network, as opposed to doing some sort of local lookup in /usr/share/dict

6. Figure out from the network traffic that the tool is using a dictionary-specific protocol as opposed to just making an HTTP request to dictionary.com or whatever.

--

Nah, the only way someone would know (or even suspect!) that dict:// is somehow related to an ancient Unix command-line tool is prior knowledge, and it's unreasonable to expect the article author to have somehow intuited such an idea.

bubblesnort · 2024-09-22T15:52:13 1727020333

HTTP is the weird protocol here. Lots of protocols were named after the program (or vice versa).

finger, ftp, ssh, talk, telnet, tftp, maybe whois too?

kragen · 2024-09-22T18:07:13 1727028433

also gopher. mail and dns are two other exceptions tho

kragen · 2024-09-22T15:06:59 1727017619

right, the causal chain would be that the author already used the dict command and had at some point read the man page, which begins

    DICT(1)                                                                DICT(1)

    NAME
           dict - DICT Protocol Client

    SYNOPSIS
           (...)

    DESCRIPTION
           dict  is  a  client  for  the  Dictionary Server Protocol (DICT), (...)

but, yeah, not everybody has that background

which is fine! nobody is born knowing all the unix commands

hnlmorg · 2024-09-22T13:11:15 1727010675

The “dict” string was included in a regex of protocols. So they wanted to learn more about the protocol.

It’s entirely possible they were already aware of other software that supports dictionary lookups.

kragen · 2024-09-22T13:20:47 1727011247

dict isn't other software, it's a client for the protocol being discussed

hnlmorg · 2024-09-22T13:37:50 1727012270

You can still be aware that ‘dict’ client exists without realising that it wasn’t just another HTTP user agent.

kragen · 2024-09-22T13:41:25 1727012485

ludwik · 2024-09-22T12:53:01 1727009581

This is a fun post about an obscure internet protocol, not a how-to.

forbiddenlake · 2024-09-22T11:59:33 1727006373

The point of the article wasn't how to define a word, it was answering why old code mentioned the dict protocol in the same regex as http and ftp.

I for one had never heard of the dict:// protocol, so I was curious about it.

akho · 2024-09-23T07:20:34 1727076034

This is not old code. Old code did not have dialer (tel:) urls. Given the timing, the dict it refers to is also not the original one, but the Safari link scheme.