
Inside the OED: can the world’s biggest dictionary survive the internet? - f_allwein
https://www.theguardian.com/news/2018/feb/23/oxford-english-dictionary-can-worlds-biggest-dictionary-survive-internet
======
kragen
The OED can not only survive the internet; it will flourish on the internet.
We can incorporate the entire thing into Wiktionary, which, as
[https://news.ycombinator.com/item?id=16461656](https://news.ycombinator.com/item?id=16461656)
points out, is already in some ways more comprehensive and reliable.
Unfortunately, for copyright reasons, we have to go back to its first edition,
and even the last volume of the first edition might still be in copyright.

I announced the project at [https://www.mail-archive.com/kragen-
tol@canonical.org/msg001...](https://www.mail-archive.com/kragen-
tol@canonical.org/msg00130.html), bought a copy of the dictionary, and spent a
bunch of nights at the archive scanning its volumes.
[https://old.datahub.io/dataset/oed](https://old.datahub.io/dataset/oed) talks
a bit about the available data halfway through the project. I did eventually
scan the whole thing, and since then other people have contributed other
scans. So far nobody has OCRed it and imported it wholesale into Wiktionary.

At some point I hacked together a web service that would show you the page
image for a given word (perhaps after a few tries), but I don't have it
running now.
[https://archive.org/details/oed11_201407](https://archive.org/details/oed11_201407)
seems to be a downloadable program that does something similar.

It's really unfortunate that the past century of work at the Oxford University
Press will be lost and have to be redone. Such is the price of copyright.

~~~
ComputerGuru
OCR has come a _long_ way since 2005. I was an early adopter around 2000, and
the results scared me away from OCR for thE better part of two decades. I
recently revisted it with no special approach, just trying Adobe Acrobat Pro’s
OCR module and was _very_ pleasantly surprised with the progress that has been
made.

Adobe (and I’m sure many others) now preserve the image but place text behind
it in a hidden layer or else use a pseudo ML process to create a font from the
scan and fit it to the text with high accuracy (and low binary sizes).

Might be worth revisiting.

Edit: just saw the full mailing list url (on phone, hard to see anything) and
realized who I am replying to. Hi! I have missed your writings! I recommend
“only a constant factor worse than optimal” all the time to people.

~~~
kragen
I'm flattered! I'll come back to publishing soon, possibly pseudonymously.

------
eschevarria
“In February 2009, a Twitter user called @popelizbet issued an apparently
historic challenge to someone called Colin: she asked if he could ‘mansplain’
a concept to her. History has not recorded if he did, indeed, proceed to
mansplain. But the lexicographer Bernadette Paton, who excavated this exchange
last summer, believed it was the first time anyone had used the word in
recorded form. ‘It’s been deleted since, but we caught it,’ Paton told me,
with quiet satisfaction.

[…]

A few days ago, I emailed to see if ‘mansplain’ had finally reached the OED.
It had, but there was a snag – further research had pushed the word back a
crucial six months, from February 2009 to August 2008. Then, no sooner had
Paton’s entry gone live in January than someone emailed to point out that even
this was inaccurate: they had spotted ‘mansplain’ on a May 2008 blog post,
just a month after the writer Rebecca Solnit had published her influential
essay Men Explain Things to Me. The updated definition, Proffitt assured me,
will be available as soon as possible.”

One Wiktionary contributor did a better job[1] in 2012 by immediately finding
the use[2] from May 2008. The OED is more Prestigious and Respectable and
Authoritative, but the Wiktionary is more comprehensive, informative,
reliable, convenient, _useful_ , also cheaper.

[1]:
[https://en.wiktionary.org/w/index.php?title=Citations:manspl...](https://en.wiktionary.org/w/index.php?title=Citations:mansplain&oldid=16992030)

[2]:
[https://web.archive.org/web/20130518221612/http://www.journa...](https://web.archive.org/web/20130518221612/http://www.journalfen.net/community/fandom_wank/1156737.html?thread=179210113#t179210113)

~~~
forapurpose
> One Wiktionary contributor did a better job

If we define "better" by speed, then every HN comment is better than every
book ever published, and all the blog posts on climate change are better than
the scientific research. I find it's the opposite: The things that take longer
to publish are usually better.

------
tardygrad
I found The Surgeon of Crowthorne by Simon Winchester to be a fascinating read
about James Murray and how he came to write the OED.

It is ironic that the definitive book on the English language was compiled by
an American, and that he made it while he was incarcerated in an asylum for
murder just makes the tale more interesting.

Simon Winchester is an excellent storyteller and the Surgeon of Crowthorne is
an entertaining and insightful book - highly recommended!

~~~
s3r3nity
If you enjoyed that, I highly recommend Winchester's "The Meaning of
Everything," which goes into even more detail on all the historical editors of
the OED, and their respective process in approaching the massive project. One
of my favorite books I've ever read.

------
drb91
I’d pay hundreds of dollars for a version of OED that shipped in a well-
defined sqlite database. I’d even pay for updates.

~~~
SomeHacker44
I agree. I would love a standalone app (Mac, iOS) which did not require
offline access, even if it was multiple gigabytes. There was an old OED 2e app
for the Mac with the full content.

~~~
walterbell
The standalone offline concise version includes updates, has 1/3 of the OED
and is 128MB on iOS.

------
Myrmornis
I miss the OED. I consulted it in my parents’ house as a teenager, and ever
since I have been too lazy or cheap to acquire a copy, whether paper or
otherwise. Basically I want an electronic copy, but $295 annual subscription
seems steep compared to the price of paper copies ($50 it looks like). It
hardly needs saying that the dictionaries freely available online are not
adequate replacements.

~~~
lukev
To be fair what you get with the $295 subscription is the _full_ OED,
including all the expanded etymologies, which is 20 volumes and seems to sell
for $1000.

~~~
Nition
I used to use a a bit when I was at University and they had a subscription.
Having etymologies for all the words is really interesting. And it's certainly
a lot more thorough than the free online dictionaries.

Unfortunately it's one of those things where the free version is "good
enough", so most people won't pay for the improved version. Especially at
$295/yr.

------
forapurpose
There is nothing that compares to the OED if you are serious about knowledge.
Knowledge obviously depends heavily on language, but of course language is
nebulously defined: The same word means different things to different people
and in different places and times.

The OED uniquely solves this problem through an unbelievable amount of
scholarship into each word's range of meanings, providing incredible breadth
and depth, back to the known beginnings of its usage. If you are serious about
knowing what something specifically means, it's essential. As one simple
example, I find it to be the best source for really grokking mathematical
terms by far, and also sometimes the meaning of the 19th century mathematician
is different than the usage in 2018.

~~~
Symbiote
> grokking

1\. Understand (something) intuitively or by empathy.

Origin: 1960s: a word invented by Robert Heinlein (1907–88), American author.

I knew the word, but I'm pleased to discover the origin.

[https://en.oxforddictionaries.com/definition/grok](https://en.oxforddictionaries.com/definition/grok)

------
ghostcluster
I find Urban Dictionary to be the most fascinating dictionary project in the
era of the Internet, and much more contemporaneously useful.

By the time a term enters the OED, it's already fairly dry, or the editors are
jumping on a bandwagon in an attempt to meme a word into existence for
political reasons. Witness the Oxford English Dictoary's 'Word of the Year
2017': youthquake. [https://en.oxforddictionaries.com/word-of-the-year/word-
of-t...](https://en.oxforddictionaries.com/word-of-the-year/word-of-the-
year-2017)

Good luck trying to hold onto relevance with stunts like that.

~~~
nasredin
I am so quietly proud of myself.

I managed to not learn the definition of that "word of the year". Through all
the news articles about their decision, the outrage, I remain in blissful
ignorance, content with my superiority over the Steve Buscemis at OED.

"How do you do, fellow kids?"

Personally I enjoy and visit UD more often than OED.

------
dogma1138
OED is a standard for most of not all English as a Second Language or English
as a Foreign Language classes from as early as the first or second grade world
wide (depending on the country) this alone pretty mush keeps it in circulation
since it’s essentially a companion text book.

~~~
nkurz
Either you come from somewhere with extraordinarily talented elementary school
students, or you are confusing the OED with a different similarly named work,
perhaps the something from here
[https://www.oxfordlearnersdictionaries.com/](https://www.oxfordlearnersdictionaries.com/)?

The OED that's being referred to here is a 20+ volume set that costs about a
thousand dollars: [https://global.oup.com/academic/product/the-oxford-
english-d...](https://global.oup.com/academic/product/the-oxford-english-
dictionary-9780198611868?cc=us&lang=en). University libraries will probably
have a copy or two, and I suppose some high school (9th-12th grade) libraries
might have a single rarely consulted set, but I'm doubtful any elementary
schools in the US have one.

The best evidence I can give for its exclusiveness might be the link on the
front page of the official OED site [http://www.oed.com](http://www.oed.com)
to the information about the print edition: [http://public.oed.com/about/the-
oed-in-print-and-on-cd-rom/](http://public.oed.com/about/the-oed-in-print-and-
on-cd-rom/). It goes to a page not found. Rather than being a standard
companion book, apparently so few copies of this are sold that no one has
bothered to fix the broken link!

~~~
dogma1138
I didn’t realize that this is only about the full “encyclopedia britanica”
version of the OED but Oxford has multiple editions of OED like OED for
schools and OED for advanced learners which most students around the world
use.

[http://www.oed.com/staticfiles/oedtoday2/oxford_and_the_dict...](http://www.oed.com/staticfiles/oedtoday2/oxford_and_the_dictionary.pdf)

------
walterbell
Free Oxford English Dictionary (1888), 15000 searchable page scans in a
Windows app,
[https://archive.org/details/oed11_201407](https://archive.org/details/oed11_201407)

The mobile app for "Shorter OED" has 600K words and costs $30:
[https://www.mobisystems.com/shorter-oxford-
dictionary/](https://www.mobisystems.com/shorter-oxford-dictionary/)

~~~
f_allwein
Just to be clear: The Oxford English Dictionary has 21,728 pages in 20
volumes, covering around 300,000 entries. It's quite fun to browse (despite
the shortcomings mentioned in the article), so give it a try if you have the
chance.

------
mmjaa
I'd rather like to have a service that allows me to construct my own
dictionaries. There have got to be some standard OpenSource/GNU-like tools
that give a budding lexicographer the things they need to construct
dictionaries - does anyone know what these tools are, and how effective they
are at creating custom dictionaries?

~~~
Nition
One thing I always wanted was a paper dictionary that _excluded_ the really
common words. It'd be a lot faster to look up _thaumaturge_ if I didn't have
to sift through the likes of _that_ and _the_. Take say the 5000 most common
words in English and leave them out.

Sure on rare occasions you'd look up something that was left out, but that
already happens in the other direction (towards rare words) because _that_ and
_the_ are taking up space where more difficult words could have been included.
Smaller dictionaries often don't have the word I wanted, but do have a ton of
words I'd never need to look up.

~~~
my_first_acct
According to someone on Stack Exchange, quoting the NY Times, quoting the
Chief Editor of the OED [1], 'there are for the verb-form alone of “run” no
fewer than 645 meanings'. Other common words with huge numbers of meanings
include "put" and "set".

[1] [https://english.stackexchange.com/questions/42480/words-
with...](https://english.stackexchange.com/questions/42480/words-with-most-
meanings)

~~~
Nition
And I'm sure some of those are obscure meanings, but the likelihood is I'm
never going to look them up. Exclude those many definitions of "run" and
either make the dictionary a bit smaller (easier to search) or include several
rare word definitions in the same space.

~~~
dragonwriter
Removing a handful of thousands of common words won't make the OED noticeably
shorter, and it's not space constrained anyway.

And I don't think a rare use of a common word is any less likely to be looked
up than a rare word.

~~~
Nition
I don't mean the full OED. Personally I have a Pocket Oxford and a Concise
Oxford. Both are space constrained. The Pocket Oxford in particular has around
60,000 words so leaving out the most common 5-10K would make a noticeable
difference. Since it's the complex words that get left out in small
dictionaries, as you decrease dictionary size, the size saving that you'd get
from removing common words proportionally increases. But you might be right
that it wouldn't make enough of a difference, since small dictionaries can
also fit less words in total. It might never be enough to matter much.

------
qplex
Looking up the definitions of obscure words is one of the tasks where I hardly
ever go any further than the very first page of the search engine results.

That is I don't even click at the links: the condensated summaries usually
contain the definitions (from multiple dictionaries).

I'm not sure how this works out for the actual content providers.

------
zie
Not about the OED really, but my favorite dictionary resources:
[https://www.waywordradio.org/resources/#dictionaries](https://www.waywordradio.org/resources/#dictionaries)

and the double-tongued dictionary is fun:
[https://www.waywordradio.org/dictionary-
listing/](https://www.waywordradio.org/dictionary-listing/)

------
forapurpose
There's a well-known narrative that before the Internet, for all of human
history, information was scarce and humans found ways to adapt; and since the
Internet, information has been overwhelming and humans must find different
ways to adapt.

What is disappointing to me is that, with so many more options, people have
not adapted by utilizing only the best information. Anyone can read the best
sources (with a major exception; see below), but instead they choose more and
more crap and now even actively delegitimize the better sources in favor of
propaganda. In other words, why hasn't the vastly increased competition in the
marketplace of ideas yielded far superior knowledge? _Epistemology_ should be
one of the hottest words and hottest subjects of the day.

I hypothesize that it's a failure of the intellectual elite. Instead of
spreading their incredible wealth to the world over this new medium, they kept
it to themselves behind high paywalls (science journals, OED, JSTOR, etc.).
And instead of defending the values of and passion for knowledge and
intellect, of the Enlightenment, scholarship, and reason, many I see and talk
to adopt the trendy anti-intellectualism, bizarrely undermining their own
reason for being.

I know many people will say that they can get a dictionary for free, why use
the OED. The problem is that those who know better are not standing up to
assert why it is superior, necessary, and incredibly valuable to the world.

~~~
thaumasiotes
> I know many people will say that they can get a dictionary for free, why use
> the OED. The problem is that those who know better are not standing up to
> assert why it is superior, necessary, and incredibly valuable to the world.

There are a couple of problems with this.

First, the OED is primarily a work of historical scholarship. Anyone in the
market for a dictionary _so that they can look up the meanings of words_ would
be better off getting something else, like Merriam-Webster or really anything.
The OED is not a good choice for this purpose.

Second, a story: I spent a semester studying Chinese in a foreign student
program at a Chinese school. Like everybody else, I used Pleco for my
dictionary needs.

My friends in the class quickly noticed that I was getting much better
dictionary value from my installation of Pleco than they were getting from
theirs. When one of them asked how I was doing it, I responded "I bought for-
pay dictionaries" and they immediately lost interest.

There was no need for me to assert that the dictionaries I was using were
superior and valuable -- my friends had already come to that conclusion
themselves by watching me. They just weren't willing to pay $20 for that
additional value.

~~~
shullbitt0r
Because that's cheating! I'm half kidding but you are arguing about price,
mainly. Not why the cheaper (free?) content is lacking. You could also just
afford a personal teacher and taken to the extreme, a personal translator so
you wouldn't need to learn the language at all, which would be cheating.
Except that you might have a personal desire to speak the language. What's the
problem with Chinese that free offerings are inferior?

It is just chaotic is all I could infer from a first look. I mean English can
be pretty messy already and maybe a specific Chinese dialect will be more
regular than the bigger picture of the whole language. It's not that your
coeds were being cheap, perhaps, it could just be disappointment for something
as basic to cost anything at all, and relieve that it's not their personal
shortcoming, but just an externalized advantage. You seem to say not even $20
was low enough, have I got that right at least?

~~~
KMag
> Maybe a specific Chinese dialect will be more regular than the bigger
> picture of the whole language.

Standard written Chinese is based on the grammar of spoken Mandarin. If
someone mentions a Chinese dictionary without mentioning a particular dialect,
they're almost certainly referring to Standard Written Chinese (essentially
Mandarin).

Also, Chinese is less a single language than the Romance languages are a
single language. The more far-flung dialects share less in common than, say
Italian and Spanish. If Portugal were a province of Spain, Portuguese would
probably be considered a dialect of Spanish. As my Linguistics professor used
to say, "A language is a dialect with an army." One rarely deals with "the
whole Chinese language", if by that you mean the union of all of the dialects.
That would be like dealing simultaneously with Romanian, Portuguese, Romansh,
French, Italian and Spanish as a single entity.

