The OED can not only survive the internet; it will flourish on the internet. We can incorporate the entire thing into Wiktionary, which, as https://news.ycombinator.com/item?id=16461656 points out, is already in some ways more comprehensive and reliable. Unfortunately, for copyright reasons, we have to go back to its first edition, and even the last volume of the first edition might still be in copyright.
I announced the project at https://www.mail-archive.com/kragen-tol@canonical.org/msg001..., bought a copy of the dictionary, and spent a bunch of nights at the archive scanning its volumes. https://old.datahub.io/dataset/oed talks a bit about the available data halfway through the project. I did eventually scan the whole thing, and since then other people have contributed other scans. So far nobody has OCRed it and imported it wholesale into Wiktionary.
At some point I hacked together a web service that would show you the page image for a given word (perhaps after a few tries), but I don't have it running now. https://archive.org/details/oed11_201407 seems to be a downloadable program that does something similar.
It's really unfortunate that the past century of work at the Oxford University Press will be lost and have to be redone. Such is the price of copyright.
OCR has come a _long_ way since 2005. I was an early adopter around 2000, and the results scared me away from OCR for thE better part of two decades. I recently revisted it with no special approach, just trying Adobe Acrobat Pro’s OCR module and was very pleasantly surprised with the progress that has been made.
Adobe (and I’m sure many others) now preserve the image but place text behind it in a hidden layer or else use a pseudo ML process to create a font from the scan and fit it to the text with high accuracy (and low binary sizes).
Might be worth revisiting.
Edit: just saw the full mailing list url (on phone, hard to see anything) and realized who I am replying to. Hi! I have missed your writings! I recommend “only a constant factor worse than optimal” all the time to people.
“In February 2009, a Twitter user called @popelizbet issued an apparently historic challenge to someone called Colin: she asked if he could ‘mansplain’ a concept to her. History has not recorded if he did, indeed, proceed to mansplain. But the lexicographer Bernadette Paton, who excavated this exchange last summer, believed it was the first time anyone had used the word in recorded form. ‘It’s been deleted since, but we caught it,’ Paton told me, with quiet satisfaction.
[…]
A few days ago, I emailed to see if ‘mansplain’ had finally reached the OED. It had, but there was a snag – further research had pushed the word back a crucial six months, from February 2009 to August 2008. Then, no sooner had Paton’s entry gone live in January than someone emailed to point out that even this was inaccurate: they had spotted ‘mansplain’ on a May 2008 blog post, just a month after the writer Rebecca Solnit had published her influential essay Men Explain Things to Me. The updated definition, Proffitt assured me, will be available as soon as possible.”
One Wiktionary contributor did a better job[1] in 2012 by immediately finding the use[2] from May 2008. The OED is more Prestigious and Respectable and Authoritative, but the Wiktionary is more comprehensive, informative, reliable, convenient, useful, also cheaper.
If we define "better" by speed, then every HN comment is better than every book ever published, and all the blog posts on climate change are better than the scientific research. I find it's the opposite: The things that take longer to publish are usually better.
I found The Surgeon of Crowthorne by Simon Winchester to be a fascinating read about James Murray and how he came to write the OED.
It is ironic that the definitive book on the English language was compiled by an American, and that he made it while he was incarcerated in an asylum for murder just makes the tale more interesting.
Simon Winchester is an excellent storyteller and the Surgeon of Crowthorne is an entertaining and insightful book - highly recommended!
If you enjoyed that, I highly recommend Winchester's "The Meaning of Everything," which goes into even more detail on all the historical editors of the OED, and their respective process in approaching the massive project. One of my favorite books I've ever read.
I agree. I would love a standalone app (Mac, iOS) which did not require offline access, even if it was multiple gigabytes. There was an old OED 2e app for the Mac with the full content.
I just published a repository that de-obfuscates the archive from OED's Windows CD-ROM version. I don't know anything about the markup language it uses and I don't know anything about JSON either, but I attempt to parse it and convert it to JSON.
Sadly, the archive only contains 300,000 words IIRC.
I miss the OED. I consulted it in my parents’ house as a teenager, and ever since I have been too lazy or cheap to acquire a copy, whether paper or otherwise. Basically I want an electronic copy, but $295 annual subscription seems steep compared to the price of paper copies ($50 it looks like). It hardly needs saying that the dictionaries freely available online are not adequate replacements.
To be fair what you get with the $295 subscription is the _full_ OED, including all the expanded etymologies, which is 20 volumes and seems to sell for $1000.
I used to use a a bit when I was at University and they had a subscription. Having etymologies for all the words is really interesting. And it's certainly a lot more thorough than the free online dictionaries.
Unfortunately it's one of those things where the free version is "good enough", so most people won't pay for the improved version. Especially at $295/yr.
I have the full 20-ish volume print edition, which is fun to peruse sometimes if you have the shelf space for it. I got mine used for somewhere in the neighborhood of $100 on ebay, so you might check there.
I've heard about the existence of the full volume, but to date I don't think I've ever seen nor laid hands on it (if I've seen it, I think I would have remembered browsing it). I have a feeling that most people who speak English as their primary language won't even be aware of its existence.
The Encyclopædia Britannica is the only thing I've (slightly) leafed through before I started really getting online. I remember its thin pages and dry reading compared to less formal encyclopedias. I even remember pining over advertisements of the CD/DVD set and thinking about whether it'd be better to connect to the internet at home, or get the DVD of Britannica.
Yes. Most UK libraries have a setup so that their members can use the online OED from anywhere on the internet (you just have to enter your library card number). It uses cookies so you don't have to log in every time you want to look up a word, so the user-experience can be very convenient. I have a firefox keyword search shortcut for it.
I'm fortunate in that my (US) University has a subscription, but unlucky in that they have a poorly implemented 2-factor authentication for EVERYthing connected to the University, whose cookies disappear after 12 hours.
There is nothing that compares to the OED if you are serious about knowledge. Knowledge obviously depends heavily on language, but of course language is nebulously defined: The same word means different things to different people and in different places and times.
The OED uniquely solves this problem through an unbelievable amount of scholarship into each word's range of meanings, providing incredible breadth and depth, back to the known beginnings of its usage. If you are serious about knowing what something specifically means, it's essential. As one simple example, I find it to be the best source for really grokking mathematical terms by far, and also sometimes the meaning of the 19th century mathematician is different than the usage in 2018.
I find Urban Dictionary to be the most fascinating dictionary project in the era of the Internet, and much more contemporaneously useful.
By the time a term enters the OED, it's already fairly dry, or the editors are jumping on a bandwagon in an attempt to meme a word into existence for political reasons. Witness the Oxford English Dictoary's 'Word of the Year 2017': youthquake. https://en.oxforddictionaries.com/word-of-the-year/word-of-t...
Good luck trying to hold onto relevance with stunts like that.
I managed to not learn the definition of that "word of the year". Through all the news articles about their decision, the outrage, I remain in blissful ignorance, content with my superiority over the Steve Buscemis at OED.
"How do you do, fellow kids?"
Personally I enjoy and visit UD more often than OED.
OED is a standard for most of not all English as a Second Language or English as a Foreign Language classes from as early as the first or second grade world wide (depending on the country) this alone pretty mush keeps it in circulation since it’s essentially a companion text book.
Either you come from somewhere with extraordinarily talented elementary school students, or you are confusing the OED with a different similarly named work, perhaps the something from here https://www.oxfordlearnersdictionaries.com/?
The OED that's being referred to here is a 20+ volume set that costs about a thousand dollars: https://global.oup.com/academic/product/the-oxford-english-d.... University libraries will probably have a copy or two, and I suppose some high school (9th-12th grade) libraries might have a single rarely consulted set, but I'm doubtful any elementary schools in the US have one.
The best evidence I can give for its exclusiveness might be the link on the front page of the official OED site http://www.oed.com to the information about the print edition: http://public.oed.com/about/the-oed-in-print-and-on-cd-rom/. It goes to a page not found. Rather than being a standard companion book, apparently so few copies of this are sold that no one has bothered to fix the broken link!
I didn’t realize that this is only about the full “encyclopedia britanica” version of the OED but Oxford has multiple editions of OED like OED for schools and OED for advanced learners which most students around the world use.
Just to be clear: The Oxford English Dictionary has 21,728 pages in 20 volumes, covering around 300,000 entries. It's quite fun to browse (despite the shortcomings mentioned in the article), so give it a try if you have the chance.
I'd rather like to have a service that allows me to construct my own dictionaries. There have got to be some standard OpenSource/GNU-like tools that give a budding lexicographer the things they need to construct dictionaries - does anyone know what these tools are, and how effective they are at creating custom dictionaries?
One thing I always wanted was a paper dictionary that excluded the really common words. It'd be a lot faster to look up thaumaturge if I didn't have to sift through the likes of that and the. Take say the 5000 most common words in English and leave them out.
Sure on rare occasions you'd look up something that was left out, but that already happens in the other direction (towards rare words) because that and the are taking up space where more difficult words could have been included. Smaller dictionaries often don't have the word I wanted, but do have a ton of words I'd never need to look up.
According to someone on Stack Exchange, quoting the NY Times, quoting the Chief Editor of the OED [1], 'there are for the verb-form alone of “run” no fewer than 645 meanings'. Other common words with huge numbers of meanings include "put" and "set".
And I'm sure some of those are obscure meanings, but the likelihood is I'm never going to look them up. Exclude those many definitions of "run" and either make the dictionary a bit smaller (easier to search) or include several rare word definitions in the same space.
I don't mean the full OED. Personally I have a Pocket Oxford and a Concise Oxford. Both are space constrained. The Pocket Oxford in particular has around 60,000 words so leaving out the most common 5-10K would make a noticeable difference. Since it's the complex words that get left out in small dictionaries, as you decrease dictionary size, the size saving that you'd get from removing common words proportionally increases. But you might be right that it wouldn't make enough of a difference, since small dictionaries can also fit less words in total. It might never be enough to matter much.
I want exactly the opposite - to be able to construct dictionaries where there are no words used in definitions that don't also have their own entries. To me this is a key feature of a dictionary - definition completeness.
And its one of the reasons I'd like to know about the toolset used to construct dictionaries - I'd give anything to be able to throw a word set at these tools, have a custom dictionary constructed from it, and repeat until unity...
I actually assumed that was already a general rule, because any word I've ever cross referenced in a dictionary has always been there. In fact, in the diminutive "Little" Oxford[1] there are few enough words available for definitions that you can get stuck in a loop between two words! Although there's always at least one other synonym word in the definitions.
I'd really love to know the technology behind lexicography, as a state of the art. If anyone works in this field, any pointers you might provide are highly appreciated.
I understand that unity is a difficult thing to attain in dictionaries, but perhaps there are tools and methods used by lexicographers that I need to learn about, as an interested party.
Your Brain? I know it deteriorates if not used frequently and especially if you can rely on look up tables, but really, there is no replacement. Only supplements. Collections of cards in shoe cartons are used by libraries.
What else do you need? A hypergraph of word-vertexes and relation-edges animated in webgl, layered by categories and streamed from an elastic back end? That's your brain.
I'd really prefer something not web-based, but if you know of any tool like this, I'm all ears anyway. What I'd really want is to be able to construct lexicographic sets and save the in SQLite databases, for use in apps of course - but that may be a tall order. You know of tools like this?
I don't. I do have a background in dictionaries, though: I created The Online Slang Dictionary (http://onlineslangdictionary.com/) which is the eldest slang dictionary and thesaurus on the web. I've given some minutes of thought to turning the underlying code into a product...
What sort of data would you like to capture in the DB, and what would apps do with it?
I would like to have a kind of API that allows me to add a new word entry, add its definitions, describe the derivations (if any) of the word, and have a field for examples of the words usage. Some sort of system that would allow me to take a corpus, deconstruct the terms and words used, create a definition for each word, and so on.
The reason is that I'm fascinated with dictionaries - the physical kind - in general, and would like to incorporate some sort of dictionary/glossary system in most of my apps as a means of documenting the technical terms for the subject those apps address.
Another thing is that I'm a complete newbie to the subject of lexicography, and I'd love to learn more from like-minded individuals, on what kind of tooling and methodology is out there. I suppose I could go and look at the sources for such things as the OpenDict, sdcv (https://wiki.archlinux.org/index.php/Sdcv) and so on .. but surely there must be standards for the ways Websters and so on construct their lexicography. I'm just not familiar with these tools, so I'd hoped to have some insight from other like-minded individuals in this community, before I launch into a google-trail of my own investigation.
With your site (wonderful, by the way) - did you invent your own schema for the database, or are you using some standard tooling on the backend?
My ideal tooling would give me the ability to pipe a text file into it, construct a list of all un-defined terms/words/symbols, give the means of importing definitions from known sources for each word, and so on. Ultimately I'd love to be able to construct custom dictionaries for any given corpus, such that no term in that corpus is undefined; and to be even more meta about it, no term in the dictionary itself would be undefined. Obviously this is a highly iterative thing, as the undefined terms list grows with every definition added - which is why I think I have to understand the tooling better before I start constructing my own processes for this task.
Looking up the definitions of obscure words is one of the tasks where I hardly ever go any further than the very first page of the search engine results.
That is I don't even click at the links: the condensated summaries usually contain the definitions (from multiple dictionaries).
I'm not sure how this works out for the actual content providers.
There's a well-known narrative that before the Internet, for all of human history, information was scarce and humans found ways to adapt; and since the Internet, information has been overwhelming and humans must find different ways to adapt.
What is disappointing to me is that, with so many more options, people have not adapted by utilizing only the best information. Anyone can read the best sources (with a major exception; see below), but instead they choose more and more crap and now even actively delegitimize the better sources in favor of propaganda. In other words, why hasn't the vastly increased competition in the marketplace of ideas yielded far superior knowledge? Epistemology should be one of the hottest words and hottest subjects of the day.
I hypothesize that it's a failure of the intellectual elite. Instead of spreading their incredible wealth to the world over this new medium, they kept it to themselves behind high paywalls (science journals, OED, JSTOR, etc.). And instead of defending the values of and passion for knowledge and intellect, of the Enlightenment, scholarship, and reason, many I see and talk to adopt the trendy anti-intellectualism, bizarrely undermining their own reason for being.
I know many people will say that they can get a dictionary for free, why use the OED. The problem is that those who know better are not standing up to assert why it is superior, necessary, and incredibly valuable to the world.
> I know many people will say that they can get a dictionary for free, why use the OED. The problem is that those who know better are not standing up to assert why it is superior, necessary, and incredibly valuable to the world.
There are a couple of problems with this.
First, the OED is primarily a work of historical scholarship. Anyone in the market for a dictionary so that they can look up the meanings of words would be better off getting something else, like Merriam-Webster or really anything. The OED is not a good choice for this purpose.
Second, a story: I spent a semester studying Chinese in a foreign student program at a Chinese school. Like everybody else, I used Pleco for my dictionary needs.
My friends in the class quickly noticed that I was getting much better dictionary value from my installation of Pleco than they were getting from theirs. When one of them asked how I was doing it, I responded "I bought for-pay dictionaries" and they immediately lost interest.
There was no need for me to assert that the dictionaries I was using were superior and valuable -- my friends had already come to that conclusion themselves by watching me. They just weren't willing to pay $20 for that additional value.
> First, the OED is primarily a work of historical scholarship. Anyone in the market for a dictionary so that they can look up the meanings of words would be better off getting something else, like Merriam-Webster or really anything. The OED is not a good choice for this purpose
The OED is excellent for this purpose, due to its completeness, and other, even “unabridged” dictionaries fall far short of it in this use.
Of course, you have to a pretty particular needs in words to look up to be frequently hitting the space where the OED is needed for this.
The OED may be excellent for this purpose in the same sense that a Juicero press was excellent for compressing a bag of vegetable mash, but, much like the Juicero press, that doesn't make it a good choice in the usual case.
I personally find the OED just as useful as many of the people commenting here to defend it. I am interested in the content and it's something I would be happy to own and occasionally consult. (Though not to purchase.)
But unlike most everyone here, I'm acknowledging that the value I get from the OED is purely entertainment value, not utility.
I'll note also that it obviously cannot be true that "many, many people" find the OED useful, as the total number of people who access the OED more than a couple of times a year is less than "many, many".
> I'll note also that it obviously cannot be true that "many, many people" find the OED useful, as the total number of people who access the OED more than a couple of times a year is less than "many, many".
I don't access an optometrist more than once a year.
I find optometry very, very useful to me. Frequency of access and value derived from a thing have no necessary relationship.
> the OED is primarily a work of historical scholarship. Anyone in the market for a dictionary so that they can look up the meanings of words would be better off getting something else
I couldn't disagree more strongly. I use it to learn the meanings of words and there is no better source:
If you want to know the meaning of words hundreds of years in the past, the OED is a good choice. That question is relevant to almost no one. If you want the current meaning, the OED is a poor choice.
I'd disagree with "almost no one". Lots of people still read books and texts that were composed in the past - it's even compulsory at school - and lots of people have a hobbyist interest in etymology. If I look a word up in a smaller dictionary than the OED I usually find it doesn't tell me anything I didn't already know, or it just raises questions that send me to the OED anyway. Of course you wouldn't use the OED to look up a technical term. Wikipedia would be better for that.
What I really love about the OED is how it keeps getting better even in its coverage of old words. There are lots of words which ten or twenty years ago had the etymology given us "unknown" but which now have a detailed explanation, thanks to recent research. (If you want an example, I think the word "koala" is one.)
It depends on how fast the meaning of that particular word has changed.
Tangentially, is there a better use case for moving to a database than 20+ volumes of books? To put it another way, if someone had a database and said, 'hey, let's print this thing out - it will be around 20 large books - and sell it!' ...
Because that's cheating! I'm half kidding but you are arguing about price, mainly. Not why the cheaper (free?) content is lacking. You could also just afford a personal teacher and taken to the extreme, a personal translator so you wouldn't need to learn the language at all, which would be cheating. Except that you might have a personal desire to speak the language. What's the problem with Chinese that free offerings are inferior?
It is just chaotic is all I could infer from a first look. I mean English can be pretty messy already and maybe a specific Chinese dialect will be more regular than the bigger picture of the whole language. It's not that your coeds were being cheap, perhaps, it could just be disappointment for something as basic to cost anything at all, and relieve that it's not their personal shortcoming, but just an externalized advantage. You seem to say not even $20 was low enough, have I got that right at least?
> Maybe a specific Chinese dialect will be more regular than the bigger picture of the whole language.
Standard written Chinese is based on the grammar of spoken Mandarin. If someone mentions a Chinese dictionary without mentioning a particular dialect, they're almost certainly referring to Standard Written Chinese (essentially Mandarin).
Also, Chinese is less a single language than the Romance languages are a single language. The more far-flung dialects share less in common than, say Italian and Spanish. If Portugal were a province of Spain, Portuguese would probably be considered a dialect of Spanish. As my Linguistics professor used to say, "A language is a dialect with an army." One rarely deals with "the whole Chinese language", if by that you mean the union of all of the dialects. That would be like dealing simultaneously with Romanian, Portuguese, Romansh, French, Italian and Spanish as a single entity.
> It's not that your coeds were being cheap, perhaps, it could just be disappointment for something as basic to cost anything at all
Dictionaries are not basic in the slightest. I'm constantly frustrated that the state of the art in Chinese/English dictionaries is still not very good.
Here's an example. The Mandarin word for "protect" is 保护 bǎohù. As a matter of semantics, protecting involves three roles: (1) the protector; (2) the beneficiary; (3) a danger to be warded off. In English, (1) is marked by being the subject of the verb protect, (2) is marked by being the object, and (3) is optionally marked by being the object of a complementary prepositional phrase headed by from: in
1E. I will protect her from going hungry.
role (1) is "I", role (2) is "her", and role (3) is "going hungry".
A quality dictionary will include all of that information if you look up protect. But the state of the art in Chinese/English dictionaries is to note that 保护 is a verb, that it means "protect", and to provide a few example sentences, none of which feature role (3) at all. I had to ask a Chinese person how to indicate the danger involved in protecting, which it turns out is marked by an entire complementary clause:
1Z: 我保护她免挨饿 wǒ bǎohù tā miǎn ái'è
Translating the syntax directly into English, this is something like "I protect her to avoid going hungry". There is no possible way of learning the correct usage of 保护 from a Chinese/English dictionary at the moment, as none of them saw fit to include this information. You're certainly not going to get there by analogy to English.
I announced the project at https://www.mail-archive.com/kragen-tol@canonical.org/msg001..., bought a copy of the dictionary, and spent a bunch of nights at the archive scanning its volumes. https://old.datahub.io/dataset/oed talks a bit about the available data halfway through the project. I did eventually scan the whole thing, and since then other people have contributed other scans. So far nobody has OCRed it and imported it wholesale into Wiktionary.
At some point I hacked together a web service that would show you the page image for a given word (perhaps after a few tries), but I don't have it running now. https://archive.org/details/oed11_201407 seems to be a downloadable program that does something similar.
It's really unfortunate that the past century of work at the Oxford University Press will be lost and have to be redone. Such is the price of copyright.