Hacker News new | past | comments | ask | show | jobs | submit login
Machine learning’s crumbling foundations (pluralistic.net)
208 points by Anon84 on Aug 22, 2021 | hide | past | favorite | 101 comments



It's a structural issue caused by the way wealth creation works for majority of people in tech. Job hopping, trendy frameworks in CV, "high-impact" projects done ASAP, etc.

No one wants to do boring, slow pace work with lots of planning, reflection and introspection. And why would they do it? These kind of jobs are usually worst paid. We, the practitioners, have every economic incentive to go the other route.

The problem goes far wider in tech than just ML. And unless the society collectively learns to appreciate patience and long-term thinking, as virtues above all else, it won't go away any time soon. What can be done is to discourage use of ML systems if an explainable deterministic system can be used (even one developed in a rush). For example credit scoring. Rules are good while black box artificial neural network isn't, even if the NN has some % more accuracy. Then if the rules are not good then can be amended and in special cases customer support could also override the rules based on human (hopefully unbiased) judgement.

The problem mentioned in the article of COVID-19 detection based on radiology scans is an example of a system which needs ANNs due to the nature of image processing (very difficult problem for rules AI). While techniques such as ShAP could be helpful a radiologist still needs to check because ANNs learn a lot of useless noise very often and the prediction can be nonsensical. Here it would be best to use PCR tests, serology or any more traditional and "boring" tool as it works. Luckily that is the case and shit CNN models start and end their lives in some useless paper.


I saw a large organization which was the epitome of this -- Executive Directors would propose ambitious ML projects, Directors would create plans and teams, Managers would execute on budgets, create more detailed plans, and then...someone actually needed to do the work.

Because of the length of the effort, the annual compensation would already have been handed out and the EDs, Directors, Managers had already "extracted" their compensation for the project, but usually had none left for the workers who eventually needed to do the actual work.

Not unexpectedly, a rough job was somehow jammed thru with understaffed, underpaid, and unmotivated low-level workers to actually "deliver" on the "AI" projects -- so victory could be declared at the top level...and new projects could begin.

This isnt an ML problem, i'm sure the whole cycle has been repeated with technology-of-the-day generation after generation. It has more to do with governance and organizational maturity to measure real impacts.


That sounds truly awful. Not necessarily surprising —- but could you give us some clues as to which company this was so that we can avoid working there?


This is a common enough occurrence that you probably want to learn how to spot bad setups, where-ever they might be. I think the key is to discern Value vs Vanity. You want to be on value-add projects (those producing revenue, or reducing risk, increasing speed, or reducing cost) but not on Vanity projects.

The trouble is that differentiating Vanity vs Innovation is hard. You can discern them though, in two ways I think:

1. By the level of motivation of low-level workers (true innovation is exciting) while underfunded vanity projects are soul crushing

2. By the seeming intentions of senior management -- are they more focused on the stated goal or on press/buzz?

I do not have sufficient n-value to come up with hard and fast rules but i'd love to hear others' thoughts


I've been in a few over the decades, and sounds like every tech company after the glory/startup years.

Was even in one "hype" startup that began this way almost immediately.


Not grandparent but that orgchart sounds like.. a (possibly government related) tech org somewhere in the Commonwealth.


Exactly.

Machine learning, and "data driven" business leadership, is being treated as a get rich quick scheme like it's low hanging fruit that's easy to do.

When in fact it's been known under multiple different names for a very long time, quantitative management.

And the reason it wasn't popular before is that it's very tricky to pull off.


I had the same comment in a post about

Simple Systems Have Less Downtime (2020)

https://news.ycombinator.com/item?id=28061998


Why would anyone care to fix things? The way they are are perfectly amenable to the blame- and conclusion-laundering many ML clients seek.


Sadly PCR tests for COVID also test positive for flu and half a dozen other causes. That's why CDC/FDA are seeking proposals for a new test that actually works!

https://www.cdc.gov/csels/dls/locs/2021/07-21-2021-lab-alert...


You've fallen for the internet. Please restart and try again.

https://www.reuters.com/article/factcheck-covid19-pcr-test-i...


Sigh. The deniers and antivaxers will have made up their minds already, and this will just be perceived as part of the mass media coverup. It's hopeless.


You've been repeating this, but it doesn't seem to be true...

https://news.ycombinator.com/item?id=28262833


That is simply false on a basic level. That notice by the CDC says the EXACT OPPOSITE of your comment. They're recommending that labs switch to a multiplex test that can screen for both flu and covid at the same time because PCR only detects SARS CoV 2.


Note: the "multiplex test" is most likely still a PCR test (just 'multiplex PCR' instead of 'single-probe PCR'), so where you say "PCR only detects SARS CoV 2" it should say "the currently-used PCR test only detects SARS CoV 2".


My favourite example of bad data in for machine learning is the tragic tale of Scots Wikipedia: https://www.theguardian.com/uk-news/2020/aug/26/shock-an-aw-...

It turned out an enthusiastic but misguided US teenager who didn't actually know the Scots language was responsible for most of the entries on it... and a bunch of natural language machine learning models had already been trained on it.



Scots pretty much is a dialect of English that is phonetically spelt out - it's not surprising that a US teenager could write it.


>"Scots pretty much is a dialect of English that is phonetically spelt out - it's not surprising that a US teenager could write it."

No, there are a number of distinct linguistic features of Scots, and it has its own regional dialects, e.g. Doric, Orcadian, Shetland (which is also in part based on the extinct Norn language). See e.g. https://dsl.ac.uk/about-scots/history-of-scots/ (and sub-pages such as https://dsl.ac.uk/about-scots/history-of-scots/grammar/ ) for further information. Simply doing a dictionary-lookup word-replacement completely misses all of this nuance.


> One common failure mode? Treating data that was known to be of poor quality as if it was reliable because good data was not available… they also use the poor quality data to assess the resulting models.

This drives me nuts. Spend $10k getting high quality data and throw a simple model at it? Nah, let’s spend a month of time from someone making $400k/yr for less trustworthy results. And on the blogosphere it’s even worse. ‘This is the best data available so here goes’ justifies so much worse-than-worthless BS.

And don’t even get me started on the ‘better than human’ headlines that result.


I feel like its even worse than just resorting to bad data when good data isn't available: the field of deep learning has cultivated the perception that it's robust to bad data as one of its hallmarks.

That is, you can pump relatively raw data into it and it will self-select features and then self-regulate their use so therefore most of the initial steps of data cleaning, feature selection etc are not necessary, or require less expertise. This is now spilling over into general ML so that when quacks assert that their model just magically overcomes these things and people actually believe it.


The irony is that there are techniques for dealing with noisy/mislabeled/bad data (e.g. gold loss correction [0], errors-in-variables models [1]), but that stuff isn't "sexy" and not enough practitioners know about it.

0: https://arxiv.org/abs/1802.05300

1: https://en.m.wikipedia.org/wiki/Errors-in-variables_models


This.

The thing is that some of the techniques commonly applied when training NN are often "good enough" to deal with the presence of corrupted data (e.g. using SGD to optimize a model, while applying weight decay and drop-out, adds a regularization effect that somewhat replicates the effect of assuming errors-in-variables), as long as the input data is not total trash, which deters people from applying more formalized robust approaches to it.

As long as "things kind of work", it is difficult to convince other people to adopt robust methods, particularly due to the existence of a "robustness vs. efficiency" trade-off (which can make robust methods seem additionally "unsexy").


Is this any more than fancy outlier detection (genuinely asking, this is not my field).

i.e. if the majority of data fed to a system is bad, will it work?


They do different things. Both are useful.

Outlier detection detects data points that look different from your existing data in some way; that "lie outside" what is usual. Sometimes the assumption is that outliers are generated by an entirely different process from the rest of the data. It's important not to conflate "outliers" with legitimate data points that happened to fall at the tail ends of the data distribution.

The techniques I listed attempt to actually compensate for some known or estimable level of badness in the data. For example, gold loss correction (GLC) essentially estimates the probability of misclassifying a data point, and uses that to adjust the output of the model.


You could use this article's underlying thesis to explain why a lot of tech companies fail as well.

Google Health is a good example of failing to appreciate specialization and domain-expertise. Trying to draw value from broad generic data collection when IRL it requires vertical-focused domain-oriented collection and analysis to really draw value.

Funnelling everything into a giant pool of data only had so much value - reducing it to just a proprietary API integrations platform in exchange for valuable data.

This AI analogy extends to healthcare in real life: The job of any generalist doctor is largely just triaging you to the specialists. You reach the limits of their care pretty quickly for anything serious.

AI is much the same way, the generic multipurpose tools tend to quickly lose value after surface level stuff before requiring heavy specialization. Google's search engine is full of custom vertical categorization, where simple Pagerank wasn't enough.

This is why startups can be very useful to society as they get forced to focus on smaller issues early on, out of pure practicality, or quickly die off if they try to bite a bigger problems than they can chew.

Almost every major multi-faceted business started off with a few 'whales' on which they built their business.

Most of the biggest startup flops have been the ones that took VC really early before doing the dirty hard work of truly finding and understanding the problems they are trying to solve.


> Google Health is a good example of failing to appreciate specialization and domain-expertise. Trying to draw value from broad generic data collection when IRL it requires vertical-focused domain-oriented collection and analysis to really draw value.

I'm not sure I agree with this statement. From what I've heard, Google Health employed a huge team of doctors and they were included through the entire feature development lifecycle, similar to how the product org functions in other software companies.


Hiring a broad set of domain experts != a domain/vertical focused business. ‘Doctors’ can cover a massive disparate field of study.

My point is they did it backwards, they should have found real world healthcare problems to solve then built the common ground between them. Building a generic API platform or cloud database turned out to not be the problem anyone needed help solving. Most companies who did the integration to Health did it for marketing, not because it was essential to any business value.

How many companies have done “AI” merely for marketing too?

Google search ranked websites better than anyone, they zeroed in on that one problem and removed all the cruft, while Yahoo and others were jamming as much crap into their ‘portals’ as possible. Google seemed to have forgot that lesson.

Waymo fell for this too. They built an entirely new type of car and gambled on a whole new taxi service (among other promises) that would entirely disrupt transportation - as the starting point. Innovation rarely ever jumps ten steps ahead like that. They chose to solve a thousand problems at once while the rest of the world with actual delivered products are struggling to solve even assisted highway driving in high-end luxury cars.. cars people were going to buy anyway.


Algorithm:

1) Decide to take over Domain X.

2) Hire a bunch of people from Domain X. Don't hire anyone who doesn't agree that you can take over X.

3) Make them report to the people whose idea it was in the first place. If they say "Hey, maybe this wasn't such a great idea" then push them out, as an example to the rest.

4) FAIL.

Note that #1 is the key. The decision to do it precedes the hiring.


It sounds like cherry picking bad examples to me. Likewise you could say "programming's foundations are crumbling" by citing all sorts of programming projects that use bad or faulty code.

Meanwhile, speech recognition seems to work extremely well by now (I am a little bit older, so I remember when it didn't work so well).

I am also not aware of any real world cases of AI being used to detect Corona, so that seems to be an example in favour of AI. People tried to use AI, but it didn't work out. So it isn't being used for that purpose.


> Meanwhile, speech recognition seems to work extremely well by now (I am a little bit older, so I remember when it didn't work so well).

*provided you speak English or Mandarin, the former preferably of a continental US variety

It's astonishing how bad things get again once you mix in an accent, local dialect (e.g. Swiss German) or a less frequently spoken language (like Croatian).


Nevertheless, the huge jump is from "does not work at all" to "it works". It seems likely that the technology that worked for English will also work for many other languages.

As for Chinese, it is also pretty amazing that you can visit a Chinese website, click "translate" in your browser's menu bar, and get a reasonably readable translated version.

I wonder if people just take too many things for granted.

Or internet search - they say quality of Google searches have been declining, nevertheless we had a pretty good run for the past 20 years or so with being able to find information on the internet. That is AI as well.


> It seems likely that the technology that worked for English will also work for many other languages.

It won't for the foreseeable future. Not for technical reasons; it's just that other languages are usually not handled correctly because most companies think they can just use the exact same approach as in English and they're done.

Until they realise that non-English native speakers also use English words and abbreviations to some degree, both in IT-related contexts but also in everyday life. Now it doesn't just need to handle that one language but also English with an accent. If they're lucky it'll work reasonably well in most cases despite variations depending on the region.

Right now even keyboard completion suggestions struggle with mixing languages and become completely useless in some cases. As English words may be mixed in at any location (and in wildly different frequencies depending on the user) the software now has to guess the language for every single word. The results are not great.

> they say quality of Google searches have been declining, nevertheless we had a pretty good run for the past 20 years or so

As long as Google continues with blunders like showing wrong pictures of people in infoboxes they'll keep failing hard. Their amazing AI shows wrong pictures for serial killers, rape victims and more, which already led to consequences for those people. What makes it much worse is that when someone complains about such a case Google will just replace that picture with another wrong portrait - if they react at all. It would be helpful if those big tech companies would for once trust in human intelligence instead of throwing larger models at the problem.


Maybe people using Google should start to apply some common sense and not believe everything at face value. Nevertheless, the examples you cite are extremes that affect only few people. So you would rather have no internet search engines at all, so that those problems could be avoided?

Isn't that a bit like saying cars are crap because people die in accidents? Maybe there are just upsides and downsides to most new technologies, and if the upsides outweigh the downsides by far, people will go for it?

As for human intelligence, I am not convinced humans would necessarily fare better at such tasks. I mean they fall for the "same name, same person" fallacy.


“Google should not use badly trained beta ML to guess which person in the world with this name is a serial killer” is not “there should be no search engines.”

Google was very successful with the latter for a long time before they started in on the former.


So they went overboard with that feature. But I really doubt humans would do much better. I think if they encounter somebody with the name of a known serial killer, most people would at least pause.


Why would you expect dialects with vastly fewer training examples to be on par with the most widely spoken languages? It's a simple matter of available data, and the state of the art architectures operate on a paradigm that scales quality of the model to quantity of training data.

If you want better speech recognition for Swiss-German, then record and transcribe hundreds of thousands of hours or whatever level of parity you want to achieve with recognition.

It's not "astonishing" at all. Models won't generalize well unless they have sufficient data, so to achieve multi accent functionality, we need lots more high quality data. Or we need better architectures, so identifying where models fail and engineering a better architecture could be a breakthrough. The shortcomings are not surprising or mysterious at all, it's simply a function of the nature of these algorithms.


> it's simply a function of the nature of these algorithms

Addendum: don't overlook the incentives and biases of the people building said algorithms.


I think you may be conflating the manner in which a tool is used with the tool itself. Incentives and biases are irrelevant to the scaling paradigm of the transformer architecture, for example.

I don't think there's a single valid example of a biased or racist architecture as such. An algorithm can be seen as a particular use of an architecture, and like every human endeavor can be done well or badly.

The infamous tank detector neural network was biased towards clouds. Microsoft Tay was biased towards troll induced garbage. Neither bias says anything about the architectures underlying the implementation except that the tool was used poorly.

I think we should leave the discussions of incentive and bias at the level of particular implementation, as abstract architectures can't be generically adjusted or affected by biases or ethics or moral considerations. The selection of training data and intended functionality and particulars of a project are where biases and other considerations arrive on the scene. The ideas of transformer or other neural network models don't have any aspects where you can add in ethical considerations - they're fundamentally amoral abstractions.


> programming's foundations are crumbling

That's also correct, and has been for some time (it got worse on each tech boom). This may just be a special case of that.


What do you mean? At least from the point of view of the end user, apps seem to become better over time.


I was tempted to just downvote this, but I thought I'd reply instead:

No, they do not. An existing version of an app may get better over time, but unfortunately it then gets replaced with a different version, which starts from the position of extreme bugginess.

In the case of Microsoft Office apps, for instance, one could easily argue that they are steadily getting worse as more and more features are added.

Google Chrome is pretty clearly getting worse in terms of the amount of memory it uses. I could go on.


So why not go back to some old version of it? I don't think "memory consumption" is necessarily a good indicator, because sometimes using more memory is a sign of good optimization.

Also how is the memory consumption if you turn off all modern features?


> So why not go back to some old version of it?

Because the old version doesn't work due to DRM/it depending on a remote API version that's no longer available/it's just flat out unavailable/etc...

> Also how is the memory consumption if you turn off all modern features?

It's cute you think you /can/ turn off the modern features in a lot of today's garbage.


Pretty sure you can turn off a lot of things in modern browsers, if you find the hidden settings menu. For sure you can turn off things like JavaScript or video.


Pro tip: In fact, turning off JavaScript on a specific site is often a good way to get past their paywall.

On other sites, it just makes it not work. YMMV.


I hadn't used Word/Excel for many years, and was f*cking appalled to discover that they have TWO levels of menus now.

This comes from generations of PMs and engineers who need to add features to justify their existence. No one has any incentive to keep things simple.


For some features, like autosaving and cloud sync, sure. Others like doing things in bulk, macros, plugins, not seeing ads, control over updates, all that stuff is vanishing.


The infosec meltdown sure seems to indicate programming’s foundations are crumbled. All of the unsafe C library code underlying nearly ever modern system is unsafe at any speed.


I've seen this in a corporate setting: a machine learning model trained to automatically apply categories to new content based on user-select categories for existing content... that failed to take into account that the category list itself was poorly chosen, so the user-selected categories didn't have a particularly strong relationship to the content they were classifying.


> The disdain for the qualitative expertise of domain experts who produce data is a well-understood guilty secret within ML circles, embodied in Frederick Jelinek’s ironic talk, "Every time I fire a linguist, the performance of the speech recognizer goes up."

This reminds me of how the chimp sign language studies got much better results from hearing evaluators than from deaf ones.


Doctorow seems to be missing the meaning of that quote (which is also not the title of a talk, ironic or otherwise). It was specifically a comment on the usefulness of computer language models created manually based on linguistic theories of grammar versus ones in the same model family created automatically from real-world data -- the latter tended to work better. These days I usually hear it quoted more broadly as a warning about the danger of encoding too much possibly-wrong domain knowledge in an ML system when a more generic model and the training data are sufficient to learn the useful parts on their own.

Neither of those translates into disdain for qualitative understanding of the underlying reality behind the data set, which is one of those things that everyone knows is important. The problem is that such understanding is actually hard, and easy to mess up even when you're trying.


The article isn’t very clear about when harm has been done. It’s unclear which machine learning models researched production and whether human safety was on the line, like it would be for a bridge or a driverless car.

For example:

> Hundreds of ML teams built models to automate covid detection, and every single one was useless or worse.

That’s bad, but it doesn’t seem to mean there were hundreds that made it to production use? Drilling down, there is this bit from Technology Review [1]

> That hasn’t stopped some of these tools from being rushed into clinical practice. Wynants says it isn’t clear which ones are being used or how. Hospitals will sometimes say that they are using a tool only for research purposes, which makes it hard to assess how much doctors are relying on them. “There’s a lot of secrecy,” she says.

So clearly there was a lot of research that wasn’t immediately useful, but in the end it’s not clear how much reached production, whether it was critical to any health decisions, or whether people were harmed by it.

[1] https://www.technologyreview.com/2021/07/30/1030329/machine-...


“Everyone wants to do the model work, not the data work”


I don't know how it is in other fields; I'm a linguist, who made the transition to computational linguistics back when you had to be a linguist to be a computational linguist (the 1980s). Slow forward to statistical (and now neural) comp ling; I find it incredibly boring. But the data work still needs to be done, and there are still linguists. And even more than me, they find computational linguistics (of whatever type) less interesting that "real" linguistics. So they will do data work, and willingly.


Which is sad because data work can lead to real domain knowledge, while fitting a grab bag of generic models teaches you nothing by itself (wooo, this thing has 0.0003 higher AUC than that thing!)

Fitting generic data science predictive models is such a rote task these days that there's a crowd of start-ups begging you to pay them to automate it for you.


Sad part is while there are still people who prefer the data work in some fields, it’s not valued, since the model people have decided the data is a commodity!

Result: they too have to move to model work!


I ask the following as someone who builds and tests models and also annotates data as a domain expert. Is labeling really undervalued by society? Or just by VCs?

I mean, if society depends more on the labeler (e.g. radiologists) why should society reward people for trying to replace the radiologists, regardless of the data quality?

From a societal perspective where human factors scientists tell us that we need people to actually be employed to achieve a sense of self-worth and happiness, shouldn’t we punish labelers who might otherwise only enrich the capitalists and undermine the health of the nation’s workforces, and thus the wellbeing of the nation as a whole? Did we learn nothing from the underemployed, disaffected, demoralized, suicidally depressed Trump electorate?

The Trump presidency may be a hot mess from which the country may never recover, but are these not the lessons that we ostensibly learned, that were the topic of millions of gallons of ink between 2016 and 2018?


I work in a company where ML has made a considerable difference to our bottom line (search component of an e-commerce site). When I joined the company, search was so bad, it was easier to just use google and include 'inurl:' to actually find products on the site. Now, years later, the builtin search actually gives you what you're looking for better than google does. (This is important because if you can't find something quickly you're more likely to shop elsewhere).

If you've seen "ML done right", you wouldn't use the word "crumbling".

That said - I won't deny ML is over-hyped. It works for very specific problems and in many cases the best solution is a non-ML one. Knowing when NOT to use ML is just as important as knowing when to use it.


Does ML really work for you here vs elastic search or other full text search?


I'm not 100% familiar with the details - there's a ML component that calculates some stuff to narrow the ES search down (like categories), then at the end there's another ML system that re-ranks things. (ES alone doesn't give the best results)


> Ethnic groups whose surnames were assigned in recent history for tax-collection purposes (Ashkenazi Jews, Han Chinese, Koreans, etc) have a relatively small pool of surnames and a slightly larger pool of first names.

This is... not accurate. The reason the Chinese have a small pool of surnames is that their surnames are much less recent than ours, not more recent.

And I don't think the Ashkenazi surnames are particularly more recent than the surnames of the Europeans they lived among. Rather, they have concentrated surnames mostly because they were occupationally concentrated.


Also, the purpose of Republican voter purges is not, particularly, to find people who have double-registered. It is more useful to the GOP to have a ton of false positives. Having a huge headline number allows them to claim that voter fraud is rampant ("Over a million people double registered!!!"). It also allows them to challenge the registration of many voters that the GOP doesn't like. Whether they've actually succeeded or not in finding double registration, these challenges raise the bar of voting difficulty for the other side.


Most Chinese surnames are ancient, but their romanizations are not:

* Names with the same pronunciation (homonyms) will probably end up with the same romanization. No romanization system has ever completely solved this problem, except by assigning arbitrary spelling differences.

* People will probably tell authorities the pronunciation of their name in their native language (Mandarin/northern dialects, Cantonese, Hakka etc.), which creates different-sounding versions of the same name. Worse, these romanizations are far from unified and don't correspond to the standardized romanization systems. Familiar example: Lee for the name 李 (Pinyin: Lǐ) and its homonyms.

* There are multiple romanization systems in use, which also yield different versions of a name, even if they all sound the same. A familiar example is Mao Zedong, whose name is Mao Tse-tung according to the Wade-Giles system, from before Pinyin romanizations of names became commonplace. For Cantonese names, a dizzying amount of romanization systems exist.

All of these render any surveys about Chinese names in the Diaspora extremely difficult, and most statistics completely garbage. For some families it might be impossible to recover the actual surname.


None of that is relevant here - every effect you list (well, not the first one) tends to increase the perceived variety of Chinese surnames, while the observation we're explaining is a lack of variety. That lack of observed variety is due to an actual lack of variety which your effects have failed to mask. And that actual lack of variety is due to the age of the system.


If your point is the age: most of these names are ancient and can be traced to earlier that the first millenium BC. This is long enough that names can actually start to die out. Also, family names carry great significance in East Asian cultures*. They can carry great prestige, but also infamy. People often changed family names to become less associated with disgraced people. Emperors awarded their surname to loyal and meritous commoners, and these in turn gave it to their followers. Sometimes, whole populations adopted them. This happened to the Li (李), Chen (陳) and Wang (王) surnames.

It's true though that there are actually not that many to begin with. It's just a quite restricted set of words, and because of the writing system there is no variety because of spelling differences. Most surnames are only one character long, and the really long ones are mostly transliteration of non-Han surnames. Also, many non-Han populations were assigned a common surname when they became sinicized.

*: Western family names are mostly rooted in patronymics, place names, professions or adjectives.


Yeah, that was an odd claim about Han Chinese surnames. Many have been around for thousands of years (https://www.chinadaily.com.cn/ezine/2007-07/20/content_54412...) and almost all are single-character surnames based on a limited set of possible sounds (~400 in Mandarin, IIRC)


There are still double-character surnames, though not as many as there used to be.

That's evolution for you. Some surnames are big winners, some go extinct.

(It's also worth noting that Chinese surnames are highly concentrated in a sampling sense -- there are just a few surnames which cover large chunks of the population -- but if you made a list of names, as opposed to a list of people, the pool of names would look much larger.)


I don't know anything about Chinese surnames, but their paucity cannot be due to a limited set of possible sounds. First, I would interpret "sounds" as phonemes (including tones), and there are far fewer of those than 400. More likely what you mean is the number of combinations of phonemes into valid Chinese Mandarin monosyllables, of which I cannot imagine there being only 400. In any case, there are (from what little I've heard) lots of bisyllabic Chinese words. Can't they be represented by single (or double) characters? There are thousands of commonly used Chinese characters, and tens of thousands more uncommonly used characters.


> More likely what you mean is the number of combinations of phonemes into valid Chinese Mandarin monosyllables, of which I cannot imagine there being only 400.

That's your problem, not ilamont's. The limited syllable inventory of Mandarin Chinese is very well known. No need to stretch your imagination over it.

That said, surnames are not limited by the number of syllables for the obvious reason that the spelling is part of the surname.


No, the "only 400 syllables" refers to the syllables without taking into account tones. But tones are as much a part of Mandarin syllables as coda consonants; taking the tones into account, there are over 1200 distinct syllables.


You are getting something here by mentioning characters. Indeed there are a lot of distinct Chinese surnames written in the Chinese script that become identical after romanization, especially the romanization in the West where different tones are also ignored.

Wikipedia has a nice list of common Chinese surnames at https://en.wikipedia.org/wiki/List_of_common_Chinese_surname... and one can easily find examples: like 许 and 徐 both become Xu after romanization.


This effect is reduced by the fact that there are multiple romanization systems in use.

Even taking that into account, the diversity is quite low. There are a lot of Chinese characters, but very few of them are actually used as surnames. Less than ~500 names are shared by more than 95% of the population, and most current surveys arrive at way less than 10000 in active use. There are two- and three-character names, but because of their low occurrence they are even more vulnerable to various processes that reduce the diversity of family names.

Chinese family names carry great significance. They were often assigned as rewards, and people adopted different ones for various reasons. This makes their distribution and future development subject to more than random chance.


https://www.familysearch.org/wiki/en/Jewish_Personal_Names has more information about compulsory adoption of surnames amongst European Jews for taxation purposes in the 18th century.


I think the assertion was more that that is when everyone was forced to take surnames?


From https://en.wikipedia.org/wiki/Surname#History :

> By 1400, most English and some Scottish people used surnames, but many Scottish and Welsh people did not adopt surnames until the 17th century, or later.

> During the modern era, many cultures around the world adopted family names, particularly for administrative reasons, especially during the age of European expansion and particularly since 1600. Notable examples include the Netherlands (1795–1811), Japan (1870s), Thailand (1920), and Turkey (1934).

So that would put Ashkenazic surnames at healthily older than e.g. Dutch surnames.


Hot take: there is no "bad" data.

It's a term we often hear, that implies there is "good" and "bad" data.

A dataset can have errors in labeling, be very small, be unbalanced, but all that can be managed with the proper methods.

THE biggest problem is when you training data does not correspond to the production use-case.

It's not that the dataset is "bad", it's just that the problem you're solving with your ML algorithm trained on that data does not correspond to the problem you're trying to solve.

The most "perfect" ML algorithm trained on the most "perfect" dataset for self-driving cars for example (for detection, segmentation of objects or whatever) made the US will have problems when the cars drive in an other country. Your MNIST-trained NN will have problems in a country where numbers are written slightly differently. Some people will put pictures of cats in your car model classification software. Pictures taken on a smartphone by your users will be different than your dataset scrapped on the web.

There is no bad data, just badly used data. And most of the work (and the most interesting part IMO) in ML is to identify, quantify and neutralize biases in models and differences between the data you have and the data the production system will work with.


I've always thought it was very sad and unfortunate that core data classes like sampling design and experimental design have fallen out of academic style.


Really depends on the domain and the engine. OpenAI code generation is staggering (https://www.youtube.com/watch?v=SGUCcjHTmGY&t=1214s), its summarization and classification is still very much a work in progress.


I agree with this - when there's an economic incentive to get clean data, you get clean data.

For instance, there's a lot of manual clean up work put into things like training data sets for speech recognition because there has been a lot of investment there. Same with self driving I assume because so much $$$ got invested there.

Radiology scans or cough based COVID detectors or medical claims on the other hand? I wouldn't expect it. It's just researchers trying to get a quick paper without adequate funding.


Most ML requires collecting, cleaning, and transforming datasets into something that a model can train on for a specific domain. Codex and Copilot aren't good examples of this because they are training on terabytes of public code repos - meaning that there is no code cleaning step. It's relying on the sheer volume of data that is being processed to try and filter the 'unclean' data (think buggy code written by a human) out of the model.

These are really the exception rather than the rule when it comes to collecting data for ML/AI applications.


TLDR: Many ML models in production are terrible, because they were trained on terrible data. These bad models are being used in high stakes situations, such as COVID-19 detection. ML engineers need professional ethos/regulation, analogous to how civic engineers seeking to build a bridge don't screw around.

My take: Yep, if the model is used a high stakes situation, this is absolutely the case. The model should be required to undergo rigorous testing / peer review before it's released into the wild. In a high stakes situations, we have to ensure that a model is good before people get their hands on it, because people can be reliably depended on to treat the model as an oracle.

The metaphor of a "crumbling foundation" is a bad one, though. It's just unregulated; models aren't leaning on one another, and there isn't a risk of wholesale collapse.


Seems to me that the foundations are not crumbling, but there should be a way to formally determine how good is a model going to be in the wild before it is used, especially in certain industries. Which I think it's where research is focused on these days? White box models, Bayesian distributions etc.?


This seems like a re-hashing of Michael Jordan's essay on the subject: https://medium.com/@mijordan3/artificial-intelligence-the-re...


Discussed here:

Artificial Intelligence – The Revolution Hasn’t Happened Yet (2018) - https://news.ycombinator.com/item?id=25530178 - Dec 2020 (120 comments)

The AI Revolution Hasn’t Happened Yet - https://news.ycombinator.com/item?id=16873778 - April 2018 (161 comments)


There's a nice substack I found that is precisely about this problem and wider variations of it, that is, the problem of figuring out numbers that actually tell you something about the universe:

https://desystemize.substack.com


> In the early 2000s, there was a movement to produce tools and training that would let domain experts produce their own tools – rather than delivering "requirements" to a programmer, a bookstore clerk or nurse or librarian could just make their own tools using Visual Basic.

This is something interesting that I hadn't noticed. "RAD" tools like VB that I remember from when I was a teenager seem to have ceased to exist - replaced with either complex IDEs and languages that almost require a CS degree or with dumbed-down "glue some stuff together" automation like IFTTT/Shortcuts. The last bastion of application development for "normies" is probably Excel?


It goes through cycles. In the 90s, it was called 4GLs ("fourth generation programming language"). Then RAD tools. (Or maybe RAD then 4GL? I don't remember.) The latest evolution of this cycle seems to be low-code/no-code.


It's interesting that tools like VB were derided by many and the practitioners looked down on. Meanwhile they were productive tools, they did live up to their name.


This is why we've been trying to encourage people to think about lightweight data logging as a mitigation for data quality problems. Similar to how we monitor applications with Prometheus, we should approach ML monitoring with the same rigor.

Disclaimer: I'm one of the authors. We spend a lot of effort to build the standard for data logging here: https://github.com/whylabs/whylogs. It's meant to be a lightweight and open standard for collecting statistical signatures of your data without having to run SQL/expensive analysis.


The academics is rife with intentionally misleading results and findings that aren't actually innovative or informative. Worse yet, providing source code and a way to repeat the results is rare, and what good is a research paper with results that aren't repeatable? ML has a bright future, but it's very hard for me to take the academic research seriously. That's why it seems like few care about anything beyond making new models and bolstering their resume by fudging the results.


"Some of the data-cleaning workers are atomized pieceworkers, such as those who work for Amazon's Mechanical Turk, who lack both the context in which the data was gathered and the context for how it will be used."

The knowledge of the goal makes for yet another bias. "Need to know basis" in the intelligence "community" at the same time protects someone and tries to alleviate sources' biases.


Labeled data is just much more expensive than both the computing time and model builders' time. Only a handful of rich corporations can afford to hire all the human labelers needed. Or use billions of labelers for free, as with google captcha. All the other teams are trying to do with whatever data crumbs are available.


This can explain why pollsters get things wrong so often.



The garbage in garbage out cascading failure generally seems to crash pretty fast. Given the U.S. is a capitalistic society the companies / institutions that do this and don't achieve their goals through data science should be apparent and then fail accordingly.

Am I missing something here?


The trail of devastation left by this process, in financial and human terms, when medical systems go awry or vendors to state judicial systems wrongly convict innocent people.


I agree about your latter example, but about your first example: isn't it the case that these faulty AI systems for medical diagnosis have been rejected? Doctors don't like them because they don't want to be replaced or one-upped, and because they just don't trust them (rightly so, as it turns out). So the systems, which were put out for use on a trial basis, don't get used.


URL should be changed to https://pluralistic.net/2021/08/19/failure-cascades/ - same content on the author's site, without having to navigate around the Medium paywall.



This hits so many of the greatest hits for how to speak to emotion and play on existing feelings more than following the data.

Starts off with an appeal to so called technical debt. A nebulous concept that more plays on debt being bad then it shows anything to actually do.

It then moves to comparing to other engineering, with the implicit idea that they have it together in ways that we don't.

Oh, and I skipped the part of statistical abuse. Because, what? Turns out special cases abound in data driven efforts. Instead of looking for ways out, we are looking to blame those that tried? That... Doesn't seem productive.

I also don't buy some of the argument. Focusing on voter purging as if that is a data science problem seems willfully ignorant. That is a blatant power grab that is just hiding behind data jargon.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: