Hacker News new | past | comments | ask | show | jobs | submit login
GPT-3 implicitly favor text written by authors from powerful social positions (arxiv.org)
34 points by amrrs on Jan 26, 2022 | hide | past | favorite | 9 comments



That's a thorny one. I had a knee jerk reaction but upon reflection they're right about something if perhaps for some different reasons. Crux of it is, all great writing is really great editing. So called "privileged" writing in journals, newspapers, and paragons of English style like the Economist and the FT, along with literary fiction, or even crappy genre fiction - will have had the benefit of an editor.

Their argument seems predicated on the idea that either the author is the only writer and the text leapt from his head like Athena fully formed. (like these comments of mine, surely), or their entire sample set of student newspapers all had equally competent sub-editors. I'd say their argument that privileging any text is a "language ideology" is weak because the percieved quality of the writing should be attributed to the additional work that went into its editing, whereas they're saying it's due to the authors social status based on zip code. Chances are, the smaller school papers are just some yahoo publishing their own copy.

Too many holes. It seems to just elevate the same critical theory as a pretext for asserting a qualification to govern GPT model training, by using the same problematizations it uses on everything else. (e.g. call it x'ist until you control it, invent an unsolvable problem only you can manage, dilute and destabilize consensus with exogenous concerns, etc.) I'd agree a lot of good stuff is probably not making it into language models because it's not edited (or ironically, not gatekept), but I'm not sure the authors are really sincere about improving language models. To me, they're using a very narrow interpretation of quality writing to assert that GPT models require governance and political accountability.


“We find that newspapers from larger schools, located in wealthier, educated, and urban ZIP codes are more likely to be classified as high quality. We then demonstrate that the filter's measurement of quality is unaligned with other sensible metrics, such as factuality or literary acclaim. We argue that privileging any corpus as high quality entails a language ideology, and more care is needed to construct training corpora for language models, with better transparency and justification for the inclusion or exclusion of various texts.”

So better grammar and use of language?


Right, invoking "ideology" seems a bit unnecessary and politically motivated. "More highly educated people produce better content." In other news, poor neighborhoods have more violent crime than rich neighborhoods...

> We then demonstrate that the filter's measurement of quality is unaligned with other sensible metrics, such as factuality or literary acclaim.

So the filter can't check whether something is scientifically accurate or artistically appealing. That makes sense. Of course it can't.


I’m kind of confused by the literary acclaim bit. If wealthy people produced better content why isn’t the best written content of all, pinnacles of literary acclaim (often acclaimed by the same wealthy population) within this cohort? If it’s because wealthy populations write better then why isn’t the pinnacle of writing within this cohort?

Intuitively this seems non obvious to me. What am I missing?


It's just measuring adherence to a particular standard of written english. To use it to determine whether one work is "better" or not, you need to make a value judgement about which written standard or dialect you want to consider the target and compare to.

Choosing that target is the ideology mentioned in an earlier comment: holding one dialect as superior or more fit to a purpose is ideological in a pure technical sense.

You could also choose a different standard to get different results, for example imagine telling it that this is the goal for "best" https://www.bbc.com/pidgin

Anyway since it mentioned using wikipedia as a source for "good" english, it makes sense to me to intuit it that way. Prose on wikipedia is detailed and held to a very specific standard, which is legible to the filter. Prose on wikipedia is also not usually going to win literary awards though, since that depends on more than careful adherence to a standard.


"Better" is just presupposing the conclusion: that there is something inherently superior about certain styles. That thinking is precisely the "language ideology" the authors are concerned about!

Better for who? For what purpose?


Dialect variation, English second-language speakers, etc. may be classified as better or worse (though this is controversial and I'm not entirely convinced for the dialect variation case), but that doesn't mean that they don't have something important to say. Some of the most productive conversations I've had are with people with far less mastery of English than native speakers, and even people whose grammar and spelling are poor.


  a well funded multiple paid teacher or admin reviewed publication backed by money to invest in good software has a higher quality and gpt detects that therefore gpt is full of bias seems less about model bias and more about societal failinfs


But we want to blame the black box, not our own enlightened view of what's "good"

/s Sarcasm aside, i think GPT "did well" here in terms of picking up an average of what society deems good. This is not something comfortable, but i also dont think it is something inaccurate. Hopefully more of these ai enabled "revelations" (that back what some critical theorists have been saying for decades) will help us unpack and understand the collection of biases we each hold. Yes, the failings are societal, can it be a point of reflection? Or do we keep refiguring the model to obscure the issue?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: