Hacker News new | past | comments | ask | show | jobs | submit login

This is part of more general problem, they had to rename a gene to stop excel auto-completing it into a date.

https://www.theverge.com/2020/8/6/21355674/human-genes-renam...

Edit: Apparently Excel has its own Norway Problem ... https://answers.microsoft.com/en-us/msoffice/forum/msoffice_...




> This is part of more general problem

The more general problem basically being sentinel values (which these sorts of inferences can be treated as) in stringly-typed contexts: if everything is a string and you match some of those for special consideration, you will eventually match them in a context where that's wholly incorrect, and break something.


edit: fixed formatting problem

> sentinel values

Using in-band signaling always involves the risk of misinterpreting types.

> This is part of more general problem

DWIM ("Do What I Mean") was a terrible way to handle typos and spelling errors when Warren Teitelman tried it at Xerox PARC[1] over 50 years ago. From[2]:

>> In one notorious incident, Warren added a DWIM feature to the command interpreter used at Xerox PARC. One day another hacker there typed

    delete *$
>> to free up some disk space. (The editor there named backup files by appending $ to the original file name, so he was trying to delete any backup files left over from old editing sessions.) It happened that there weren't any editor backup files, so DWIM helpfully reported

    *$ not found, assuming you meant 'delete *'
>> [...] The disgruntled victim later said he had been sorely tempted to go to Warren's office, tie Warren down in his chair in front of his workstation, and then type 'delete *$' twice.

Trying to "automagically" interpret or fix input is always a terrible idea because you cannot discover the actual intent of an author from the text they wrote. In literary criticism they call this problem "Death of the Author"[3].

[1] https://en.wikipedia.org/wiki/DWIM

[2] http://www.catb.org/jargon/html/D/DWIM.html

[3] https://tvtropes.org/pmwiki/pmwiki.php/Main/DeathOfTheAuthor


>> [...] The disgruntled victim later said he had been sorely tempted to go to Warren's office, tie Warren down in his chair in front of his workstation, and then type 'delete $' twice.

Ironically, this did not render the way you intended because HN interpreted the asterisk as an emphasis marker in this line.

It works here:

    ... type 'delete *$' twice.
because the line is indented and so renders as code, but not here:

> ... type 'delete $' twice.

because the subsequent line has emphasized text*. So the scoping of the asterisks is all screwed up.


Eh. "Death of the Author" is a reaction to the text not being dispositive as to what the author meant. It's deciding you don't care what the author meant, no longer considering it a problem that the text doesn't reveal that. Instead the text means whatever you can argue it means.

Which can be a fun game, but is ultimately pointless.


It gets more complicated when the author himself changes their mind about that.


That’s a shrewd observation. Static types help with this somewhat. E.g. in Inflex, if I import some CSV and the string “00.10” as 0.1, then later when you try to do work on it like

x == “00.10”

You’ll get a type error that x is a decimal and the string literal is a string. So then you know you have to reimport it in the right way. So the type system told you that an assumption was violated.

This won’t always happen, though. E.g. sort by this field will happily do a decimal sort instead of the string 00.10.

The best approach is to ask the user at import time “here is my guess, feel free to correct me”. Excel/Inflex have this opportunity, but YAML doesn’t.

That is, aside from explicit schemas. Mostly, we don’t have a schema.


If we're talking about general problems, then I don't think we can be satisfied with "sometimes it's a problem with types and sometimes it's a UI bug." That's not general.


> E.g. sort by this field will happily do a decimal sort instead of the string 00.10.

So that system is not consistent with type checking? How is this not considered a bug?


I mean if the value is imported as a decimal, then a sort by that field will sort as decimal. This might not be obvious if a system imports 23.53, 53.98 etc - a user would think it looks good. It only becomes clear that it was an error to import as a decimal when we consider cases like “00.10”. E.g, package versions: 10.10 is a newer version than 10.1.

Types only help if you pick the right ones.


Sure. In most static type systems though, you would be importing the data into structures that you defined, with defined types. So you wouldn’t suddenly get a Decimal in place of a String just because the data was different. You’d get a type error on import.


And of course the plague that is CSV when your decimal delimiter is ,


Basically, autoimmune disease, but for software.


I suppose this is a cliched thought, but the more general problem kind of emblematic of current "smart" features... and their expected successors.

OOH, this is a a typically human problem. We have a system. It's partly designed, partly evolved^. It's true enough to serve well in the contexts we use it in on most days. There are bugs in places (like norway, lol) that we didn't think of initially, and haven't encountered often enough to evolve around.

In code, we call it bugs. In bureaucracy, we just call it bureaucracy. Agency A needs institution B's document X, in a way that has bugs.

Obviously, it's also a typical machine problem. @hitchdev wants to tell pyyaml that Norway exists, and pyyaml doesn't understand. A user wants to enter "MARCH1" as text (or the name of a gene), and excel doesn't understand.

Even the most rigid bureaucracy is made of people and has fairly advanced comprehension ability though. If Agency A, institution B or document X are so rigid that "NO" or "MARCH1" break them... it probably means that there's a machine bug behind the human one.

Meanwhile... a human reading this blog (even if they don't program) can understand just fine from context and assumptions of intent.

IDK... maybe I'm losing my edge, but natural language programming is starting to seem like a possibility to me.

^I feel like we need a new word for these: versioned, maybe?


"The computer won't let me" is a particularly maddening "excuse" from bureaucrats...


I don't understand why those support agents for Microsoft just threw their hands up in the air and asked customers to go through some special process for reporting the bug in Excel. Why are they not empowered/able to report the issue on behalf of customers? It's so clearly a bug in Excel that even they are able to reproduce with 100% reliability.


It looks like it is intended behavior in Excel.


Yes. Excel cells are set to a "General" format that, by default, tries to guess the type of data the cell should be from its content. A date looking entry gets converted to a date type. A number looking string to a number (so 5.80 --> 5.8, very annoying since I believe in significant digits) When you import cvs data, for example, the default import format is "General" so date looking strings will be changed to a date format. This can be avoided by importing the file and choosing to import the data as "Text". People having these data corruption problems forgot to do that.

It's "user error" except that there is no way to set the default import to import as "Text" (as far as I know), so one has to remember to do the three step "Text" import every time instead of the default one step "General" import.


Excel doesn't support CSV files. Anyone who believes that has never used Excel. [0] You're supposed to use spreadsheets as is. Programs that have excel export features should always directly export xlsx files.

[0] The only thing you can safely do with CSV files is to interpret every value as text cell. CSV files always require out of band negotiation on everything, including delimiters, quotation, escape characters, the data type of each column.


However....

Users BELIEVE Excel supports CSV file. That's the reality on the ground. Fighting against that is a losing battle.


I'd say the more general problem is a bad type system! In any language with a half decent type system where you can define `type country = Argentina | ... | Zambia` this would be correctly handled at compile-time, instead of having strange dynamic weak typing rules (?) which throw runtime errors in production (???).


I would like to see how your solution handles the case of new countries or countries changing name. Recompile and push an update? If the environment is governmental this can take a very very very long time.

The proper solution, in my opinion, is a lookup table stored in the database. It can be updated, it can be cached, it can be extended.

And for transfer of data, use formats to which you can attach a schema. This way type data is not lost on export. XML did this but everyone hates XML. And everyone hates XSD (the schema format) even more. However, if you use the proper tools with it, it is just wonderful.


An even more general problem is that we as humans use pattern-matching as a cerebral tool to navigate our environment, and sometimes the patterns aren't what they appear to be. The Norway problem is the programming equivalent of an optical illusion.


Good language design involves deliberately adding redundancy which acts like a parity bit in that errors are more likely to be detected.


That's an interesting statement to apply to natural languages.

Consider this headline in English: "Man attacks boy with knife". This can be read two ways, either the man is using a knife to attack the boy, or the boy had the knife and thus was being attacked.

The same sentence in Polish would make use of either genitive or instrumental case to disambiguate (although barely). However, a naive translation would only differ in the placement of a `z` (with) and so errors could still slip through. At least in this case the error would not introduce ambiguity, simply incorrectness.

Similar to language design we can also consider: does the inclusion/requirement of parity features reduce the expressivity of the language?


does the inclusion/requirement of parity features reduce the expressivity of the language?

This was a real eye-opener for me when learning Latin in school: stylistic expressions such as meter, juxtaposition, symmetry are so much easier to include when the meaning of a sentence doesn't depend on word order.


> stylistic expressions such as meter, juxtaposition, symmetry are so much easier to include when the meaning of a sentence doesn't depend on word order.

Eh.... some things are easy and some things are hard in any language. The specifics differ, and so do the details of what kinds of things you're looking for in poetry. Traditional Germanic verse focuses on alliteration. Modern English verse focuses on rhyme. Latin verse focuses on neither. [1]

English divides poetically strong syllables from poetically weak syllables according to stress. It also has mechanisms for promoting weak syllables to strong ones if they're surrounded by other weak syllables.

In contrast, Latin divides strong syllables from weak syllables by length. Stress is irrelevant. But while stress can be changed easily, you're much more restricted when it comes to syllable length -- and so Publius Ovidius Naso is invariably referred to by cognomen in verse, because it isn't possible to fit his nomen, Ovidius, into a Latin metrical scheme. That's not a problem English has.

[1] I am aware of one exceptional Latin verse:

> O Tite, tute, Tati, tibi tanta, tyranne, tulisti.


The real problem here is that people use Excel to maintain data. Excel is terrible at that. But the fact that it may change data without the user being aware of it, is absolutely the biggest failing here.


The problem is more that it's insanly overpowered, while aiming for convenience out of the box. An "Excel Pro"-Version which takes away all the convenience and gives the user the power to configure the power pinpointet to their task might be a better solution. Funny part is, most of those things are already configurable now, but users are not educated enough about their tools to actually do it.


Excel allows people to maintain data all over the place. From golf league data to job actual data compared to estimates to so much more. And, excel is accessible enough that tens of millions (or maybe more) of people do it.


The one I’ve seen was a client who wanted to store credit card numbers in an Excel sheet (yes I know this is a bad idea, but it was 15 years ago and they were a scummy debt collection call center). Signed integers have a size limit, which a 16 digit credit card number significantly exceeds.

Now, you and I know this problem is solved by prepending ‘ to the number and it will be treated as a string, but your average Excel user has no understanding of types or why they might matter. Many engineers will also look past this when generating Excel reports.


and cusips, which are strings, get converted to scientific notation.

https://social.msdn.microsoft.com/Forums/vstudio/en-US/92e0a...


Easiest solution is just to rename Norway.


"Renaming it to Xorway resulted in untold damages from computer bugs..." - Narrator


Norway Orway Xorway Nandway Andway

Yes, yes, I see... This could be problematic, indeed. If only there were a logical solution.


So basically they renamed a gene because they had employees who were too stupid to use excel?


Regarding Excel: It also happens with Somalia, which makes this issue even stranger. Apparently because of "SOM".


There’s a really simple solution to this problem, which has been around since the 70’s: schemas.


> they had to rename a gene to stop excel auto-completing it into a date.

No one in their right mind uses a spreadsheet for data analysis. Good for working out your ideas but not in a production environment. I figure excel was chosen as this the utility the scientists were most familiar with.

The proper tool for the job would be a database. I recall reading about a utility, a highly customized database with an interface that looks just like a spreadsheet.


The analysis itself isn’t (usually) happening in Excel.

A lot of tools operate on CSV files. People use Excel to peek at the results or prepare input for other tools, and that’s how the date coercion slips in.

Sometimes, people do use it to collate the results of small manual experiments, where a database might be overkill. Even so, the data is usually analyzed elsewhere (R, graphPad, etc).


>A lot of tools operate on CSV files.

The mistake was to believe that Excel can operate on CSV files. It doesn't support them in any meaningful way. It supports them in a "I can sort of pretend that I support CSV files" way.


What is a good alternative to working with CSV files than Excel? Excel sure isn't ideal but it's always there as part of the MS Office suite, so I've never looked for anything esle.


And yet, we are still being taught to use an Excel (2003) spreadsheet for data analysis... (Because that's what most businesses are still using !)


> they had to rename a gene to stop excel auto-completing

I can just about understand that "No" might cause a problem, but “Membrane Associated Ring-CH-Type Finger 1" being converted to MAR-1 defeats me.


>, but “Membrane Associated Ring-CH-Type Finger 1" being converted to MAR-1 defeats me.

No, that's not what's happening. To clarify...

If you type a 41 characters long string of "Membrane Associated Ring-CH-Type Finger 1" into a cell -- Excel will not convert that to a date of MAR-1.

On the other hand, it's if you type an 6-char abbreviation of "MARCH1" that looks like a realistic date -- Excel converts it to MAR-1.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: