Hacker News new | past | comments | ask | show | jobs | submit login
A File Format Led to a Crossword Scandal (2019) [video] (youtube.com)
171 points by luu 4 months ago | hide | past | favorite | 39 comments

Hi, this is Saul. If you like this kind of simplicity-first data-exploration approach, you might want to check out another project of mine, VisiData (visidata.org). It's specifically for lightweight data exploration and analysis and it runs directly in the terminal.

Good job selling yourself this time!

Hey, just wanted to say that this looks like a really cool and useful project. I work with a few medical databases and sometimes I just need a very quick breakdown of specific data and while I usually just write a short script, the utility and portability of this code looks fantastic to me. Which brings me to a question, how well does this program handle moderately large databases (~100GB-1TB) in your (or anyone else's) experience? In other words does it try to load everything into memory, or does it query as needed when given a database?

It loads everything into memory, so it works well with <1GB datasets. The architecture could be changed to allow for larger datasets like yours, but that would likely be a large undertaking and would probably be a paid feature.

Hey Saul, your talk was great and engaging! Great work!

> Despite Parker’s denial, many in the crossword world see willful plagiarism in Parker’s puzzles, and they see the database that revealed the repetition as a tool of justice. “It’s like a murder mystery solved 50 years later with DNA evidence,” Matt Gaffney, a professional crossword constructor, told me.

There's a response to postmodernism that says "reality is that which, when you ignore it, doesn't go away".

I have a hard time seeing this as a "scandal"; it's firmly in the class of things that aren't a problem unless someone tells you you have a problem. A murder has victims. But if you're unhappy that the crossword puzzle you solved last week was secretly a rerun from 10 years ago, it's not obvious whether the blame for your unhappiness should go to the guy who reran the puzzle, or the guy who told you it was a rerun.

I don't think it's the rerunning that's the problem, but the misattribution, claiming others' work as their own or denying them credit (and presumably royalties).

If the originality of the crossword has no value, why would it matter whether someone's claim that it is original is true or false? The most logical basis for attributing value to the claim of originality is that there is value in the originality that bleeds through to the claim.

Compare e.g. someone being jailed for resisting arrest when there was no justification for arresting him in the first place.

The law tends to disagree: crossword puzzles are copyrightable material just like any other published text is, so their value comes from the material that they help sell, whether that's a newspaper, or a crossword puzzle book, or a website, or any other published, in the legal sense, work.

But misattribution is not a problem at all in that analysis. It's just as illegal to violate a copyright with proper attribution as it is if I claim the work is my own.

The law doesn't care whether you claim a copyrighted work is yours or not. It cares whether, if you copy a copyrighted work, you have the license to do so.

Sorry... what? Why would you say the originality of the crossword has no value? And what on earth does that have to do with resisting arrest?

I wouldn't really see the players as victims, but A) crossword constructors are potentially having their work ripped off and/or receiving less work and B) Uclick/USA Today are paying someone to do something when they could have just rerun old puzzles and got a similar result. A comparison to a murder investigation is maybe a bit much, but I can see where people are coming from.

it should not be summed up as a comparison to a murder investigation, but rather as a a comparison to DNA evidence.

> "reality is that which, when you ignore it, doesn't go away".

Reality is that which, when you stop believing in it, doesn't go away. Philip K. Dick, I Hope I Shall Arrive Soon

If you’re unhappy after someone told you you worked a rerun crossword puzzle, maybe blame yourself? The only thing changing is your interpretation of your own experiences.

I was confused by this response, because it appears to just repeat the content of my original comment.

Now I'm more confused that my comment was upvoted and this was downvoted.

I'm guessing it's because you said it in an indirect way, and I said it directly. And people don't like being told that their gut feelings and outrage are only in their own head. I'm never really sure which is the right way to approach people -- the indirect approach goes over some people's heads sometimes (like me, a little this time) but the direct approach often gets outright rejected from confirmation bias. Teaching is hard, man.

Here's the FiveThirtyEight article about this mentioned a few times in the video [1].

[1] https://fivethirtyeight.com/features/a-plagiarism-scandal-is...

There's a footnote about Saul's interesting name, which leads to:

  Pawanson is a bit quirky — his unusual
  last name is the product of a decision
  he made years ago.
  "I was born Paul Swanson," he said.
  "But I thought, 'there are lots of Paul
  Swansons out there. 'So I changed it."

As someone with a name completely unique in the history of the world (so far as I have found), there are certain advantages! I wouldn't be surprised if people do this more often in the future. It is pretty nice that if you Google my name, you only get results about me.

Interesting. I relish the fact that when you google my name you get pages and pages of a semi-famous figure that, honestly, most people haven't heard of.

I cherish the anonymity.

I have a teacher friend named Mike Pence.

He tells me that it's impossible for students to snoop on him because he is "google-proofed."

Nice! I too was intrigued by his unusual name, and went on a quest to see if he had it changed from the more common "Paul Swanson".

It's a very interesting choice to just swap the letters like that instead of going for a completely different name.

It would be funny if his name gets included in a crossword puzzle, and people second guess the spelling.

I don't know anything about him or his decision to change from Paul to Saul, but Paul/Saul is on of the most important Christian apostles. As both Jew and Roman citizen, his Jewish name was Saul (from the Jewish king Saul in the Old Testament maybe?) and his Roman name was Paul. So just changing the first letter might not be completely random. :-)


The question that directly pops up is why not Pwanson?

Actually it is Pwanson, indeed. A couple slides in the video confirm it.

Pawanson is a typo.

Saul Pawnson would be a cute hacker alias, however.

This is an awesome story, I especially like the speaker's organization of the narrative taking us along for the ride. Maybe this will be the push I need do a better job learning Unix tools.

The delivery was very engaging and a good example to other engineers on how crafting a compelling narrative can help drive home the importance of your work.

> "It's not hoarding if it's organized."

Oh, that's getting thugged.

If it's organized, now it's archiving.

The author's biography is quite fascinating. If there's a museum of visualization, it'd be an exhibit: https://www.saul.pw/biograph/

It's interesting that while sophisticated plagiarism detecting software is commonplace for writing submissions at newspapers, book publishers, universities etc., they don't bother doing the same with crosswords (and probably other puzzles like Sudoku).

I didn't realize unix tools were this powerful. That's an amazing story.

I think the point is that they're not, usually, this powerful. Saul made a deliberate choice to create a file format that would be extremely amenable to these tools.

Yeah, Unix utilities and the whole text processing paradigm can do some amazing things if you know how to design for it. I’ve been doing some Cloudformation work recently, and it’s so easy just to throw together dashboards to watch the progress and outcome of a deploy.

Were the misattributed authors aware that they were being given credit for puzzles they didn't write? I'm assuming they must have been.

The misattributed authors are fake names, admitted by Timothy Parker himself. "Tim Burr" is one mentioned in the talk.

Data is beautiful

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact