
A File Format Led to a Crossword Scandal (2019) [video] - luu
https://www.youtube.com/watch?v=9aHfK8EUIzg
======
rabidrat
Hi, this is Saul. If you like this kind of simplicity-first data-exploration
approach, you might want to check out another project of mine, VisiData
(visidata.org). It's specifically for lightweight data exploration and
analysis and it runs directly in the terminal.

~~~
manjalyc
Hey, just wanted to say that this looks like a really cool and useful project.
I work with a few medical databases and sometimes I just need a very quick
breakdown of specific data and while I usually just write a short script, the
utility and portability of this code looks fantastic to me. Which brings me to
a question, how well does this program handle moderately large databases
(~100GB-1TB) in your (or anyone else's) experience? In other words does it try
to load everything into memory, or does it query as needed when given a
database?

~~~
rabidrat
It loads everything into memory, so it works well with <1GB datasets. The
architecture could be changed to allow for larger datasets like yours, but
that would likely be a large undertaking and would probably be a paid feature.

------
thaumasiotes
> Despite Parker’s denial, many in the crossword world see willful plagiarism
> in Parker’s puzzles, and they see the database that revealed the repetition
> as a tool of justice. “It’s like a murder mystery solved 50 years later with
> DNA evidence,” Matt Gaffney, a professional crossword constructor, told me.

There's a response to postmodernism that says "reality is that which, when you
ignore it, doesn't go away".

I have a hard time seeing this as a "scandal"; it's firmly in the class of
things that aren't a problem unless someone tells you you have a problem. A
murder has victims. But if you're unhappy that the crossword puzzle you solved
last week was secretly a rerun from 10 years ago, it's not obvious whether the
blame for your unhappiness should go to the guy who reran the puzzle, or the
guy who told you it was a rerun.

~~~
mkl
I don't think it's the rerunning that's the problem, but the misattribution,
claiming others' work as their own or denying them credit (and presumably
royalties).

~~~
thaumasiotes
If the originality of the crossword has no value, why would it matter whether
someone's claim that it is original is true or false? The most logical basis
for attributing value to the claim of originality is that there is value in
the originality that bleeds through to the claim.

Compare e.g. someone being jailed for resisting arrest when there was no
justification for arresting him in the first place.

~~~
TheRealPomax
The law tends to disagree: crossword puzzles are copyrightable material just
like any other published text is, so their value comes from the material that
they help sell, whether that's a newspaper, or a crossword puzzle book, or a
website, or any other published, in the legal sense, work.

~~~
thaumasiotes
But misattribution is not a problem at all in that analysis. It's just as
illegal to violate a copyright with proper attribution as it is if I claim the
work is my own.

The law doesn't care whether you claim a copyrighted work is yours or not. It
cares whether, if you copy a copyrighted work, you have the license to do so.

------
tzs
Here's the FiveThirtyEight article about this mentioned a few times in the
video [1].

[1] [https://fivethirtyeight.com/features/a-plagiarism-scandal-
is...](https://fivethirtyeight.com/features/a-plagiarism-scandal-is-unfolding-
in-the-crossword-world/)

~~~
ireflect
There's a footnote about Saul's interesting name, which leads to:

    
    
      Pawanson is a bit quirky — his unusual
      last name is the product of a decision
      he made years ago.
      
      "I was born Paul Swanson," he said.
      "But I thought, 'there are lots of Paul
      Swansons out there. 'So I changed it."
    

Amazing!

~~~
servercobra
As someone with a name completely unique in the history of the world (so far
as I have found), there are certain advantages! I wouldn't be surprised if
people do this more often in the future. It is pretty nice that if you Google
my name, you only get results about me.

~~~
matt-attack
Interesting. I relish the fact that when you google my name you get pages and
pages of a semi-famous figure that, honestly, most people haven't heard of.

I cherish the anonymity.

~~~
busyant
I have a teacher friend named Mike Pence.

He tells me that it's impossible for students to snoop on him because he is
"google-proofed."

------
abalaji
This is an awesome story, I especially like the speaker's organization of the
narrative taking us along for the ride. Maybe this will be the push I need do
a better job learning Unix tools.

~~~
colmvp
The delivery was very engaging and a good example to other engineers on how
crafting a compelling narrative can help drive home the importance of your
work.

------
smitty1e
> "It's not hoarding if it's organized."

Oh, that's getting thugged.

~~~
TheRealPomax
If it's organized, now it's archiving.

------
Erwin
The author's biography is quite fascinating. If there's a museum of
visualization, it'd be an exhibit:
[https://www.saul.pw/biograph/](https://www.saul.pw/biograph/)

------
paxys
It's interesting that while sophisticated plagiarism detecting software is
commonplace for writing submissions at newspapers, book publishers,
universities etc., they don't bother doing the same with crosswords (and
probably other puzzles like Sudoku).

------
xmprt
I didn't realize unix tools were this powerful. That's an amazing story.

~~~
smabie
I think the point is that they're not, usually, this powerful. Saul made a
deliberate choice to create a file format that would be extremely amenable to
these tools.

------
ericsoderstrom
Were the misattributed authors aware that they were being given credit for
puzzles they didn't write? I'm assuming they must have been.

~~~
rabidrat
The misattributed authors are fake names, admitted by Timothy Parker himself.
"Tim Burr" is one mentioned in the talk.

------
rafaelturk
Data is beautiful

