
The Extraordinary Growing Impact of the History of Science - joubert
https://medium.com/the-physics-arxiv-blog/the-extraordinary-growing-impact-of-the-history-of-science-642022a39d67
======
dalke
This came out last year, and was discussed on HN.

One of the things I said then, which I'll repeat now, concerns these lines:

> But many of the subject areas that experienced a decline in older citations
> were part of two broader areas: chemical and materials sciences, and
> engineering.

> Consequently, these broad disciplines show almost no increase in old
> citations. Just why citation trends should differ in these disciplines isn’t
> clear.

If you look at the early history of information retrieval you'll see that
chemical documentation was tightly connected with it. Several of the IR
pioneers (Luhn, Mooers, Taube) worked with the American Chemical Society or
presented at ACS conferences. I like to point out that "information retrieval"
as a term was first presented at an ACS meeting.

By the 1960s, CAS (the Chemical Abstracts Service of the ACS), the Institute
for Scientific Information, and other organizations provided pretty complete
indexing of the chemical literature.

While it's improved since then, I believe the old literature for chemistry has
been searchable longer than for other fields, and in large part because the
chemical structure give a uniquely indexable key (usually) for indexing the
literature.

------
shalmanese
This is an essay on the ease of accessing historical archives of science, not
the discipline involved studying science from a historical lens. Those are two
very different things that scientists routinely conflate.

~~~
agumonkey
Similarly I heard about an historian with a strong scientific view on history,
he even coined a name for his field, unfortunately I can't remember it for my
life.

~~~
benbreen
Cliometrics? At any rate, this is my favorite take on the concept of
scientific history, courtesy of Isaiah Berlin:

[http://berlin.wolf.ox.ac.uk/published_works/cc/scihist.pdf](http://berlin.wolf.ox.ac.uk/published_works/cc/scihist.pdf)

The parent comment makes a good point here. Although I did appreciate the nod
at the end to the larger archive of historical sources that digitization has
"presentized" or whatever we want to call the process discussed here. I just
wish there were more dialogue between practicing scientists and historians of
science - in my experience there is virtually no professional interaction.

~~~
dalke
I really enjoyed [http://cacm.acm.org/magazines/2015/1/181633-the-tears-of-
don...](http://cacm.acm.org/magazines/2015/1/181633-the-tears-of-donald-
knuth/fulltext) ("The Tears of Donald Knuth"), which describes Knuth's reasons
to study the history of a field, and also describes some of the disciplinary
and institutional difficulties involved.

~~~
agumonkey
I'm fascinated by the drift of teaching and its limited span. How in 20 years
people will have forgotten about fringe or non mainstream ideas and will
probably have them again thinking they're something new to dive into.

~~~
dalke
Here's one I've been researching for the last year or so (of on-and-mostly-off
research) - superimposed coding.

Remember about 5-10 years when Bloom filters popped up often on HN and other
tech sites? They are one type of superimposed coding. The basic idea is that a
given word or descriptor gets $k$ bits, out of $N$ in a bitset. In a Bloom
filter, $k$ is constant, but in the general case it can vary.

In the 1940s, Calvin Mooers worked out one system of choosing $k$ based on the
information content of the descriptors. He called it Zatocoding. He applied
this to selecting punched cards, including punched cards with chemical data;
that being the field I work in. Knuth has a bit about the approach in The Art
of Computer Programming, v 3, with "false drop cookies."

This technique was never mainstream, in part I think because one has to trust
in random numbers, but also in practice it was applied to manual punched card
searches, and it's difficult to memorize random punch codes.

As computers became machines instead of people, this approach disappeared.
There's a chemistry paper in about 1975 which considers it, with the phrases
"old technique" and "this may appear to be a throwback", then a paper in 1976
which gives a much needed improvement for how to apply Zatocoding to
hierarchical data.

I think that 1976 paper is amazing. It helped me understand the problem. But
it was never part of mainstream understanding. That is to say, very few modern
practitioners know of the concept, it's not in any of the text books (they
have a very shallow coverage of the topic), and the few papers that have
referenced the 1976 article don't seem to have understood the paper, but are
citing it for completeness sake.

That said, it doesn't solve the problem. An alternate approach was developed
in the late 1980s that's more directly related to the Bloom filter sense of
using $k$ hash functions. As a result, it's detuned for the specific
information in the data set, but it's able to reject new descriptors that
weren't part of the training set.

Based on the documentation for the method (it was never published in the
scientific literature; everyone references the manual page), it comes across
like a Bloom filter. However, in talking with people in the know, a specific
detail was not clarified, which is that $k$ is a function of the size of the
chemical feature used as the descriptor. This makes it a proper superimposed
coding system.

No one has looked at this information-theoretical approach in 25 years. Almost
no one knows that this can even be a useful approach.

~~~
toddkaufmann
Very interesting. So Zatocoding [1] allows for a card to have multiple index
entries, and then any of those could be used to retrieve it.

For completeness, the 1951 paper is here [1]. Apparently has been covered in
undergraduate algorithms at U. of I. [2] (with Bloom filters) so be getting
more exposure.

Is the "needed improvement" of the 1976 paper due to better methods available
(e.g. proof rigor, or understanding of sorting algorithms) or a better
explanation of the methods (perhaps because of more widespread knowledge of
information theory, and a better defined terminology).

I thought edge-notched cards [3] had been used for a long time (and have:
since 1896)--I remember reading about them used for fingerprint card retrieval
(indexed by feature, on each finger. This didn't use superimposed coding, but
instead was a type of content-addressable memory--the time to retrieve all
cards with e.g. a whorl on the thumb is O(1).

Apparently the cards are still used some places, see Kevin Kelly's [3] site
for some images and interesting comments.

Finally found a fingerprint filing reference: "For example, as early as 1934
the FBI tried a punchcard and sorting system for searching fingerprints, but
the technology at that time could not handle the large number of records in
the Ident files." [5]

    
    
        1. https://courses.engr.illinois.edu/cs473/fa2013/misc/zatocoding.pdf
           (or buy from Wiley for $38)
        2. https://courses.engr.illinois.edu/cs473/fa2013/lectures.html
        3. http://en.wikipedia.org/wiki/Edge-notched_card
        4. http://kk.org/thetechnium/one-dead-media/
        5. Chapter 3: Evolution to Computerized Criminal History Records
           https://www.princeton.edu/~ota/disk3/1982/8203/820306.PDF

~~~
dalke
Mooers' original work was presented at the ACS in the September 1947. Knuth
cites that, as well as the Am. Doc. (1951) citation you gave, in TAOCP v3
p571. His 1948 Master's thesis from MIT, which goes through the derivation, is
at
[http://dspace.mit.edu/handle/1721.1/12664](http://dspace.mit.edu/handle/1721.1/12664)
. This is why I say it's from the 1940s, not 1950s. I also think the chapter
"Mathematical Analysis of Coding Systems" by Carl Wise, from "Punched cards;
their applications to science and industry" (2nd ed. 1958), at
[http://babel.hathitrust.org/cgi/pt?id=uc1.b3958636;view=1up;...](http://babel.hathitrust.org/cgi/pt?id=uc1.b3958636;view=1up;seq=456)
gives an excellent treatment of the topic.

Nice find with the U. of I. course. Interestingly, if you listen to the
presentation the speaker says: the topic won't be on the test, perhaps the
rigor behind the method isn't as good as desired for the class, but it's
something the students ought to know exists. Plus the speaker "just found out
about" the topic (at 08:19), like you, cites a 1950s date, and describes it as
using a fixed number of bits per category .. which is why he says that
Zatocoding "reappears" years later a a Bloom filter. They are both
superimposed codes, but not the same thing.

I find the mention of the two problems with superimposed codes to be
interesting. 1) researchers don't like false drops and instead expect perfect
matches (in Mooers' chapter in 'Punched Cards' his spins it as serendipitous
matches; Knuth does something similar in TAOCP with 'false drop cookies'.),
and 2) librarians love to make hierarchical categorizations, which isn't
needed with Zatocoding.

(BTW, from what I read, American Documentation was _the_ journal for
information theory in the 1950s and covered many topics. I interpret Mooers'
paper more as advertisement to a wider audience, because it doesn't go into
the technical details of his method. He was trying to drum up work for his
consulting business.)

The "needed improvement" comes with chemical descriptors. Suppose one of your
descriptors is "contains a carbon", another is "contains 3 carbons in a row"
and a third is "contain 6 carbons in a row". I'll write this as 1C, 3C, and
6C. Zatocoding treats all descriptors as independent, though there's a paper
where he describes a correction for when there are correlations. But in this
case whenever 6C exists then both 3C and 1C also exist. These are hierarchical
descriptors.

I was wrong about the date. The improvement paper is Feldman and Hodes, JCICS
1975 15 (3) pp 147-152, not 1976. In the original Zatocoding, the number of
bits $k$ is given as -log2(descriptor frequency). In Feldman and Hodes, $k$
for the root descriptors are given the same way, but $k$ for a child
descriptor is given as log2(parent frequency/child frequency). It's possible
for a fragment to have multiple parents (consider that "CC" and "CO" are both
parents of "CCO"). In that case, use the least frequent parent.

In addition, ensure that the bits selected for the child do not overlap with
the bits set for the parent.

This ends up giving a first-order correction to Zatocoding for hierarchical
data.

The "needed improvement" therefore is the ability to handle hierarchical
descriptors, which are frequently found in chemical substructure screens.

While you say "still used some places", Kelly's site concerns dead media. I
spent some time trying to find modern edge notched cards, including, though a
friend, asking on a mailing list of people interested in old computing tech.
No success. There were a couple of used ones on eBay, but I wanted 500 unused
ones so I could make a data set myself. Instead, this is the 21st century, and
I think I can use a paper die cut machines to make them for me, with pre-cut
holes even.

You mention "sorting" several time. My use is for selection, not sorting.

For punched cards the selection time will be O(N). The only way to get O(1) is
with an inverted index of just the descriptor you're looking for. Mooers in
1951 proposed 'Doken' (see
[http://www.historyofinformation.com/expanded.php?id=4243](http://www.historyofinformation.com/expanded.php?id=4243))
as a way to search '100 million items in about 2 minutes'.

------
toddkaufmann
There are tools and data available at the Science of Science (Sci^2) site [1],
part of the Cyberinfrastructure for Network Science Center, funded by NSF.

I haven't looked closely in a couple years, but my impression is NSF hopes
such tools can help show the effects (and "ROI") of research grant money,
connection with PI's and institutions, and their impact both on publication
citation and economic impact (development of technology etc.).

Ideally this could also be used to measure growth in technical fields to
determine whether more (or less) funding is required to answer bigger
questions in basic science (which may not have economic incentives yet),
methodologies, public policy, and education (will there be enough Ph.D's in
the pipeline to meet demands for fields that will exist in ten years?).

Scientometrics [2] (the journal) has been around for nearly 40 years, and I
assume people were thinking about such issues then. Sci^2 looks to me like a
more "big data" approach to not only understanding this, but seeing if it is
possible to "push" the frontiers (but I admit I don't know anything that goes
on at NSF or how their decision-making process works).

Another tool, Publish Or Perish [3], is aimed at individual academics to
understand their (or another's) impact in terms of citation metrics that are
used in the games for academic (and other) hiring purposes.

I stumbled on Sci^2 when trying to learn some new fields (computer vision, hpc
/ parallel computing, network science, sensemaking) and wanted to quickly find
seminal papers (ie highly cited, or literature reviews) to quickly get a broad
overview. Not having the patience or time to read lots, playing with
interesting tools and trying to extract data from Google Scholar and the like
was more attractive.

Being impatient, I wanted a way to process knowledge like data. To measure
something like growth of a field, it seems something like scientometrics with
some natural language processing and ontology engineering is needed.

The Google paper seems to be more about an analysis of Google Scholar data and
what can be gleaned there. Maybe an update of Google Scholar Metrics is
coming? I am surprised no reference to scientometrics in the arXiv paper
(maybe they aren't familiar with the literature?).

1\.
[https://sci2.cns.iu.edu/user/index.php](https://sci2.cns.iu.edu/user/index.php)
2\.
[http://en.wikipedia.org/wiki/Scientometrics_%28journal%29](http://en.wikipedia.org/wiki/Scientometrics_%28journal%29)
3\. [http://www.harzing.com/pop.htm](http://www.harzing.com/pop.htm)

