
Document Structure Analysis Algorithms: A Literature Survey [pdf] - tokai
https://lhncbc.nlm.nih.gov/files/archive/pub2003015.pdf
======
roflc0ptic
Funny. I ran across this survey I had to write a document structure analysis
step for a document analysis pipeline recently. I thought of this paper in
particular when watching
[https://vimeo.com/9270320](https://vimeo.com/9270320) ("What We Actually Know
About Software Development"), which I ran across on the recent thread on tech
talks.

Not a computer scientist (barely a scientist at all!), so I don't know if I'm
being unreasonable or not, but I was a little dismayed at the literature here.
Is this quality of publication common? Is it getting better? The lack of
experimental methods and reproducibility seems abominable.

~~~
matt4077
This specifically is a survey where it's uncommon to carry out your own
"experiments". You use the numbers from the original papers and hope there's
some overlap in the metrics they use.

Reproducibility is a problem, especially in the sense that source code was
traditionally not published along with the paper. There are a few reasons for
that, i.e.:

\- The publish in "publish or perish" isn't talking about github

\- The grad student wrote the code and it's 3000 lines of FORTRAN

\- The professor wrote the code and it's 6000 lines of ALGOL60 \- The
university owns the copyright but there's no process for OSS releases \- The
authors own the copyright and can't wait to turn this into a commercial
spinoff.

\- It's 30 lines of python, inextricably linked to the internal domain-
specific toolset that's 25GB including multiple copies of test data and a lot
of external source code with unclear licenses.

\- The proof-of-concept is only about 20% of the work of a polished library
that's fit to release

It's getting better now with some fields establishing standards for data
analysis pipelines, and some large companies with product experience doing a
lot of academic work (i. e. facebook/tensorflow). The AI/ML community is
probably a shining example considering there are polished, ready-to-use
implementations available for almost all publications.

~~~
roflc0ptic
>> This specifically is a survey where it's uncommon to carry out your own
"experiments". You use the numbers from the original papers and hope there's
some overlap in the metrics they use.

My comment was unclear. I wasn't criticizing the lit review, I was criticizing
the literature it is reviewing. The lit surveyed does a poor job even defining
metrics.

The rest is very interesting, and heartening. Thanks for the thoughtful
response.

