
Show HN: Parsing horse racing charts with Apache PDFBox - robinhowlett
https://github.com/robinhowlett/chart-parser
======
joosters
Very interesting! I had never heard of Apache PDFBox before, I must give it a
try. I have a similar program that parses horse racing PDFs from sites such as
www.racehorserunner.com - which are of a much simpler format, but cause
endless problems for me when the PDFs have layout problems. For example,
issues like one column being too long and overlapping with another, e.g the
last race on
[http://www.racehorserunner.com/Archives/ELP/ELP170702.pdf](http://www.racehorserunner.com/Archives/ELP/ELP170702.pdf)

All PDF parsers that I have tried cope very badly with these kind of
situations, and often try to be 'too clever' in that they value the final
layout of the text over and above the individual strings.

Have you experienced similar problems with PDFBox, or does it handle
formatting and layout fairly reliably?

~~~
jahewson
PDFBox committer here, if you want even lower-level access to the page content
stream, without anything 'clever' at all, check out the
PDFGraphicsStreamEngine class, which is a superclass of the text extraction
and rendering classes. Gives you access to the raw glyphs. You can override
PageRenderer too, for visual debugging, e.g. render glyph bounding boxes. We
have an interactive Swing PDFDebugger which does just that.

[https://github.com/apache/pdfbox/blob/6f18d7c4bef4d23a22d/ex...](https://github.com/apache/pdfbox/blob/6f18d7c4bef4d23a22d/examples/src/main/java/org/apache/pdfbox/examples/rendering/CustomGraphicsStreamEngine.java)

~~~
joosters
Thanks for the guidance, I'll take a look.

------
maxxxxx
I still don't understand how PDF could become one of the standards for
publishing documents. Well structured content gets converted into PDF which
loses most of that structure. And then a lot of work is done to guess that
structure from PDF and convert it back to a better file format. It just shows
that successful solutions don't have to be technically good.

~~~
userbinator
The keyword is "publishing" \--- as in, producing human-readable physical
copies, not electronic ones. It just so happens that the format was relatively
suitable for the latter too (because it actually looks like a printed document
rendered on the screen --- unlike HTML or other formats around at the time),
which is why that use-case became popular. PDF is basically a descendant of
PostScript, which was designed to control printers.

(Its PostScript origins may also explain the bizarre mix of text and binary
that constitute the file format. For example, page contents are in a
relatively free-form PostScript-ish RPN-like textual language, but are found
in "content streams" which may be compressed or encoded into a binary format.
Data "object" structures include things like '<<'-delimited dictionaries, '['
arrays ']', textual "/Names", and even provisions for comments(!?).

Then there are things like the cross-reference table of all objects in the
file, which is an array of fixed-width _textual numbers_ representing file
offsets, e.g. "0000001056 00000 n" refers to something 1056 bytes from the
start of the file. Reactions of _WTF!?_ from those working with the format for
the first time are not uncommon.)

~~~
amenghra
Minimal PDF explained:
[https://brendanzagaeski.appspot.com/0004.html](https://brendanzagaeski.appspot.com/0004.html)

------
beager
Very neat, and gets me curious about PDFBox, but every time I see something
that converts a consistent-layout PDF back to structured data, I just bemoan
the fact that this would all be trivial with an API for these kinds of things.

------
0x445442
Great job!

I was just looking at collecting race information and historical results data
a month or two ago and was struck by the lack of available structured data.
Heck, I couldn't easily find any for pay options either.

------
Cyph0n
Firstly, what an interesting library. Secondly, this is among the best TLDR
readmes I've ever seen! I lack exposure to this area, so I'm actually quite
impressed with the complexity of it.

Keep up the great work.

------
richiverse
As a python programmer, I found R's pdftools to be indispensable for messy
text based PDFs. I couldn't find a python lib that worked as consistently
across variously different formats.

~~~
tunaoftheland
I came across
[https://github.com/pdfminer/pdfminer.six](https://github.com/pdfminer/pdfminer.six)
recently and was impressed with what it could get done. The documentation can
be challenging to parse, so I relied on a code sample from a StackOverflow
answer. Have you had a chance to try it out? Curious about how/if it works
well across platforms.

------
hbcondo714
Impressive! Seems like you can't just use PDFBox out of the box (no pun
intended) and need to write some custom code specific to the PDF itself per
the chart-parser commits[1]

[1] [https://github.com/robinhowlett/chart-
parser/tree/master/src...](https://github.com/robinhowlett/chart-
parser/tree/master/src/main/java/com/robinhowlett/chartparser)

~~~
robinhowlett
Author here; well, PDFBox is good for simple text stripping. If I wanted to
print all the text on the PDF, that would be very straightforward and not much
code. However, the PDF chart here is in essence a representation of structured
data. I wanted to get the content in that format so that I could both
serialize to JSON plus have an SDK to boot.

------
JabavuAdams
Crazy! I was just looking in to this topic a few weeks ago, for a friend.
Thanks!

------
vbuwivbiu
what I would love is an app that would reformat portrait PDFs as 2-column
landscape for reading on my screen

~~~
mpweiher
Most PDF viewers I am aware of (including my own, PostView) have 2 up modes,
are those not sufficient?

~~~
vbuwivbiu
that's still presenting the pages in portrait orientation. I want the pages to
be landscape and for the text to flow in at least 2 columns.

~~~
sk5t
PDF is used primarily for pre-paginated media and does not reflow text; if the
PDF author wanted the pages in landscape orientation, or using some other
paper dimensions, he would have specified that. Same goes for margins and the
like.

~~~
grogenaut
Doesn't mean the author isn't wrong for my reading situation... Or isn't doing
it just cause everyone else is.

~~~
userbinator
I think what the parent is trying to say is that PDF is not like HTML or text
or other formats where the viewer is primarily responsible for a lot of the
formatting --- in fact, a PDF page contains not much more than primitive
instructions of the form "move to X, Y"; "set font to F"; "draw text 'Some
text here'" (some pathological cases issue individual moves and draws _for
each character_ ) --- so expecting all PDF viewers to be able to somehow
"reverse-engineer" or "decompile" that set of low-level drawing instructions
into more semantic entities like lines of text or even words in order to
reformat the text is a little too much.

Anyone who has tried selecting text from a two-column PDF page will also
quickly realise the nature of the problem.

~~~
grogenaut
I totally get that; however since these days a standard sheet of papers is no
longer the main reading mechanism... I'm not sure it's the best layout for
reading. My brother is a PhD and reads papers all day. He hates the two column
format and paid for a reflowing reader. PDF is terrible on screens with
different ratios than paper... Eg good computers and mobile. As I get older I
use my plethora of giant screens to crank the fonts way up and sit back
relaxed. PDF is terrible in that situation.

Just a thought... Maybe highly formatted PDFs for paper print shouldn't be the
standard anymore. Eg my original point.

------
ocrimgproc
Can it be used for invoices?

