
Logging in large mathematical models - pablobaz
https://www.ahl.com/logging-in-large-mathematical-models
======
meuk
For my master thesis, I implemented a new and fancy algorithm. The code seemed
to be fine and dandy for the usual, simple test cases. After using more
elaborate test cases, I found cases that didn't work well.

After contacting the author, who indicated he didn't have such problems, I
literally spent months trying to debug the code. When I finally gave up and
re-wrote my implementation basically from scratch, and found the same
problems, I contacted the author again. He then indicated that indeed there
was a problem with the method for these cases, that he understood the problem
and found a way to fix it. In hindsight, the problem was not hard to
understand (but still, the claims in the paper were unwarranted IMO).

Conclusion? I wish I was a math prodigy, then I would have spotted the problem
instantly. Also, be wary of claims made in papers.

~~~
soVeryTired
Most papers are wrong. Not necessarily wrong in an essential way, but most of
them overstate their claims.

There's a rule of thumb in the machine learning community: if you want an
algorithm that does X well, find a paper on doing X, and implement the method
that the paper _compares its proposal against_. That way you're near state of
the art, but you're also using a method that many people have used and tested.

~~~
Q6T46nT668w6i3m
This is new to me, but I think it’s fantastic advice!

------
petercooper
I know it's a bit of a tangent, but proactive, large-scale logging of models
like this (such as those used in machine learning) may become desirable to
meet the requirements of GDPR. If you have to be able to explain how an
algorithm made a decision, you need to be able to pull up data like this
somehow.

~~~
77pt77
Do you actually believe people in authority have the ability to understand
what's at stake in this?

If you do, you need more life experience.

~~~
petercooper
I am a pragmatist looking at ways to work within the law, not an idealist
looking to change the law, so it doesn't matter what I believe about people
with power. They make decisions and we work out solutions to work within or
around them.

And, yes, we all need more life experience :-)

------
arethuza
A while back I was working on a system doing fairly complex engineering
calculations and I implemented detailed logging of both the values used and
the actual calculations performed.

This allowed me to be able to generate a spreadsheet (with the values and
calculations in place) that could show a non-developer exactly how the outputs
had been calculated (you could use the features of Excel to add visual
annotations of precedents and dependencies).

I was pretty pleased with that approach.

------
no_identd
If you want IMMENSELY powerful logging, take a look at how the trace logging
of Racket's Medic Debugger works, an absolutely ingenious solution:

[https://docs.racket-lang.org/medic/index.html](https://docs.racket-
lang.org/medic/index.html)

Highly interestingly, albeit a bit off topic, the authors of the paper from
Medic originated very recently took this technique, and cranked it to 11:

[https://conf.researchr.org/event/sle-2017/sle-2017-papers-
de...](https://conf.researchr.org/event/sle-2017/sle-2017-papers-debugging-
with-domain-specific-events)

…which won them a distinguished paper award!

~~~
soegaard
Thanks for the link - it looks great.

------
dmichulke
This looks to me like a standard logging toolchain where you just have
programmatic access to the logs.

This is like claiming you save and load a json object (instead of its
serialization) in some hashmap / DB for fast lookup.

Am I missing something?

~~~
posterboy
this might seem normal but perhaps not in mathematics? There's the story about
homotopy type theory that first had to be invented including a supporting
programming language to formalize and automatically error check proofs, which
is still a niche topic in maths (because it seems faster to think barely, I
gather. Edit: and because of the problem of bootstrapping such a system and
Not Invented Here syndrome).

------
vog
That's a great approach to logging/debugging complex models on large datasets.

I'm pretty sure this can be applied outside the math, e.g. on systems with
complex business rule over large datasets.

------
trextrex
I implemented a library that does exactly this --
[https://github.com/IGITUGraz/SimRecorder](https://github.com/IGITUGraz/SimRecorder)
(In case anyone finds it useful). It supports storing data in both hdf5 and
redis (although I wouldn't recommend redis for storing large numpy arrays)

------
cocoablazing
What’s the advantage of the file system/repo/bespoke diag database over
storing the numpy arrays in the existing database infrastructure?

Doesn’t implementing this system with HDF5 cause headaches for concurrency in
either direction?

~~~
joshmarlow
I don't know the author's use-case specifically, but that could fall over in
two ways:

1) if you've got models that are re-generated periodically based on new
inputs/algorithm tweaks, then you can potentially end up with quite a few of
these as you scale.

2) if you want to track the details that/debug the reason your production
system made a given decision, you need to log not just your model but all of
the parameters that went into that decision. If that type of decision happens
many times a day, then you can end up with some pretty massive logs to go
through.

In either case, storing that historical data in your transactional database
can be a bit of a load, so it's ideal to keep it separate if you get any kind
of volume. I've actually bumped into 2) at one job.

