
The Source Code for NYC's Forensic DNA Statistical Analysis Tool - foob
https://github.com/propublica/nyc-dna-software
======
Alex3917
For context, this is getting released just after the city council committee on
technology had a public hearing about a bill that would require any software
used by the government to prosecute folks for crimes to be open source:

[https://www.fwdeveryone.com/t/FrtKpHJ2T0mCic3j_FSzPQ/nyc-
sof...](https://www.fwdeveryone.com/t/FrtKpHJ2T0mCic3j_FSzPQ/nyc-software-
transparency-bill-1696)

This bill would set the standard for the entire country, so the testimony is
actually pretty interesting and worth watching, especially the opening speech
by councilman Vacca.

[https://councilnyc.viebit.com/player.php?hash=xHL4m7vQXCM9](https://councilnyc.viebit.com/player.php?hash=xHL4m7vQXCM9)

I also testify at 56:00, albeit hadn't prepared remarks in advance so wasn't
as polished as some of the other folks.

~~~
slackingoff2017
This is excellent news and I'm incredibly grateful the code used to convict
people will be publicly scrutinized.

That said, me and other C# devs have already found numerous signs that at
least parts of the codebase were done by someone clueless or inexperienced
with C#. And there's no tests.

If someone spends the time to look into the codebase, I'm predicting a
basically 100% chance that parts of this code will be shown to produce
inaccurate or completely wrong results. Innocent people will probably go free
because of this.

~~~
jeffshek
The fact that there are no tests makes me feel incredibly sad - there are
probably plenty of people who were unfairly sentenced because of this.

------
devillighter
Code in "Comparison.cs" on line 1307

    
    
       // if we don't have frequencies, use the default (this should never happen, but here it is)
                        if (tblFreq == null || tblFreq.Rows.Count == 0)
                        {
                            a = (float)0.02;
                            b = (float)0.02;
                        }
    

is kind of fishy. Why not bomb out instead of failing silently for things that
are never supposed to happen? Not saying this is inherently wrong - I'm
certainly not capable of actually understanding the details of the analysis

~~~
slackingoff2017
This guy that wrote this is a c# noob. He should just use a null conditional
for the if, Epsilon for tiny but non-zero floats, and should very likely be
using doubles instead of float since they're basically the same speed for
double the precision.

Bonus points for not using the shorthand float literal syntax.

I would expect to see this many goofs only from someone that's been using C#
for a month or so. I only used it for a couple years

~~~
TwoBit
Your criticism seems a bit overboard. You are making assumptions about things
you don't know about, and are judging personal choice floating point
representation.

~~~
slackingoff2017
I can overlook a lot of coding issues, but seeing so many in a few lines is a
sign somebody doesn't know what they're doing. Just using a linter would
probably find some of these automatically, so the fact that they're there is a
strong signal that the dev didn't even know/care to use one.

I'm making assumptions based on experience with C# codebases and I think my
argument on double vs float is solid. I don't think I've ever seen floats used
in a production C# codebase because there's no point.

Overboard maybe :) but I still think whoever wrote that is very inexperienced.

~~~
taspeotis
Looks pretty bad to me. Null conditional doesn’t really matter because the
code likely predates its existence. But...

[https://github.com/propublica/nyc-dna-
software/blob/master/F...](https://github.com/propublica/nyc-dna-
software/blob/master/FST.Common/Database.cs#L122)

1\. Knows how to use "using" but doesn't use it for myConnection.

2\. myConnection.CreateCommand would be more idiomatic than new SqlCommand ...
cmd.Connection = myConnection;

3\. Doesn't know how to use "using"! using (DataTable myDt ... then goes on to
"return myDT"

4\. Also just noticed the oddity of myAdapter.SelectCommand = cmd although cmd
is immediately disposed after that, then Fill is called later.

That's one function although the rest seem to repeat similar sins (CTRL+C,
CTRL+V).

~~~
slackingoff2017
Nice! I can see more too.

Using out parameters, these are highly discouraged. C# let's you do terrible
things (including goto) but the community is pretty aligned that it's bad
practice. In my years of doing C# I've never used one once.

Not using var anywhere. This is C# type inference 101.

Throwing "Exception" in catch with only a message. This is horrific because it
throws away the original context and any stack trace. You can just say "throw"
(just the one word) or rethrow the exception with an innerexception with the
original info trivially. This is unforgivable, an experienced C# developer
would never do this.

Extraneous uses of "this" when it's the default.

Initializing variables to null

Not using shorthand object initializers(but this is only in newer versions of
C#)

Using class variables instead of properties

Using parse instead of TryParse... Like Java exception handling is an
expensive stack unwind on failure, not supposed to be used for control flow.

The worst though, is how did they didn't make a method to reuse the SQL
connect and data table code and instead copy pasted it around 20 times.

~~~
MichaelGG
If the input parsed is supposed to be well formed or isn't in a high
performance area it's totally fine to use parse rather than the sorta ugly
TryParse (which require using out params you just said were bad).

Especially if all they're going to do is throw.

------
foob
I haven't had a chance to dig into the analysis part of the code yet, but
here's an amusing variable assignment that jumped out at me in
`FST.Common/ComparisonData.cs`.

    
    
        compareMethodIDSerializationBackupDoNotMessWithThisVariableBecauseItsAnUglyHackAroundACrappyLimitiationOfTheFCLsCrappyXMLSerializer = comparisonID;

~~~
slackingoff2017
I love this type of thing in C# because it's such a terse and well designed
language. It's easy to make something stand out.

A cult favorite of devs at my previous employer was to (ab)use the power of
LINQ to try to create the longest single line methods possible. I held the
unofficial record for a while with a LINQ statement over 1200 chars. Innocent
enough since resharper can unwrap them into sane code but lots of fun

~~~
sixothree
I love/hate it when someone writes a linq statement and ends with "look, one
line".

------
tekmaven
On first glance it uses very old frameworks (for example, ASP.NET WebForms)
and there are 0 unit tests.

A concerning thing is that a dev database username and password was committed
to multiple files, like this: [https://github.com/propublica/nyc-dna-
software/blob/master/F...](https://github.com/propublica/nyc-dna-
software/blob/master/FSTServiceConsoleHost/app.config#L17).

~~~
bas
For this kind of project, given the stakes involved (e.g. a defendant's
freedom), why would you want to use the brand-new framework-of-the-month that
the cool kids are using?

~~~
michaelmrose
Possibly they shouldn't be using bug ridden trash either seems like a
compromise ought to be possible.

~~~
sk5t
Webforms are not trash, just very annoying to use as a developer.

------
finnn
It seems that the source code for any computer program who's output is used as
evidence in a court of law should be available for inspection.

------
mcescalante
The accompanying article/blog post linked in the Readme
([https://www.propublica.org/article/federal-judge-unseals-
new...](https://www.propublica.org/article/federal-judge-unseals-new-york-
crime-labs-software-for-analyzing-dna-evidence)) has good background &
information about the history of this, as well as potential impact moving
forward.

------
inetknght
I work in DNA analysis and stuff like this makes me happy specifically because
I know just how easy it is to have flawed analysis techniques written in
software.

~~~
CyberDildonics
How long does this stuff typically take to run on a dataset?

~~~
inetknght
That depends on the specific kind of analysis being done and one what size of
dataset you want to run the analysis. Some analysis techniques filters data
down to just a few MiB, others need as much as a few hundred GiB, for a single
individual. Sometimes you do population analysis on that data which can end up
multiplying individual sample size to the size of the population.

~~~
CyberDildonics
Ok, what kind of ballpark time do each of those entail? You didn't give a
single hard number.

~~~
AlexCoventry
I've done work in statistical genetics, but not in forensic analysis
specifically, so take these rough estimates with a grain of salt. That said,
forensic analysis generally involves identifying the origin of a tissue
sample. If the sample is substantial enough for reliable genotyping, once the
genotyping is done the actual identification step should take less than a
second in a well-optimized system, even to pick out the origin from every
human genome in history (assuming such a database was available.) It's just a
matter of finding who has matching allelic variants, and ensuring that you
match over enough common alleles with low linkage disequilibrium (shared-
haplotype correlation) that the risk of a match is infinitesimal, even among
individuals with high consanguinity. If you organized the genomes in a tree
where each node split on a single allele, you could search in logarithmic
time.

~~~
woodson
The situation is more difficult if you consider the possibility of having
multiple contributors to a sample (mixtures), allele dropout, etc. I’ve heard
some models/implementations take a day for MCMC sampling to calculate the
probabilities. (I’m talking about forensic DNA specifically)

------
gravypod
Is there a large dataset of anonymized DNA samples that we could try this tool
against? Does 23andme do something like that? I hope they don't but it would
be very useful for just this one case!

~~~
AlexCoventry
Amazingly, it looks like they still use Sanger sequencing, and a lot of the
uncertainty will probably come from genotyping errors.

There is a lot of anonymized genotype data in the 1000 Genomes Project
([http://www.internationalgenome.org/data](http://www.internationalgenome.org/data)),
but the devil is really in the genotyping details with this kind of thing. It
looks like in the validation project
([https://www.documentcloud.org/documents/4113877-1-19-17-Exhi...](https://www.documentcloud.org/documents/4113877-1-19-17-Exhibit-
I-FST-Executive-Summary.html)), they went from biological samples to sequence
data to analysis results, which is the right approach.

With this kind of low-througput genotyping so much depends on the accuracy of
the genotyping method, though, which in turn depends on the lab, the protocol,
the technicians, and sometimes even the weather. The software is not where I
would start, in worrying about forensic evidence based on this method, though
it definitely could be a source of errors.

~~~
dekhn
I'm not sure what you mean by "amazingly, it looks like they still use Sanger
sequencing". Sanger is considered gold standard-- all HTS genome methods are
validated against it.

~~~
AlexCoventry
For this task, the low throughput harms performance. There are genotyping
methods which will measure tens of thousands of common variants for a couple
of bucks. And the more genotypes you have, the surer you can be of a match,
even given a slightly higher genotyping error rate.

~~~
dekhn
I think you may be misunderstanding my point (it's not clear if you are an
expert in this area; I have more than a passing knowledge of DNA). In every
area I have worked in, Sanger sequencing is considered gold standard and is
used whenver the results are being used for medical purposes. That includes
genotyping as well as verifying the results of WGS>

It's considered a critical QC step.

Throughput doesn't matter if you have false-positive or false negative errors
that cause erroneous medical decisions.

~~~
AlexCoventry
I'm very familiar with Sanger sequencing. Base for base it's one of the most
accurate methods. It is by no means worthy of being considered a gold
standard, it's just somewhat better. And the drawback is that dollar for
dollar, it's pretty inefficient. You can genotype all the common variants
using an Affymetrix gene chip,and base your inference on vastly more data. For
forensic analysis, where the goal is to discriminate genomes, that would be
much more accurate. Maybe go back and verify the most
discriminatory/confirmatory loci with Sanger sequencing after that.

------
jasminz
I just love the way they have connection string along with what seems real
database URI and accompanying username and password.

------
tmsam
Nested ternaries, ungodly amounts of repetition, ZERO tests, hardcoded
sensitive info... this code wouldn't pass code review anywhere. I don't care
when it was written, any day it is used to send people to jail it should be
the best possible code humans can write. And it is so far from that.

