Hacker News new | comments | show | ask | jobs | submit login
The Source Code for NYC's Forensic DNA Statistical Analysis Tool (github.com)
134 points by foob on Oct 20, 2017 | hide | past | web | favorite | 45 comments

For context, this is getting released just after the city council committee on technology had a public hearing about a bill that would require any software used by the government to prosecute folks for crimes to be open source:


This bill would set the standard for the entire country, so the testimony is actually pretty interesting and worth watching, especially the opening speech by councilman Vacca.


I also testify at 56:00, albeit hadn't prepared remarks in advance so wasn't as polished as some of the other folks.

This is excellent news and I'm incredibly grateful the code used to convict people will be publicly scrutinized.

That said, me and other C# devs have already found numerous signs that at least parts of the codebase were done by someone clueless or inexperienced with C#. And there's no tests.

If someone spends the time to look into the codebase, I'm predicting a basically 100% chance that parts of this code will be shown to produce inaccurate or completely wrong results. Innocent people will probably go free because of this.

The fact that there are no tests makes me feel incredibly sad - there are probably plenty of people who were unfairly sentenced because of this.

You make a very valid and telling point early on when you point out that in NYC, a person in a job painting people's nails (Nail Technician/Cosmetologist) requires around 1000 hours of training, yet a software developer designing software that can put people in jail (or worse) requires no formal training or qualification at all!

Disclaimer - I mean no disrespect to Nail Technicians.

Code in "Comparison.cs" on line 1307

   // if we don't have frequencies, use the default (this should never happen, but here it is)
                    if (tblFreq == null || tblFreq.Rows.Count == 0)
                        a = (float)0.02;
                        b = (float)0.02;
is kind of fishy. Why not bomb out instead of failing silently for things that are never supposed to happen? Not saying this is inherently wrong - I'm certainly not capable of actually understanding the details of the analysis

The only thing that should follow a "This should never happen" is an assert or exception.

Do we know who wrote this or at least which city agency? Or was it done by consultants or acquired?

Unless it’s a uniform prior... this is bad.

This guy that wrote this is a c# noob. He should just use a null conditional for the if, Epsilon for tiny but non-zero floats, and should very likely be using doubles instead of float since they're basically the same speed for double the precision.

Bonus points for not using the shorthand float literal syntax.

I would expect to see this many goofs only from someone that's been using C# for a month or so. I only used it for a couple years

Your criticism seems a bit overboard. You are making assumptions about things you don't know about, and are judging personal choice floating point representation.

I can overlook a lot of coding issues, but seeing so many in a few lines is a sign somebody doesn't know what they're doing. Just using a linter would probably find some of these automatically, so the fact that they're there is a strong signal that the dev didn't even know/care to use one.

I'm making assumptions based on experience with C# codebases and I think my argument on double vs float is solid. I don't think I've ever seen floats used in a production C# codebase because there's no point.

Overboard maybe :) but I still think whoever wrote that is very inexperienced.

Looks pretty bad to me. Null conditional doesn’t really matter because the code likely predates its existence. But...


1. Knows how to use "using" but doesn't use it for myConnection.

2. myConnection.CreateCommand would be more idiomatic than new SqlCommand ... cmd.Connection = myConnection;

3. Doesn't know how to use "using"! using (DataTable myDt ... then goes on to "return myDT"

4. Also just noticed the oddity of myAdapter.SelectCommand = cmd although cmd is immediately disposed after that, then Fill is called later.

That's one function although the rest seem to repeat similar sins (CTRL+C, CTRL+V).

Nice! I can see more too.

Using out parameters, these are highly discouraged. C# let's you do terrible things (including goto) but the community is pretty aligned that it's bad practice. In my years of doing C# I've never used one once.

Not using var anywhere. This is C# type inference 101.

Throwing "Exception" in catch with only a message. This is horrific because it throws away the original context and any stack trace. You can just say "throw" (just the one word) or rethrow the exception with an innerexception with the original info trivially. This is unforgivable, an experienced C# developer would never do this.

Extraneous uses of "this" when it's the default.

Initializing variables to null

Not using shorthand object initializers(but this is only in newer versions of C#)

Using class variables instead of properties

Using parse instead of TryParse... Like Java exception handling is an expensive stack unwind on failure, not supposed to be used for control flow.

The worst though, is how did they didn't make a method to reuse the SQL connect and data table code and instead copy pasted it around 20 times.

If the input parsed is supposed to be well formed or isn't in a high performance area it's totally fine to use parse rather than the sorta ugly TryParse (which require using out params you just said were bad).

Especially if all they're going to do is throw.

I started using C# during the preview. I've never used CreateCommand.

I haven't had a chance to dig into the analysis part of the code yet, but here's an amusing variable assignment that jumped out at me in `FST.Common/ComparisonData.cs`.

    compareMethodIDSerializationBackupDoNotMessWithThisVariableBecauseItsAnUglyHackAroundACrappyLimitiationOfTheFCLsCrappyXMLSerializer = comparisonID;

I love this type of thing in C# because it's such a terse and well designed language. It's easy to make something stand out.

A cult favorite of devs at my previous employer was to (ab)use the power of LINQ to try to create the longest single line methods possible. I held the unofficial record for a while with a LINQ statement over 1200 chars. Innocent enough since resharper can unwrap them into sane code but lots of fun

I love/hate it when someone writes a linq statement and ends with "look, one line".

On first glance it uses very old frameworks (for example, ASP.NET WebForms) and there are 0 unit tests.

A concerning thing is that a dev database username and password was committed to multiple files, like this: https://github.com/propublica/nyc-dna-software/blob/master/F....

For this kind of project, given the stakes involved (e.g. a defendant's freedom), why would you want to use the brand-new framework-of-the-month that the cool kids are using?

I'm all for using the right tool for the job. WebForms code is not possible / extremely difficult to unit test.

The actual codebase itself may be pretty old so WebForms isn't a surprise - the SLN file indicates it was saved with Visual Studio 2010 (https://github.com/propublica/nyc-dna-software/blob/338a1b86...).

Possibly they shouldn't be using bug ridden trash either seems like a compromise ought to be possible.

Webforms are not trash, just very annoying to use as a developer.

For something contracted this is the norm. We used to send out estimates for writing tests and clients never wanted to pay for them. I've found projects just as buggy with tests and now I only write them for code that's getting heavily reused.

Webforms is likely a client request as well, but also keep in mind that as shitty as webforms is it let's you throw functionality together faster than almost anything else. Again proly contractors

The funny comments others mentioned is another clue that this was not developed internally. I miss that about being a consulting company... The client is usually hiring you because they have no idea what they're doing so rest assured nobody there will ever look at the code.

It seems that the source code for any computer program who's output is used as evidence in a court of law should be available for inspection.

The accompanying article/blog post linked in the Readme (https://www.propublica.org/article/federal-judge-unseals-new...) has good background & information about the history of this, as well as potential impact moving forward.

I work in DNA analysis and stuff like this makes me happy specifically because I know just how easy it is to have flawed analysis techniques written in software.

How long does this stuff typically take to run on a dataset?

That depends on the specific kind of analysis being done and one what size of dataset you want to run the analysis. Some analysis techniques filters data down to just a few MiB, others need as much as a few hundred GiB, for a single individual. Sometimes you do population analysis on that data which can end up multiplying individual sample size to the size of the population.

Ok, what kind of ballpark time do each of those entail? You didn't give a single hard number.

I've done work in statistical genetics, but not in forensic analysis specifically, so take these rough estimates with a grain of salt. That said, forensic analysis generally involves identifying the origin of a tissue sample. If the sample is substantial enough for reliable genotyping, once the genotyping is done the actual identification step should take less than a second in a well-optimized system, even to pick out the origin from every human genome in history (assuming such a database was available.) It's just a matter of finding who has matching allelic variants, and ensuring that you match over enough common alleles with low linkage disequilibrium (shared-haplotype correlation) that the risk of a match is infinitesimal, even among individuals with high consanguinity. If you organized the genomes in a tree where each node split on a single allele, you could search in logarithmic time.

The situation is more difficult if you consider the possibility of having multiple contributors to a sample (mixtures), allele dropout, etc. I’ve heard some models/implementations take a day for MCMC sampling to calculate the probabilities. (I’m talking about forensic DNA specifically)

Are you talking from DNA extraction to customer-has-their-results? Or just specific analysis?

For a very high priority sample, DNA extraction can take a couple of days; sequencing prep can take a few hours to a couple days; sequencing can take a couple of hours to a couple of days depending on how much data you need; quality control can take a few hours to a few days; first stage analysis software (just calling what DNA exists and giving it quality scores) can take a few hours also depending on how much data you're analyzing to a couple days; so from initial extraction start until you have DNA data you can analyze for a specific product is a week if you babysit; up to four weeks if you worry more about cost efficiency. After that depends on what type of customer-end analysis you want to do and using what software. Industry standard university level software can take anywhere from a few hours to a few months... again also depending on how much data it's looking at.

Typically the turn-around time (we receive DNA to we've published results to the customer) is around 6 to 8 weeks for most of the analysis products my company sells.

> You didn't give a single hard number.

Considering end-to-end could take a few days (super high priority sample being baby-sat at every step of the way by experienced technicians) to a few months (statistical analysis on populations)... it's pretty hard to give a single number. It really depends on what analysis you want done.

There's a lot of room for improvement in this field of software!

> There's a lot of room for improvement in this field of software!

That's what I was thinking. If it takes a lot of CPU time, it seems that there is likely a huge amount of optimization that would be useful to the people using it.

> it's pretty hard to give a single number.

I wasn't looking for a single number in the sense that it encapsulates everything I was looking for at least one number to see if the run time of these processes is a hindrance. What you have here is very interesting, thanks!

For the 6 to 8 week analysis, how many cores is that typically running on?

6 to 8 week analysis goes through many steps on many machines. "How many cores" isn't directly comparable. All of our sold products target 6 to 8 weeks.

Think of it like an assembly line. How many robot arms does the assembly line need to produce something in 6 to 8 weeks? It depends on the type of object being produced; more intricate details, sizes, or etc might need to go through more robots, more arms. An assembly line for a complex item, such as a vehicle, doesn't put all of the materials in front of a single robot and give the robot more arms. It instead goes through many different robots with specialized purposes.

So if your factory produces many different things, but some of those specialized robots could be generalized again for multiple products so that you share those robots among things being produced, you might have one robot that spends 20% of its time producing a thousand of product A and 75% of its time producing a hundred of product B, and another 5% of its time in maintenance.

It's the same for DNA analysis; it goes through many different software steps, each with varying requirements of I/O throughput, CPU speeds, memory size, etc. Each individual step could be (and generally is) run on different machines with different resources suited to a particular step. Sometimes those steps can be shared among different products. But different products will take different amounts of time to work through the analysis.

The machines running our largest analysis steps have 88 cores and 512GiB of RAM. But number of cores is not indicative of why it takes 6 to 8 weeks to process.

Is there a large dataset of anonymized DNA samples that we could try this tool against? Does 23andme do something like that? I hope they don't but it would be very useful for just this one case!

Amazingly, it looks like they still use Sanger sequencing, and a lot of the uncertainty will probably come from genotyping errors.

There is a lot of anonymized genotype data in the 1000 Genomes Project (http://www.internationalgenome.org/data), but the devil is really in the genotyping details with this kind of thing. It looks like in the validation project (https://www.documentcloud.org/documents/4113877-1-19-17-Exhi...), they went from biological samples to sequence data to analysis results, which is the right approach.

With this kind of low-througput genotyping so much depends on the accuracy of the genotyping method, though, which in turn depends on the lab, the protocol, the technicians, and sometimes even the weather. The software is not where I would start, in worrying about forensic evidence based on this method, though it definitely could be a source of errors.

I'm not sure what you mean by "amazingly, it looks like they still use Sanger sequencing". Sanger is considered gold standard-- all HTS genome methods are validated against it.

For this task, the low throughput harms performance. There are genotyping methods which will measure tens of thousands of common variants for a couple of bucks. And the more genotypes you have, the surer you can be of a match, even given a slightly higher genotyping error rate.

I think you may be misunderstanding my point (it's not clear if you are an expert in this area; I have more than a passing knowledge of DNA). In every area I have worked in, Sanger sequencing is considered gold standard and is used whenver the results are being used for medical purposes. That includes genotyping as well as verifying the results of WGS>

It's considered a critical QC step.

Throughput doesn't matter if you have false-positive or false negative errors that cause erroneous medical decisions.

I'm very familiar with Sanger sequencing. Base for base it's one of the most accurate methods. It is by no means worthy of being considered a gold standard, it's just somewhat better. And the drawback is that dollar for dollar, it's pretty inefficient. You can genotype all the common variants using an Affymetrix gene chip,and base your inference on vastly more data. For forensic analysis, where the goal is to discriminate genomes, that would be much more accurate. Maybe go back and verify the most discriminatory/confirmatory loci with Sanger sequencing after that.

23andMe use a genotyping array so they probably don't type STRs only SNPs

I just love the way they have connection string along with what seems real database URI and accompanying username and password.

Nested ternaries, ungodly amounts of repetition, ZERO tests, hardcoded sensitive info... this code wouldn't pass code review anywhere. I don't care when it was written, any day it is used to send people to jail it should be the best possible code humans can write. And it is so far from that.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact