
When Data Do Not Conform to Rows and Columns: Diagnosing Disease by T-Cells - jostmey
https://github.com/jostmey/dkm
======
jostmey
Datasets of T-cell receptors cannot be arranged as numbers into rows and
columns, representing what we call non-conforming features. By forcing
ourselves to develop methods to cope with non-conforming features, we have
developed a new approach we call Dynamic Kernel Matching for classifying
complex, non-conforming data. We think this approach may be useful for other
problems.

~~~
inetknght
In your README, you state:

> _we implement the Needleman-Wunsch algorithm in TensorFlow._

What sort of performance do you get with your implementation on TensorFlow vs
other implementations of Needleman-Wunsch?

What scores did you use for gap opens and extensions?

~~~
jostmey
> What sort of performance do you get with TensorFlow vs other implementations
> of Needleman-Wunsch?

TensorFlow runs > 10,000 instances of the Needleman-Wunsch algorithm in
parallel. But I believe each instance of algorithm runs in serial. This is
very efficient if all the sequences are roughly the same length, allowing for
millions of sequences to be processed each second.

Here's my implementation of the Needleman-Wunsch algorithm:
[https://github.com/jostmey/dkm/blob/master/antigen-
classific...](https://github.com/jostmey/dkm/blob/master/antigen-
classification-problem/model/alignment_score.py)

> What scores did you use for gap opens and extensions?

Another great question! There are two gap penalties. One for skipping weights
and one for skipping features. If I know there are always more features than
weights, then I set the penalty for skipping a feature to zero. This does not
penalize the model for having more features than weights. I then set the
penalty for skipping weights to a value approximating negative infinity. This
ensures every weight is used and is not wasted.

------
lmeyerov
How would this compare to common types of graph neural nets?

Afaict many use random walks (=traces) to learn their model, except they are
typically then used for a classifier over individual nodes/edges instead of
paths/ graphs. I'm not sure of the natural formulation to reuse them here.
Likewise, it is unclear if node reuse on a path here is meant to be
meaningful, which would also seem to change natural encodings.

~~~
jostmey
I've only just started looking into classifying graphs. I'm not really
qualified to answer this question, but I will try anyway. Let's assume we are
talking about un-directed graphs using deep learning jargon.

DKM would not be a random walk through the graph. Features would be assigned
to weights using a graph alignment algorithm, which is NP-hard. There's
probably multiple ways to define graph alignment, so you have to pick an
alignment strategy that makes sense given the problem. For example, we may
want to reuse the same weights whenever the graph forks, ensuring each branch
is treated the same way. Once we have assigned features to weights, we can
classify the graph with only a single neuron. We can also use a deep model, if
we choose.

Some graph neural networks reduce the graph in a feed-forward manner. These
models, which can only represent directed interactions between nodes, are
inappropriate for un-directed graphs.

Some graph neural networks require multiple neurons to represent the graph, in
contrast to DKM, which can use only a single neuron.

There are types of graph neural networks that I have yet to understand, so I
cannot compare these models to DKM.

Feel free to email
([https://news.ycombinator.com/user?id=jostmey](https://news.ycombinator.com/user?id=jostmey))
if you want to discuss this more. I would be happy to provide insight if you
want to implement a graph-DKM model.

------
spenvo
If I understand this, this model (dkm, "dynamic kernel matching") is to
sequences as convolutional networks are to images?

~~~
jostmey
Yeah, that's right. Just as a convolutional network picks a patch from an
image using a max-pooling operation, dynamic kernel matching picks symbols in
a sequence using a max-alignment operation

------
jacques_chester
> _Datasets of T-cell receptors cannot be arranged as numbers into rows and
> columns_

Is this true? The title seems to imply that T-cell receptor data can't be
represented with sets or manipulated with relational algebra. I'm a little
skeptical.

~~~
jostmey
EDIT: I think I see the confusion. This project is about statistical
classification, which generally assumes the data is in rows and columns. It is
about statistical classification _when_ data is not in rows and columns. Of
course the data can be represented as a set of sequences. The challenge is how
to build statistical classifiers for that, which is the scope of this project.

> The title seems to imply that T-cell receptor data can't be represented with
> sets

There are two datasets, one containing labelled sequences and the other
containing labelled sets of sequences. The problem is that the number of
features is irregular, resulting in some rows with more columns than others.
Also, the information in each column does not line up. It's like you have
patient age represented by a column, but because the number of features is
irregular, suddenly that column switches to the patient's weight.

> manipulated with relational algebra

Can you be more specific? I'm not sure what you mean

~~~
Retric
/Pedantic

That’s a common issue for many datasets. You can map that just fine in a
relational DB either using Nulls on some columns or breaking that column data
into a separate table. EX: (CustomerID: 50, ColumnID : 6852, Value : “Mike”)

~~~
tomnipotent
Except this isn't how statistical methods on vectors and matrices work.

In an ML data set, the value "Mike" may actually be one-hot encoded to one of
500 columns (one column value is 1, everything else is 0) - because you have
500 different names in your dataset, so each VALUE gets its own column.

It's a very different problem/solution than NULLs in databases.

~~~
Retric
I was referring to jacques_chester’s comment. You’re bringing up a related and
still well explored topic as this situation is very common.
[https://link.springer.com/article/10.1007%2Fs00180-013-0468-...](https://link.springer.com/article/10.1007%2Fs00180-013-0468-8)

