
Machine Learning-Based End-To-End CRISPR/Cas9 Guide Design - indescions_2018
https://www.crispr.ml/
======
j7ake
This is one situation where a “black box “ method is pointless. When you
design guide RNAs you most definitely want a transparent method that you can
reason and debug.

~~~
nfusi
I'm one of the authors. Predictive models like ours are useful for sorting
through thousands of potential guides/sites and scoring them. In the paper
associated with the service
[https://www.nature.com/articles/s41551-017-0178-6](https://www.nature.com/articles/s41551-017-0178-6)
(pre-print here
[https://www.biorxiv.org/content/early/2016/10/05/078253](https://www.biorxiv.org/content/early/2016/10/05/078253))
we look closely at what the models are picking up and explain our findings in
the context of the current biological knowledge.

~~~
nonbel
>"We found that adding these same features from the CFD model further boosted
performance and so also included these. The final deployed model was trained
only on the Avana data (combining with Gecko did not increase cross-validation
performance)."
[https://www.biorxiv.org/content/early/2016/10/05/078253](https://www.biorxiv.org/content/early/2016/10/05/078253)

Sounds like you leaked info from the training data into validation/test data,
which will make you overfit and thus overstate the accuracy. I may have missed
it, but did you evaluate the skill of this model on a holdout dataset?

EDIT:

This link doesn't appear to work:

>"All source code and a front-end website for the cloud service will be made
available from [http://research.microsoft.com/en-
us/projects/crispr](http://research.microsoft.com/en-us/projects/crispr) upon
publication."

~~~
nfusi
No, there was no leakage. We trained on one dataset and evaluated on a
completely different one, then did the reverse to show that the model
generalized well irrespective of the training data (Figure 2). The decision of
which model to deploy was based on cross-validation over the Avana data. We
would have loved to have even more data, but generating data from this kind of
experiment is expensive and labor-intensive.

EDIT: we will update the link, thanks. The correct link is
[https://www.microsoft.com/en-
us/research/project/crispr/](https://www.microsoft.com/en-
us/research/project/crispr/)

~~~
nonbel
If you cv on a dataset, then change the features (or hyperparameters) and cv
again, picking the best model, then you will will overfit to the cv. This is
data leakage, it will lead you to be overly optimistic about your model
performance on unseen data.

This is well known, and honestly only takes one time working with a real hold
out set (no cheating) to learn for life. Eg:
[https://datascience.stackexchange.com/questions/17288/why-k-...](https://datascience.stackexchange.com/questions/17288/why-
k-fold-cross-validation-cv-overfits-or-why-discrepancy-occurs-between-cv)

~~~
michaelhoffman
The final performance evaluation does not use cross-validation, but uses
totally held out validation data not used during model selection.

~~~
nonbel
Thanks this is not at all clear from the pre-print. From the final paper it
does seem you are right, but the datasets and usage probably could be a bit
clearer (eg include a table with that info).

------
egl2019
[https://www.microsoft.com/en-
us/research/project/crispr/](https://www.microsoft.com/en-
us/research/project/crispr/)

------
josephpmay
It would be nice if there was an explanation on what this was

~~~
Kodix
Sounds like it's a machine learning based way to make it easier to design the
guide RNA for the CRISPR protein used in gene editing, making it easier to
target it the way you want to.

But that's basically all I know. It really would be nice to have someone
actually knowledgeable describe this better.

~~~
nfusi
Yes, this is correct. We developed a series of ML models to predict 1) whether
a given guide RNA is likely to result in the knockdown of a gene 2) whether a
guide RNA is likely to produce unintended effects somewhere else in the
genome.

Also see [https://blogs.microsoft.com/ai/crispr-gene-
editing/](https://blogs.microsoft.com/ai/crispr-gene-editing/)

------
fullstackwebdev
more info? how do I use this at home? also is it broken?

