
Ask HN: Would you pay for a code similarity detection tool? - pka
I&#x27;ve been working on a proof-of-concept code similarity detection tool.<p>The tool is based on matching semantically equivalent code fragments, i.e. it detects similarities across inlined or lifted functions, reordered but equivalent expressions and so on.<p>Check out the attached screenshot [1].<p>The idea is to mine the publicly available npm repos, and provide a payed service for detecting similar code fragments already implemented in npm libraries -- basically an open REST service and one or two plugins for the most popular editors (vim, Sublime, Atom?)<p>If this proves successful, I would extend the service to other languages and services, where applicable.<p>Would you pay for such a service? What kind of features would you expect?<p>[1] https:&#x2F;&#x2F;dl.dropboxusercontent.com&#x2F;u&#x2F;30225560&#x2F;ase.png
======
delinka
Commercial consumption of this idea is around verifying licensing. My employer
runs a tool against our internal repositories looking for code it's aware of
on the internet. When it flags a match, a human looks at our use of the
publicly available code and verifies our compliance with the license that came
with the code.

I, a software developer, wouldn't buy such a thing, but there's certainly
enterprise/corporate demand for it.

~~~
pka
I haven't thought about enterprise, it may be direction worth pursuing,
thanks!

------
chei0aiV
Some exist already:

[http://www.harukizaemon.com/simian/](http://www.harukizaemon.com/simian/)
[http://dickgrune.com/Programs/similarity_tester/](http://dickgrune.com/Programs/similarity_tester/)
[https://github.com/silviocesare/Clonewise](https://github.com/silviocesare/Clonewise)

~~~
elij
I agree -- here's another: [https://www.blackducksoftware.com/products/black-
duck-suite/...](https://www.blackducksoftware.com/products/black-duck-
suite/protex)

------
dogma1138
Well there are plagiarism detection tools that education institutions use to
detect cheating in CS classes.

There are several machine learning solutions that can look for your IP in
either source code or binary form even on an abstract (algorithm) level.

Many tools exist like that, the real question is what exactly your target
audience is?

If you are scanning public repo's what service does the tool actually
provides, and no detecting code similarity isn't the answer here.

Selling something isn't really an issue as long as it has a clear purpose
which i don't think your idea has at this moment in time.

~~~
pka
Many of the plagiarism detection tools don't work on a semantic level. There
are some, I haven't found an easy, it-just-works online solution.

Could you share links to the IP detection ML tools?

And to answer your question, please read my answer here [1].

[1]
[https://news.ycombinator.com/item?id=9954879](https://news.ycombinator.com/item?id=9954879)

~~~
dogma1138
Been demoed those solutions by some consultancy firms will have to dig them
up.

But still what is the use case?

I mean if i wrote code which is functioning why would i replace it with some
one elses code?

Taking in raw code even (especially from) OSS repo's is a huge huge can of
worms.

Say you are developing a product if you use an OSS library which is
distributed as is you can bundle it with your product under most licenses
unless you modify it without having worry about anything.

If you copy paste code form that library into your own code base well than
what is it? a copyright violation? derivative work? I can easily bundle
OpenSSL with their Apache license and just make a remark about it some where
during the installation, i don't have to distribute my software or source code
under that license or under any other OSS license.

But if i take the raw source code say their ASN.1 parser and implement it in
my own program? what now? I'm not an IP lawyer but I'm pretty sure this either
violates the license outright or my software now just became some derivative
work which means the terms (or some of them) of the original license now
apply.

Even if my software was meant to be OSS it's still an issue maybe i don't want
to use Apache or BSD maybe it want GPLv2 or V3 or my own license or what ever.

The other issue that stands out to me is that I already wrote functioning
code, it works, it's mine, i know it i understand it i can maintain it, why
should i on-board someone elses code that i won't know, won't understand, wont
be able to maintain as easily? Where is the benefit in that if I've already
written all / most of my code for your similarity score to trigger a
suggestion?

------
laumars
Personally no.

Sometimes I deliberately choose not to abstract away simple logic (as in your
screenshot) for performance reasons, or just to reduce a complex dependency
chain (I'm all for code reuse, but I do also believe in a balance when writing
portable code). And the instances where the required logic is more complex,
I'd know to be looking for a module before writing my own code.

Because of these reasons, I couldn't even see myself using this tool if it was
free.

However, this does sound an interesting project and I think you should still
proceed with it regardless of my feedback as, even if it doesn't because a
profitable exercise, I could see this becoming a future must-have feature for
IDEs - eg for code refactoring. In fact maybe you could extend this tool to
analyse repeated code within a project and suggest abstracting that out to a
function (that's a tool I probably _would_ use on larger code bases!)

------
sakopov
Would you pay for it yourself? I'm struggling to see what a normal developer
would use it for. I think your main customer will be high schools, colleges
and universities. Not so much professional developers.

~~~
pka
I would, yes - it's a form of scratching my own itch.

I don't know about you, but I find myself writing boilerplate code very often.
Converting between timezones, reading config files, making HTTP requests
(+error handling), UI idioms (click & disable), etc.

Often the code handling such things is spread over a bigger function, like
open a file on line 5, loop over lines and read into array on lines 24, 25,
26, close file in finally {} clause.

What I want is a tool telling me "hey, you can replace these lines by function
X in open-source package Y." Maybe I overestimate its universal usefulness
though :)

------
tptacek
There's a YC company doing something very much like this, but against compiled
binaries.

~~~
pka
Would you mind sharing which, or haven't they gone public yet?

~~~
NateLawson
I'm the founder of SourceDNA, which tptacek is referring to.

[https://sourcedna.com/](https://sourcedna.com/)

~~~
JoachimSchipper
You've got YC investment now? Congratulations!

------
peterjmag
Mirror, in case the OP's Dropbox account hits its bandwidth limit:
[http://i.imgur.com/ciEUQth.png](http://i.imgur.com/ciEUQth.png)

------
chvid
Your lefthand example works on "i" as a global variable - it has additional
effects compared to the example on the right.

Tools like this are common within Java as Java and similar are easier to this
kind of analysis on.

IntelliJ IDEA is long reigning champ in this area. (Which is a product I would
pay for.)

~~~
pka
You are right - sorry about that. The example was put together in 5mins and
was meant to just convey the general idea.

I'll have to check out IDEA again, thanks!

------
danpalmer
(I work mostly on Python, so this is my opinion based on that)

Most of the code that I write that would be duplicating the functionality in a
library would be doing so because we don't need the extra functionality. For
example, I need to pluralise a small set of words, so I write 3-4 clause if-
statement and append some "s" characters because I don't see the need to use
python-inflection. The latter is massively more complex, so unlikely to be
detected as the same thing.

Sure, there might be a few matches, but I suspect they will mostly be helper
functions within libraries, rather than the public API of libraries.

I would prefer not to send code to a web service for detection, although not
totally against it. In many companies this would not be allowed, either
through policy, firewalls, exfiltration detection systems, or lack of internet
on development machines (I know people who work, or have worked myself, in all
of these situations).

Something I think would be far more valuable, and possibly more realistic as
well, is local detection that highlights possibly duplicated code in a
codebase. I find little snippets (1-2 lines) that have been duplicated on a
fairly regular basis, and if I could identify those to be extracted out into
re-usable methods, that would be amazing.

Would I pay money for it? Probably not, it's not that much of a problem, and
all those sorts of tools are usually open source anyway. Unfortunately that's
my expectation now.

~~~
pka
All valid concerns.

I guess the best way to address the usefulness issue is to run it against
popular frameworks, like React or Angular, and see what happens.

If it proves useful (like say 10% of code could be replaced by existing
functions), would you change your mind?

------
lukaslalinsky
Tools are very hard to sell. Especially very specialized tools. And tools for
programmers often come with the end-user price of 0, so justifying anything
above that is hard.

For me personally, I don't see the need to even use such a tool, let alone
paying for it. Many programmers seem to focus on code, but in my experience,
that's usually not the problem you have when things go bad.

------
taspeotis
Your example looks similar to what ReSharper does:
[http://blog.jetbrains.com/dotnet/2009/12/11/resharper-50-pre...](http://blog.jetbrains.com/dotnet/2009/12/11/resharper-50-preview-
loops-2-linq/)

~~~
deejbee
I think Visual Studio does it too with its Analyse Solution for Code Clones.
It's an Ultimate edition feature though.

------
pka
Clickable link for the image:
[https://dl.dropboxusercontent.com/u/30225560/ase.png](https://dl.dropboxusercontent.com/u/30225560/ase.png)

------
morenoh149
I was just reading about plagiarism detection recently
[http://theory.stanford.edu/~aiken/moss/](http://theory.stanford.edu/~aiken/moss/)

How does your techniques compare to Winnowing?
[http://theory.stanford.edu/~aiken/publications/papers/sigmod...](http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf)

~~~
pka
It works on a semantic level (i.e. what the code actually means), rather than
fingerprinting strings. This means that reordering code segments, renaming
variables, inlining or lifting functions wouldn't affect a match, if the code
is semantically equivalent.

~~~
morenoh149
cool. Sounds more robust.

------
zamalek
> basically an open REST service

Be careful with that, it restricts your customers to a certain
lightmillisecond distance from your server - if the plugin is intended to be
real-time.

~~~
cat9
If it turns out to be at all useful with regard to maintaining revenue-
generating code in production, $25 per year is a laughably low ask, even if
it's only useful some of the time for some of the people.

Figuring out which people and when is, generically, a matter of doing your
homework with customer development and targeted marketing, and then further
improving your marketing surface as you learn more about who is a customer and
how to find them.

Going from "this might be useful" to a working business model is, of course, a
deeply nontrivial problem...but it's one that many people have solved before,
and it's a heck of a lot easier if you don't start with "maybe $25 per year."

~~~
zamalek
For some reason my brain added in an estimated cost to the OP. You're right:
it's outside the scope of what is being asked. Removed the pricing stuff.

------
ganarajpr
There is a project called jsinspect which does what you are talking about. I
dont know if yours would be a bit more robust than this one.. But just wanted
to provide this - so you know what you are competing with.

[https://github.com/danielstjules/jsinspect](https://github.com/danielstjules/jsinspect)

~~~
pka
Thanks, I've seen jsinspect. It works by comparing ASTs, which is probably
than matching strings, but I guess it wouldn't be able to deal with inlined
functions or reordered code segments etc.

------
benjamincburns
Would love to see this integrated into an IDE such that it is capable of
detecting whether or not what you're typing (or something close to it) already
exists. For companies with large codebases this is a major enough problem that
a decent solution would have a compelling enough ROI to shell out some cash.

------
self_awareness
I'm not sure if I would buy it, but it could be a very nice way of introducing
to a new language/framework. When learning a new environment, one doesn't know
all the functions and it's very easy to write code that is already
implemented.

So, maybe try asking someone who is into programming training programs.

------
crispy2000
McCabe software has a tool that's supposed to find duplicate code within a
code base (e.g. cut-n-pasted functions), using path analysis. They have been
around for ages, but I've never used them since they're rather expensive.

------
alvatar
As a professor, I always missed an easy-to-use and modern tool for that. I
didn't research this topic much, but not finding anything easy and ready to
use is probably a market opportunity (if there _is_ a market).

------
voberoi
Have you looked at Code Climate?

[https://codeclimate.com/](https://codeclimate.com/)

~~~
pka
No, thanks for the link!

Code Climate looks more like a linter though?

------
gull
How did you choose to work on this idea?

~~~
pka
Basically by wanting to make my life easier when working on bigger, hard-to-
maintain codebases :)

------
mkolosick
How do you determine semantically equivalent code fragments? Is it a dynamic
solution or a static solution?

------
esusatyo
Personally, no. Even if it was an IDE feature, I would rather just learn it if
I use it frequently.

------
angvp
I would not stuff like that exist for free and I rarely use that kind of
tools.

------
willvarfar
Generally, no. There is no money to be made making tools.

