
Show HN: What parts of your code are not original - nunobrito
Hello,<p>Are you curious to discover which snippets of your code were copied from Stackoverflow?<p>Where else on the Internet are those icons that you copied a long time ago?<p>Or simply to discover which licenses apply to the open source in your code?<p>There&#x27;s an app for that: http:&#x2F;&#x2F;triplecheck.net&#x2F;quantum&#x2F;<p>Development of this tooling took over two years, we archived over 630Tb of open source data around the web. Some sources of data have gone offline in the meanwhile but we kept a copy for posterity.<p>Things to consider:
   - Stackoverflow snippet detection is limited to Java at this moment
   - However, snippet detection works for mainstream languages in other repositories (sourceforge, github, googlecode, etc)
   - app is command line based (our UX skills suck), you need java installed
   - please let me know if pricing is too high or too low. We are bootstrapped, since there is no VC then pricing == survival
   - bugs will happen. Early edition, my apologies in advance for any bugs that surface
   - privacy NOT guaranteed. I don&#x27;t store your code, only fingerprints are sent to the server and these are NOT stored after scan is concluded. However, your data will be captured by network providers. Please don&#x27;t scan critical code, there&#x27;s a secure offline app. Details at http:&#x2F;&#x2F;triplecheck.net&#x2F;what-we-do.html
   - more than 300 open source licenses are detected<p>If the tool helped you: please retweet, upvote or just share your feedback and tips on how to make this grow from here. From one engineer to another: My personal thanks, I mean it.<p>-- Nuno
======
osivertsson
Cool!

And thanks for being up front about "privacy NOT guaranteed". Such a statement
builds trust for me as an engineer.

The low pricing puts this in range for me if I would do independent software
development and outsource part of the work. I would want to make sure I don't
pay for development and somebody just copy-pastes some GPL code. That could
also spell trouble the day I might want to be acquired by some big player and
the problem surfaces in their audit.

I'm passionate about software freedom. I can really see the usefulness of a
tool such as this being part of continuous integration to keep developers
honest, especially if some part of development is outsourced to big software
factories.

But I'm having a hard time convincing project managers etc. about the
importance of license compliance. If you can help me with this I might be able
to sell in a tool like this.

~~~
nunobrito
Indeed. License compliance is something very secretive, the old-styled
management does not really want to hear they are using GPL. When the company
gets acquired is a shock discovering otherwise. Would be better to understand
what GPL is all about, rather than hiding. Last year we had a company in
Germany that the acquisition failed because they were not respecting the
copyleft licenses.

Would appreciate your help. How can we get in contact? There is a contact form
on our website if you wish. Thank you! :-)

~~~
osivertsson
What makes me really sad is organizations where management sort of
understands, but argues things like:

    
    
      * we have never cared about this since we started way back and it has worked fine!
      * it will cost us time and money, let's ignore it.
      * how could someone ever find out?
    

That said, having a tool that can clearly point to problems could be a big
help when management changes and someone sympathetic to uncovering and fixing
license issues comes in.

(Contact details are in my profile.)

------
pollen23
This gives me severe flashbacks to clueless clients/PMs. We had projects we
had to run through Black Duck -- "No open source code."

The reference implementation of the Mersenne Twister was once GPL, although it
wasn't anymore at that time. Still, there are only so many ways you can
implement a Mersenne Twister. So my implementation got flagged.

~~~
sdevlin
Why did you need to write an implementation of Mersenne Twister?

~~~
bpicolo
"No open source code."

~~~
Kluny
Yeah, we got that. But why did you need a Mersenne Twister?

~~~
tbabb
My understanding is that Mersenne Twister is kind of a gold standard for non-
cryptographic PRNGs. Linear congruential generators are known to be rather
poor, and only ideal where speed is critical and the quality doesn't matter.

------
Too
Could you make an example report for some public well known project taken from
github. Your current example report just includes a screenshot of some super
generic graphs that don't mean anything.

Look at what viva64 is doing to promote their static analysis tool, they
analyze open source repos all the time and write about the results:
[http://www.viva64.com/en/b/0366/](http://www.viva64.com/en/b/0366/) That
gives me confidence that the product can actually find issues that might have
relevance for me.

~~~
nunobrito
Thank you Too, very helpful feedback. Will get into it. I've added an example
on the page, direct download link is
[http://triplecheck.net/download/example.zip](http://triplecheck.net/download/example.zip)

------
FigBug
Ok. I've downloaded it. Double clicking doesn't seem to run it, but if I open
the jar from the command line, it will run. (OS X Mavericks)

Then I go to get an API key, so I sign up for mashape.com? Is this service
somehow related to triplecheck.net? I need an API key, but nowhere do I see
one.

I think your startup process needs to be simplified or better documented.

Edit: Ok, found the key. Now I need to put my code in the application bundle?
Why can't I just select a folder?

Edit2: Apparently that wasn't the key or the app couldn't access the internet.
Maybe no NSAppTransportSecurity in the .plist

~~~
nunobrito
Thanks you for the feedback, very useful to see how a first time user runs the
app.

> so I sign up for mashape.com? | Answer: Yes. Sorry about that. I think it
> makes sense as next step to run our own API management.

> put my code in the application bundle? Why can't I just select a folder? |
> Answer: Look for "settings.xml", there you can change to another folder

> Apparently that wasn't the key or the app couldn't access the internet. |
> Answer: Can you try running the jar from command line? The mac version needs
> to be fixed, the jar edition should work:
> [http://triplecheck.net/download/quantum.zip](http://triplecheck.net/download/quantum.zip)

To see the UI: java -jar quantum.jar

To run from command line: java -jar quantum.jar scan

The API key should be inside settings.xml too. If you typed the wrong key, you
can replace it there or just delete settings.xml to reset the app. Good luck,
please let me know it this worked. Thanks.

------
tyingq
"Everything is a remix" is an interesting take on this sort of thing[1]

[1] [http://www.npr.org/2014/06/27/322910178/is-everything-a-
remi...](http://www.npr.org/2014/06/27/322910178/is-everything-a-remix)

While it's specific to music, the concept certainly applies to anything
creative.

~~~
kuschku
Actually, there are three more parts to it, dicussing movies, computers, etc.

------
teamhappy
I've always wondered what the chances of two people writing the same chunk of
code are.

~~~
nunobrito
This could be an investigation work to get hard data. From personal experience
would say that exist many cases where a given code can only be written in a
given manner. On other cases, people simply copied the code many years ago and
never again know where it can be found.

For me, more relevant is to compare the variable names. If both code snippets
have very similar variable names, then one of them is likely a copy.

~~~
osivertsson
I've seen cases where names where changed, but comments retained with the same
typos as in the GPL package...

~~~
jgh
I've found a couple of Chinese companies offering compiled versions of my open
source code before...They would change the class names and stuff but keep all
the interface methods the same so it was pretty easy to figure out what they
had done.

------
nickpsecurity
"what parts of your code are not original?"

Let's start with the syntax, constructs, compilers, linkers, and use of
assembler. Very little that's original any more. That's fine as most science
and tech that's worth a shit is an increment on some prior development.
Evolutionary, not revolutionary.

------
amelius
But what happens if somebody uploads parts of your code to stackoverflow?

~~~
hueving
Stackoverflow now owns it and you will have to comment your code with a link
to the stackoverflow post. You will also have to send stackoverflow 1% of your
salary.

~~~
746F7475
I hope this is a joke

~~~
fgandiya
The part about 1% salary is.

[http://meta.stackexchange.com/questions/272956/a-new-code-
li...](http://meta.stackexchange.com/questions/272956/a-new-code-license-the-
mit-this-time-with-attribution-required)

~~~
hellbanTHIS
I'm just going to pretend I didn't read that.

------
2ion
The API key dispensary mechanism you are using does reject throwaway,
anonymous email addresses from anonbox.net at registration time. Is that
intentional?

~~~
nunobrito
Not intentional at all, that comes by default on Mashape.com

Sorry about the hassle

------
sytelus
Do you detect only exact similarity? What if variable names, formatting is
changed? What if code had been refactored quite a bit? Can you give more
details on what exact algorithm you use?

~~~
nunobrito
Different algorithms are used.

1) binary comparison. Without knowing what type of file we are matching, we
compare to other files and evaluate if the binary contents are similar (or
preferably 100% equal)

2) snippet matching. For mainstream languages (C, Java, Javascript, Python,
etc) we transform the code into anonymized blocks that don't care about
variable names, formatting or comment blocks. Then the code is compared for
similarity. Up to 80% similarity is still qualifying as a match.

To provide context, we have the concept of code diversity. Meaning that a
given match needs to present a relatively high number of different logical
instructions in order to qualify as match. Example, multiple IF statements
will not qualify, unless they contain other code within. If you change the
order, add/remove code we are still robust enough to detect changes.

For special cases where exists known malicious intention of hiding the code I
will be cross-matching different algorithms and specifically look on variable
names and comments inside the code. In such cases, a manual inspection gets
done by an expert and becomes truly difficult for a developer to escape the
detection of non-original code.

In fact, if the guy is indeed able to hide code from triplecheck then it
reached a level of sophistication that no normal third-party developer will be
capable of (easily) detecting plagiarism. In our experience have occurred rare
cases where only with new techniques we notice that a given company managed to
hide non-original code from our tooling.

In either case, we live and learn from such examples and gets more difficult
on new iterations of the tooling to evade (non) originality detection.

