
DéjàVu: a map of code duplicates on GitHub - devilcius
https://blog.acolyer.org/2017/11/20/dejavu-a-map-of-code-duplicates-on-github/
======
hawski
I always like to see how some API is used in real projects. Sadly GitHub
search is mostly useless for this, because of the number of duplicates. Google
code search was great. It even supported regexps. Then the was koders.com, now
there's also something from ohloh and it's better than GitHub AFAIR.

EDIT: ohloh became openhub and now the code search is discontinued. So there
is the nonfunctional GitHub search and an open niche for other projects...

~~~
sdesol
Disclaimer: I'm the founder of GitSense
([https://gitsense.com](https://gitsense.com)) that indexes code and Git
history.

Indexing and retrieving code at scale is actually a really challenging problem
due to the fact that there is a lot of code, on a lot of branches, in a lot
forked repos. With GitSense, it doesn't even try to determine the
authoritative source (repo/branch), since I personally think this is a lost
cause, given current AI technology.

With GitSense, everything is context driven, which is how you can reasonably
remove duplication. To search, you have to define what branches/repositories
to consider, which can a be a few to a few thousand. Note, once a search
context has been defined, it can be reused, so this isn't something you have
to create every time, if you want to search.

I sort of envision a Yahoo type (the first incarnation) approach to searching
for code. The basic idea is provide a curated search experience, where domain
experts can share what they believe to be relevant branches to consider, for a
given problem.

Without some human intervention, I think duplication is a given and as you
point out, can lead to useless results.

------
coding123
What really sucks is people committing node_modules, that's just plain wrong.

~~~
linkregister
... until LeftPad 2.0 [1] happens, then everyone will be mocking those who
_didn 't_ commit node_modules.

1\. [http://blog.npmjs.org/post/141577284765/kik-left-pad-and-
npm](http://blog.npmjs.org/post/141577284765/kik-left-pad-and-npm)

~~~
keiyakins
What's wrong is that the package manager and repo manager aren't talking to
one another. "Keeping a copy of the dependency with your code" and "copy and
pasting the code" should not be the same thing.

~~~
linkregister
I agree, ideally artifacts should be stored and cached, e.g. in a stored
container or as a tarball.

Other than a local npm mirror, what do you think is the best way to store
artifacts for new builds? Vendoring code is not a bad solution if it permits
new builds to occur even in the event of a package manager outage. I see
vendoring code as the simplest solution for small companies who don't want the
management overhead of running a local mirror of package managers.

That said, I'm very much open to learning a new technique for handling this
problem.

------
zbentley
Wow, GitHub could save a lot of storage space if they dedup'd across
projects/files explicitly, rather than storing Git repos, which is what I'm
assuming they do.

Even with a good deduping/compressing filesystem, the way git history is
stored means that they're probably missing out on a ton of savings here. Eh,
it's probably not worth the complexity/deviation from standard Git tooling.

~~~
dfox
Github uses their own storage backend which I believe shares objects across
all of projects regardless of whether they are explicit forks or not.

~~~
zbentley
That's really neat. Is there any documentation/discussion available on that
technology? It sounds like something that would be fascinating to learn about.

~~~
Edmond
I am not sure if the parent's claim is true, ie that Github is storing objects
and sharing them across forked repos.

If they are, then it likely just a direct implementation of git the
technology. you can see how git stores data here: [https://git-
scm.com/book/en/v2/Git-Internals-Git-Objects](https://git-
scm.com/book/en/v2/Git-Internals-Git-Objects)

~~~
aidos
Not an expert, but that would only work if there was a single repo, right?

~~~
Edmond
Not an expert either :)

There is the concept of submodules which allows for multiple repos while
maintaining the checksum mechanics that allows sharing the same bit of
information between branches and across commits: [https://git-
scm.com/book/en/v2/Git-Tools-Submodules](https://git-scm.com/book/en/v2/Git-
Tools-Submodules)

The trick is that git maintains an abstract file system (ie a graph) across
commits. The graph consist of pointers to content without having to create a
clone of the actual content for every new version of said graph....it gets a
little dizzy to explain but it is really not too complicated :)

------
neurotrace
This is very interesting. I would have liked to see the results for JavaScript
when you ignore the node_modules folder. If that's going to count for code
duplication then pip dependencies should be included as well.

This should definitely be taken as a lesson though: JS needs a better
deployment solution. That, or better education on the current solution(s).

~~~
k__
Do people check in their node_modules?!

~~~
neurotrace
Apparently some people do. They really shouldn't. I can only imagine this is
in some places ignorance and others out of fear for another left-pad scenario.

~~~
k__
Then shouldn't that article account for it?

I would have thought that JS has fewer dupes because of NPM

~~~
neurotrace
That's exactly what I'm saying. The author even states that the node_modules
folder makes up 70% of the files in the JS section. Seems like a poor way to
measure.

------
hultner
Would love to see a follow up where we would see how much duplication existed
if we controlled for common dependencies and autogenerated code in conjunction
with data on how many repositories are fully cloned (i.e. all code is near
identical to another repository).

------
az0
Very interesting from a security perspective. So much potentially dangerous
code copy-pasted and most of it is probably never updated too. I've personally
found some C vulnerabilities in code that I easily found used in many projects
by Googling the vulnerable line... Usually not so much to do about it too.

~~~
jlarocco
Trying to frame this as a security problem is a stretch, IMO.

My impression is that most public projects on GitHub are only of interest to
the author, and maybe a small handful of people. I, for example, have over 100
non-forked public repos and, except for 3 or 4 projects, nobody even looks at
most of them, much less clones them and uses them. Even the ~4 that do get
attention, it's usually not because they're using the code itself - it's
because they're doing something similar and want to see how I did it.

On the other hand, I only have anecdotal evidence to back up that claim, so
who knows.

------
Tommakx
Would be more interesting to see an analysis of almost equal files - to detect
reimplementations of the same thing

------
inetknght
Now predicting automatic software that looks at duplicated code, flags it for
violating license agreements, and sues for money.

Welcome to the future of copyright trolls.

~~~
ddavis
Burden of proof would hopefully save the day here.

~~~
inetknght
...tell that to existing copyright / piracy trolls

------
nihonium
In order to prevent code duplication on a global scale, we need more
frameworks, like leftpad. :sarc:

~~~
hinkley
You’re being sacastic, but sitting down and looking at what subject areas
appear most frequently and talking about _why_ would be useful for any
language.

Are they even getting it right or do they all have the same bugs? Are there no
existing libraries? Are the downsides worse? Can we fix that? Should this
functionality live in the core language (did we miss a feature).

