
Show HN: CVPR 2020 unofficial statistics, with better search functionality - ck_one
http://cvpr20.janruettinger.com/
======
ck_one
CVPR 2020 statistics (unofficial) + better search functionality

tldr; I have created a dataset about CVPR 2020 papers consisting of the title,
author(s), affiliated institution(s) and the abstract of each paper and put it
behind Elastic Search to make it more accessible. Happy searching!

Initially, I wanted to find out which research institution is in involved in
what papers. To my surprise this information wasn't readily available. I had
to go through each of the 1500 papers to extract the information. I used a
script to get the title, author(s) and the abstract of each paper and worked
with a freelancer (100$/~30h) to get the institution of every author. Then I
used local sensitive hashing to clean institution names and put the whole
dataset behind Elastic Search. A simple idea turned out to be a good learning
project since it was my first time working with a freelancer and also my first
time using Elastic Search.

Quick summary about CVPR 2020 statistics: Google is still the number one in
terms of number of publications. China is gaining momentum quicker than I
expected.

~~~
mkl
The blank page without context gives no clues as to what words might make
interesting searches. Maybe put a list or word cloud of key words off to the
side to suggest ideas?

~~~
ck_one
Good point! Here is a list of keywords I have used: ["Oxford", "Technical
University of Munich", "Self Supervised", "3D Reconstruction"]. Will add a
list of keywords after dinner on the site.

------
tziki
Wow, I'd noticed the increase in Chinese sounding author names during the past
few years but had no idea China was such a major player.

------
sbielmeier
Glad to see TUM being represented in this with a couple of papers as well...

~~~
ck_one
TUM == Technical University of Munich. TUM has a strong Computer Vision
department with Niessner, Leal-Taixé and Cremers.

------
goodmattg
Great project! Obviously a stretch, but would be curious to see a breakdown by
workshop, general conference, orals, etc. The umber of publications is not the
same as their quality.

Also, do you disambiguate corporate affiliations in academia? E.g. many papers
will be +1's for a university and a corporate research group (Stanford +
Google)

~~~
ck_one
That would be interesting indeed. The way I do it right now is that every
institution listed on a paper gets +1 submission for that paper. That's not
always fair since sometimes corporate labs only provide resources and no other
contributions. But I couldn't come up with a good method to resolve this
issue. I think the numbers are a good approximation of reality.

------
eunos
Thanks for the great work!

Btw If I may ask, when you compiled the Institutions, did you apply some
disambiguation? For example consider that UCLA === University of California,
Los Angeles

~~~
ck_one
Great question! A combination of techniques were involved in cleaning up
institution names. First, I deleted all department names so that only the
university/lab name was left. Second, I used local sensitive hashing (wiki:
[https://en.wikipedia.org/wiki/Locality-
sensitive_hashing](https://en.wikipedia.org/wiki/Locality-sensitive_hashing))
to find similar strings. LSH maps similar strings to the same bucket. This
resolved some ambiguity like: UC Berkeley/University of California Berkeley
etc. But there was also a good chunk of manual clean up necessary at the end.

------
ck_one
This is the traffic I got following the HN post:

\- total number of search requests: 3500

\- ~20 search requests/min (at the moment)

I never had that much traffic with a private project. Thanks for your
interest!

------
andrew_eit
Great initiative! How long did it take you to parse through all the papers etc
and build this tool?

~~~
ck_one
The coding part of the project (extracting information, setup elastic search,
build website) took me around two days. But some information had to be
extracted manually. I hired a freelancer on upwork for that who spent ~30hs on
it. It would be cool if conferences would collect data directly with the
submission more carefully so that we can analyse more easily which country/lab
is getting stronger and why.

