

Stanford PhD Dissertation Browser - abhaga
http://nlp.stanford.edu/projects/dissertations/

======
dramage
Thanks for the feedback, folks. I'm actually a bit surprised this hit hacker
news without any of us authors posting it, but heck, I'll jump in.

Think of this as an experiment in exploring a document collection at a higher
level than search. Specifically, what you're seeing is Stanford's
dissertations through the lens of a text model that tries to distill high-
level patterns in the data. It doesn't always succeed, but it often hits the
mark. There are plenty of ways that the visualization and the underlying text
model could be improved.

For the curious, I'll tell you a bit more on how the numbers are computed: we
build a unigram language model of the contents of every Stanford department
based on their dissertations. Then, we posit that every dissertation comes
from a mixture of those department models (using a supervised topic model,
Labeled LDA). This lets us infer, for every dissertation, a weighted mixture
of departments that best characterizes that abstract. So, say, dissertation X
is 60% computer science, 20% physics, and so on. These scores are aggregated
to compute the average similarities between departments, and are sliced to
give the view over time.

So what you're looking at is, essentially, a visualization of word overlap
between departments measured by letting the dissertations in one department
borrow from words from another department. Which departments borrow the most
words from which others?

When you zoom in two-levels (click on a department twice), individual
dissertations are plotted on a line between each dissertation's home
department and it's next highest scoring department. So the relative position
of two dissertations near each other is not meaningful unless they are on the
same radial line. Dissertations from other departments that have a high score
for the central, focused department, are also shown.

For instance, take a look at Computer Science in 2005. You'll see three
dissertations along the radial line to Linguistics - those are the three
students that graduated from the Stanford NLP group that year. There are
plenty of other places you find similar things that work, and also places
where things don't work as nicely as you'd expect.

The visualization Jason built was really interesting from the text modeling
perspective, because it let us experiment with many model variations (lda, tf-
idf, etc etc) to see how well each matched our intuitions. This model, though
still wanting, was by far the best. Good enough, even, for us to put online
for the world to play with, and for hacker news to pick apart ;)

~~~
maxogden
why'd you take the data down? I was writing a scraper in order to make your
underlying data into happy open and accessible data:
<http://scraperwiki.com/scrapers/stanford-dissertations/edit/>

~~~
pbh
I can't speak for Dan, but it's probably best if you don't do this! I think
most of the data itself is UMI data from ProQuest, which seems to be licensed
pretty strictly.

~~~
maxogden
boo proprietary data!

------
torme
Neat? Yes.

Usable? No.

I'm not sure if theres some piece of information thats trying to be portrayed
here (other than quantity of dissertations per major) but as a browser, it's
pretty useless. It employs mystery meat nav heavily, mostly because you have
to hover over a dissertation to view any info about it. Imagine finding a
paper you liked, and then going back 15 minutes later to try and find it.

If there is some data thats trying to be shown its not clear to me what it is.
Some of the inner circles are in columns which seem to indicate correlation,
but I cant figure out if thats accurate or not. For instance, if I click on
philosphy, I see a dot in the direction of Food Research, and when I hover
over that dot I see a thesis about "Practical reasoning and the varieties of
agency".

What is trying to be portrayed here?

~~~
abi
Yep, this is a really bad way to present the information. Having a collapsible
columnar list for each department would be orders of magnitude more useful.
Plus it would be in HTML, and you could use ctrl/cmd+f to search for
dissertations by keyword.

Perfect example of _bad_ data viz.

~~~
adulau
Yes and no. I found quite useful and visual the way to show the relationship
between topics. For example, you can see the close relationship between
biology and psychology. I'm not sure a large table could the job.

~~~
torme
I see almost no correlation between bio and psych. In fact, I only see 2
articles that link those 2 topics at all.

------
Jun8
Oh, for a second there I thought Stanford made all their dissertations
available online. On second thought why doesn't universities do this and
instead try to sell access? I've heard of people who put in a $20 bill in the
copy of their dissertations in the university library and finding it intact
years later.

~~~
kleiba
At least for C.S. I would guess that most people will happily send you a PDF
copy of their theses (if they're not available from their websites already).
Just send them a friendly email.

~~~
Jun8
You're right but contacting everyone like that is not very practical, in most
cases you're nor event aware that a dissertation for the topic you're
interested exists.

IDEA BOLT: What if one creates a central website that offers free storage and
search capabilities and ask people to upload their dissertations and theses?

What do you think about that idea?

~~~
RK
<http://arXiv.org>

~~~
Jun8
Yep, but arXiv's content is very limited, e.g. very little EE research and
economics, and nonexistent social sciences.

~~~
RK
I don't know where to find the stats, but the number of categories has
expanded greatly. I assume that trend will continue.

The hard part is convincing people to upload their work. Even in physics the
use of the arXiv is not uniform. In some areas, almost every published and
unpublished paper is posted, while in other areas of physics hardly any papers
are. For example, compare the category Quantum Physics with Space Physics.

------
baddspellar
I did my PhD in Computer and Systems Engineering (in the EE department of my
school), an my thesis involved use of Computer Vision and AI in the analysis
of microscope images of human cells, so I did a lot of work with MD's and
Biologists. I thought it would be interesting to see which theses overlapped
cell biology and Electrical Engineering.

The browser showed two overlaps: "Low-Power dynamic amplifiers for pipelined
A/D conversion" and "Precision clock synthesis using direct modulation of
front end multiplexers/demultiplexers in high speed serial link transceivers"

The first of these mentioned "cell phones" in the abstract. There was no
evidence of any cell biology link in the second.

The visualization may be interesting, but I'm not so confident in the quality
of the data.

------
Groxx
Interesting setup, but it seems pretty wildly incorrect at times.

For instance: Comp Sci -> Ethics: "Designing interactions that combine pen,
paper, and computer". Comp Sci -> Radiology: "Securing untrustworthy software
using information flow control" Comp Sci 98 -> Physiology: "Consistent
overhead byte stuffing"

Could be a heck of a lot better. Especially given the long almost-locked-up
pauses, and the inability to keep a block of text up when you move the mouse
away. All the little things add up, making me doubt the creator used it
themselves at all, aside from making sure it functioned.

------
abeppu
I'm a little annoyed that they used the phrase "topic distance" but it looks
like they're using something which is in some cases very asymmetric (the
'distance' between CS and EE is not the same as between EE and CS). As a
visualization, it's not meaningful because they don't explain what it means
for two topics to be 'close'.

I'm guessing they're using something like a KL divergence between the
distributions over words (smoothed?) given each topic, but that could be way
off.

------
abi
It it possible to obtain the data for this somehow? I'd interested in making a
different, more straightforward columnar text-only view.

------
j2d2j2d2
In flash? C'mon!

