
Internet Search Tips - telotortium
https://www.gwern.net/Search
======
mrspeaker
These are some interesting tips. And as a bonus, the page is giving me serious
flashbacks to Fravia's Web Searchlores!
[http://search.lores.eu/indexo.htm](http://search.lores.eu/indexo.htm)

~~~
gwern
Hopefully my tips are a little easier to read and learn from than Fravia's.
And much more up to date...

~~~
sitkack
Thanks for writing this up.

I see no mention of the wayback machine.

Nor digging into raw git and mercurial repositories.

I have also had luck emailing folks that either inquired about document or
last good email for the author.

Universities often have blanket access to journals if you use their wifi. SH
has mostly supplanted the need for this, but not always. Universities are
still great for interlibrary loans, if you are friendly with the library
staff, they might let you IL a book and read it in the library if they don't
checkout to citizens.

Last one on mining a local search database is like you said, to determine the
boolean operators, get familiar with the FTS backends (SQLServer, Postgres,
Elastic Search, etc) and some lightweight un-obtrusive scripting (using in
browser js console).

Like this snippet that extracts the pdfs and doi.org urls from a page. `copy`
is a chrome specific helper. This snippet is great for extracting results that
are lazily loaded in, something wget cannot do for you.

    
    
        function gpdfs() {
            aref = document.getElementsByTagName("a");
            var result = [];
            for(i = 0; i < aref.length; i++) {
                var t = aref[i];
                if (t.href.endsWith(".pdf") || t.href.includes("doi.org")) {
                    result.push(t.href);
                }
            }
            return result;
        }
    
        function pc() {
            copy(gpdfs());
        }
    
        pc();

~~~
gwern
> I see no mention of the wayback machine.

The IA is covered in multiple sections. The Wayback Machine is merely one area
of the IA.

> Nor digging into raw git and mercurial repositories.

I have never had to do that for any research task I've undertaken, so that
would be both too obscure to mention and I wouldn't know anything about it.

> Universities often have blanket access to journals if you use their wifi.

A good point. My proxy superseded this for me but I _used_ to do this, and
simply forgot about it. I'll add that. (Another good trick I know is using the
old Google Reader RSS exports hosted on IA to get fulltext of webpages. I'll
have to add that too: [https://www.gwern.net/Search#searching-the-google-
reader-arc...](https://www.gwern.net/Search#searching-the-google-reader-
archives) )

> get familiar with the FTS backends (SQLServer, Postgres, Elastic Search,
> etc)

I was really just thinking of grep and some other CLI utilities there. :)

~~~
greglindahl
BTW "pip install warcio" is the latest hotness for processing warc files.
Also, you can add a header to wget to download a byte range. Unfortunately
archival tools are mostly targeted at crawling and then playback, not so much
special collections like this one.

~~~
gwern
How do you use wget? I know you can specify a start position, intended for
resuming downloads, but you need to specify an end position as well (to
extract just the single compressed file), and I don't see any way to specify
an end in the wget manpage.

~~~
greglindahl
Hm, come to think of it, I guess I was programming in python at the time. The
internets say that wget doesn't support range requests and while curl's
manpage says --range works, it doesn't work for me.

------
guidovranken
> master/PhD theses: sorry. It’s probably hopeless if it’s pre-2000. If you
> have a university proxy, you may be able to get a copy off ProQuest.
> Otherwise, you need full university ILL services, and even that might not be
> enough (a surprising number of universities appear to restrict access only
> to the university students/faculty, with the complicating factor of most
> theses being stored on microfilm).

At least in the Netherlands there is [1] which indexes many papers/theses
including very old ones, and many are freely accessible.

[1] [https://www.narcis.nl/](https://www.narcis.nl/)

~~~
btrettel
I've used my (US) university's interlibrary loan (ILL) service many times.

In my experience theses and dissertations are treated like any other book in
interlibrary loan. Local public libraries also often have interlibrary loan,
perhaps for a fee. So I can't see why theses would be inaccessible via ILL
outside of a university.

I have run into some theses and dissertations being restricted to people
affiliated with the particular university which holds the book, but that's
rare in my experience. More common is the case of rare books (few copies
available, like in the case of theses/dissertations) being marked as "library
use only", so you can't take the book outside of the library. But I have
received many of these via ILL, and my university simply respects the wishes
of the book's owner by not allowing me to take the book outside of the library
I requested the book from.

~~~
gwern
As a kid, I made heavy use of my public library's ILL (sorry, taxpayers), but
I never heard any whisper that university-level ILL of any documents was
possible, and the forms my librarians filled out strongly implied that only
books & CDs were ever contemplated.

> In my experience theses and dissertations are treated like any other book in
> interlibrary loan.

Yes, but you have to be _at_ a university in the first place. Life is grand if
you're a student with full privileges - ILL is, IMO, one of the most
underrated fringe benefits of being a student - but once you're out, you're
out in the cold. I've discussed this with any number of people, including
people at think-tanks without university affiliations, and no one's come up
with a good solution how to get back into the ILL system short of hiring
students to do requests for you.

> I have run into some theses and dissertations being restricted to people
> affiliated with the particular university which holds the book, but that's
> rare in my experience.

Yes, fairly rare, but it still happens. Unfortunately, I don't know what
happens when you ILL them because I didn't run into any examples until after I
graduated. I'm also puzzled when I run into online open access theses which
are, however, embargoed for a year... (I simply schedule a reminder to go
back, but I'm perplexed as to what could possibly be the point of that.)

~~~
btrettel
Have you tried requesting a thesis or dissertation via a public library's
interlibrary loan service? I am fairly confident that most universities would
lend a thesis or dissertation to a public library. I don't see any clear
reason why they would not other than snobbery. (Edit: And I can recall a few
instances where my university got a book via ILL from a public library. So the
converse does happen, at least.)

While I have not used public library ILL, my impressions is that the
difference is more in scale than access. (Though access surely is reduced.) I
recall reading public library ILL policies that charged for requests and only
allowed one request at a time. That would be a lot worse than what I have
right now, but much better than nothing. (Also, I have used ILL at two
government labs and I would say the difference again was more in scale than
access, and that these labs were somewhere between public and university
libraries.)

There are many paid document delivery services. Here's my university's one:
[https://www.lib.utexas.edu/find-borrow-
request/interlibrary-...](https://www.lib.utexas.edu/find-borrow-
request/interlibrary-loan/outside-orgs-individuals)

I've used a few paid document delivery services before. They weren't cheap,
but I was able to get some things that my university's ILL wasn't able to.

Personally, I'd prefer some sort of scan request barter system. I scan X pages
for you, you scan X pages for me. I've done similar things informally.
r/scholar hasn't worked out well for anything not already digitized. I've
thought before that someone should make a scan request website that gives you
credits for fulfilling a request. Hopefully the lawyers would stay away from
this as these items can't really be obtained any other way; if they were
available for sale, people would buy them.

As for embargoed theses, my impression is that usually the author wants to get
a patent on something discussed in the thesis. In US law, you can only get a
patent within a 1 year after publicly disclosing the invention. So the embargo
gives them more time. There likely are other reasons as well, but this is the
only one I've encountered.

~~~
gwern
> I don't see any clear reason why they would not other than snobbery.

Snobbery is a pretty good reason for anything, I've found. In any case, it
could be many things: expense (as my university regularly reminded us, each
ILL cost like $20 on net), low demand from patrons (even if self-fulfilling,
still a valid reason), lack of trust, not being plugged into the right ILL
system/software... I don't recall any of the books I ILLed from my public
library being indicated as being from universities, though there were several
close to us.

I suppose I should try my public library when I have some spare time. I'm
almost certain they'd be unable to get either books or theses or papers, but
it'd be interesting to know the specific reason why not.

> There are many paid document delivery services. Here's my university's one:

Yes, actually, I know someone who used that one last month. (Scan quality
could've been much better, IMO.) It is, unfortunately, rare to have a
straightforward 'pay us $X and we'll give you a copy of any thesis' service
linked or mentioned on the library website, and they are only for _that_
library's holdings ("scans of book chapters and articles from the UT Libraries
collection" ie not anyone else's). I would complain a lot less if most
universities had it! I've sometimes wondered if more university libraries have
it than I think they do, and they just hide it.

> As for embargoed theses, my impression is that usually the author wants to
> get a patent on something discussed in the thesis.

That would be reasonable, but in the subjects I usually research, that would
make little sense. I think the last embargoed student thesis I ran into was a
study on the stimulant effect of nicotine on cognitive performance; hard to
see any patent on that being possible, much less profitable to apply to.

~~~
btrettel
I'd push back if a public library refused to do ILL. (Though I can be fairly
stubborn.) If cost is a concern, offer to pay. If a librarian says that they
don't offer that service due to low demand, ask if they could make an
exception. Maybe ask another librarian who might be more open to the idea. I
don't know what to do about a lack of trust.

Lacking the right software is not a valid excuse. UT will send "ALA requests"
every once in a while if the lending library isn't in their software. As far
as I can tell an ALA request is just this form mailed or emailed:
[http://www.ala.org/rusa/sites/ala.org.rusa/files/content/sec...](http://www.ala.org/rusa/sites/ala.org.rusa/files/content/sections/stars/resources/ALA_ILL_Request_Form.pdf)

I might suggest contacting various universities about getting copies of theses
from them. It's possible that they'd be perfectly willing to scan them for
you, even for free. On this note, I've been surprised by the extent some
corporations have gone to provide me with copies of proprietary technical
reports they produced. One time I called the number on the website of a large
corporation and my call was forwarded to their staff librarian, as I recall. A
few weeks later I got the requested report. It had to go through some release
process, but they were happy to share their work. Another time I emailed a
generic address at a Shell research lab, and a few weeks later I got a copy of
the report I wanted. There have been failures as well, but I was surprised by
how frequent the successes were.

------
throwaway4790
Worth adding in my opinion:

booksc/bookfi/bookzz

IRC bot rooms (#bookz, #ebooks and others)

Private trackers (Bibliotik et al.)

DHT search engines, eMule, DC++

In my country you can also just walk into a public library, get a free
membership card and start browsing

------
8bitsrule
Most important in my experience are the terms that you choose for a search.
Sometimes it helps to think about how most people would phrase a question.

Often results can be too broad ... so use them to choose more specific terms
to add to your query to 'focus in' on the result you need. (E.g. adding years
(even specific dates) can add focus -and quality- to your results.)

It helps to keep a growing list of bookmarks for high-quality 'specialist'
sites with -a lot- of content.

------
saagarjha
Great article, but I’m left disappointed after looking at the “obstacles” that
were presented. For example, paywalls: why is it ok for legal documents to
cost money to access? This should be free for everyone–if they need money for
“technology” (somehow RECAP apparently doesn’t, or can figure out a way to
cover their expenses without charging people) they should roll it into taxes.
Same goes for university research, especially if it’s publicly funded. So much
effort wasted for information that should be freely available.

------
yusuf-lan3
some good tips but the site is super hard to read (content structure wise)

~~~
gwern
I've reorganized it a bit.

------
etaioinshrdlu
I am honestly concerned for Gwern's mental health. So much personal
optimization taken to such an extreme, doing blatantly illegal things and
blogging about it. Some are probably quite unhealthy.

I am all for being the best person we all can be -- but Gwern seems to have
made personal optimization the end goal itself.

I think we should aim to be happy well adjusted people, not work ourselves to
death, not drug ourselves to our physical limit.

There were times in my life when I behaved similarly and it was rooted in deep
dissatisfaction with life.

To be clear, this is not an ad hominem attack. The article is amazing.

~~~
vonseel
Gwern is amazing and what a fascinating website. I hear you and understand
where you are coming from, but some people thrive under extreme organization
like this. These are the kind of people I want writing technical documentation
on my teams.

------
gigama
"If you can’t use it while chatting without the other person noting your
pauses, it’s not fast enough."

Still takes time to read, comprehend and reply.

~~~
glomph
I think they might mean chatting online.

~~~
yorwba
Gwern has now replaced "chatting" by "IRCing".

