
How F5Bot Slurps All of Reddit - foob
https://intoli.com/blog/f5bot/
======
jwilk
> The other 95% of it is just wasted bandwidth.

You can save a lot of bandwith by requesting compressed responses:

    
    
      $ curl -s --user-agent moo/1 -H 'Accept-Encoding: gzip' "$pretty_long_url" > test.gz
      $ wc -c < test.gz 
      63507
      $ gzip -d < test.gz | wc -c
      426941
    

(OK, that's 85% saved, not 95%, but hey.)

~~~
dev_dull
Trading cpu for bandwidth, so optimize for what you want.

~~~
koolba
Decompression is usually cheaper (CPU-wise) than compression so technically
you’re trading _their_ CPU for less of your bandwidth and CPU.

If you do it right you can even keep the content stored compressed without re-
compressing by saving the compressed byte stream directly.

~~~
tgtweak
Also use pigz instead of gzip so it can use multiple threads on decoding
(although you're probably Limited by sending bandwidth)

~~~
jwilk
pigz doesn't help much with decompression:

[https://github.com/madler/pigz/issues/36#issuecomment-249041...](https://github.com/madler/pigz/issues/36#issuecomment-249041503)

 _Decompression can’t be parallelized, at least not without specially prepared
deflate streams for that purpose. As a result, pigz uses a single thread (the
main thread) for decompression, but will create three other threads for
reading, writing, and check calculation, which can speed up decompression
under some circumstances._

------
333c
> I mean do you really want subreddit name and subreddit_name_prefixed?
> They’re the same, one just has an “r/” in front of it.

This is (unfortunately) not quite true. Since Reddit introduced "profile
posts," there can be a post where the subreddit name is something like
"u_Shitty_Watercolour" but the subreddit_name_prefixed is actually
"u/Shitty_Watercolour", rather than "r/u_Shitty_Watercolour".

Example:
[https://www.reddit.com/user/Shitty_Watercolour/comments/84nh...](https://www.reddit.com/user/Shitty_Watercolour/comments/84nhwi/here_is_my_patreon_if_you_would_like_to_support/.json)

~~~
jshap70
I'm not sure that's true. I think they both work, see:
[https://www.reddit.com/r/u_Shitty_Watercolour/](https://www.reddit.com/r/u_Shitty_Watercolour/)

Maybe one is just an alias though? I wonder if you can make a
r/u_$unused_username and then later register $unused_username

edit: nope, you can't make a sub that starts with "u_"

~~~
333c
They do both work, you're correct.

However the point of subreddit_name_prefixed (I assume) is to display
something in a user-facing way. For this purpose, r/u_something is correct but
not proper.

------
saurik
It is difficult for me to describe just how angry it makes me that reddit
doesn't provide a way for users to even do basic things like "see all of my
own comments" or "see all of the posts made to the subreddit I moderate". They
keep nerfing the search APIs and claim it is so they could make the indexes
more efficient, but while that might make sense for a full-text search
interface, that is entirely unreasonable for basic functionality like "I'm
scrolling back through time on my own user page" (where the efficient index is
pretty obvious). Both of "see all of the content I posted" and "see all of the
content I'm supposedly responsible for" seems like it should be basic, if not
required, functionality for any website.

[https://www.reddit.com/r/changelog/comments/7tus5f/update_to...](https://www.reddit.com/r/changelog/comments/7tus5f/update_to_search_api/)

[https://www.reddit.com/r/redditdev/comments/7qpn0h/how_to_re...](https://www.reddit.com/r/redditdev/comments/7qpn0h/how_to_retrieve_all_removed_posts_via_api/)

[https://www.reddit.com/r/help/comments/1u0scj/get_full_post_...](https://www.reddit.com/r/help/comments/1u0scj/get_full_post_history/)

~~~
mustacheemperor
Reddit doesn't even allow users to save more than 1000 posts, and worse does
not visibly document this or provide any kind of warning that the limit has
been exceeded. Anecdotally, I've read users say that revisiting the saved
pages will still show an "unsave" button so the information is recorded
somewhere. But once a user exceeds 1000 entries on their "saved" page, adding
new ones will silently vaporize old ones.

[https://www.reddit.com/r/help/comments/6nxqjm/maximum_of_100...](https://www.reddit.com/r/help/comments/6nxqjm/maximum_of_1000_posts_is_it_true/)

~~~
Deimorz
It's a bit weirder than that. It actually _does_ save all the posts, but the
"saved" page (like almost every other page on the site) will only show you
1000 items, so there's just no way to access all the older items once they've
been "pushed off the end".

I posted some more information about it a while ago here:
[https://www.reddit.com/r/help/comments/7en0uu/my_saved_posts...](https://www.reddit.com/r/help/comments/7en0uu/my_saved_posts_end_after_a_year/dq69oz3/?context=3)

~~~
lucb1e
If you unsave a newer one, wouldn't the older one pop up again because it
updates the index with limit 1000 again? If the data is there, it should find
all posts and then truncate, in that order, if I'm understanding this
correctly.

~~~
Deimorz
Nope, because the old one's already been pushed off the end of the list and it
doesn't re-generate the list when you unsave, just removes the item from the
list.

Imagine I have a list of max length 5, when I initially fill it up it looks
like [5, 4, 3, 2, 1]. If I save one more thing, it adds 6 at the front, then
truncates the list and removes the 1 from the end, so now you have [6, 5, 4,
3, 2]. At that point, if I unsave #3, it just removes it from the list so
you'd have [6, 5, 4, 2]. #1 is still saved, but nothing happens to pull it
back into the list.

~~~
ralfd
What if you unsave 2 items and save 1. Is then the list remade?

~~~
Deimorz
No, the list is never remade. The new item just gets inserted into the list,
exactly the same as if it had never reached the limit in the first place.

------
wolco
"You may think PHP is slow"

Why would we think php is slow? PHP is blazing fast certain applications
(looking at you sugarcrm) make this into a mockery by rewriting queries and
loading unnecessary data into each page request.

Nice to see a php related show and tell.

~~~
bufferoverflow
PHP used to be very slow, it got better with v7.0.

It's still quite slow compared to C/C++/Rust/Go, more than 10x slower:

[https://benchmarksgame-
team.pages.debian.net/benchmarksgame/...](https://benchmarksgame-
team.pages.debian.net/benchmarksgame/performance/fasta.html)

~~~
sieabahlpark
An interpreted language is slower than a compiled binary? Color me shocked.

~~~
hk__2
> An interpreted language is slower than a compiled binary? Color me shocked.

That same benchmark shows JS code that’s 6x faster than PHP.

~~~
barrkel
Yes, but JS is faster than Ruby and Python, and indeed most of your go-to
scripting options.

~~~
anonytrary
JS has the advantage of being the thing that makes your website run faster on
the screen of the user whose clicks are earning you ad dollars. JS is in a
good spot.

------
allenz
Related: Jason Baumgartner has maintained a Reddit scraping pipeline for a few
years now, and wrote up some notes about making it robust:
[https://pushshift.io](https://pushshift.io)

~~~
minimaxir
You can also use the Pushshift real-time feed in BigQuery to query for
keywords in submissions in real time (unfortunately the comments feed broke
last month)

Example query which searches for 'f5bot' in the past day and correctly finds
the corresponding posts on Reddit:

    
    
       #standardSQL
       SELECT title, subreddit, permalink
       FROM `pushshift.rt_reddit.submissions`
       WHERE created_utc > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
       AND REGEXP_CONTAINS(LOWER(title), r'f5bot')

~~~
stuck_in_the_ma
There has been a lot of interest expressed in getting this working and
dependable. It's part of my plan when releasing the new API. There is A LOT of
internal code managing everything. I've got terabytes of indexes alone just to
handle the 5 million API requests I'm currently getting each month to the
Pushshift API (I have around 20 terabytes of SSD / NVMe space and around 512
GB of ram behind this project).

------
saagarjha
Aho-Corasick is really great. It’s a bit complicated to set up, but once you
have the modified true set up it’s really fast. By the way,

> Basically I use the selftext, subreddit, permalink, url and title. The other
> 95% of it is just wasted bandwidth.

It’d probably be better for Reddit if they allowed for specifying the fields
we care about rather than just returning the whole thing…

~~~
cxseven
Also, imagine the joules saved worldwide if known key names didn't have to be
sent each time, or, better yet, data was packed in a format that optimized for
size and processing speed rather than readability.

I mean, I enjoy the idea of human readability as much as Jon Postel, but at
certain scales you have to wonder about the hidden cost of petabytes of human-
readable data flying over the wire, never to be seen by anything but
computers.

~~~
wvenable
Except that the data stream is, I should hope, compressed so that the data is
actually packed into a format optimized for size.

~~~
yitosda
Not necessarily, and even so not for free.

(Client must specify compression support)

~~~
wvenable
Most clients support it implicitly; you probably have to go out of your way to
get an uncompressed stream. Now compressing a verbose text string is not
optimal but given the past attempts I'd hesitate against using a pre-packed
format. Historically that has not worked out well. Compressing the text format
is ultimately the worse-is-better solution.

~~~
jwilk
They are using libcurl, for which you need to request compression explicitly:

[https://curl.haxx.se/libcurl/c/CURLOPT_ACCEPT_ENCODING.html](https://curl.haxx.se/libcurl/c/CURLOPT_ACCEPT_ENCODING.html)

------
stevebmark
This is just scraping JSON, I'm surprised it made it to the front page. The
only thing worth noting is that Reddit is is able to _serve_ that much JSON

~~~
NVRM
Yeah, was thinking the same.

------
JeremiahMN
I made something just like this that worked on forums. Basically any forum
that was using the tapatalk plugin (pretty much any busy forum uses it these
days) you could subscribe too. It doesn't look like this will handle
mispellings of works, or anything like that. I was handling that, however it
took a LOT of processing power and quickly realized that the more people used
it, the more it wasn't going to scale really well. Good luck with your
project.

~~~
steve19
I run a forum and would be very interested in seeing your code. Can you share?

------
dev_dull
> _So here’s the approach I ended up using, which worked much better: request
> each post by its ID. That’s right, instead of asking for posts in batches of
> 100, we’re going to need to ask for each post individually by its post ID.
> We’ll do the same for comments._

Seems a bit over the top imho. Maybe a better approach is to ask for a 1,000
and look for any missing — which you can grab individually.

I’d be a little annoyed at people not using batch mode and making so many
request but that’s just me.

~~~
codeplea
Each request still returns 100 posts. It's just that you have to specify the
100 post IDs individually.

Their default listing mode works very poorly. It would certainly be more
requests to use a hybrid system like you're talking about.

------
visarga
There's a reddit database dump including the interval 2005 - 05/2018 at:

[https://bigquery.cloud.google.com/table/fh-
bigquery:reddit_c...](https://bigquery.cloud.google.com/table/fh-
bigquery:reddit_comments.2015_05)

------
testplzignore
Which API do most Reddit bots use? Do they use the Reddit APIs directly, or do
they use one of the third-party services (F5Bot, pushshift)? And are there any
other options for getting a firehose of new Reddit posts/comments?

~~~
bicubic
[https://praw.readthedocs.io/en/latest/code_overview/reddit_i...](https://praw.readthedocs.io/en/latest/code_overview/reddit_instance.html#praw.helpers.comment_stream)

It's pretty easy to get a firehose.

------
kierenj
Do the social share buttons literally cover the first few paragraphs of
content for anyone else?

------
wingerlang
For the service itself, I've been using it for a long time and it works really
well.

~~~
codeplea
Thanks! I'm glad it's working well for you.

------
textmode
"Turns out that Reddit [API] has a limit. It'll only show you 100 posts at a
time."

100 sounds like a typical "max-requests" pipelining limit.

He does not mention CURLMOPT_PIPELINING.

Does this mean he makes 100 TCP connections in order to make 100 HTTP
requests?

~~~
meritt
The 100 has nothing to do with HTTP pipelining, it's just a standard REST
style "&limit=100" hard limit

~~~
textmode
You might be right.

With the "&limit" parameter he can change how many items he receives per HTTP
request. This has nothing to do with a limit on how many HTTP requests he can
make per TCP connection (pipelining). Maybe that is the "100" he is
complaining about, i.e., 100 items per HTTP request.

However you failed to answer my question: Is he making 100 TCP connections to
make 100 HTTP requests?

Does the Reddit server set a limit on how many HTTP requests he can make per
connection? (100 is a common limit for web servers)

Sometimes the server admins may set a limit of 1 HTTP request per TCP
connection. This prevents users from pipelining outside the browser, e.g.,
with libcurl or some other method.

~~~
meritt
I didn't feel the need to answer your question because it was abundantly clear
in the code that it's not using pipelining. You posted the exact curl option
that he's _not_ using.

~~~
textmode
I apologise if I confused you. I was simply wondering why he is not using
pipelining, which IME can be ideal for the sort of text retrieval he is
performing.

------
RSZC
edit: cool

~~~
lucb1e
Please don't remove your old text when it turns out you're wrong or have an
unpopular opinion. You're not the only person in this thread who missed the
batch part of individual ID requests and I initially was confused as well, but
removing your post breaks the comment thread, and after 3h you can't edit
anymore so I'll have to downvote.

------
prolikewh0a
This is mostly why I left Reddit. The API allows far too much control and I
started questioning what was even real. Being able to quickly find keywords
and then have a network of bots that creates replies/upvotes/downvotes is very
disturbing thought to me. I can't even imagine something like that on a large
scale to change opinions.

~~~
Bromskloss
How do you change opinions using that? Would it be an illegitimate way to
change someone's opinion?

~~~
jersully72
I think what they're describing are propaganda bots. If so, what you're asking
is how does propaganda work, and is it not legitimate?

~~~
Bromskloss
Yes. Voting patterns, for example, how do those change people's opinions? It
would seem that the most they could do is give an impression of what comments
are and are not well received. Do people base much of their opinions on that?
In which direction?

