

 My popular Twitter Analytics app has reached technical limits. Need help - mittermayr
http://tphq.tumblr.com/post/35052036358/fruji-i-need-help

======
shazow
When I worked on SocialGrapple (similar featureset to Fruji), here's what I
did:

* Technically: I had a well-optimized PostgreSQL database which had a few parts: Follower graph schema with a revision id (which node is following which), it got cleaned out every N revisions; a delta schema which took the last two revision ids and diff'd them; an aggregate schema which did a bunch of queries and summarized the results every T interval; metadata schema which stored cached information about each node (updated every time that user object was fetched).

I'm pretty obsessive about query and schema optimization, and I had a
comprehensive benchmarking suite which helped me consistently improve my
performance with bulk insertion as well as aggregate queries as well as user
displayed queries. Each job was broken down into small efficient pieces that
were executed in dependency order by my custom task scheduler, Turnip (open
source at <https://github.com/shazow/turnip>).

I don't remember the exact numbers but I was approaching 100M rows on a single
512mb Linode.

Redis would have worked too but I would have needed much more RAM or more
moving pieces to move things in and out of RAM for processing. None of my
queries were slow enough to worry about this.

* Pricing: As others mentioned, higher prices make it easier to scale. I charged based on the size of the account and how many accounts you wanted to monitor (basically proxies to how many API calls you'll cost me). A small account cost something like $6/mo, 5 medium accounts $14/mo, 25 bigger accounts at $50/mo, 100 large accounts (1M+ followers) at $125/mo. I had modest revenue but I can't say my pricing scheme was perfect. I was actively messing with it towards the end.

* I had a legacy Twitter whitelisted account which gave me 10K api hits per hour. This helped me _a lot_. At the same time, I was careful to not become too dependent on that account in case I lost it. I was well within the boundary of normal user limits the entire time and only really used my whitelisted account to experiment or backfill new data. I made sure to always make the most efficient API calls to avoid wasting them. I too had issues with timeouts but it more came in waves when Twitter was having infrastructure issues rather than consistently. It wouldn't surprise me if this has gotten worse.

Also, I used and stitched all three Twitter APIs: REST, Search, and Streaming.
It was painful.

* Diversify. Twitter is becoming an increasingly developer-unfriendly platform to build on, and your business should not be dependent on it. I added Facebook support to SocialGrapple, and I was going to add Google+ support too. Today, I'd also add app.net support. That said, the majority of my business was still Twitter, and that sucked. This was a big factor in my decision to sell out and shut it down—I didn't see the developer ecosystem as a place where you can have a sustainable business, let alone a thriving one.

I actually had several conversations/negotiations with Twitter about how
they'd interpret their terms of service wrt my product. It helped to know
people at the company to get a favourable ruling, but I still felt like it
could be reversed—err, "provided with guidance" at any moment.

For what it's worth, I found it more rewarding to build an analytics product
that was super useful for a smaller group of people than a little helpful for
a lot of people (I'd say tweepsect.com is the latter). Think about where on
the spectrum you want to be as this makes decisions, like pricing, easier.

Best of luck! Shoot me an email (in my profile) if you'd like more details.

~~~
mittermayr
andrey, just wanted to say thanks for this comprehensive answer. it seems like
you speak from a lot of experience and while it's scary to read through what
you have had to do to survive, it all makes a lot of sense and helps me in
picking my battles a bit.

again, thanks for this, i am going through it later again, just responding to
over 60 other e-mails with help and support, just fascinating to see this. if
one thing, we can hopefully make it clear that betting on someone's platform
will provide tremendous opportunity but also introduce a considerable
uncertainty if it takes off.

------
PaulHoule
There's an antipattern here.

Don't build services based on other people's APIs.

It's sad but true. For all the talk about mashups, it's rare to find a demo
that's actually cool and much rarer to find a real application because of
these problems with API limits. Sure, you might be able to build something
that can handle 99% of twitter users, but the interesting and profitable 1%
will blow out the API limit and then you're hosed.

~~~
redacted
So, your solution to the problem (carefully stated and with recompense
offered) is: "Have a time machine"?

I guess it is a good solution, but it would require a lot of fundamental
physics work so it might not match the time-frame he needs.

My point, and I apologise for the snark, is that when people have a problem
and ask for help, they are not asking for judgement on things they should have
done in the past, they are looking for ideas for how to move forward. If you
feel strongly that people who build on APIs do not deserve this help, then
perhaps consider making that point on any of the frequent "X API sucks"
threads the pop up from time to time.

~~~
mittermayr
he is both right and wrong. relying on APIs can make the world very complex
and high-pressure suddenly, but on the other hand, building/owning your
complete eco-system is sometimes not doable from a starting perspective (wish
I'd started Twitter, but I haven't...). so his advice, while not helpful, does
have some truth between the lines.

------
troels
Maybe I'm stating the obvious, but you write that:

 _So I scale back up to 100 follower requests for the next call. It goes
through. Next 100, fails. I scale down again …_

So, that sounds like a cache is being primed on the first request. If this is
a consistent pattern, you should be able to issue a request for one record,
followed by a request for 100 and get it through most of the time (E.g. unless
you run into a garbage collection cycle). If you code against this assumption,
you should be able to utilise 50% of the theoretical limit of a 100/hour.

Is there something I'm missing?

~~~
mittermayr
sometimes, it's follows a clear pattern. request 100, fail. request same 100,
success. repeat. --- that's likely because they load it in the cache once they
have the results. it halves my success rate, basically.

but sometimes it's just random (depending on twitter overall traffic I'd
assume).

plus, some records are faulty and can not be fetched. so this causes other
issues as well and drops the api call.

------
hboon
A couple of items comes to mind:

1\. Can you modify your system to return results based on a subset of their
followers? If this is still of value to the user, it looks like the way to go,
providing results based on a subset followed by another set of "final" results
based on the full follower graph.

2\. If you have enough celebrity-scale accounts using your service, are you
able to share their followers details and cut down on time used to pull them?

3\. On the business side, the service sounds dangerously like something that
would be killed by Twitter if it becomes successful, by some definition of
successful because you are pulling out their follower graphs. Look at Tumblr
and Instagram. While it is good to milk it while you can and build it with
ambition, it is also wise to look further and prepare for the day if it gets
shut down and revenue goes to 0. If you have nothing to lose, go ahead, but be
aware.

~~~
mittermayr
Great, great points. Here are my responses: 1) Yes, that seems to be the only
way out right now. But my core features (most popular followers / most
valuable followers) depend on a full set and will not be accurate. Growth data
works fine though. The main reason why my service got traction were the most
popular followers / most valuable followers metrics though. So that sucks. But
seems to be my only resort right now.

2) I can get IDs faster, and then theoretically use my own DB to see if I need
details for that ID (i.e. follower details) from Twitter or have them stored.
Problem is: I am duplicating Twitter's database, also, I had this before and
the database grew to immense size and kept crashing all the time despite
efforts to avoid it. Also: Celebrities have a large distribution of followers,
so unlikely to save much time by seeing who's details I already got elsewhere.

3\. Yes, although Twitter announced their quadrants recently, of services that
will be supported by them, one of them is social analytics. This is what I do.
So that should be good. But you have a point, I can't touch any money for up
to 1 year in case Twitter shuts it down, I want to pay back the remaining
unused time to my users. So it's frozen money.

~~~
JoachimSchipper
Focusing on 1): if "most popular follower" means "follower with the most
followers", you could try one of three strategies.

First, just check which of the the top 100 (1000) most-followed users follow
the celebrity - this will likely eliminate the heaviest hitters immediately.

For accounts with normal amount of followers, do as you already do.

For the in-betweeners,

    
    
        Pick your celebrity.
        For a random 1% (.1%) of his/her followers:
            Retrieve list of people followed by this follower
            For each user followed by a follower of the celebrity:
                Increase number of people following this user by 1
        For each user in the top 1% (10%, .1%) of the above table of users:
            Retrieve number of followers
        Report user with highest number of followers from the above
    

This is, of course, based on the idea that often-followed users who follow a
celebrity will also have many followers among the followers of the celebrity.
Results will become more accurate as you poll more followers, of course.

(I don't use Twitter or their API, and the above may be completely wrong.)

~~~
mittermayr
see, this is what I came for. I believe there's room for optimization within
the limits I have to adhere to right now. your approach seems like a stab at
it and I'll think this through again in a bit if it makes sense. but you're
hitting the nail on the head in terms of where my problem is, i can't change
what twitter does, i can display partial results but it's suboptimal in many
ways, so how can i make these partial results the best possible quality? this
is where you answer seems to make sense - thanks. i'll think about this one
for sure.

------
PanMan
First, you could try Datasift or Gnip, who both sell twitter data, and thus
have no API limit. Not sure if you can afford it, as it does have a cost.

Second, maybe you could use the streaming API: you could get part of the data
that way, and have more credits. If users follow back, you could use the
sitestream, although it's quite different to work with then the REST API.

Thirdly, if I read correct, if Alice and Bob are both your clients, and Fred
follows both, you now collect Fred's data twice, right? I would put a cache in
between that. Riak, cluster of redis, or even S3 or DynamoDB. If I can help
more, send me an email (in profile)

Lastly, if you have twitter investors in your userbase, ask for an intro to
talk to twitter. They see the value of your service.

~~~
mittermayr
Hey! Datasift+Gnip both seem to supply conversational data, which I currently
don't track. My focus is on user-data, which both don't seem to provide. And
yes, they're expensive, wow.

The streaming API idea is a good one to get the most out of all of Twitter's
data sources... someone else suggested this as well. Seems like it's worth a
try.

Your Alice/Bob/Fred assumption is correct. I had a cache of sorts, through a
large mysql table which went way over a couple of gigabytes and kept crashing
all the time and restoring took half a day.

The twitter investors haven't had a chance to see any of the service since
they are still waiting for results :) tough one to ask for intros ;)

~~~
PanMan
The first thing I would try is rebuilding the cache. Since you only need to do
key-value lookups, there are many (easy) approaches, that can scale better
than MySQL. You could do JSON files on a filesystem (with some nesting, as you
don't want to put 20 million files in 1 dir). Or Redis: 20 million x 1 KB is
20 GB. You will need more with overhead, but a few machines would work. Or
Riak (we went > 1 billion items with Riak). But even MySQL should be able to
handle this: We had over 100 million records in MySQL (on SSD's) before
switching to something else.

That seems to be the quickest win to reduce the number of requests, but it
will depend on the overlap of your usersets how much this will help.

------
adrinavarro
I'm surprised nobody has mentioned Redis yet.

I'd build it with two stages. First, a cache that holds user information (sort
of replicating profiles), that is, follower count and etc. This would be
shared among all users, and can go into a dedicated Redis instance (and why
not, also replicated to a MySQL-InnoDB for convenience).

Then, the "graph" DB (follower list), that I'd put into Redis. With some
scripting and Redis magic, you can keep automatically sorted (server-side)
users by their follower count. You'll just need a lot of RAM (get a dedicated
server, look at ovh or others, cloud is usually more expensive and less
reliable when it comes to RAM).

You can collect profile information before they go on to the 1.1 (which forces
auth), to populate the global DB. Then, you'd only have to fetch users'
follower IDs (using the 1.1: followers/ids), which I believe is way more
reliable (and progressively, pooling queries, populate the profile database in
batches of 500 or 250 users, using followers lists with user details).

This means that data can be queried dynamically without killing the server (or
the servers, there should be more than one), therefore allowing for "partial
results" (1M followers -> info about the first 10,000 just after signing up,
for example).

~~~
mittermayr
you do have a sensible approach that you suggest here. originally, without
using redis, i wanted to use mysql to cache all user data. and insert/refresh
details in it over time. the table quickly grew to 20M records (with meta-data
information taking at least 1K of data per record if not more) , the database
grew to multiple gigabytes. twitter has up 140M accounts or more now, so i'd
need headroom here, although i'd likely not touch a large amount of twitter
users.

also the system started making sense after a while when I had user ids that I
had already cached (you are correct, the IDs I get through followers/ids which
is much more well-though out function in terms of limits).

but then mysql constantly crashed and reparing/backing up a multiple-gigabyte-
table exceeded my technical abilities and i gave up. so I split up everything
into per-user-sqlite databases that I backup to S3. i lose the ability to
access a cache of users though since I can't query other user's sqlite
databases in a sane way to see if they have meta data for that user id.

major problem is that I believe twitter will eventually shut me down if I
duplicate/replicate their user database (and I constantly need to refresh
since user data will eventually be outdated).

~~~
adrinavarro
They don't need to know you replicate! ;)

By the way, which fields (nickname, avatar, follower count, following count…?)
are you storing for follower representation?

~~~
mittermayr
well, they'd find out eventually, that's my worry. it's tough to speculate on
something like this.

this is the culprit call:
<https://dev.twitter.com/docs/api/1.1/get/users/lookup>

i store anything that might provide sensible statistics later on from that
response (so no profile colors or photos)

~~~
adrinavarro
Well, it's intended to work with followers/ids and other calls. Twitter might
go against you, but that would be disregarding your calls… If they pull the
plug it will be because of features, not because of how you use the API :(

------
laxk
I am working on the project where I also have to scrape a lot of information
about a user(posts/tweets/statuses, photos, friends, etc) on the social
networks(g+, twitter, facebook, youtube, etc) All these limitations are really
annoying. Errors are expected almost on every API call (timeout, 5xx errors,
host not found, new fields on data structure, etc). The scraper should be
smart enough to support all these caveats.

What I do not understand is why these major players do not want to introduce
API for money? I am ready to pay for it and I know a lot of people who also
are ready to pay for it. But please, remove these limitations and make your
APIs more stable.

~~~
mittermayr
i agree, a pricing model would be very interesting. and probably a good way
for twitter to monetize their platform.

------
mittermayr
I have quite a few celebrities signed up, without advertising to them,
including certain inventors of Twitter itself, TV personalities, major
investors and others.

And all of them (1M+ followers) have to wait up to 60 days or more to get past
the login page due to a bug and limits of Twitter.

I feel there's a smart way to work around this and I have always managed to do
so in the past, but now, I've hit my technical limits and need help.

I am willing to split upcoming PRO account payments 50/50 with anyone able to
help me code moving forward / solve this issue.

~~~
huhtenberg
For this sort of user demographics your pricing appears to be off by an order
of magnitude. Just sayin' :)

~~~
mittermayr
i know, but as said, it started as a funny experiment, became a toy and then I
felt bad charging more for things i can not influence. It'll definitely be
able to charge 19.99 a month up to 99.00 a month for corporations, there's a
lot of features I can add - but right now, it'd put me under even more
pressure to charge that much. it's a messed up weird situation.

~~~
a3camero
Considering increasing the highest price level. There are people who build
businesses around Twitter. They will pay more than $1200/yr for something that
helps them make more money than that.

~~~
mittermayr
yeah well, i instantly would and I have the features/service to back it up
with quality data, but as long as I can't deliver any sort of service to
larger accounts (1M+ followers), it doesn't make sense to charge like that.
long term, no problem. right now, twitter limits me too much.

------
sycren
Could you not think of this as a networking problem. Say you have 10 users
with more than 1 million followers and there are 100 million twitter users,
what is the probability that some of your other users are following them?

Perhaps for some of the accounts that do not use much credit to search for
followers, you could also search for those who it follows. Then on your
backend, you can see if this subset exists within a previous list of followers
of a big user.

------
countessa
don't know the twitter api very well, so....maybe instead of scaling back and
doggedly trying to get the current record set, you simply mark the 1st 1000 as
having an error somewhere, proceed to the 2nd 1000....come back to the ones
giving you problems later. I know eventually you have to come back and get the
broken ones, but if you manage to process 80% of someone's millions of
followers, then you can start digging into the other 20% a bit at a time and
at least provide some value for your customers in the meantime......

I haven't actually checked out your service because i'm not really a twitter
person so maybe you do this already, but could you provide statistics based on
the amount of data you've got so far? (i know they will be inaccurate), but
you can sort of give them as a "moving target", based on x number of your
followers type thing. That way the user gets a little bit a value right away.

~~~
mittermayr
i also tried this and it helped a bit, for every 100 user details request I
pull a random set of 100 ids from different places in a user's follower list
to minimize getting stuck. it helped a bit. but main problem are still the
time-outs.

------
irfan
I had a similar problem when using SSL connections. I'm not sure about your
data by in my case the data was pretty much public data and there was no harm
in using simple HTTP connection. This significantly improved the speed and no
more very frequent timeouts.

Also try enabling/disabling gzip compression for API calls.

~~~
mittermayr
the bottleneck is twitter's api limits, data-wise and http connection wise I
have headroom, lots of.

simple http connections to parse/spider the follower records from public pages
is a no-go since twitter blocks the IP then, and scaling this out will
eventually not end in a good way.

~~~
irfan
I was talking about simple http connection for API instead of https for API

------
kernel_sanders
This may be against twitter's ToS, but can you create a bunch of accounts with
different API keys to access the api concurrently and get around the rate
limits?

I'm sure twitter must account for this, but how do they? You don't need to
provide much information to get an API key.

~~~
mittermayr
yeah no, they're explicitly not allowing this and as soon as this thing grew
out of weekend-project scale, i have to adhere to any rules they're pushing
out. too high a risk to be shut down and locked out completely.

~~~
YousefED
What about re-using keys of your other registered users to handle the API
requests of users with a lot of followers in parallel?

------
dhruvbird
Interesting practical problem. I've tried mentioning a few solutions you might
try here: [http://dhruvbird.blogspot.com/2012/11/twitter-api-y-u-no-
res...](http://dhruvbird.blogspot.com/2012/11/twitter-api-y-u-no-respond-fast-
why.html)

~~~
mittermayr
i responded on your blog. i am using solution 3 already, it helps a bit but
doesn't help with the core problem.

~~~
dhruvbird
@mittermayr I responded to your comment too :) There isn't much you can do
about the latency at their end except try and get the most out of your API
quota by making sure you don't issue a call when you know you will time out
(discussed in the reply).

I would be interested in knowing other solutions that work for you.

------
michaelmior
Not sure if this is against Twitter's ToS, but since follower data is public,
can you make use of unused API requests from accounts with fewer followers?

~~~
mittermayr
Nah, unfortunately not. All calls originate from my company account or on
behalf of a user. So for using other people's account tokens to scan, wouldn't
work either as they are only authorized to scan their followers.

------
mittermayr
just wanted to say thanks real quick for all this help. i've received over 30
e-mails already from random people helping me out with advice. i know it's
often looked down upon in comments sections to thank the community since it
provides no value, but i don't give a shit right now, i just need to say
thanks, so much everyone :)

------
pbrumm
If one of your requests for followers fails at 100, does it fail for a
different user as well? Instead of backing down on a user you could try
delaying it for a time period and switch to another user.

also keeping track of when the failures occur is important, and potentially
valuable information.

~~~
mittermayr
i run up to 16 crawlers at the same time. twitter limits (in terms of rate
limits) me only for each user and the calls i issue through that user's
credentials. so i can go parallel easy. not much of a bottleneck on my side.

~~~
pbrumm
Does it fail for an individual user or is it all users at that time period.
instead of reducing the number of followers, you may just need to wait 5 - 10
minutes and ask for that user again with full 100.

~~~
mittermayr
pretty much what I do, my local sqlite storage per user slows things down a
bit (but that's good since Twitter is even slower). so between requests, I
often give Twitter enough time to finish the request for the previous 100 and
store them in cache, so that when I re-request the same 100 (i always try
twice), there are often there. but not always. it's a mix of overall twitter
load, plus where/how deep down these 100 followers are stored, plus if a
follower record is damaged (happens frequently), plus other time-out factors.
that's what I am complaining about, it's so hard to work around this.

------
mosselman
Maybe it is a good idea to ease the load on the API by scraping the HTML pages
of twitter. Follower count is right there.

This solution is only partially ugly if you are only after the follower count
of a username.

~~~
9mit3t2m9h9a
Actually, loading follower list via HTML page of twitter could also be a
viable idea - more viable than playing by the rules, at least.

------
OoTheNigerian
Ironically, this made me sign up, thereby adding to your load. Sorry.

I'm sure before the end of today someone here would help you out. Better still
it might help trying to contact a few peeps directly.

~~~
mittermayr
ha, thanks! that wasn't my intention, but well, it helps checking out if it
can withstand the load.

i feel sort of bad asking for advice here, since many have better things to do
and i am the one making money with this, but i've reached a point where I
don't know how to continue. and this is very odd.

------
OoTheNigerian
Quick one.

I am having inaccurate results. I can confirm I have a verified account and
more than 0 retweets weekly.

I am guessing this inaccuracy is as a result of the challenge you are having
with indexing.

~~~
mittermayr
the verified account will be correct tomorrow. this is a timing bug that some
users encounter where one procedure finishes faster than another (it's
rendered before all data is there, somehow). the retweets is funky because
Twitter is migrating to a new API version which I've adopted mostly. but the
retweets feature will be gone in the new api feature, they don't offer that
data anymore. so it might not be reliable right now.

------
nodemaker
if exception.message=="Capacity Error": sleep(x) x=x*2 continue

~~~
mittermayr
well, I do back off for a bit once I hit rate limits or the request times out.
but regardless, I lost an API call credit.

~~~
bonzoesc
You don't lose a credit when you get rate limited.

~~~
mittermayr
yeah, but i don't even go to rate limit (i check before every call). i stop 2
or 3 calls shy of it, so that doesn't limit me. but if I send an API request
and that request times out and does not return data, then it still cost me a
precious call (and one closer to rate limit).

------
batgaijin
Switch to dedicated and put saved money into ram.

~~~
mittermayr
i'm on dedicated. 16gb of ram. not the issues. the bottleneck is on twitter's
side.

