
Twint – Twitter scraping tool written in Python - usernameno
https://github.com/twintproject/twint
======
zedeus
Inspired by Twint I created an alternative web client that relies almost
entirely on web scraping. The parser is more complete than Twint's due to the
different goals, supporting things like polls and videos properly. It's still
work in progress but very usable with features like RSS and convenient profile
search. [https://github.com/zedeus/nitter](https://github.com/zedeus/nitter)

~~~
scalableUnicon
I don't use twitter android application and the browser version on mobile is
extremely slow. This made twitter usable on my browser again. Thanks! PS: It
would be even more useful if someone hosts it in a domain like
<someshortcode>twitter.com, then I can just add shortcode to the url and
escape from the slow, official one.

~~~
mav3rick
Use a PWA instead. Go to Twitter on Chrome and do "Add to Home Screen". Really
fast.

------
pknopf
I created a tool that uses Twitter's public facing (but private) API.

[https://github.com/pauldotknopf/twitter-
dump/](https://github.com/pauldotknopf/twitter-dump/)

You can use it to download every tweet from a user, not just the last 3000
that their API supports. It uses the same query syntax that the web search
uses.

~~~
arkanciscan
Does it work on suspended accounts? Mine got reported by a bunch of Trump
supporters so I lost 17 years of tweets and Twitter won't listen. I can still
see my own tweets, but only a page at a time. I'm thinking I might need to use
a scraper like Twint instead of the API.

~~~
pknopf
If you can see it while you are logged in, then yes.

Check ```twitter-dump auth``` for instructions on how to use your web cookies
with the command.

When the command queries Twitter, it will look as if it is coming from you.

~~~
arkanciscan
Nice! I followed the instructions but after I paste the curl command I get
this error:

Unhandled exception. System.ArgumentOutOfRangeException: Length cannot be less
than zero. (Parameter 'length') at System.String.Substring(Int32 startIndex,
Int32 length) at TwitterDump.Program.ParseCurlCommand(String curlCommand,
Dictionary`2& headers) in /home/pknopf/git/twitter-
dump/src/TwitterDump/Program.cs:line 150 at
TwitterDump.Program.Auth(AuthOptions options) in /home/pknopf/git/twitter-
dump/src/TwitterDump/Program.cs:line 113 at
TwitterDump.Program.<>c.<Main>b__0_1(AuthOptions opts) in
/home/pknopf/git/twitter-dump/src/TwitterDump/Program.cs:line 33 at
CommandLine.ParserResultExtensions.MapResult[T1,T2,TResult](ParserResult`1
result, Func`2 parsedFunc1, Func`2 parsedFunc2, Func`2 notParsedFunc) at
TwitterDump.Program.Main(String[] args) in /home/pknopf/git/twitter-
dump/src/TwitterDump/Program.cs:line 30

What'm I doing wrong?

~~~
arkanciscan
NM, I figured it out. Chrome has several "copy as curl..." commands now. I was
using "Copy as Curl(cmd)" but "Copy as Curl (bash)" is what worked.

------
Stammon
JFYI: Twint is already the name of a popular payment app in Switzerland

[https://www.twint.ch/en/](https://www.twint.ch/en/)

~~~
sschueller
"popular" is a stretch, it's what a few banks are trying to get people to use
because they couldn't get NFC working on apple devices with their apps. Until
recently they have been giving money away like crazy to get people to use it.

It requires extra hardware at the cash register (if using bluetooth) or
modification to the vendors terminal to be able to display a qr code.

Sadly the app is extremely slow and cumbersome to use at a cash register
compared to other options such as NFC payments on android or just straight tap
and pay credit cards.

It would be a great system for payments online as scanning a code is very easy
compared to entering all your CC information but becauses the fees are so high
(compare to credit cards) the largest online electronics retailer (digitec) in
Switzerland dropped them after the intro rates expired.

~~~
netsharc
It's probably an attempt to implement what WeChat/Alipay has. In China the
majority of POS terminals has this device [https://www.alibaba.com/product-
detail/Specialized-Alipay-an...](https://www.alibaba.com/product-
detail/Specialized-Alipay-and-Wechat-Payment-Desktop_60787993760.html) , which
is a fancy housing for a camera that scans the wallet barcode on your phone
(optimized for the perfect angle and to prevent glare, I'm guessing). It seems
preferable and faster than Bluetooth...

------
throwaway4585
A strong quality label is that twint's twitter account got suspended -
evidently Twitter doesn't like that a third party tool scrapes tweets better
than they allow to.

~~~
fnord123
Their API is absolute garbage. Search is incomplete and streaming consumes
GBs/hour even when you're getting no results.

------
dmuth
Oh hey, glad to see a project I contributed to get a mention on HN.

I wrote the Splunk integration with Twint for crawling Twitter timelines:
[https://github.com/twintproject/twint-
splunk](https://github.com/twintproject/twint-splunk)

Feel free to hit me up if there are any questions about that part of Twint.

~~~
skylarchunk
Thanks for your hard work! The tool has been extremely helpful in my political
science research. I have noticed that a sort of error occurs when I try to
scrape too many tweets from too many accounts at once (which I do by
referencing a txt file with the account names that I want to scrape tweets
from). I've fixed this by just making shorter lists of account names. Is there
another workaround I am unaware of?

------
lorey
Looks like a decent tool, but I personally would not use it in production.
Took a few minutes to look through the code: They basically use the HTML pages
instead of the APIs. What puts me off is that the code is missing tests
altogether and has quite a few separation of concern issues. HTML extraction
is everywhere, storage is embedded into the package. Cool proof of concept,
not ready for production, I think.

Minor stuff: Printing instead of logging. Would prefer a package that only
does the retrieval and nothing else. Hardcoded SQL(ite?) storage.

~~~
karlicoss
Fair objections perhaps, but regarding using HTMLs -- scraping is the whole
point, because getting API tokens these days is hard.

~~~
silverdrake11
I was teaching a hands on workshop (meetup) on how to use the Twitter API a
few months ago, and the hardest part for everybody was getting those API keys,
clicking through several pages, checking boxes, having it emailed to you.

Then it turned out Twitter refused half of the attendees the API key.. (maybe
they thought it was spam coming from the same wifi, same time).

So then I just gave out my API key to the rest of the class, and in a few
minutes it was blocked..

For a service that has a history of empowering users to protest and to spread
news in crisis situations, it's a shame their API is so locked down.

~~~
karlicoss
Yep. At least it was a realistic experience :)

The hardest part when working with data is often not manipulating data per se,
but spending time on crap like this.

------
anthonyaykut
I'd be interested to hear if this tool could be used to scrape malware hashes
(or links containing malware hashes) from Twitter? This or any other tool
really... it appears my brain has turned to mush today and I cannot find or
get anything to work :(

~~~
thomasdub
What’s your use case for this? I scrape a ton of pastebin links and other
sources of hashes posted by folks on Twitter about Emotet, trickbot etc. and
I’d be happy to point them to a webhook or get them to you another way? Happy
to talk through how we do it too.

~~~
anthonyaykut
Interesting! I’ll reach out to you via LinkedIn and/or Keybase to explain my
use case.

------
tandav
nice job, twitter api is terrible

------
_____smurf_____
I noticed that you don't use tweepy
([https://github.com/twintproject/twint#requirements](https://github.com/twintproject/twint#requirements)).
Can you highlight the difference?

~~~
detaro
Tweepy is for API access, this is a scraper.

~~~
faizshah
And the main advantage is you don’t need to authenticate and you aren’t rate
limited.

~~~
_____smurf_____
A question I asked before, but I get different answers. what are the -legal-
limitations of scraping data when we have an a limited access API

~~~
faizshah
The problem is that there isn't a straight answer to this, see this recent
thread:
[https://news.ycombinator.com/item?id=22180559](https://news.ycombinator.com/item?id=22180559)

It kind of comes down to how well you can defend yourself from it being called
a DOS attack (follow politeness standards and robots.txt), from violating
their copyright (generally not problematic if you don't distribute the data),
and from violating their terms of service (this is key in the case of twitter
and reddit, carefully read their TOS).

However, the scraping of public information like in the case of tweets or
reddit posts is the less problematic part. It's when you distribute the data
or aggregations of the data that it could be problematic to use scraped public
information.

------
ornornor
TWINT is also the name of the product for mobile payments in Switzerland
(twint.ch). I hope the reject won’t run into a fight over the name.

~~~
ornornor
s/reject/project

------
drej
I've used a fair share of (painful) APIs over the years and I have one simple
plea: can we please stop being a-holes? Can we stop scraping websites that
have APIs? They are already offering machine readable data and maybe they have
a reason for not providing everything. Scraping their sites circumvents this
API and not only abuses their systems, but also makes your code super brittle
- any change to their site breaks your code - if you code wasn't broken
already as you were banned by the provider.

~~~
simonw
If people are resorting to scraping then clearly the official API isn't fit
for their purposes.

As an example, here are two key features that are missing from the Twitter API
at the moment:

\- Bookmarks. You can privately bookmark tweets on the Twitter website and
apps. There is no way to access the list of tweets you have bookmarked in the
API.

\- Threads. The concept of threads - where a tweet has replies from the same
author get special treatment in terms of display - is key to how Twitter is
used today. The official API doesn't support them, in that there is no way to
look at a tweet and see that there exists a threaded tweet reply.

There is no good commercial reason for excluding either of these features from
the public API, other than that Twitter have made a strategic decision not to
invest resources in expanding the API to keep up with new features they are
adding to the platform.

Given that, is it any surprise that people are resorting to scraping?

~~~
simonw
You say "maybe they have a reason for not providing everything" \- I cannot
think of a reason not to provide me with API access to my own private
bookmarks other than "we decided to invest our engineering resources
elsewhere".

Which isn't a bad reason! But it's not a good argument for people not to
scrape their own data.

~~~
dewey
> But it's not a good argument for people not to scrape their own data.

Why would you want to scrape your own data if you an already request all your
data and get a whole archive? [https://help.twitter.com/en/managing-your-
account/how-to-dow...](https://help.twitter.com/en/managing-your-account/how-
to-download-your-twitter-archive)

~~~
simonw
Because then you have to trigger and download a GB+ file every time you want
to programmatically access your latest bookmarks.

