
Sending 1.2M Tweets - edent
https://shkspr.mobi/blog/2019/07/sending-1-2-million-tweets/
======
SimeVidas
I remember the days when your ISP would give you 5 MB of hosting space.

If each of the 1.2 million tweets includes a ~150 KB image, that’s 180 GB of
images hosted on Twitter for free.

~~~
craz8
A while ago, this guy found that you could store files in the DNS servers
belonging to other people, and created DNSFs that he documented here:
[https://blog.benjojo.co.uk/post/dns-filesystem-true-cloud-
st...](https://blog.benjojo.co.uk/post/dns-filesystem-true-cloud-storage-
dnsfs)

It looks like there might be about 250mb available for all to share across the
internet using this system

PingFS is even more out there!

The internet is full of weird corners to exploit in fun ways

------
azhenley
Does this mean that I can make a Twitter account to backup all my photos and
then use the "Download your data" feature [1] to download all of them?

[1] [https://help.twitter.com/en/managing-your-account/how-to-
dow...](https://help.twitter.com/en/managing-your-account/how-to-download-
your-twitter-archive)

~~~
bscphil
Only if you don't mind them being recompressed and the metadata being deleted.
Of course, if you don't, surely Google Photos would be a better choice, since
it's specifically designed for this purpose, has unlimited storage for lossy-
compressed photos, and photos under a certain size are left alone.

~~~
petepete
All it needs now is a sensible uploader. Google Photos is so close to perfect
in every way other than that.

~~~
StavrosK
Also the fact that there is zero privacy.

------
goblin89
I’ve been wondering how does Creative Commons apply in ‘big data’-ish use
cases. Can a dataset distributed under CC BY-SA be analyzed, possibly used as
part of training input for an ML model? What if a product is built on top of a
model that learned from a CC-licensed dataset? Products are rarely distributed
under CC; bow far do ShareAlike & Attribution reach, by letter and by spirit?

Should there be (or does there exist) a type of license for _data_ —different
from the ones typically used for software source code (MIT, GPL) and ones
typically used for creative work (CC), encouraging innovation but giving
something back to dataset creator or maintainer?

~~~
edent
Those are reasonable questions. At work, we release lots of data under OGL
(Open Government Licence) which is CC compatible.

For my personal stuff, if you'd like a different license, I'm happy for you to
pay me for a more restrictive one. But if you build an ML using my open data,
I expect that model to be released under a similarly licence.

~~~
goblin89
Didn’t know about OGL, it does look suitable for this purpose.

To (partially) answer myself, contrary to what I implied CC-BY does cover this
base if (for example) the creator of the dataset accepts a note in product’s
“About” documentation as sufficient attribution.

------
savant_penguin
It looks like a great dataset to associate power generation to pictures of the
sky. Perhaps it could help decide the best location to place the solar panels?
One big picture of the sky and you would get the power-generating estimate of
each location based solely on the image. Perhaps taking several large pictures
over the year would help decide the best location on average. Or the location
with best worst-case scenario. Hmmm

~~~
grenoire
Camera exposure and light sensitivity are likely inconsistent. As long as this
data is not in the photographs, they are as good as nothin'.

~~~
smmnyc
I think a machine learning algorithm wouldn't care about that, because with a
large enough training data set it would start to account for that and be able
to accurately predict energy output based on the image alone.

~~~
grenoire
Regardless of how big the dataset is, the image recognition algorithm is bound
to get confused by the large differences in colour that exposure and
sensitivity results in. It will likely look for the overall gray-to-blue
gradient and estimate results from that; on the gray end of the things alone,
the camera settings make a very, very big difference. You can't _really_ tell
the algorithm to ignore these and only determine the level of 'cloudiness.'

Another issue with this dataset is the overlay changing over time in text
content, font, and colour. The algorithm might overfit and think e.g. yellow
font presence means higher output simply because the output was higher during
that period. You could strip away the text, but then you're introducing
potential errors into the dataset yourself.

------
Nican
Rending all pictures at 30 frames per second, it would be a 12-hour video.

~~~
manuw
Someone should do this :)

------
ijafri
while I applaud the author, but it didn't entirely settle well with me... I
guess it still amounts to misuse of one's resources. in this case twitter.

~~~
ascales
This comment reminds me of getting flamed on forums for hotlinking images from
some guy's website... Times have changed.

