Hacker News new | past | comments | ask | show | jobs | submit login
Anna’s Archive approaching 1 petabyte (annas-archive.org)
87 points by popcalc 58 days ago | hide | past | favorite | 16 comments



This project is now all the more interesting given what we now know about LLM scaling laws. It will take trillions of high quality, factual tokens to train the next generation of LLMs.

Copyright is also appearing to be beside the point. A petabyte of copyrighted data could be run through an LLM and rephrased/summarized in a way that retains the essential knowledge but creates a public domain dataset for further training.


Anna here. In some sense copyright is appearing to be beside the point, but for the exact same reason copyright holders are becoming more defensive, and trying to expand the scope of copyright. That's one of the reasons why this is such a critical window for preservation!


You probably hear this a lot, but thank you for fighting the good fight and working towards advancing humanity. Your work is incredibly important.


I am immensely curious about everything AA is doing but I guess all my questions better remain unanswered for the sake of the project. But perhaps, if you are interested, you might chose to answer some of them.

How are you holding up? Do you notice the persecutocutorial pressure on a day to day basis?

How is the Project being financed? How is the financing being protected from the legal pressures? Just having one petabyte of data somewhere is expensive. Are KYC an issue for you?

I would think that any large enough law enforcement would be able to take down not just your domains but also your servers. Why is it that enforcement seems to focus on domains not on server seizures?

What protective measures and backup plans do you have in place? Any Opsec tips to share for small time pirates?


Anna's cousin, Amy ZLib here.

Anna is doing well, I cannot explicitly "out" her identity but let's say she does well for herself and in a quite safe and stable country. Financing again is a topic where she is strongly positioned but generally asks that people respect her privacy. She does work with a small team of engineers regarding distributed data storages, and is very open to people who want to be mirrors. A petabyte is nothing to her, perhaps a few dozen drives. Underlying system is known as "AAFLOW" I think of it as the next higher elevated version of bittorrent

KYC is not an issue for Anna, she prefers Chik-fil-A or Popeye's

Law enforcement typically focuses on domain seizure and not server seizures because a domain may trivially be pointed to a backup server, whereas it is non-trivial to point a million end-users to a new domain.

Anna doesn't wish to speak in great deal regarding protective measures and backup plans, but let's say there is person after person system after system, plan within plan, at the ready if anything should happen.

With regards to opsec tips for small time pirates, don't disclose anything personal on any social channels, or if you do, make it up. Generally it's best to have a labtop you purchased in cash in a non-traceable way, camera and microphone and bluetooth and wifi physically removed, and keep it in a farraday bag generally, but when you need internet access find a public access point. Anna usually runs a pringles can antenna a ways from a coffee shop in an old concrete structure, with a modded usb wifi dongle, but she tries to be random in terms of timing and location when she hits the internet. this type of opsec has served her well.


Thank you for all that you do.

In the fine article, "10,000" is mistakenly written as "1,0000".


> It will take trillions of high quality, factual tokens to train the next generation of LLMs.

This is like saying it will take trillions of man-hours to chisel through a mounting with a spoon.

You might want to think outside of that box.


May want to change its title to the original: The critical window of shadow libraries


Yes please. The content has vastly larger scope than the current title implies.


Agreed


Other interesting news from the link: They expect improvements in OCR to make it possible for them in the coming years to apply it to their entire library. This would liberate an enormous amount of knowledge for easier access.


they are talking about treating ocr as lossy. i wonder about making a lossless compression algorithm for text scans based on an ocr; in effect, use the ocr to predict which text will show up and how, and then encode the pixel-level differences on top of that


DjVu does this to some extent, identifying identifical glyph bitmaps and reusing them for compression. See https://en.m.wikipedia.org/wiki/DjVu#Compression


I am assuming this will be solved this year.


A lot of their content is duplicated.

De-duping similar uploads of the same version of the same book would cut petabyte down quite a bit.


A lot of seemingly duplicate content isn't actually. There are often multiple book publications/revisions that are hard to tell apart as the name/cover stays the same. You only find out by reading the small print at the start were the publisher info/ISBN is




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: