

What is the estimated GBs of text data on wikipedia.com? - frankydp


======
christophe971
27 GB with markup: <http://en.wikipedia.org/wiki/Wikipedia:Database_download>

Probably 1/10th of that is real text data, so no more than 3 GB.

~~~
alok-g
90% sounds to be too high an overhead. At 435 words per article [1] and
3553578 articles (English only) [2], it has ~1.5 billion words. At 5.1 letters
per word and one extra for space, it comes out to 9.4 GB. With 27 GB
uncompressed, even that sounds like too much overhead for markup, but maybe.

[1] <http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons> [2]
<http://en.wikipedia.org/wiki/Special:Statistics>

------
octopus
For the English version approximative 14GB archived, see this link for more
information:

<http://dumps.wikimedia.org/>

~~~
frankydp
Thanks for this link. I always thought wikipedia should sell Annual Wikipedia
DVD's. Something like a brittanica.

EDIT: typo

