Download GeoWAC:
Population-balanced Gigaword Corpora by Language
Click here to download language-specific corpora.
| Language | Size in Words |
| (ara) Arabic | 606,436,833 |
| (aze) Azerbaijani | 204,182,099 |
| (bel) Belarusian | 71,138,153 |
| (bul) Bulgarian | 1,100,000,098 |
| (cat) Catalan | 103,364,780 |
| (ces) Czech | 1,100,000,137 |
| (dan) Danish | 1,100,000,378 |
| (deu) German | 1,100,000,981 |
| (ell) Greek | 1,100,000,667 |
| (eng) English (Inner Circle) | 2,100,000,587 |
| (eng) English (Outer Circle) | 2,100,000,533 |
| (eng) English (Expanding Circle) | 2,100,000,515 |
| (est) Estonian | 492,028,896 |
| (fas) Farsi | 1,100,000,846 |
| (fin) Finnish | 1,100,000,463 |
| (fra) French | 2,100,000,209 |
| (gle) Irish | 21,379,703 |
| (hbs) Serbo-Croatian (Cover) | 1,100,000,899 |
| (hin) Hindi | 230,899,644 |
| (hun) Hungarian | 1,100,000,263 |
| (ind) Indonesian | 437,494,108 |
| (isl) Icelandic | 179,855,817 |
| (ita) Italian | 1,100,000,922 |
| (jpn) Japanese | 1,100,000,422 |
| (kat) Georgian | 136,947,018 |
| (kaz) Kazakh | 94,954,270 |
| (kor) Korean | 291,998,107 |
| (lav) Latvian | 282,206,379 |
| (lit) Lithuanian | 1,100,000,820 |
| (mkd) Macedonian | 119,179,940 |
| (mon) Mongolian | 120,779,559 |
| (nld) Dutch | 1,100,000,946 |
| (nor) Norwegian | 1,100,000,000 |
| (pol) Polish | 1,100,000,937 |
| (por) Portuguese | 2,097,432,740 |
| (ron) Romanian | 1,100,000,969 |
| (rus) Russian | 2,100,000,182 |
| (slk) Slovak | 1,100,000,005 |
| (slv) Slovenian | 490,154,703 |
| (spa) Spanish | 2,100,000,227 |
| (sqi) Albanian | 26,048,420 |
| (swe) Swedish | 1,100,000,853 |
| (tam) Tamil | 85,828,037 |
| (tgl) Tagalog | 27,887,410 |
| (tur) Turkish | 141,977,388 |
| (ukr) Ukrainian | 515,907,405 |
| (urd) Urdu | 45,012,456 |
| (uzb) Uzbek | 39,007,861 |
| (vie) Vietnamese | 1,100,000,214 |
| (zho) Chinese | 2,100,000,702 |
| TOTAL | 42.46 billion words |