Download GeoWAC:
Population-balanced Gigaword Corpora by Language
Click here to download language-specific corpora.
Language | Size in Words |
(ara) Arabic | 606,436,833 |
(aze) Azerbaijani | 204,182,099 |
(bel) Belarusian | 71,138,153 |
(bul) Bulgarian | 1,100,000,098 |
(cat) Catalan | 103,364,780 |
(ces) Czech | 1,100,000,137 |
(dan) Danish | 1,100,000,378 |
(deu) German | 1,100,000,981 |
(ell) Greek | 1,100,000,667 |
(eng) English (Inner Circle) | 2,100,000,587 |
(eng) English (Outer Circle) | 2,100,000,533 |
(eng) English (Expanding Circle) | 2,100,000,515 |
(est) Estonian | 492,028,896 |
(fas) Farsi | 1,100,000,846 |
(fin) Finnish | 1,100,000,463 |
(fra) French | 2,100,000,209 |
(gle) Irish | 21,379,703 |
(hbs) Serbo-Croatian (Cover) | 1,100,000,899 |
(hin) Hindi | 230,899,644 |
(hun) Hungarian | 1,100,000,263 |
(ind) Indonesian | 437,494,108 |
(isl) Icelandic | 179,855,817 |
(ita) Italian | 1,100,000,922 |
(jpn) Japanese | 1,100,000,422 |
(kat) Georgian | 136,947,018 |
(kaz) Kazakh | 94,954,270 |
(kor) Korean | 291,998,107 |
(lav) Latvian | 282,206,379 |
(lit) Lithuanian | 1,100,000,820 |
(mkd) Macedonian | 119,179,940 |
(mon) Mongolian | 120,779,559 |
(nld) Dutch | 1,100,000,946 |
(nor) Norwegian | 1,100,000,000 |
(pol) Polish | 1,100,000,937 |
(por) Portuguese | 2,097,432,740 |
(ron) Romanian | 1,100,000,969 |
(rus) Russian | 2,100,000,182 |
(slk) Slovak | 1,100,000,005 |
(slv) Slovenian | 490,154,703 |
(spa) Spanish | 2,100,000,227 |
(sqi) Albanian | 26,048,420 |
(swe) Swedish | 1,100,000,853 |
(tam) Tamil | 85,828,037 |
(tgl) Tagalog | 27,887,410 |
(tur) Turkish | 141,977,388 |
(ukr) Ukrainian | 515,907,405 |
(urd) Urdu | 45,012,456 |
(uzb) Uzbek | 39,007,861 |
(vie) Vietnamese | 1,100,000,214 |
(zho) Chinese | 2,100,000,702 |
TOTAL | 42.46 billion words |