Download GeoWAC:
Population-balanced Gigaword Corpora by Language

Click here to download language-specific corpora.

Language Size in Words
(ara) Arabic 606,436,833
(aze) Azerbaijani 204,182,099
(bel) Belarusian 71,138,153
(bul) Bulgarian 1,100,000,098
(cat) Catalan 103,364,780
(ces) Czech 1,100,000,137
(dan) Danish 1,100,000,378
(deu) German 1,100,000,981
(ell) Greek 1,100,000,667
(eng) English (Inner Circle) 2,100,000,587
(eng) English (Outer Circle) 2,100,000,533
(eng) English (Expanding Circle) 2,100,000,515
(est) Estonian 492,028,896
(fas) Farsi 1,100,000,846
(fin) Finnish 1,100,000,463
(fra) French 2,100,000,209
(gle) Irish 21,379,703
(hbs) Serbo-Croatian (Cover) 1,100,000,899
(hin) Hindi 230,899,644
(hun) Hungarian 1,100,000,263
(ind) Indonesian 437,494,108
(isl) Icelandic 179,855,817
(ita) Italian 1,100,000,922
(jpn) Japanese 1,100,000,422
(kat) Georgian 136,947,018
(kaz) Kazakh 94,954,270
(kor) Korean 291,998,107
(lav) Latvian 282,206,379
(lit) Lithuanian 1,100,000,820
(mkd) Macedonian 119,179,940
(mon) Mongolian 120,779,559
(nld) Dutch 1,100,000,946
(nor) Norwegian 1,100,000,000
(pol) Polish 1,100,000,937
(por) Portuguese 2,097,432,740
(ron) Romanian 1,100,000,969
(rus) Russian 2,100,000,182
(slk) Slovak 1,100,000,005
(slv) Slovenian 490,154,703
(spa) Spanish 2,100,000,227
(sqi) Albanian 26,048,420
(swe) Swedish 1,100,000,853
(tam) Tamil 85,828,037
(tgl) Tagalog 27,887,410
(tur) Turkish 141,977,388
(ukr) Ukrainian 515,907,405
(urd) Urdu 45,012,456
(uzb) Uzbek 39,007,861
(vie) Vietnamese 1,100,000,214
(zho) Chinese 2,100,000,702
TOTAL 42.46 billion words