earthLings.io

Demographics | Major Languages | Minor Languages | Dialects | GeoWAC | Data Sources GitHub.com
Quick Intro
Documentation
Data-Driven Language and Dialect Mapping

Download Corpora | Download N-Grams

Contact
Jonathan Dunn



Corpus of Global Language Use (CGLU) (v4.2)*

Download the Full Corpus, By Region and Country


Link to Data Size in Words
Africa, North Download Location 1,223,532,842
Africa, Southern Download Location 26,868,810
Africa, Sub Download Location 5,938,870,966
America, Brazil Download Location 2,265,386,107
America, Central Download Location 8,877,634,300
America, North Download Location 51,921,657,887
America, South Download Location 22,441,384,853
Asia, Central Download Location 17,069,517,255
Asia, East Download Location 68,777,517,769
Asia, South Download Location 15,147,872,671
Asia, Southeast Download Location 21,386,781,131
Europe, East Download Location 65,413,609,201
Europe, Russia Download Location 15,363,644,903
Europe, West Download Location 143,748,386,801
Middle East Download Location 1,721,856,657
Oceania Download Location 1,743,571,262
TOTAL 933 gb 443.06 billion words

* v4.2 includes character segmentation for Chinese and Japanese.





GeoWAC: Population-balanced Gigaword Corpora

Download the Full Corpus Below, By Language

Or click here to download country-specific corpora.


Link to Data Size in Words
(ara) Arabic Download Location 606,436,833
(aze) Azerbaijani Download Location 204,182,099
(bel) Belarusian Download Location 71,138,153
(bul) Bulgarian Download Location 1,100,000,098
(cat) Catalan Download Location 103,364,780
(ces) Czech Download Location 1,100,000,137
(dan) Danish Download Location 1,100,000,378
(deu) German Download Location 1,100,000,981
(ell) Greek Download Location 1,100,000,667
(eng) English (Inner Circle) Download Location 2,100,000,587
(eng) English (Outer Circle) Download Location 2,100,000,533
(eng) English (Expanding Circle) Download Location 2,100,000,515
(est) Estonian Download Location 492,028,896
(fas) Farsi Download Location 1,100,000,846
(fin) Finnish Download Location 1,100,000,463
(fra) French Download Location 2,100,000,209
(gle) Irish Download Location 21,379,703
(hbs) Serbo-Croatian (Cover) Download Location 1,100,000,899
(hin) Hindi Download Location 230,899,644
(hun) Hungarian Download Location 1,100,000,263
(ind) Indonesian Download Location 437,494,108
(isl) Icelandic Download Location 179,855,817
(ita) Italian Download Location 1,100,000,922
(jpn) Japanese Download Location 1,100,000,422
(kat) Georgian Download Location 136,947,018
(kaz) Kazakh Download Location 94,954,270
(kor) Korean Download Location 291,998,107
(lav) Latvian Download Location 282,206,379
(lit) Lithuanian Download Location 1,100,000,820
(mkd) Macedonian Download Location 119,179,940
(mon) Mongolian Download Location 120,779,559
(nld) Dutch Download Location 1,100,000,946
(nor) Norwegian Download Location 1,100,000,000
(pol) Polish Download Location 1,100,000,937
(por) Portuguese Download Location 2,097,432,740
(ron) Romanian Download Location 1,100,000,969
(rus) Russian Download Location 2,100,000,182
(slk) Slovak Download Location 1,100,000,005
(slv) Slovenian Download Location 490,154,703
(spa) Spanish Download Location 2,100,000,227
(sqi) Albanian Download Location 26,048,420
(swe) Swedish Download Location 1,100,000,853
(tam) Tamil Download Location 85,828,037
(tgl) Tagalog Download Location 27,887,410
(tur) Turkish Download Location 141,977,388
(ukr) Ukrainian Download Location 515,907,405
(urd) Urdu Download Location 45,012,456
(uzb) Uzbek Download Location 39,007,861
(vie) Vietnamese Download Location 1,100,000,214
(zho) Chinese Download Location 2,100,000,702
TOTAL 42.46 billion words




Supported by the University of Canterbury and the New Zealand Institute for Language, Brain and Behaviour