earthLings.io

Demographics | Major Languages | Minor Languages | Dialects | GeoWAC | Data Sources GitHub.com
Quick Intro
Documentation
Data-Driven Language and Dialect Mapping

Download Corpora | Download N-Grams

Contact
Jonathan Dunn



Corpus of Global Language Use (CGLU)

Download the Full Corpus, By Region and Country


Link to Data Size in Words
Africa, North Download Location 1,193,000,000
Africa, Southern Download Location 26,868,000
Africa, Sub Download Location 5,847,670,000
America, Brazil Download Location 2,265,007,000
America, Central Download Location 8,764,686,000
America, North Download Location 51,858,773,000
America, South Download Location 21,991,710,000
Asia, Central Download Location 26,940,518,000
Asia, East Download Location 19,229,783,000
Asia, South Download Location 15,087,017,000
Asia, Southeast Download Location 21,100,289,000
Europe, East Download Location 65,366,344,000
Europe, Russia Download Location 15,363,648,000
Europe, West Download Location 143,566,058,000
Middle East Download Location 1,721,259,000
Oceania Download Location 1,580,569,000
TOTAL 928 gb 401.90 billion words




GeoWAC: Population-balanced Gigaword Corpora

Download the Full Corpus, By Language


Link to Data Size in Words
(ara) Arabic Download Location 606,436,833
(aze) Azerbaijani Download Location 204,182,099
(bel) Belarusian Download Location 71,138,153
(bul) Bulgarian Download Location 1,100,000,098
(cat) Catalan Download Location 103,364,780
(ces) Czech Download Location 1,100,000,137
(dan) Danish Download Location 1,100,000,378
(deu) German Download Location 1,100,000,981
(ell) Greek Download Location 1,100,000,667
(eng) English (Inner Circle) Download Location 2,100,000,587
(eng) English (Outer Circle) Download Location 2,100,000,533
(eng) English (Expanding Circle) Download Location 2,100,000,515
(est) Estonian Download Location 492,028,896
(fas) Farsi Download Location 1,100,000,846
(fin) Finnish Download Location 1,100,000,463
(fra) French Download Location 2,100,000,209
(gle) Irish Download Location 21,379,703
(hbs) Serbo-Croatian (Cover) Download Location 1,100,000,899
(hin) Hindi Download Location 230,899,644
(hun) Hungarian Download Location 1,100,000,263
(ind) Indonesian Download Location 437,494,108
(isl) Icelandic Download Location 179,855,817
(ita) Italian Download Location 1,100,000,922
(jpn) Japanese Download Location 1,100,000,422
(kat) Georgian Download Location 136,947,018
(kaz) Kazakh Download Location 94,954,270
(kor) Korean Download Location 291,998,107
(lav) Latvian Download Location 282,206,379
(lit) Lithuanian Download Location 1,100,000,820
(mkd) Macedonian Download Location 119,179,940
(mon) Mongolian Download Location 120,779,559
(nld) Dutch Download Location 1,100,000,946
(nor) Norwegian Download Location 1,100,000,000
(pol) Polish Download Location 1,100,000,937
(por) Portuguese Download Location 2,097,432,740
(ron) Romanian Download Location 1,100,000,969
(rus) Russian Download Location 2,100,000,182
(slk) Slovak Download Location 1,100,000,005
(slv) Slovenian Download Location 490,154,703
(spa) Spanish Download Location 2,100,000,227
(sqi) Albanian Download Location 26,048,420
(swe) Swedish Download Location 1,100,000,853
(tam) Tamil Download Location 85,828,037
(tgl) Tagalog Download Location 27,887,410
(tur) Turkish Download Location 141,977,388
(ukr) Ukrainian Download Location 515,907,405
(urd) Urdu Download Location 45,012,456
(uzb) Uzbek Download Location 39,007,861
(vie) Vietnamese Download Location 1,100,000,214
(zho) Chinese Download Location 2,100,000,702
TOTAL 42.46 billion words




Supported by the University of Canterbury and the New Zealand Institute for Language, Brain and Behaviour