earthLings.io

Demographics | Major Languages | Minor Languages | Dialects | GeoWAC | Data Sources GitHub.com
Quick Intro
Documentation
Data-Driven Language and Dialect Mapping

Download Corpora | Download N-Grams

Contact
Jonathan Dunn



Where does the data come from?


Twitter Corpus
(Size in Words)
Web Corpus
(Size in Words)
Africa, North 203,867,000 1,193,000,000
Africa, Southern 159,807,000 26,868,000
Africa, Sub 571,644,000 5,847,670,000
America, Brazil 156,705,000 2,265,007,000
America, Central 852,793,000 8,764,686,000
America, North 452,263,000 51,858,773,000
America, South 824,502,000 21,991,710,000
Asia, Central 220,106,000 26,940,518,000
Asia, East 198,177,000 19,229,783,000
Asia, South 580,221,000 15,087,017,000
Asia, Southeast 443,258,000 21,100,289,000
Europe, East 748,654,000 65,366,344,000
Europe, Russia 135,778,000 15,363,648,000
Europe, West 1,703,436,000 143,566,058,000
Middle East 421,926,000 1,721,259,000
Oceania 372,623,000 1,580,569,000
TOTAL 8.04 billion words 401.90 billion words




Supported by the University of Canterbury and the New Zealand Institute for Language, Brain and Behaviour