earthLings.io

Demographics | Major Languages | Minor Languages | Dialects | GeoWAC | Data Sources GitHub.com
Quick Intro
Documentation
Data-Driven Language and Dialect Mapping

Download Corpora | Download N-Grams

Contact
Jonathan Dunn



Where does the data come from?


Twitter Corpus
(Size in Words)
Web Corpus
(Size in Words)
Africa, North 336,430,937 1,193,000,000
Africa, Southern 234,186,686 26,868,000
Africa, Sub 919,630,837 5,847,670,000
America, Brazil 197,062,498 2,265,007,000
America, Central 1,341,757,648 8,764,686,000
America, North 556,578,605 51,858,773,000
America, South 1,267,643,786 21,991,710,000
Asia, Central 361,623,429 26,940,518,000
Asia, East 613,865,646 19,229,783,000
Asia, South 847,288,502 15,087,017,000
Asia, Southeast 678,541,905 21,100,289,000
Europe, East 1,194,631,973 65,366,344,000
Europe, Russia 169,182,881 15,363,648,000
Europe, West 2,641,265,299 143,566,058,000
Middle East 631,435,282 1,721,259,000
Oceania 555,409,999 1,580,569,000
TOTAL 12.54 billion words 401.90 billion words




Supported by the University of Canterbury and the New Zealand Institute for Language, Brain and Behaviour