earthLings.io

Demographics | Major Languages | Minor Languages | Dialects | GeoWAC | Data Sources GitHub.com
Quick Intro
Documentation
Data-Driven Language and Dialect Mapping

Download Corpora | Download N-Grams

Contact
Jonathan Dunn



Where does the data come from?


Twitter Corpus (v 3.1)
(Size in Words)
Web Corpus (v 4.2)
(Size in Words)
Africa, North 346,811,159 1,223,532,842
Africa, Southern 240,075,312 26,868,810
Africa, Sub 946,084,312 5,938,870,966
America, Brazil 200,159,382 2,265,386,107
America, Central 1,380,243,151 8,877,634,300
America, North 564,825,967 51,921,657,887
America, South 1,302,587,984 22,441,384,853
Asia, Central 372,381,129 17,069,517,255
Asia, East 628,248,007 68,777,517,769
Asia, South 868,248,059 15,147,872,671
Asia, Southeast 695,499,194 21,386,781,131
Europe, East 1,230,025,060 65,413,609,201
Europe, Russia 171,797,254 15,363,644,903
Europe, West 2,714,291,611 143,748,386,801
Middle East 648,149,379 1,721,856,657
Oceania 570,137,902 1,743,571,262
TOTAL 12.87 billion words 443.06 billion words




What languages does the data represent?


Twitter Corpus (v 3.1)
(Size in Words)
Web Corpus (v 4.2)
(Size in Words)
English (eng) 4,306,518,416 129,080,463,148
Spanish (spa) 2,577,870,783 38,712,525,531
French (fra) 601,281,122 26,279,625,748
Russian (rus) 271,258,849 25,474,041,929
Chinese (zho) 29,214,395 24,522,994,601
German (deu) 224,321,415 20,668,495,454
Vietnamese (vie) 27,014,727 16,096,360,235
Japanese (jpn) 573,584,440 15,488,637,703
Italian (ita) 161,668,060 13,280,735,012
Persian (fas) 107,555,066 9,829,791,454
Dutch (nld) 141,114,872 9,551,306,518
Romanian (ron) 37,167,292 7,788,254,970
Serbo-Croatian (inclusive: hbs) 124,043,485 7,740,431,468
Polish (pol) 110,508,185 6,489,447,862
Czech (ces) 63,095,941 6,339,560,010
Portuguese (por) 514,781,174 6,220,800,794
Swedish (swe) 83,001,558 6,019,744,909
Danish (dan) 47,987,703 5,709,437,492
Norwegian (nor) 8,425,521 5,604,398,342
Slovak (slk) 2,332,475 5,545,879,458
Hungarian (hun) 24,329,887 5,387,799,046
Finnish (fin) 66,657,492 4,371,755,461
Greek (ell) 134,574,855 4,283,905,427
Lithuanian (lit) 6,969,768 3,897,044,951
Bulgarian (bul) 19,165,345 3,299,883,883
Slovenian (slv) 48,792,642 2,348,371,951
Latvian (lav) 29,180,409 2,100,022,409
Indonesian (ind) 287,297,888 2,018,314,894
Estonian (est) 14,798,095 1,959,164,461
Arabic (ara) 525,060,430 1,307,133,523
Ukrainian (ukr) 18,394,388 1,076,838,601




Supported by the University of Canterbury and the New Zealand Institute for Language, Brain and Behaviour