About earthLings

The earthLings project is building a global and multi-lingual computational linguistic atlas. A linguistic atlas involves mapping both languages (who uses what language where) and dialects (who uses which variants where). A computational linguistic atlas does this with an automated and reproducible analysis of very large digital corpora (~423 billion words from the web and ~14 billion words from social media). While traditional approaches to language and dialect mapping require asking people about their language behaviours, a computational approach observes and models language behaviours on a large scale. This is a global atlas because it covers most of the countries in the world across 10k cities. This is a multi-lingual atlas because it includes 464 different languages. This is a comprehensive atlas because it models variation across entire grammars while previous atlases were restricted to a few dozen pre-selected features.

Intro Presentation (PDF)

Papers

Dunn, J. (2020). "Mapping Languages: The Corpus of Global Language Use." Language Resources and Evaluation. doi: 10.1007/s10579-020-09489-2

Dunn, J. & Adams, B. (2020). "Geographically-Balanced Gigaword Corpora for 50 Language Varieties." In Proceedings of the Language Resources and Evaluation Conference (LREC 2020).

Dunn, J. & Adams, B. (2019). "Mapping Languages and Demographics with Georeferenced Corpora." In Proceedings of GeoComputation 19. doi: 10.17608/k6.auckland.9869252.v2

Dunn, J. (2019). "Global Syntactic Variation in Seven Languages: Towards a Computational Dialectology." Frontiers in Artificial Intelligence: Language and Computation. doi: 10.3389/frai.2019.00015

Dunn, J. (2019). "Modeling Global Syntactic Variation in English Using Dialect Classification." In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects (NAACL 19). doi: 10.18653/v1/W19-1405

Dunn, J. (2018). "Finding Variants for Construction-Based Dialectometry: A Corpus-Based Approach to Regional CxGs." Cognitive Linguistics, 29(2): 275-311. doi: 10.1515/cog-2017-0029


Code

Language Identification (GitHub)

Data Collection and Cleaning (GitHub)

Mapping Site (GitHub)