About earthLings

The earthLings project is building a global and multi-lingual computational linguistic atlas. A linguistic atlas involves mapping both languages (who uses what language where) and dialects (who uses which variants where). A computational linguistic atlas does this with an automated and reproducible analysis of very large digital corpora (~423 billion words from the web and ~20 billion words from social media). While traditional approaches to language and dialect mapping require asking people about their language behaviours, a computational approach observes and models language behaviours on a much larger scale.

This is a global atlas because it covers most of the countries in the world across 10k cities. This is a multi-lingual atlas because it includes 464 different languages. This is a comprehensive atlas because it models variation across entire grammars while previous atlases were restricted to a few dozen pre-selected features.

Papers, Language Mapping

Dunn, J. (2024). “Validating and Exploring Large Geographic Corpora.” In Proceedings of the Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC/COLING 2024).

Dunn, J. & Edwards-Brown, L. (2024). “Geographically-Informed Language Identification.” In Proceedings of the Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC/COLING 2024).

Dunn, J. & Nijhof, W. (2022). “Language Identification for Austronesian Languages.” In Proceedings of the 13th International Conference on Language Resources and Evaluation (LREC 2022). 6530–6539.

Dunn, J. (2021). "Representations of Language Varieties Are Reliable Given Corpus Similarity Measures." In Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects (EACL 21).

Dunn, J. (2020). "Mapping Languages: The Corpus of Global Language Use." Language Resources and Evaluation. doi: 10.1007/s10579-020-09489-2

Dunn, J. & Adams, B. (2020). "Geographically-Balanced Gigaword Corpora for 50 Language Varieties." In Proceedings of the Language Resources and Evaluation Conference (LREC 2020).

Dunn, J.; Coupe, T.; & Adams, B. (2020). "Measuring Linguistic Diversity During COVID-19." In Proceedings of the 4th Workshop on NLP and Computational Social Science (NLP+CSS).

Dunn, J. & Adams, B. (2019). "Mapping Languages and Demographics with Georeferenced Corpora." In Proceedings of GeoComputation 19. doi: 10.17608/k6.auckland.9869252.v2

Papers, Dialect Modelling

Dunn, J.; Adams, B.; and Tayyar Madabushi, H. (2024). “Pre-Trained Language Models Represent Some Geographic Populations Better Than Others.” In Proceedings of the Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC/COLING 2024).

Dunn, J. (2023). “Syntactic variation across the grammar: modelling a complex adaptive system.” In Frontiers in Complex Systems.

Dunn, J. (2023). “Variation and Instability in Dialect-Based Embedding Spaces.” In Proceedings of the Workshop on NLP for Similar Languages, Varieties and Dialects.

Dunn, J. & Wong, S. (2022). "Stability of Syntactic Dialect Classification Over Space and Time". In Proceedings of the International Conference on Computational Linguistics (COLING 2022).

Dunn, J. (2019). "Global Syntactic Variation in Seven Languages: Towards a Computational Dialectology." Frontiers in Artificial Intelligence: Language and Computation. doi: 10.3389/frai.2019.00015

Dunn, J. (2019). "Modeling Global Syntactic Variation in English Using Dialect Classification." In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects (NAACL 19). doi: 10.18653/v1/W19-1405

Dunn, J. (2018). "Finding Variants for Construction-Based Dialectometry: A Corpus-Based Approach to Regional CxGs." Cognitive Linguistics, 29(2): 275-311. doi: 10.1515/cog-2017-0029

Papers, Grammatical Features

Dunn, J. (2024). Computational Construction Grammar: A Usage-Based Approach. Cambridge University Press. Elements in Cognitive Linguistics.

Dunn, J. (2023). “Exploring the Constructicon: Linguistic Analysis of a Computational CxG.” In Proceedings of the Workshop on CxGs and NLP @ the Georgetown University Round Table on Linguistics / SyntaxFest.

Dunn, J. (2022). “Exposure and Emergence in Usage-Based Grammar: Computational Experiments in 35 Languages.” Cognitive Linguistics.

Dunn, J. & Nini, A. (2021). “Production vs Perception: The Role of Individuality in Usage-Based Grammar Induction.” Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics (NAACL 2021). Association for Computational Linguistics. 149-159.

Dunn, J. & Tayyar Madabushi, H. (2021). “Learned Construction Grammars Converge Across Registers Given Increased Exposure.” Proceedings of the Conference on Computational Natural Language Learning (CoNLL 2021). Association for Computational Linguistics.

Dunn, J. (2019). "Frequency vs. Association for Constraint Selection in Usage-Based Construction Grammar." In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics (NAACL 19). doi: 10.18653/v1/W19-2913

Dunn, J. (2018). "Modeling the Complexity and Descriptive Adequacy of Construction Grammars." In Proceedings of the Society for Computation in Linguistics doi: 10.7275/R59P2ZTB

Dunn, J. (2018). "Multi-Unit Directional Measures of Association: Moving Beyond Pairs of Words." International Journal of Corpus Linguistics, 23(2): 183-215. doi: 10.1075/ijcl.16098.dun

Dunn, J. (2017). "Computational Learning of Construction Grammars." Language and Cognition, 9(2): 254-292. doi: 10.1017/langcog.2016.7


Code

Language Identification (GitHub)

Data Collection and Cleaning (GitHub)

Mapping Site (GitHub)

pip install c2xg