The earthLings project is building a global and multi-lingual computational linguistic atlas. A linguistic atlas involves mapping both languages (who uses what language where) and dialects (who uses which variants where). A computational linguistic atlas does this with an automated and reproducible analysis of very large digital corpora (~423 billion words from the web and ~20 billion words from social media). While traditional approaches to language and dialect mapping require asking people about their language behaviours, a computational approach observes and models language behaviours on a much larger scale.

This is a global atlas because it covers most of the countries in the world across 10k cities. This is a multi-lingual atlas because it includes 464 different languages. This is a comprehensive atlas because it models variation across entire grammars while previous atlases were restricted to a few dozen pre-selected features.

Papers, Language Mapping

Dunn, J. (2024). “Validating and Exploring Large Geographic Corpora.” In Proceedings of the Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC/COLING 2024).

Dunn, J. & Edwards-Brown, L. (2024). “Geographically-Informed Language Identification.” In Proceedings of the Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC/COLING 2024).

Dunn, J. & Nijhof, W. (2022). “Language Identification for Austronesian Languages.” In Proceedings of the 13th International Conference on Language Resources and Evaluation (LREC 2022). 6530–6539.

Dunn, J. (2021). "Representations of Language Varieties Are Reliable Given Corpus Similarity Measures." In Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects (EACL 21).

Dunn, J. (2020). "Mapping Languages: The Corpus of Global Language Use." Language Resources and Evaluation. doi: 10.1007/s10579-020-09489-2

Dunn, J. & Adams, B. (2020). "Geographically-Balanced Gigaword Corpora for 50 Language Varieties." In Proceedings of the Language Resources and Evaluation Conference (LREC 2020).

Dunn, J.; Coupe, T.; & Adams, B. (2020). "Measuring Linguistic Diversity During COVID-19." In Proceedings of the 4th Workshop on NLP and Computational Social Science (NLP+CSS).

Dunn, J. & Adams, B. (2019). "Mapping Languages and Demographics with Georeferenced Corpora." In Proceedings of GeoComputation 19. doi: 10.17608/k6.auckland.9869252.v2

Papers, Dialect Modelling

Dunn, J.; Adams, B.; and Tayyar Madabushi, H. (2024). “Pre-Trained Language Models Represent Some Geographic Populations Better Than Others.” In Proceedings of the Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC/COLING 2024).

Dunn, J. (2023). “Syntactic variation across the grammar: modelling a complex adaptive system.” In Frontiers in Complex Systems.