Do large language models work equally well across different populations?

This visualizes the average perplexity for otherwise comparable test corpora from different countries. Lower values indicate a better fit. The perplexity is standardized, so that 0 is average and -1 is one standard deviation better than average. Taken from this paper Dunn, et al. (2024). "Pre-Trained Language Models Represent Some Geographic Populations Better Than Others."