Pretraining Language Models for Diachronic Linguistic Change Discovery

[Arxiv](https://arxiv.org/abs/2504.05523) Accepted to _Findings of EACL_. We use compute and date efficient methods to pretrain a battery of historically-specific models over a relatively limited budget of tokens. We show that this approach leaks far less temporally-inapporpriate information than finetuning an existing LLM while retaining adequate performance to do lexical change detection. Our approach can be used for any corpus with delineations between collections of works. Code is available at [GitHub](https://github.com/comp-int-hum/historical-perspectival-lm) and models and data on [HuggingFace](https://huggingface.co/Hplm)