The intent of the YLMP Special Events is to raise social awareness and encourage cultural sensitivity. This year Aleksandra Tomaszewska (Institute of Computer Science, Polish Academy of Sciences) will deliver a lecture on „Corpus Data for Evidence-Based Research and Sovereign AI”
Abstract: In this lecture, I will present a twofold perspective on language data: first, as a fundamental resource for linguistic research, and second, as a cornerstone of sovereign AI development. I will discuss how recent technological advances have transformed evidence-based language studies, broadening the scope of linguistic inquiry, particularly through corpus methods and NLP tools increasingly accessible to non-programmers. I will then demonstrate how high-quality, locally governed data underpin AI systems tailored to specific cultural and linguistic needs, emphasizing the critical importance of maintaining full control over data quality, composition, and safety during all stages of model development. Drawing on practical insights from building the Polish Large Language Model (PLLuM) ecosystem, I will illustrate a data-centric approach to creating language models, including manual dataset assembly and navigating challenges such as withdrawn consent. Additionally, I will address the issue of biases embedded in language data specific to local contexts. Ultimately, by highlighting both opportunities and challenges, I will argue that attention to resource management, institutional cooperation, and data quality, as well as paying attention to linguistic biases are essential for achieving truly sovereign AI.

Aleksandra Tomaszewska (Institute of Computer Science, Polish Academy of Sciences)
An expert in corpus linguistics and Natural Language Processing applications, she is a linguist and researcher on the Linguistic Engineering Team at the Institute of Computer Science, Polish Academy of Sciences. She analyzes and co-develops language corpora and NLP tools designed to support individuals without programming experience in exploring phenomena in authentic language use. She has co-created local language models and coordinated the development of a Polish-language dataset for the Polish AI ecosystem, the PLLuM (Polish Large Language Model) family of models. She also serves as a pro bono expert in the Artificial Intelligence Working Group (Data for AI section) at Poland’s Ministry of Digital Affairs. Her research interests include corpus methods and resources, data-centric sovereign AI, lexical innovation, and biases in data and language models. She actively participates in Polish and international research projects and is the author of publications and talks for researchers, students, and general audiences. She is a graduate of the University of Warsaw.