Keynote speakers

Elena Volodina

Learner corpora – overcoming challenges with building and sharing the data

Abstract: With growing number of people seeking asylum in European countries, the need for second language (L2) teaching and the evolvement of such a practice is of great importance to modern societies. Access to learner-produced data becomes a necessary prerequisite to drive the research and didactic development in that case. This is what the SweLL project promised: access to annotated learner essays for new research projects. Now, almost at the end of the project, we face legal challenges with sharing the data openly.

This talk will introduce the main steps that SweLL has gone through – from setting up a platform, and developing new tools for L2 annotation, to releasing the data and the tools. I will describe the main results, and demonstrate the tools and search possibilities. A prominent part of the talk will be devoted to the challenges of (automatic) pseudonymization of learner essays as a prerequisite for being able to share the data with new users.

Bio: Elena Volodina is a researcher at the University of Gothenburg, Sweden. She has been active within the development of resources and applications for language learning, her main area of expertise being that of Intelligent Computer-Assisted Language Learning, Learner Corpus Infrastructure, computational linguistic methods and corpus-based text studies. Recently, she has been involved in developing tools for automatic pseudonymization of learner essays and creating lexical and grammatical profiles for Swedish as a second language.

Jan Rybicki

Books and computers, books in computers: stylometry in literary originals and translations

Abstract: Stylometry counts various linguistic and/or stylistic features in (literary) texts, and its results are very often in agreement with those obtained in traditional literary scholarship. Stylometric methods applied to even very simple quantitative data on texts, such as word frequencies, are usually enough to tell one author from another; to group texts by chronology, genre, or gender. On the other hand, some other results are much less intuitive: translations tend to group by original authors rather than translators; some authors, translators or genres are easy to detect, some not; higher-level textual features such as sentence lengths are much less effective than frequencies of context-free function words. This presentation hopes to serve as an introduction to the strange and wonderful world of quantitative literary studies, illustrated with a number very old and very recent case studies.

Bio: Jan Rybicki is Associate Professor of English Studies at the Jagiellonian University in Kraków, Poland. With a background in English literary studies, comparative literature and translation studies, he has published on stylometry in translation (“The Great Mystery of the (Almost) Invisible Translator: Stylometry in Translation”), authorship attribution (“Partners in Life, Partners in Crime?”) and gender (“Vive la différence: Tracing the (Authorial) Gender Signal by Multivariate Analysis of Word Frequencies”). His latest papers includes attributive work on Harper Lee and Elena Ferrante; he has also published on various aspects of the writing of Sienkiewicz in the original and in English and Italian translation. He helped write the stylo package for R. A literary translator in his previous lifetime, he translated into Polish such authors as Golding, Gordimer, Fitzgerald, Ishiguro or le Carré.

Daniel Zeman

Universal Dependencies: A Search for Harmonized Morphological and Syntactic Annotation

Abstract: For at least two decades, syntactically annotated corpora (treebanks) have been instrumental both in linguistic research and in development of natural language understanding applications. Even though the application aspect somewhat diminished with the current surge of neural networks, the classical who-did-what-to-whom type of questions still cannot be answered without understanding the syntax of the sentence. In order to facilitate the usage of treebanks, it is desirable that they capture same phenomena the same way, across languages and domains. This is exactly the goal of Universal Dependencies (UD): a community effort to define cross-linguistically applicable annotation guidelines for morphology and syntax, and to provide data annotated following those guidelines. In my talk, I will introduce UD, its main principles and the current state, and I will discuss some of the challenges that harmonization and multi-lingual annotation presents. In the last part of the talk, I will touch upon the latest development towards Enhanced UD and Deep UD.

Bio: Daniel Zeman is a senior researcher and lecturer at the Charles University in Prague, Czechia. He has been active in parser development, machine translation, computational morphology, cross-lingual model transfer and typology. He is one of the leading personalities in the Universal Dependencies project since its beginning in 2014.