DVITAS (Bilingual Automatic Terminology Extraction) is a project implemented by the researchers of Vytautas Magnus University and Mykolas Romeris University and funded by the Research Council of Lithuania.

The aim of the project

The aim of the project is to develop a methodology for automatic extraction of English and Lithuanian terms for a special domain from parallel and comparable corpora, as well as to create a bilingual termbase, which will be based on the empirical data and will be publicly available on the Internet. Cybersecurity (CS) terminology has been chosen as a special domain for the project.

The scientific problem

The scientific problem to be solved during the project is the automatic extraction of terminographic data from bilingual resources, i.e. parallel and comparable corpora, when one of the languages is under-resourced and morphologically rich. An innovative methodology will be created in the course of the project implementation, which, to our knowledge, has not been applied in Lithuania yet. During the project we will test possibilities to apply state-of-the-art machine learning algorithms and neural networks for bilingual term extraction.

Cybersecurity (CS) domain

The CS domain was chosen because of its special relevance for today’s information society. This area is particularly dynamic as new documents of the CS area are constantly drawn up, new concepts are developed, but the terminology has not been fixed in the Lithuanian language yet. Thus, the new CS concepts are usually expressed by several terms, often by the name used in the original (English) language or as hybrids (combinations of English and Lithuanian lexical items). Therefore, the CS termbase is now particularly relevant to drafters of legal and administrative acts, translators, IT professionals, and the general public.

Resources to be developed

During the project, bilingual (English-Lithuanian) parallel and comparative corpora of the cybersecurity domain will be compiled and made available to the public (e.g. in the CLARIN repository). They will reflect the use of cybersecurity terms in texts of various genres and types in national and international settings. Terminographic material extracted from the corpora will be published in an open bilingual (English-Lithuanian) database of cybersecurity terms. This database could serve as a model for development of terminology bases in other domains by using state-of-the-art technologies that enable the automation of terminographic material extraction processes.