Corpus-based methodologies have become a major staple of linguistic research in recent years, ranging from grammar topics, to lexicology, socio-linguistics, and applied linguistics. Data-driven approaches have had an impact on the field of translation as well, however, mainly from a computer science perspective and with regard to predominantly practical problems and the industry. The results of these trajectories include translation memory, terminology management, and (statistical) machine translation systems.
A large piece of the puzzle that still appears to be missing in this picture are data-driven approaches with a focus on aspects of translation as a complex language operation rather than a data problem that can be solved with brute force. To put it succinctly, the main problem to be addressed by the proposed project is a lack of data-driven tools that help make explicit real-world translation phenomena and provide sound theoretical models capable of explaining them. This would benefit not only translation studies as a discipline in the tradition of the humanities, but other translation-related fields, such as machine translation, computer-aided translation, and the language industry, as well.
A translation is characterised by the highest level of intertextuality to be found in any kind of text because it is always a reproduction of another text; as such it is mostly judged and/or examined with regard to the relation it bears to its original. The project aims to provide a tool that helps define this relation with a precise, generally valid, translation-specific set of metadata labels, which is to be applied to a corpus that consists solely of translations and their respective original texts. Thus, our main goals are:
- The compilation of an open-ended collection of translations and their original texts: a translation bank, which is not a finished sum of texts, but a growing resource for translation-related research
- The definition of a finite set of metadata labels that mainly capture the distinctive features of translations as highly intertextual language material
- Labelling the data in the translation bank using the defined metadata set
- Making available the resource to the general public as well as the research community, in a highly usable and interoperable form
1.3. Related Work
In the following we would like to include a number of notable examples of bi- or multilingual corpora, all of which cover a number of aspects of translation, but mostly do not present a complete picture.
European Parliament Proceedings Parallel Corpus 1996-2011 (EuroParl)
Given its large size and sentence-aligned architecture EuroParl has been highly influential especially for statistical machine translation. However, it includes a rather limited variety of text types and provides relatively little translation-specific metadata. In contrast to the EuroParl corpus, our project aims to include a much larger variety of text types and focus much more on metadata, thus making the corpus structure more flexible, and open towards a variety of research questions in the field of translation.
MeLLANGE Learner Translator Corpus
This corpus is a notable representative of learner corpora, accordingly highlights didactic aspects, and does not include professional translations, but translations produced by translators-in-training. As opposed to a wide range of other corpora, its size is comparatively small (about 12,000 words), the focus, however, is on data quality rather than quantity, as it is annotated on various levels. In other words, metadata play a more important role and translation-specific aspects are specifically addressed.
TEC (Translational English Corpus)
TEC aims to investigate the language of translated English and contains written texts translated into English from a variety of source languages. One restriction here is that the translations are always in English, so it is a unidirectional corpus. A further limitation is that it does not include source texts. However, it features a useful set of translator-related metadata (e.g. gender, nationality, and occupation) defining translations to a significant extent, in which we as well are interested.
BYU Wikipedia Corpus
Released in early 2015, the monolingual BYU Wikipedia corpus (at Brigham Young University) contains 1.9 billion words in 4.4 million web pages and can be searched with various queries, including such categories as part of speech, collocates, frequency etc. Its most interesting feature is the ability to create “virtual corpora” on the fly, for specific subject domains. These can then also be searched with the provided web interface, but cannot be downloaded free of charge.
Dutch Parallel Corpus
The last corpus we want to discuss here is a balanced, high quality parallel corpus that covers Dutch, English and French, with translations both from and into Dutch. It features an even richer metadata set than the corpora mentioned previously. It is a balanced, non-dynamic corpus, which makes it useful for a wide range of research.
Further resources of note in the community, which also have a number of features interesting for our research project, include ParFin, ParRu, Oslo Multilingual Corpus, GENTT (Textual Genres for Translation), MUST (Multilingual Student Translation), whose discussion would go beyond the scope of this proposal.
2.1. Data Selection and Collection
TransBank is to impose no restrictions regarding language pairs and directions, text types, subject domains, translator experience and education, time of production, etc. The only common criterion for selection is that texts to be included must be translations paired with their respective originals. This means, on the one hand, that the translation bank is to be a dynamic text resource, similar to a translation memory, which is open-ended by definition, and, on the other hand, similar to a classical bilingual corpus, which aims to collect a large number of texts or text segments that help find solutions to research questions. It goes beyond both of these types of text resources in so far as it constitutes a kind of meta-collection: the strong emphasis on metadata allows for selective search and sorting operations which make it possible to generate sub-corpora on demand, suitable for a wide range of potentially very specific possible research questions.
Of course, this would mean a limitless, overwhelming quantity of texts that would qualify for being included in the bank, so there must at least be a well-defined starting point. For this purpose we have outlined a data harvesting plan, which includes a pre-selection of data sources and search directions, taking into consideration aspects such as legal and copyright status, availability, and pre-existing metadata. Strategically, the harvesting plan is subdivided into two major components: a retrospective one and a prospective one, both to be explained in the next sections. Also, it should be mentioned that we do not aim to achieve a balanced collection of texts, as opposed to classical corpora, because we consider this a limiting factor since language use and distribution of text types in the real world are in themselves unbalanced and dynamic.
The selection of new data is to be carried out by the staff processing them, i.e., two students supervised by a doctorand who will help with the scientific aspects of the project, such as developing the label set, and write his/her thesis on the development and impact of the project.
2.1.1. Copyright Issues and Data Protection
Whenever dealing with large amounts of data, especially against the background of an open-access-oriented project, one major concern is the legal status of the material collected. Therefore, a two-pronged approach is to be adopted in this project. Firstly, we have roughly assessed and taken into consideration possible legal problems further downstream when pre-selecting the material listed in the harvesting plan.
Secondly, when the project has started, every starting source (i.e., everything pre-selected for the rough data harvesting plan in the first step) and additional sources gathered are to be carefully examined in terms of legal aspects. This means that during this step the legal status and rights concerning works are to be cleared in detail. As has been mentioned, data collection itself is also to be performed with the help of a dual strategy, outlined in the following two sub-sections. Rights clearance is to be outsourced to a lawyer within our academic network who also happens to be a trained translator.
2.1.2. Retrospective Data Collection: Legacy Data
This component consists in enriching existing bi- or multilingual text resources with metadata. The data are to be sourced based on the harvesting plan and any further resources found further downstream. This includes existing corpora and individual texts along with their translations.
2.1.3. Prospective Data Collection: New Data
By new data we mean translations for which the respective translators are prepared to provide metadata corresponding to our label set already from the start. This means that in this context the main issues will not be analytical metadata creation, but negotiations with practitioners, and quality control. We already have a reliable pertinent network at our disposal, which includes a translation centre, several freelance translators, and, of course, our university department. This approach will also allow us to gather significant quantities of non-literary texts, i.e. text types often underrepresented in translation-related corpora: a very important aspect of real-world translation data as the bulk of everything that is translated by the industry is non-literary.
2.1.4. Data Harvesting Plan
As has been mentioned, the DHP is a rough tool that is to provide a starting point for collecting data. It contains a number of pre-selected promising resources which are to be processed during the project and are likely to lead to the finding of further sources. There are several types of resources and we would like to include one example of each.
– international bibliography of book translations
– rich index of bibliographical metadata for translated books
– contains at least a few useful metadata labels, such as the original and target language, and subject; but mostly standard bibliographical information
– does not contain
Online text collections: Project Gutenberg
– contains texts themselves and not only metadata
– most material in the public domain – legally unproblematic but most of the works not very recent
– cross-check with the index translationum could yield substantial number of titles and their translations and texts themselves.
Existing corpora: Europarl
– very few metadata, but sentence-aligned
– good candidate with large quantities of text to be enriched with metadata
Other online resources: WikiProject Medicine
– Sub-project of Wikipedia (one of the most used bodies of text in NLP), dedicated to translation of medical texts
– Wikipedia user profiles contain many useful details such as biographical data or mother tongue(s)
– cross-referencing profiles and Project Medicine translations with version history of translated articles makes it possible to collect large amounts of original texts and their translations along with valuable metadata
– authors of this proposal affiliated with ITC (Innsbruck Translation Centre)
– competence centre for professional translation and interpreting
– spin-off of the Department of Translation Studies at the University of Innsbruck
– operates based on an agency model: significant hub for freelance translators for a broad range of language pairs, subject domains and text types
– ideal partner for gathering real-world translations and metadata – highly valuable metadata following our specifications which are gathered already before or during the production process of the translations
– prerequisite: informed consent of both the client(s) and the translator(s) involved; same mode would apply to translation offices with permanently employed staff, and freelance translators not already part of our network
Education institutions: University
– for universal empirical translation bank, learner-generated data must not be left out
– authors’ affiliation with translation studies department of University of Innsbruck grants access to large number of student translations
– student translations could be labelled with metadata before and during translation process
– same mode could be applied to partners in academic network
– informed consent of teacher(s) and student(s) required
2.2. Data Processing
As with data collection, the processing step is to be carried out by the scientific staff, consisting of two students supervised by a doctorand assisting with scientific project aspects and writing his/her thesis on the development and impact of the project.
Data labelling is one of the core activities of the project. As has been mentioned, this is to be done by applying a metadata set which is to be defined during the project. The metadata are to include all aspects relevant to the production of the translation, such as language pair, text type, subject domain, translator experience and education, time of production etc. What is decidedly not going to be labelled is translation quality, as this is an issue that has still not been resolved by the scientific community: the translation bank would provide a valuable, re-usable resource for tackling this research question; therefore, labelling translation quality before there is a reliable and agreed-upon means of assessing it would be problematic in terms of data quality and would harbour the danger of bias. A separate subset of the labels will have to be defined for source texts as they, too, have a number of key features relevant to translation, e.g. their year of publication, if they are translations themselves (resulting in intermediary translation).
Data labelling, and this cannot be stressed enough, will be the basis for faceted search applications and generating sub-corpora for very specific research questions, and overcoming the inherent limitations of static text collections.
Alignment is a necessary condition for being able to study translation phenomena as the sentence level constitutes a manageable, neatly defined translation unit, for machines as well as humans. Of course, this does not mean that the sentence is the only possible translation unit, but it is the, in terms of data processing as well as cognition, most convenient one. Even if there is some debate as to the justification of the sentence as a translation unit in the scientific community, for the purposes of this project the decision is a practically motivated step that must be taken at some point.
As a trade-off between data quality on the one hand and processing speed on the other, semi-automated alignment with the help of the free tool Align Assist has been chosen. This allows for dealing with complex alignment situations as well, such as cross-alignment when the sentence orders in the source and target text differ.
3. DATA STORAGE AND ACCESS
Data Storage is to be provided in the form of TMX files for the aligned text and METS as a container format, for storing the metadata according to the Marc 21 Standard. The web-based search and presentation platform is to provide output options for data download, which can be generated via XSLT from the above XML formats: plain text of source and target texts, and TEI compliant XML-files for those who want to label data within the texts as well, as opposed to our metadata about the texts. Also the TMX/METS files will be available for download.
The platform will allow for faceted search operations, which can be used for compiling and downloading specific sub-corpora. This means that search parameters can be combined instead of only used in a mutually exclusive manner, as is the case with fixed, separate (sub-)corpora. The combination is not only one of various labels for one group of texts, but for two: users have to choose a combination of metadata labels for the source texts on the one hand and for the pertinent target texts on the other. The search mask would therefore be two-sided. For example, users could compile a parallel corpus which fulfils the following criteria:
Source texts (included in download [yes] / [no])
[published from (1938) to (1945)], [published in (Austria)], [fictional]
Target texts (included in download [yes] / [no])
[DE-EN language pair], [female translator], [translation into (native language)]
As can be seen, users can also choose if source texts, target texts or both are to be included in the downloadable package, i.e., corpora consisting of only source texts or only target texts are an option, too: both of which not necessarily monolingual. This makes it possible to generate comparable corpora as well, e.g. by searching for all original texts from a certain subject domain in various languages, without considering the translations included in the bank. The search engine to be used is the Lucene-based Elasticsearch.
4. EXPECTED OUTCOMES AND OUTLOOK
While sharing a number of features with each of the resources listed under 1.3, the combination of features, together with a new way of accessing translation data via the envisaged web-platform, provides a genuinely new tool for promoting data-driven, empirical translation research: the main innovative feature is the ability to compile and download parallel or comparable sub-corpora on demand, tailored to the requirements regarding specific questions of translation research. Such studies may include, e.g., diachronic issues such as translation-induced language change, the impact of translation technology on linguistic features of written text, contrastive questions regarding differences between text-type norms and conventions across languages, or cognitive research interests in connection with the differences between texts produced by trainees and experienced practitioners.
Also, the openness, dynamic nature, and high data quality provided by the semi-automated alignment of the collection will yield a sufficiently large text quantity and data quality for making use of the collection within the framework of big data approaches, with significant implications for translation technologies and natural language processing (NLP), including machine translation. Also, the ability to generate comparable sub-corpora (see 3.) is well in line with a very recent and current trend in the field of data-driven linguistic research and NLP. The quantitative goal of the project is to collect approximately ten million words (tokens) within the two years of the project duration. A high likelihood of success of the project is achieved by using reliable, tried and tested technologies and standards for a new approach to accessing empirical data for translation research.
The aforementioned openness and dynamic operation are two features which are to be provided well beyond the end of the project duration as TransBank is to be maintained and expanded after the project, too. The rate of growth in that phase will depend on the funding that can be attracted by then. At the least, basic maintenance and moderate expansion of the bank will be provided within the framework of the university employment contracts of the two project leaders. We hope to build up an extensive community of users by that time, who will also contribute text material, the quality of which is, however, always to be checked by project members.
In summary, what we are aiming to provide is re-usable, open, empirical data for translation research.
DATA MANAGEMENT PLAN
Data workflow and storage
- External legal clearance of data listed in the data harvesting plan
- Collection of data by scientific staff (two students and one doctorand, who will also supervise the students)
- Active submission of prospective data (see 2.1.3.) by partners
- Passing-on of newly found data/sources for rights clearance
- Quality control and/or alignment of legally cleared data
- Upload of data into the web application
- Data storage in an XML database
- Transformation of data into output formats and download, both on demand
- Sentence-aligned original texts and their translations
- Metadata corresponding to a comprehensive label set
- Various output possibilities of text data
Methodology: tools, standards
- Semi-automated alignment using the free tool Align Assist
- Storage of aligned data in the TMX XML format
- Metadata labelling with the help of the Oxygen XML editor
- Metadata storage in METS XML files
- Storage of all XML data in an XML database
- Conversion of data for output/download using XSLT
- Output formats: plain text, TEI-compliant XML
- Data ingestion, display, and transformation/output will be provided by a dedicated web platform whose development will be outsourced to a local software engineer, on the basis of Java Application Server, NodeJS Server, and REST API.
- A dedicated web domain will be used for the project
Volume and type of data provision
- All data (goal of 10 million words) will be made available under a Creative Commons attribution licence (CC BY 3.0 AT) during and after the project
- Access to the data will be open worldwide and free of charge
- Physical data storage and server hosting will be provided by the central IT service of the University of Innsbruck, which operates one of the largest data centres in western Austria
- Backups will be produced regularly with the help of Tivoli Storage Manager (status at the time of writing) or an equivalent backup solution
The TransBank research group is based at the Department of Translation Studies at the University of Innsbruck, Austria. Its members contribute expertise in translation, linguistics, natural language processing and software engineering to the project. Currently, the following members are part of the group:
- Principal Investigator
- Doctoral Researcher
- Principal Investigator
- Software and Database Engineer