This is the third in a series of posts about each of the teams that will be attending SCI 2019, and their projects. This one was submitted by Hugh Cayless.
The Text Encoding Initiative was, from its outset, very much a Western, English-language effort. Its remit, however, is global. Primary source documents written in languages as diverse as Chinese, Mayan, Coptic, Japanese, Arabic, Old Cam, and many others are published in TEI. The Guidelines “are addressed to anyone who works with any kind of textual resources in digital form” (TEI Consortium, About these Guidelines) and they represent a major and long-lived contribution to the infrastructure of digital scholarship.
Although the desire of the TEI community is to produce a globally open and accessible resource, we face many challenges in producing translations for the Guidelines and their specifications therein. The current processes used in producing translations are outdated for our purposes. The Guidelines are a living, continuously updated document, and translations may quickly become obsolete as the sources are edited. Additionally, integrating translated materials requires a high degree of technical expertise. Where ongoing translation efforts exist, there is no framework for publicizing their efforts. Our goal is to assemble a team with the technical and linguistic competency to conceive and implement workable solutions to these problems. We believe that tackling a large and difficult real-world global scholarly communication problem like this will provide examples to other projects wishing to improve their global outreach.
The editing of the Guidelines is the responsibility of an 11-member body elected by the TEI membership—known as the Technical Council. Although the membership of the Council is international, and between its members possesses competency in at least half a dozen languages, essentially all business is done in English. The TEI Guidelines documentation has three parts: 1) the prose Guidelines, 2) the technical specifications of TEI components (the actual elements, attributes, classes, etc.), and 3) the examples of usage, which appear in both #1 and #2. Translation efforts to date have tended to focus on #2, the technical specifications, which, since they consist of short definitions and notes, are easiest to translate. Each “spec page” contains documentation in various languages, in parallel.
The efforts to internationalize the TEI’s documentation date back at least to 2005. An initiative led by the late Sebastian Rahtz developed infrastructure to support translations and solicited community efforts to provide them. This effort resulted in partial translations in Japanese, Chinese, Korean, Italian, Spanish, French, and German. Over the years since that initial effort, periodic updates have been made to individual languages, most recently German, Japanese, and Spanish. The system in place for translating spec pages relies on converting them to a spreadsheet form, in which the actual translation is performed, and then integrating these changes back into the sources. There is, as yet, no workflow for automatically re-integrating translations back into the source documents. And there is no process at all to support translating the prose of the Guidelines. Some attempts have been made to produce French versions of parts of the Guidelines, but because of their complexity, translations are much more difficult to produce and maintain. Worse, from the point of view of non-English speakers, the prose Guidelines are considered to be the canonical set of instructions on TEI usage and syntax. Consequently, a full understanding of TEI is impossible without reading the English version.
We aim to address the following questions
Our group will consider technical approaches to improving the translation workflow for the Guidelines and specifications as well as ways in which we might de-center English as the core and canonical language of the TEI.
- Could we prioritize the spec pages as the authoritative documentation, around which documentation in multiple languages could orbit?
- How should we help foster pedagogical initiatives in many languages?
- How do we make it clear that non-English-speakers can and should raise issues on our GitHub repositories (https://github.com/TEIC) and ask questions in their own languages?
- What should we prioritize for internationalization?
- How should the TEI Consortium support and/or initiate translation efforts?
- Are there automated ways (Google Translate or Deep-L, for example) in which we can give translation efforts a head start?
- Preliminary work will include the evaluation of existing translation toolkits, such as https://translatewiki.net/, and the analysis of lessons learned from previous translation initiatives, such as the recent German and Japanese translations, and from ongoing efforts, like the Spanish Text Technologies Hub.
- The team will produce a set of recommendations for the TEI Consortium, which will be submitted to the Board of Directors and posted on the TEI mailing list.
- We will deliver a follow-up report at the 2020 TEI Annual Meeting and potentially hold a workshop as well.
- Any translation toolkits or workflows we produce will be disseminated under an open license at the TEIC’s GitHub organization, https://github.com/TEIC.
- In addition, we consider it crucial that a well-documented set of procedures for creating translations will be developed and shared with the community.
The team we have assembled for Triangle SCI combines linguistic and technical expertise with practical experience teaching TEI in a variety of environments. We have experience working on German, Japanese, and Spanish translations of the TEI specifications, and so have direct knowledge of the limitations and shortcomings of the current system. We have taught TEI in Spanish, Japanese, German, and English. We also possess deep technical knowledge of the TEI itself and its infrastructure. Our group has representatives from both the TEI Technical Council and the Board. We are well-placed, therefore, both to conceive solutions to the TEI’s internationalization problems, and to implement them.
Gimena del Rio Riande is an Associate Researcher at IIBICRIT-CONICET and teaches at the University of Buenos Aires. She interested in building an Open Digital Humanities community in Argentina. During the last five years she created the first DH Lab in her country, HD CAICYT Lab, and she worked on the publishing of the first Spanish Digital Humanities OA journal, the Revista de Humanidades Digitales, the organization of the Asociación Argentina de Humanidades Digitales, and the Argentinian OA Repository Project, Acta Académica. She also collaborates with many DH projects and consortia around the world (Force11, TEI, Pelagios Commons, DARIAH). Gimena brings her experience working with different academic communities (Anglophone, Spanish), where she has explored transculturation and decolonization approaches in the Humanities. She brings her experience in OA policies and the DH and digital humanities scene in Latin America, that will help the team in expanding this framework to the Spanish-speaking community to Global South perspectives.
Martina Scholger is a senior scientist and researcher at the Centre for Information Modelling – Austrian Centre for Digital Humanities at the University of Graz. She recently received her PhD in Digital Humanities, is teaching data and text modelling with a focus on X-technologies, and is involved in numerous cooperation projects in the field of digital scholarly editing. She has been a member of the Institute for Documentology and Scholarly Editing (IDE) since 2014 and a member of the TEI Technical Council since 2016, where she is currently serving as Chair. In 2016, she was one of the co-organizers of the “TEI2German translatathon” at the annual TEI conference and members meeting in Vienna. She is therefore familiar with the current translation workflow of the TEI specifications and with the challenges and pitfalls regarding the preparation and implementation of translations into the TEI Guidelines, as well as the TEI infrastructure.
Helena Bermúdez Sabel is a postdoctoral researcher at the Université de Lausanne (Switzerland). Her current position involves the development of annotation schemes for the study of modality in Latin from a diachronic perspective. In addition, she supervises the technical aspects of the annotation process as well as data managing and dissemination of results. Before this position, Helena Bermúdez Sabel worked at the Laboratorio de Innovación en Humanidades Digitales (Madrid, Spain), an institution particularly concerned with the dissemination and training in Digital Humanities methods within the Spanish-speaking community. Besides being an instructor at different DH courses, many of them focused on TEI and XML technologies, she was one of the researchers of a project focused on enabling the interoperability of poetic resources from all European traditions. Her training as a Romance Philologist has provided her with a working knowledge of multiple romance languages: this background is not only relevant for the topic of this proposal but for SCI overall goals as well due to her understanding of the cultural heritage of different linguistic communities.
Kiyonori Nagasaki is a Senior Fellow in the International Institute for Digital Humanities in Tokyo and a lecturer of digital humanities including a TEI class in the University of Tokyo. He studied Buddhist philosophy and information technology in the graduate school in the Tsukuba University (Japan). While he has built many databases for the humanities, he has addressed to disseminate TEI among Japanese DH and Humanities researchers since over a decade ago. In 2016, a special interest group East Asian / Japanese (SIG-EAJ) was established under the auspices of the TEI consortium by his proposal in order to accelerate the activities which internationalize the TEI guidelines and its ecosystem. He has also addressed other standardization such as Unicode and IIIF and system developments in order to build a model of integrated research environments for the humanities.
Luis Meneses is a Postdoctoral Fellow and Assistant Director for Technical Development at the Electronic Textual Cultures Lab in the University of Victoria. He is a Fulbright scholar, and currently serves on the Board of the TEI Consortium and on the IEEE Technical Committee on Digital Libraries. His research interests include digital humanities, digital libraries, information retrieval and human-computer interaction. His current research focuses on the development of tools that facilitate open social scholarship.
Hugh Cayless is a Senior Digital Humanities Research Developer at the Duke Collaboratory for Classics Computing (DC3). Hugh has two decades of TEI experience, having first encountered the Guidelines as a Ph.D. student in Classics. He was a founding member of the EpiDoc Collaborative, which develops a TEI-derived schema, documentation, and tooling for representing ancient documents. He has served on the TEI Technical Council since 2012, and as Chair of that body from 2015–2018. He currently serves as the Treasurer of the TEI Consortium. Hugh has experience supporting TEI projects in many languages, including Greek, Latin, Syriac, Arabic, and English.
[ Photo by Nicola Nuttall used under Unsplash free license. ]