Workshop organized by Felix Ameka, Emmanuel Ngué Um, Sara Petrollino, Mmasibidi Setaka and Daan van Esch

Short description
Programme
Sponsors
Short description
This proposal is the Lorentz-eScience winner 2021.
Language documentation in Africa has been on the increase in the past 20 years, with many researchers documenting languages for which little or no record existed. Despite this, there are several bottlenecks that slow down the current workflow and hamper the application of cutting-edge AI technologies to these less resourced languages. A consequence of this is that less resourced languages are left behind and cannot benefit from the recent advances in machine learning and automatic speech recognition technologies that have played an important role in enhancing the visibility and information flow of major world languages.
Typical bottlenecks in the language documentation pipeline are the transcriptions of audio or video recorded data, data annotations (e.g. translations, POS tagging), and the lack of technical computational experience among language documenters. If these bottlenecks are not addressed, language documentation processes cannot be enhanced, directly impacting not only the availability of the documentation output in terms of the records of linguistic practices, but also increasing the digital divide that separates less-resourced languages from major world languages.
The workshop seeks to be a testing ground for data and software carpentry where documentary linguists, language community members, computer scientists and software developers can exchange their knowledge and experience about the application of AI technologies to African languages and the field of documentary linguistics. In this regard, the organisers’ wish is to align the language documentation agenda with the national AI research agenda (AIREA-NL), and contribute to a human-centered AI approach as advocated by The Hybrid Intelligence Centre and Humane AI.
The workshop builds on discussions and lessons learnt from the workshop hosted by the Lorentz Center in 2019: Digital Humanities – the perspective of Africa (https://dhafrica.blog/lorentzworkshop/) and follow-up meetings such as the ones that took place in the context of the UNESCO conference LT4All, and the Global Digital Humanities Symposium at Michigan State University.
The workshop will provide academics and professionals who work with African languages with practical training and technological solutions: the programme will feature two invited talks, hands-on sessions on specific machine learning applications that are now being tested in language documentation (such as ELPIS, OCTRA, and others), followed by knowledge exchange sessions moderated by Masakhane (NLP for African languages) and by Dorothy Gordon, Chair of the UNESCO programme Information for All. Selected participants will be required to have recorded language data at various stages of the documentation pipeline (e.g. transcribed and untranscribed audio/video recordings, annotated texts, etc.) and they will work on their own data under the guidance of computer scientists from both academic and non-academic institutions. The knowledge-exchange sessions will consist of interactive discussion groups where the participants will discuss ways in which current technology can be shaped to assist the needs of documentary linguists but also how the existing technology can be incorporated in language documentation to enhance the discipline.
The workshop will take place at the end of May 2021, as a satellite event of the 10th World Congress of African Linguistics, Wocal.
Programme
All times in CEST
DAY 1- MAY 31 12:15 Walk-in moment 12:45 Welcome speech by Lorentz Center 13:00 Intro speech by TEDAL organisers 13:15 Introduction of workshop participants TALK 1: Tools for the new linguist – Language technology working for you. Dorothy Gordon (UNESCO programme Information for All) 14:00-14:30: Talk 14:30- 14:45: Discussions in break-out rooms 14:45-15:00 Plenary discussion with Dorothy Gordon 15:00-15:30 Coffee break and networking 15:30-16:30 Knowledge exchange session 1: Christoph Draxler (Ludwig-Maximilians-Universität München and CLARIN D) “OCTRA” 16:30 – 17:30 Knowledge exchange session 2: Vukosi Marivate (University of Pretoria, Data Science for Social Impact, Masakhane), “Recipes for low-resource language research: a practical perspective”. |
DAY 2 – JUNE 1 TALK 2: From assemblages to corpora and back: How to enable broadest usage of documentary collections – Mandana Seyfeddinipur (ELDP & ELAR) 10:00-10:25 Live talk 10:25 – 10:45 Discussions in break-out rooms 10:45- 11:00 Plenary discussion with Mandana Seyfeddinipur 11:00-11:15 Coffee break and networking time 11:15- 12:00 Break (off-line) 12:00-13:00 Knowledge exchange session 3: Antonis Anastasopoulos (George Mason University) “Language Technology for Language Documentation: What it can and cannot do” 13:15-14:15 Knowledge exchange session 4: Martha Yifiru Tachbelie (University of Addis Ababa), “Multilingual ASR for Ethiopian languages” 14:15 – 14:30 Coffee break (off-line) 14:30-15:30 Knowledge exchange session 5: Moses Ekpenyong (University of Uyo), “SCANNAL – Mining Linguistic Corpora for Intelligent Tone Languages Documentation” |
DAY 3 – JUNE 2 09:30-10:15 ELPIS workshop, Daan van Esch and Ben Foley 10:30 – 11:30 One-to-one Q&A sessions on ELPIS 15:00-15:30 Networking tea-time 15:30-16:30 Afternoon booth with Audace Niyonkuru, founder and CEO of DIGITAL UMUGANDA |
DAY 4 – JUNE 3 09:00 Mmasibidi Setaka & Juan Steyn (SADiLaR), Tools and workflow for creating ASR-oriented data 09:00 – 09:30 Introduction: Speech data collection through web interface 09:00 – 10:00 Hands-on session 10:00 – 10:15 Coffee break 10:15 – 11:00 Introduction: Woefzela and similar approaches 11:00 – 11:15 Coffee break 11:15 – 12:40 Hands-on session 12:40 – 13:00 Conclusion |
DAY 5 – JUNE 4 TALK 3 – Neural attention for Language Documenters: towards explainable linguistics with deep learning – Stephan Raaijmakers, Prof. of Communicative AI at Leiden University 10:00 – 10:15 Pitch and sum-up of talk 10:15 – 10:30 Discussions in break-out rooms: Room 1. [Resources] The feasibility of creating hand-annotated resources for African (and in general: low-resource) languages Room 2. [Community building] Organizing an NLP/AI community for African languages: existing and emerging structures Room 3. [Education/training] Educating/training language documenters for working with AI-methods: needs, possibilities, challenges Room 4. [Applications] Self-identified applications of neural attention (and other AI) methods for supporting language documenters 10:30 – 10:45 Plenary discussion with Stephan Raaijmakers 10:45- 11:15 Coffee break and networking 11:15-12:15 Knowledge exchange session 6: Technologies for Sign Languages (Victoria Nyst & Manolis Fragkiadakis, Leiden U Centre for Digital Humanities) 12:15-13:30 (Lunch) break (off-line) TALK 4 – AI for Development, Kathleen Siminyu , regional coordinator at AI4D. 13:30 – 14:00 Live talk 14:00 – 14:15 Discussions in break-out rooms: Room 1: Multi-disciplinary collaboration for AI development Room 2: Platforms, tools, and competitive challenges for advancing AI research and innovation Room 3: Future of work in Africa powered by language tools Room 4: Future of education in Africa powered by language tools 14:15 – 14:30 Plenary discussion with Kathleen Siminyu 14:30 – 14:45 Coffee break 14:45-15:30 Roundtable and evaluation of the technology tested during the workshop, results, next steps, group reports. Closing. |