Candidate:
Felermino Dário Mário António Ali
Date, Time and Location:
20 February 2026, 14:00, Sala de Atos da Faculdade de Engenharia da Universidade do Porto
President of the Jury:
Pedro Nuno Ferreira da Rosa da Cruz Diniz (PhD), Full Professor, Faculdade de Engenharia da Universidade do Porto
Members:
Maarit Tuulikki Koponen (PhD), Professor at the School of Humanities of the Philosophical Faculty of the University of Eastern Finland (Finland);
Maria Luísa Torres Ribeiro Marques da Silva Coheur (PhD), Associate Professor, Departamento de Engenharia Informática, Instituto Superior Técnico da Universidade de Lisboa;
Sérgio Sobral Nunes (PhD), Associate Professor, Departamento de Engenharia Informática, Faculdade de Engenharia da Universidade do Porto;
Henrique Daniel de Avelar Lopes Cardoso (PhD), Associate Professor, Departamento de Engenharia Informática, Faculdade de Engenharia da Universidade do Porto (Supervisor).
The thesis was co-supervised by Rui Manuel Sousa Silva (PhD), Assistant Professor at Faculdade de Letras da Universidade do Porto.
Abstract:
“This research explores the underrepresentation of low-resource languages in the field of machine translation, with a specific focus on Emakhuwa, the most widely spoken local language in Mozambique. Despite having over 7 million native speakers, Emakhuwa remains underrepresented in both academia and technology due to a lack of digital resources and linguistic tools. To fill this gap, we have developed the first significant machine translation resources for the Portuguese–Emakhuwa language pair. Our contributions include the creation of a parallel corpus through the manual translation of journalistic texts, the digitisation of existing materials, and the translation of established machine translation evaluation benchmarks. We evaluated three central strategies to improve machine translation performance in this low-resource setting: (1) transfer learning using multilingual and Africa-centred models, (2) data augmentation through back-translation, and (3) integration of external linguistic resources such as loan glossaries and bilingual dictionaries. The results show that encoder-decoder models, particularly translation-optimised architectures such as NLLB and M2M-100, perform as well as or better than larger decoder-only models while maintaining computational efficiency. Back-translation offers modest improvements, and the integration of loanwords and dictionary resources, especially in the Portuguese-Emakhuwa direction, significantly improves translation quality, especially with the use of LLMs. This work lays the foundation for future research in NLP for underrepresented languages and demonstrates practical paths for the development of machine translation systems in resource-limited contexts.”









