Candidate:
Gabriel de Jesus
Date, Time and Location:
1 September 2025, 14:30, Sala de Atos, Faculdade de Engenharia da Universidade do Porto
President of the Jury:
Rui Filipe Lima Maranhão de Abreu (PhD), Full Professor, Departament of Informatics Engineering, Faculdade de Engenharia da Universidade do Porto
Members:
Arjen P. de Vries (PhD), Full Professor at the Institute for Computing and Information Sciences of the Radboud Universiteit, Nimega, The Netherlands;
Bruno Emanuel da Graça Martins (PhD), Associate Professor, Departament of Electrical and Computer Engineering, Instituto Superior Técnico da Universidade de Lisboa;
Henrique Daniel de Avelar Lopes Cardoso (PhD), Associate Professor, Departament of Informatics Engineering, Faculdade de Engenharia da Universidade do Porto;
Sérgio Sobral Nunes (PhD), Associate Professor, Departament of Informatics Engineering, Faculdade de Engenharia da Universidade do Porto (Supervisor).
Abstract:
Ensuring access to information in all languages is crucial for bridging disparities in communities’ participation in the digital age and fostering a more inclusive and equitable society, particularly for speakers of low-resource languages. However, enabling such access remains a significant challenge for many of these communities. Tetun, a language that transitioned from a dialect to one of Timor-Leste’s official languages when the country restored its independence in 2002, faces similar challenges. According to the 2015 census, Tetun is spoken by approximately 79% of the country’s 1.18 million population. Despite its official status, Tetun remains underserved in language technology. Specifically, information retrieval-based solutions for the language do not exist, making it challenging to find relevant information on the internet and digital platforms for text-based search in Tetun.
This work tackles these challenges by investigating retrieval strategies for text-based search that can enable the application of information retrieval techniques to develop search solutions for Tetun, with a specific focus on the ad-hoc text retrieval task. Given that language-specific algorithms, tools, and document collections for Tetun were previously unavailable, this work began by creating these foundational resources, which serve as contributions relevant to information retrieval and natural language processing domains. These resources include a tokenizer, a language identification model, a stemmer, a stopword list, a document collection, a test collection, baselines for the ad-hoc text retrieval task, and a search log dataset. The contributions to information retrieval for low-resource languages include: (1) A data collection pipeline tailored for low-resource languages to streamline the construction of textual data from the web; (2) A human-in-the-loop methodology for annotating, processing, and constructing a dataset well-suited for a variety of information retrieval and natural language processing tasks; (3) A novel network-based approach for stopword detection; (4) Methodologies for developing a stemmer, designed for a language heavily influenced by loanwords, and the construction of a ground truth set for evaluating stemmer performance; (5) A detailed approach for constructing a test collection to evaluate the effectiveness of retrieval systems; (6) A methodology for establishing a robust baseline for the ad-hoc text retrieval task; and (7) Document contextualization and dual-parameter tuning strategies for hybrid text retrieval. The results from this work contribute to the development of technologies associated with the computational processing of Tetun, address gaps in its linguistic resources, and achieve impactful outcomes that elevate Tetun’s status. These advancements open new opportunities for future research and innovation. Moreover, this work introduces promising methodologies that can be adapted to other languages facing similar challenges, thereby contributing to the broader advancement of information retrieval for low-resource languages.