Extracção de relações semânticas. Recursos, ferramentas e estratégias

TítuloExtracção de relações semânticas. Recursos, ferramentas e estratégias
Autor/aMarcos García González
DirectoresPablo Gamallo Otero
TipoTese doutoral
Data de lectura15/12/2014
Lugar de lecturaUniversidade de Santiago de Compostela
AbstractRelation extraction is a subtask of information extraction that aims at obtaining instances of semantic relations present in texts. This information can be arranged into machine-readable formats, useful for several applications that need structured semantic knowledge. This thesis explores different strategies to automate the extraction of semantic relations from texts in Portuguese, Spanish and Galician. Both machine-learning (distant-supervised and supervised) and rule-based techniques are investigated, and the impact of the different levels of linguistic knowledge is analyzed for the various approaches. Regarding domains, the experiments are focused on the extraction of encyclopedic knowledge, by means of the development of biographical relations classifiers (in a closed domain) and the evaluation of open information extraction systems. In order to implement the extraction systems, several natural language processing tools have been built for the three research languages: from sentence splitting and tokenization modules to part-of-speech taggers, named entity recognizers and coreference resolution systems. Furthermore, several lexica and corpora have been compiled and enriched with different levels of linguistic annotation, which are useful for both training and testing probabilistic and rule-based models. As a result of the work carried out in this thesis, new resources and tools are available for automated processing of texts in Portuguese, Spanish and Galician.