BigNLP: Approaching High Performance Computing to Big Data Technologies: Natural Language Processing as Case Study

Processing large amounts of text is a complex task that requires the use of several interrelated linguistic modules organized into subtasks. The biggest problems of the language processing techniques are the high computational cost and the scalability issues, which makes them impractical for analyzing large volumes (Gigabytes or even Terabytes) of documents. On the other hand, it is worth noting that the philosophy of the latest approaches to corpus linguistics are based on the "Web As Corpus", research line which postulates that using more data and text best results are always achieved.

For this reason, we consider that high performance computing and Big Data-oriented strategies fit naturally as a solution to the limited computational efficiency of the modules for language processing. However, the relative simplicity of the modular processes and the independence among the linguistic input units (sentences, paragraphs, texts ...) are factors to consider that can facilitate the integration of NLP modules into the context of high performance computer systems using Big Data technologies.

Objectives

The main goal of the project is to develop a new set of tools and software solutions for Big Data processing, which will allow the integration of a set of multilingual modules for natural language processing into a parallel and scalable suite. This suite must process large amounts of text in reduced execution times and, at the same time, make an efficient use of the considered high performance systems, paying special attention to the heterogeneous architectures. In particular, we will consider modules for Multiword extraction, Syntactic parsing, Triple extraction, Correference resolution and Sentiment analysis. Note that the new NLP modules could be used in more complex and higher level linguistic applications such as machine translation, information retrieval, question & answering, or even new intelligent systems for technological surveillance and monitoring. In addition, the new tools result of the project will be for general purpose and, in this way, they could be applied to codes and applications coming from any research area.