This system lets you search and view the lexicons changes of tens of thousands of words of Castilian over time, particularly in the 1900-2009 time axis, using as data source semantic representations built with Google n-grams in Spanish (45 billion).
The user searches for a word and a period of time and the system returns the meaning of the word in each year of the target range. The meaning of a word is represented by the set of most similar words in semantic and distributional terms. For example, the word "cancer" is closely linked in 1910 with "tuberculosis" and "syphilis" but already in 1960 the closest terms are "tumor" and "carcinoma".
The system input is a data structure in which the words are associated with degrees of similarity (Cosine) with other words per year. These data were recently generated by the PronLNat@GE team (Pablo Gamallo, Marcos Garcia) through techniques and modules of Natural Language Processing. Semantic processing of 45 billion n-grams was made, available after scanning more than 1 million books from "Google Books". Semantic processing consists of transformate the n-grams in 'word-context' distributional matrices. A matrix per year was generated where each word is a vector generated contexts. Finally, the similarity between vectors (words) was calculated and, for each word, the 20 most similar by year were selected. In total, a data structure over more than 300MB was generated as the input for this demo.
Authors
-
- Researchers
- Pablo Gamallo Otero
- Iván Rodríguez Torres
- Marcos Garcia González