Speaker
Affilliation
Leipzig
Abstract
In the past, language processing has predominantely been performed by using either explicit rule-based knowledge or implicit knowledge via learning from annotations. In contrast to this, I introduce the Structure Discovery paradigm. This is a framework for learning structural regularities from large samples of text data, and for making these regularities explicit by introducing them in the data via self-annotation.
Working in this paradigm means to set up discovery procedures that operate on raw language material and iteratively enrich the data by using the annotations of previously applied Structure Discovery processes.
Since graph representations are an intuitive way for encoding linguistic entities and their relations in nodes and edges, I will talk about some graph characteristics typically found in representations of language data. To perform necessary abstractions and generalisations needed for Structure Discovery, I introduce the Chinese Whispers Graph Clustering algorithm. This algorithm is very efficient and allows to partition graphs with millions of nodes in a short time.
Then I will present some practical applications following the Structure Discovery paradigm: A solution for language separation, an unsupervised PoS tagger and a word sense induction system.
If time allows, I will talk about possible further work, especially regarding emergent language generation models that reproduce characterstics found by Structure Discovery processes.