We describe the universal nature of language through mathematical models obtained by computing big data consisting of large-scale language resources. Using these new models, we create communication software applications.
Computational/Mathematical Modeling of Language
Estimation of Complexity of Language
Mathematical Models Underlying Sentence Structure
Computational Universals of Language
Natural Language Processing
Unsupervised Morphological/Syntactic/Semantic Analysis
Machine Learning Methods for Language Processing
Computer Assisted Language Learning
Mining from Social Network Systems
Information Retrieval and Extraction
Web Document Processing
Algorithms for Large Scale Retrieval/Extraction
Studies of language have leapt into a new era through the increased availability of various language data, in massive quantities. We seek universal properties of language and describe them via mathematical models. The models are investigated using various corpora and language features.
Text Constancy Measures
A constancy measure characterizes a given text by having an invariant
value for any size larger than a certain amount. The study of such
measures has a 70-year history dating back to Yule's K, with the
original intended application of author identification. We
mathematically and empirically examined various measures proposed
since Yule, and reconsidered reports made so far, thus overviewing the
study of constancy measures. Constancy measures are applied to
variety of texts, including texts in different natural languages,
programming languages and unknown scripts, and the complexity of
natural language is investigated.
Statistical Properties of Articulation
"When the complexity of a subsequent token increases, the location is at a context border." This phenomenon was first described by Zellig S. Harris, in his paper, "From phoneme to morpheme," in 1955. We have reformulated his hypothesis from a more information-theoretic viewpoint and verified it in various languages at different levels, from phonemes to morphemes, from morphemes to words, and from words to phrases. Generally, the hypothesis holds quite well in articulating a larger linguistic unit from a sequence formed of smaller units. This property has the potential for application to building un-supervised segmentation software.
Log Frequency and Familiarity
The influence of quantity on the cognitive perception of linguistic units is studied by measuring the correlation between the frequency, obtained from various corpora, and the word familiarity, obtained through psychological experiments. In this figure, the plotted points represent words, with the horizontal axis indicating familiarity and the vertical axis indicating log frequency, as measured for 2 terabytes of data. The log frequency and familiarity correlate well, and a high frequency is a necessary condition for a word to be familiar. Such results show how word familiarity is formed through the Weber-Fehner law. The larger the corpus, the higher the correlation. Also, speech corpus data correlate better with familiarity than do writing corpus data. Currently, these results are applied in statistical readability studies. This work has been conducted with Hiroshi Terada
Natural Language Processing, Information Retrieval and Extraction
Language technologies such as speaking and writing are attributed to humans, and the linguistic field has been considered part of the humanities. Today, elements of these language technologies require processing by machines that can handle immense amounts of language data. We thus study key technologies needed for such language processing.
Detection of Changes Within a Text
Recent texts are often conjuncts of different kinds of text. For example, many wiki documents are multilingual, consisting of a part in one language and another part in English. Another example is plagiarism, where part of a text was previously authored by someone besides the author. We seek mathematical methods for detecting such changes in text styles, through the use of information theory and statistical outlier detection. This research has been conducted with Hiroshi Yamaguchi
Deterministic Tree-based Parsing
Supervised parsing has been extensively studied and forms the basis for semi-supervised/unsupervised methods. Given the contrast between global optimization and deterministic methods, it is interesting to ask whether all (qualitatively different) supervised parsing methods have already been developed. Through such analysis, Kotaroh Kitagawa
proposed a common way to enhance previous deterministic parsing methods by changing the unit of processing from a word to a tree. This naturally adds local search to deterministic parsing methods, thus taking advantage of both global optimization and determinism.
Sorting Texts by Readability
For efficient language learning, it is crucial to read texts of the appropriate language learning level. Readability evaluation has a history of more than 50 years, and recent approaches use machine learning. Specifically, there have been two main approaches: regression and categorization. In contrast, we have devised a new method: readability by sorting. Here, machine learning is applied to produce a comparator that judges which of the two given texts is relatively more difficult. With this comparator, a set of texts is sorted and the readability of a given text is modeled as a ranking among the sorted texts. The same method is applicable to other text scoring problems. This work was conducted with Satoshi Tezuka
and Hiroshi Terada
Today, a variety of people use a variety of languages through a variety of devices. There is a strong need for software applications that aid communication, both among humans and between humans and various devices. Focusing on language, we seek useful software applications that can aid human language processing and communication.
Logue: Speech Analysis for Everyone
Logue is a system to help users discover and correct problems in their speech. It is demonstrated as a smartphone application that listens to a user's voice, estimates speech features such as speed and enunciation clarity, and provides real-time graphical feedback.
There are very few people who would claim to have perfect speech. Depending on the speaker, speech can be too loud, too fast, mumbled, and so on. However, it can be difficult to be aware that these problems exist in one's speech, and even then it is difficult to shake these bad habits. Our aim is to create an automated, objective system that can identify these problems, and prompt the user when they emerge.
Logue applies its own set of speech analysis methods. These are light-weight to allow real-time feedback on resource-limited platforms such as smartphones, and intelligent to reliably estimate high-level, abstract features such as "enunciation clarity". Evaluations to date have shown that these methods can be effective in our goal of identification and assistance in correction of speech problems.
This system was implemented by Daniel Heffernan
, graduating in 2013 from Dept. Creative Informatics, IST, University of Tokyo. The system will be available via the iOS App Store some time in 2013.
PicoTrans : An Icon-driven User Interface for Machine Translation
PicoTrans is a user interface for travelers, which integrates the popular notion of a picture book with a statistical machine translation system that can translate arbitrary word sequences. The simple paradigm of pointing at pictures is used as the primary method of user input, so the device can be used as if it were a picture book. The result is fed to a module that translates the sentence into another language. W have developed a prototype system that inherits many of the positive features of both approaches, while at the same time mitigating their main weaknesses. PicoTrans is studied with Wei Song
and Andrew Finch
, along with Eiichiro Sumita
of NICT. We won the Best Paper Award at IUI2011.
Kanji Lookup for Everybody: Kansuke
The Kansuke kanji lookup method is not based on the arbitrary conventions of how ideograms are drawn, but rather, on a code consisting of three variables: the numbers of horizontal, vertical, and other strokes. For example, the code for the ideogram "東" (higashi, meaning east) is three vertical strokes, four horizontal strokes, and two other strokes. With such codes, a non-native learner of Japanese or Chinese can look up ideograms even with no knowledge of the ideographic conventions used by natives. This study has been done with Julian Godon
. Our presentation on this software won the Presentation Award at the annual conference of the Association for NLP, Japan, in 2007.