Kumiko Tanaka-Ishii Group

ーMathematical Exploration of Dynamics underlying Symbolic Systems ー
   Language, Financial Markets, and Communication   
Social activities such as communication and financial market interactions are inherently symbolic. We explore the universal properties underlying dynamics of large-scale real symbolic systems through mathematical models derived by computing with big data obtained from large-scale resources. Using these models, we explore new ways of engineering to aid human social activities.
Analysis of real symbolic dynamics by applying complex systems theory
x Empirical properties behind symbolic systems
x Mathematical models explaining scaling properties
x Methods for measuring long memory and their mathematical models
x Complexity of symbolic systems
Symbolic systems and deep learning/machine learning methods
x Neural networks that reproduce the empirical properties
x Unsupervised and semi-supervised methods
x Symbolic approaches for non-symbolic data
Mathematical informatics of language, financial markets, and communication
x Computational linguistics
x Financial informatics
x Media analysis
x Mathematics of communication
Computational semiotics
x Essential nature of symbols
x Self-similarity underlying symbolic systems
x Semiotics of non-sign representation
Recent studies
Analysis of real symbolic dynamics by applying complex systems theory
Common physical scaling properties are known to hold across various symbolic systems with dynamics. Using real, large-scale data, we study these properties and construct a mathematical model that explains them.
Metrics that characterize kinds of data x
Various metrics are considered in terms of whether they characterize different kinds of data. For example, in the case of natural language, metrics that specify the author, language, or genre have been studied. One such metric is Yule's K, which is equivalent to Renyi's second-order (plug-in) entropy. Yule's K computes a value that does not depend on the data size but only on the data kind. We explore such metrics among various statistics related to scaling properties of real data. They enable quantitative comparison across different kinds of data, such as music, programming language sources, and natural language.
Complexity underlying symbolic systems x
How complex are symbolic time series such as language, music, and programs? Consider the number of possibilities for a time series of length n, with a parameter h, as 2hn. For a random binary series consisting of half ones and half zeros, h=1. For the 26 characters in English, however, the number of possibilities is not 26n, because of various constrains, such as "q" being followed only by "u". Shannon computed that h=1.3, but the question of acquiring a true h for human language is difficult to answer and remains unsolved: it is unknown whether h is even positive. Therefore, we study ways to compute the upper bound of h for various kinds of data, including music, programs, and financial data, in addition to natural language.
Analysis of long memory underlying symbolic time series x
Real instances of symbolic dynamics have a bursty character, meaning that events occur in a clustered manner. For example, the figure on the right shows how rare events occur over time (the first indicates rarer events than the second; the second, rarer than the third). This clustering phenomenon indicates how the sequence has long memory and thus exhibits self-similarity. We study fluctuation analysis methods for symbolic dynamics to quantify the degree of clustering and examine different self-similarity degrees across various systems.


Symbolic systems and deep learning/machine learning methods
We discuss the potential and limitations of deep learning and other machine learning techniques with respect to the nature of symbolic systems, and we study directions for improvement. Moreover, we explore unsupervised and semi-supervised methods for state-of-the-art learning techniques.
Deep learning and scaling laws x
Many difficult problems are now being solved through deep learning techniques, such as image recognition and machine translation. In these cases, which aspects of real systems do deep learners capture or ignore? We investigate whether scaling laws hold for data generated by a deep learner and seek a new way to evaluate machine learning methods. For example, the figure on the right shows how a character-based long short-term memory (LSTM) fails to generate a text with long memory that existed in the original text that it had learned. Similar consideration applies to financial applications based on deep learning.
Analysis of a generative adversarial network (GAN) x
A generative model is a mathematical formulation that generates a sample similar to real data, and many such models have been proposed using machine learning methods. Study of a good model serves to characterize the nature of a system and also to understand the potential of machine learning. We are interested in so-called adversarial methods using deep learning, which are implemented by having two networks contest with each other. We study the fundamental potential of such a model and seek to generate a sample resembling real financial market data.
Extraction of templates from texts x
Multi-word expressions with slots, or templates , such as "Starting at __ on __ " or the expression "regard _ as _" appear frequently in text and also in data from sources such as Twitter. Automatic extraction of these template expressions is related to grammar inference and is a challenging problem. We propose to do this by using a binary decision diagram (BDD), which is mathematically equivalent to a minimal deterministic finite-state automaton (DFA). We have studied a basic formulation and currently seek a larger application to extract patterns from social networking service (SNS) data.

Mathematical informatics of language, financial markets, and communication
We study universal properties underlying language, finance, and communication, through computing with various kinds of large-scale data, and we apply our understanding of those properties to engineering. In addition to domain-specific themes, we also explore multi-disciplinary targets. For example, we study financial market analysis by using blogs and other information sources, and we simulate information spread on a large-scale communication network.
Large-scale simulation of communication network x
After the 2011 earthquake in the Tohoku region of Japan, Twitter played a crucial role in helping with searching for victims and locating resources. To study the mathematical nature underlying information delivery on social media, we crawled the topology of an SNS on a very large scale, with over 100 million nodes. On this gigantic graph, the best mathematical model of communication is explored via simulation, so that simulated macroscopic statistics, such as the speed and bounds of information spread, agree with those of the real data. We also study the best way to visualize such information spread.
Bitcoin price and Twitter x
The bitcoin price crash at the beginning of 2018 was caused by various social factors. The influence of news wire stories and social media was especially crucial because of the combination of both credible and fake information together. We accumulate bitcoin data and analyze the relation of Twitter data with the bitcoin price. In particular, we seek to mine Tweets that influence the actual price.

Quantification of grammaticality x
How grammatically complex are adults' utterances as compared with those of children? Or, how is a literal text structurally more complex than a Wikipedia source? One existing, formal way to consider such questions is through the Chomsky hierarchy, which formulates different complexity levels of grammar through constraints put on rewriting rules. While the hierarchy provides qualitative categorization, we investigate a new way to quantify grammatical complexity by using metrics based on scaling properties. Our method could partly answer the questions raised above.

Computational Semiotics
By using semiotic methodology, we philosophically investigate symbolic systems, especially aspects that are difficult to describe only through computational or mathematical means.
Semiotics of scent x
Human vision can be explained through basic formulations such as RGB, brightness, and saturation. These correspond both to human physical receptors and also to basic words such as "red" and "blue". In contrast, scent cannot be broken down into such basic factors: the human olfactory system has many receptors functioning in a complex manner. This leads to the fact that words representing basic scents are very limited, and instead, we represent the sense of smell through metaphoric expressions such as "smells like an apple" or "scent of lavender." This suggests that an overall picture of olfactory concepts exists in language expression. We collect olfactory expressions from large corpus data and compare cultural differences in the overall space. Moreover, we consider medical applications for testing basic olfactory capacity.
Self-similarity underlying symbolic systems x
A sign is essentially a speculative and reflexive object. Various key questions, such as how a sign is introduced, how it is used and acquires meaning, what kinds of signs exist, and what is the nature of sign systems, are fundamental to understanding the basis of symbolic systems. Focusing on reflexivity, we investigate the nature of a system with symbolic systems and gain understanding of its self-similar characteristics.