Kumiko Tanaka-Ishii Group

ーMathematical Exploration of Social Complex Systems ー
   Language, Communication, and Financial Markets   
We explore the universal properties underlying large-scale social systems through mathematical models derived by computing with big data obtained from large-scale resources. Using these models, we explore new ways of engineering to aid human social activities.
Analysis of social systems by applying complex systems theory
x Empirical properties behind social systems
x Mathematical models explaining scaling properties
x Methods for measuring long memory and fluctuation
x Complexity of social systems
Deep learning/machine learning methods for complex systems
x Deep learning models that reproduce the scaling properties
x Unsupervised and semi-supervised methods
x Symbolic approaches for non-symbolic data
Mathematical informatics across language, financial markets, and communication
x Computational linguistics
x Mathematics of communication network
x Newswire reports and financial market prices
Recent studies
Analysis of large-scale social systems by applying complex systems theory
Common scaling properties are known to hold across various large-scale social systems. Using real, large-scale data, we study the nature of these properties and construct a mathematical model that explains them.
Metrics that characterize kinds of data x
Various metrics are considered in terms of whether they characterize different kinds of data. For example, in the case of natural language, metrics that specify the author, language, or genre have been studied. One such metric is Yule's K, which is equivalent to Renyi's second-order (plug-in) entropy. Yule's K computes a value that does not depend on the data size but only on the data kind. We explore such metrics among various statistics related to scaling properties of real data and compare different kinds of data such as music, programming language sources, and natural language.
Complexity underlying human linguistic sequences x
How complex are human linguistic time series such as language, music, and programs? Consider the number of possibilities for a time series of length n, with a parameter h, as 2hn. For a random binary series consisting of half ones and half zeros, h=1. For the 26 characters in English, however, the number of possibilities is not 26n, because of various constrains, such as "q" being followed only by "u". Shannon computed that h=1.3, but the question of acquiring a true h for human language is difficult to answer and remains unsolved: it is unknown whether h is even positive. Therefore, we study ways to compute the upper bound of h for various kinds of data, including music, and programs, in addition to natural language.
Analysis of long memory underlying non-numerical time series x
Real instances of social systems have a bursty character, meaning that events occur in a clustered manner. For example, the figure on the right shows how rare events occur over time (the first indicates rarer events than the second; the second, rarer than the third) in texts. This clustering phenomenon indicates how the sequence has long memory and thus exhibits self-similarity. We study methods for non-numerical time series to quantify the degree of clustering and examine different self-similarity degrees across various systems.


Deep learning/machine learning methods for complex systems
We discuss the potential and limitations of deep learning and other machine learning techniques with respect to the nature of complex systems, and we study directions for improvement. Moreover, we explore unsupervised and semi-supervised methods for state-of-the-art learning techniques.
Deep learning and scaling laws x
Many difficult problems are now being solved through deep learning techniques, such as image recognition and machine translation. In these cases, which aspects of real systems do deep learners capture or ignore? We investigate whether scaling laws hold for data generated by a deep learner and seek a new way to evaluate machine learning methods. For example, the figure on the right shows how a character-based long short-term memory (LSTM) fails to generate a text with long memory that existed in the original text that it had learned. Similar consideration applies to financial applications based on deep learning.
Generative models of complex systems x
A generative model is a mathematical formulation that generates a sample similar to real data. Many such models have been proposed using machine learning methods including deep learning. Study of a good model serves to characterize the nature of a system and also to understand the potential of machine learning. We study auto-encoders and adversarial methods, the fundamental potentials of generative models, to generate a sample resembling real data.
Extraction of templates from texts x
Multi-word expressions with slots, or templates , such as "Starting at __ on __ " or the expression "regard _ as _" appear frequently in text and also in data from sources such as Twitter. Automatic extraction of these template expressions is related to grammar inference and is a challenging problem. We propose to do this by using a binary decision diagram (BDD), which is mathematically equivalent to a minimal deterministic finite-state automaton (DFA). We have studied a basic formulation and currently seek a larger application to extract patterns from social networking service (SNS) data.

Mathematical informatics across language, financial markets, and communication
We explore common universal properties underlying language, finance, and communication, through computing with various kinds of large-scale data, and we apply our understanding of those properties to engineering across domains. For example, we study financial market analysis by using blogs and other information sources, and we simulate information spread on a large-scale communication network.
Large-scale simulation of communication network x
After the 2011 earthquake in the Tohoku region of Japan, Twitter played a crucial role in helping with searching for victims and locating resources. To study the mathematical nature underlying information delivery on social media, we crawled the topology and tweets of an SNS on a very large scale, with over 100 million nodes. On this gigantic graph, the best mathematical model of communication is explored via simulation, so that simulated macroscopic statistics, such as the speed and bounds of information spread, agree with those of the real data. We also study the best way to visualize such information spread.
Bitcoin price and Twitter x
The bitcoin price crash at the beginning of 2018 was caused by various social factors. The influence of news wire stories and social media was especially crucial because of the combination of both credible and fake information together. We accumulate bitcoin data and analyze the relation of Twitter data with the bitcoin price. In particular, we seek to mine Tweets that influence the actual price.

Quantification of structural complexity underlying human linguistic sequences x
How grammatically complex are adults' utterances as compared with those of children? Or, how is a literal text structurally more complex than a Wikipedia source? One existing, formal way to consider such questions is through the Chomsky hierarchy, which formulates different complexity levels of grammar through constraints put on rewriting rules. While the hierarchy provides qualitative categorization, we investigate a new way to quantify structural complexity by using metrics based on scaling properties. Moreover, we try to explain the difference, from a complex network perspective.