Kumiko Tanaka-Ishii Group

ーMathematical/Computational Exploration of Social Complex Systems ー
   Language, Communication, and Financial Markets   
We explore the universal properties underlying large-scale social systems through mathematical models derived by computing with big data obtained from large-scale resources. Using these models, we explore new ways of engineering to aid human social activities.
Analysis of social systems by applying complex systems theory
x Empirical properties behind social complex systems
x Mathematical models explaining scaling properties
x Methods for measuring long memory and fluctuation
x Quantification of complexity of social systems
Deep learning/machine learning methods for social complex systems
x Deep learning models that reproduce the scaling properties
x Unsupervised and semi-supervised methods
x Symbolic approaches for non-symbolic data
Mathematical informatics across language, financial markets, and communication
x Computational linguistics
x Mathematics of communication network
x Influence of newswire reports on financial market prices
x Analysis of financial data using deep learning
Recent studies
Analysis of large-scale social systems by applying complex systems theory
Common scaling properties are known to hold across various large-scale social systems. Using real, large-scale data, we study the nature of these properties from viewpoints such as complexity, degree of fluctuation and self-similarity, and construct a mathematical model that explains them.
Metrics that characterize kinds of data x
Various metrics are considered in terms of whether they characterize different kinds of data. For example, in the case of natural language, metrics that specify the author, language, or genre have been studied. One such metric is Yule's K, which is equivalent to Renyi's second-order (plug-in) entropy. Yule's K computes a value that does not depend on the data size but only on the data kind. We explore such metrics among various statistics related to scaling properties of real data and compare different kinds of data such as music, programming language sources, and natural language.
Quantification of structural complexity underlying real world time series x
How grammatically complex are adults' utterances as compared with those of children? Or, how is a literal text structurally more complex than a Wikipedia source? How can this complexity be compared with music performance or the programming languages sources? One existing, formal way to consider such questions proposed in linguistic domain is through the Chomsky hierarchy, which formulates different complexity levels of grammar through constraints put on rewriting rules. While the hierarchy provides qualitative categorization, it cannot serve for comparing structural complexity of time series quantitatively. We investigate a new way to quantify structural complexity by using metrics based on scaling properties.

Analysis of long memory underlying non-numerical time series x
Real instances of social systems have a bursty character, meaning that events occur in a clustered manner. For example, the figure on the right shows how rare events occur over time (the first indicates rarer events than the second; the second, rarer than the third) in texts. This clustering phenomenon indicates how the sequence has long memory and thus exhibits self-similarity. We study methods for non-numerical time series to quantify the degree of clustering and examine different self-similarity degrees across various systems.


Deep learning/machine learning methods for complex systems
We discuss the potential and limitations of deep learning and other machine learning techniques with respect to the nature of complex systems, and we study directions for improvement. Moreover, we explore unsupervised and semi-supervised methods for state-of-the-art learning techniques.
Deep learning and scaling laws x
Many difficult problems are now being solved through deep learning techniques, such as image recognition and machine translation. In these cases, which aspects of real systems do deep learners capture or ignore? We investigate whether scaling laws hold for data generated by a deep learner and seek a new way to evaluate machine learning methods. For example, the figure on the right shows how a character-based long short-term memory (LSTM) fails to generate a text with long memory that existed in the original text that it had learned. Similar consideration applies to financial applications based on deep learning.
Generative models of complex systems x
A generative model is a mathematical formulation that generates a sample similar to real data. Many such models have been proposed using machine learning methods including deep learning. Study of a good model serves to characterize the nature of a system and also to understand the potential of machine learning. We study various time series models including classical Markov models, grammatical models, Simon process, random walks on network, neural models, auto-encoders and adversarial methods. The fundamental properties of generative models are studied whether they are able to generate a sample resembling real data.
Unsupervised extraction of templates from texts x
Multi-word expressions with slots, or templates , such as "Starting at __ on __ " or the expression "regard _ as _" appear frequently in text and also in data from sources such as Twitter. Automatic extraction of these template expressions is related to grammar inference and is a challenging problem. We propose to do this by using a binary decision diagram (BDD), which is mathematically equivalent to a minimal deterministic finite-state automaton (DFA). We have studied a basic formulation and currently seek a larger application to extract patterns from social networking service (SNS) data by an additional use of deep learning methods.
Mathematical informatics across language, financial markets, and communication
We explore common universal properties underlying language, finance, and communication, through computing with various kinds of large-scale data, and we apply our understanding of those properties to engineering across domains. For example, we study financial market analysis by using blogs and other information sources, and we simulate information spread on a large-scale communication network.
Influence of textual data and communication structure on financial prices x
The bitcoin price crash at the beginning of 2018 was caused by various social factors. The influence of news wire stories and social media was especially crucial because of the combination of both credible and fake information being mixed and expanded on the social media. We accumulate financial data including various stock/bitcoin prices and analyze the influence of the communication structure and textual data.


Entropy rate of human symbolic sequences x
We explore the complexity underlying human symbolic sequences via entropy rate estimation. Consider the number of possibilities for a time series of length n, with a parameter h, as 2hn. For a random binary series consisting of half ones and half zeros, h=1. For the 26 characters in English, however, the number of possibilities is not 26n, because of various constrains, such as "q" being followed only by "u". Shannon computed that h=1.3, but the question of acquiring a true h for human language is difficult to answer and remains unsolved: it is unknown whether h is even positive. Therefore, we study ways to compute the upper bound of h for various kinds of data, including music, programs, market data, in addition to natural language.