ーMathematical Exploration of Dynamics underlying Symbolic Systems ー
Language, Financial Markets, and Communication
Social activities such as communication and financial market interactions are inherently symbolic.
We explore the universal properties underlying
dynamics of largescale real symbolic systems through mathematical models derived by computing
with big data obtained from largescale resources. Using these models,
we explore new ways of engineering to aid human social activities.
Analysis of real symbolic dynamics by applying complex systems theory
Empirical properties behind symbolic systems
Mathematical models explaining scaling properties
Methods for measuring long memory and their mathematical models
Complexity of symbolic systems
Symbolic systems and deep learning/machine learning methods
Neural networks that reproduce the empirical properties
Unsupervised and semisupervised methods
Symbolic approaches for nonsymbolic data
Mathematical informatics of language, financial markets, and communication
Computational linguistics
Financial informatics
Media analysis
Mathematics of communication
Computational semiotics
Essential nature of symbols
Selfsimilarity underlying symbolic systems
Semiotics of nonsign representation
Recent studies
Analysis of real symbolic dynamics by applying complex systems theory
Common physical scaling properties are known to hold across various symbolic systems with dynamics. Using real, largescale
data, we study these properties and construct a mathematical model that explains them.
Metrics that characterize kinds of data
Various metrics are considered in terms of whether they characterize different kinds of data.
For example, in the case of natural language, metrics that specify the author, language, or
genre have been studied. One such metric is Yule's K, which is
equivalent to Renyi's secondorder (plugin) entropy. Yule's K computes a value
that does not depend on the data size but only on the data kind.
We explore such metrics among various statistics related to scaling properties
of real data. They enable quantitative comparison across different kinds of data,
such as music, programming language sources, and natural language.
Complexity underlying symbolic systems
How complex are symbolic time series such as language, music, and programs?
Consider the number of possibilities for a time series of length n, with a parameter h,
as 2 ^{hn}. For a random binary series consisting of half ones and half zeros,
h=1. For the 26 characters in English, however, the number of possibilities
is not 26 ^{n},
because of various constrains, such as "q" being followed only by "u". Shannon computed that
h=1.3, but the question of acquiring a true h for human language is difficult to
answer and remains unsolved: it is unknown whether h is even positive.
Therefore, we study ways to compute the upper bound of h for various kinds of data, including
music, programs, and financial data, in addition to natural language.
Analysis of long memory underlying symbolic time series
Real instances of symbolic dynamics have a bursty character, meaning that events occur in a clustered manner. For example,
the figure on the right shows how rare events occur over time (the first indicates rarer events than the second;
the second, rarer than the third). This clustering phenomenon indicates how the sequence has long memory and
thus exhibits selfsimilarity. We study fluctuation analysis methods for symbolic dynamics to quantify the degree
of clustering and examine different selfsimilarity degrees across various systems.
Symbolic systems and deep learning/machine learning
methods
We discuss the potential and limitations of deep learning and other machine learning techniques with respect to
the nature of symbolic systems, and we study directions for improvement. Moreover, we explore unsupervised
and semisupervised methods for stateoftheart learning techniques.
Deep learning and scaling laws
Many difficult problems are now being solved through deep learning techniques, such as
image recognition and machine translation. In these cases, which aspects of real systems
do deep learners capture or ignore? We investigate whether scaling laws
hold for data generated by a deep learner and seek a new way
to evaluate machine learning methods.
For example, the figure on the right shows how
a characterbased long shortterm memory (LSTM) fails to generate a text with long memory that existed
in the original text that it had learned. Similar consideration
applies to financial applications based on deep learning.
Analysis of a generative adversarial network (GAN)
A generative model is a mathematical formulation that
generates a sample similar to real data, and many such models
have been proposed using machine learning methods.
Study of a good model serves to characterize the nature of a system and
also to understand the potential of machine learning.
We are interested in socalled adversarial methods using deep learning,
which are implemented by having two networks contest with each other.
We study the fundamental potential of such a model
and seek to generate a sample resembling real financial market data.
Extraction of templates from texts
Multiword expressions with slots, or templates , such as
"Starting at __ on __ " or the expression "regard _ as _"
appear frequently in text and also in data from sources such as Twitter.
Automatic extraction of these template expressions is related to grammar inference
and is a challenging problem. We propose to do this by using a binary decision diagram (BDD),
which is mathematically equivalent to a minimal deterministic finitestate automaton (DFA).
We have studied a basic formulation and currently seek a larger application to extract patterns
from social networking service (SNS) data.
Mathematical informatics of language, financial markets, and
communication
We study universal properties underlying language, finance, and communication, through computing with
various kinds of largescale data, and we apply our understanding of those properties to engineering. In addition
to domainspecific themes, we also explore multidisciplinary targets. For example, we study financial market
analysis by using blogs and other information sources, and we simulate information spread on a largescale
communication network.
Largescale simulation of communication network
After the 2011 earthquake in the Tohoku region of Japan, Twitter played a crucial role in helping with searching
for victims and locating resources. To study the mathematical nature underlying information delivery on social
media, we crawled the topology of an SNS on a very large scale, with over 100 million nodes. On this gigantic
graph, the best mathematical model of communication is explored via simulation, so that simulated
macroscopic statistics, such as the speed and bounds of information spread, agree with those of the real data.
We also study the best way to visualize such information spread.
Bitcoin price and Twitter
The bitcoin price crash at the beginning of 2018 was caused by
various social factors. The influence of news wire stories
and social media was especially crucial because of
the combination of both credible and fake information together.
We accumulate bitcoin data and analyze the relation of Twitter data with the bitcoin price.
In particular, we seek to mine Tweets that influence the actual price.
Quantification of grammaticality
How grammatically complex are adults' utterances as compared with those of children?
Or, how is a literal text structurally more complex than a Wikipedia source?
One existing, formal way to consider such questions is
through the Chomsky hierarchy, which formulates different complexity levels of grammar
through
constraints put on rewriting rules.
While the hierarchy provides qualitative categorization,
we investigate a new way to quantify grammatical
complexity by using metrics based on scaling properties.
Our method could partly answer the questions raised above.
Computational Semiotics
By using semiotic methodology, we philosophically investigate symbolic systems, especially aspects that are
difficult to describe only through computational or mathematical means.
Semiotics of scent
Human vision can be explained through basic formulations such as RGB, brightness, and
saturation.
These correspond both to human physical receptors and also to
basic words such as "red" and "blue".
In contrast, scent cannot be broken down into such basic factors:
the human olfactory system has many receptors functioning in a complex manner.
This leads to the fact that words representing basic scents are very limited, and
instead, we represent the sense of smell through metaphoric expressions such as "smells like an
apple" or
"scent of lavender." This suggests that an overall picture of olfactory concepts exists
in language expression. We collect olfactory expressions from large corpus data
and compare cultural differences in the overall space.
Moreover, we consider medical applications for testing basic olfactory capacity.
Selfsimilarity underlying symbolic systems
A sign is essentially a speculative and reflexive object.
Various key questions, such as how a sign is introduced, how it is used
and acquires meaning, what kinds of signs
exist, and what is the nature of sign systems,
are fundamental to understanding the basis of symbolic systems.
Focusing on reflexivity, we investigate
the nature of a system with symbolic systems and gain understanding
of its selfsimilar characteristics.

