Friday, April 18, 2025

A Simple Environment for Language Acquiring Agents

I made a Gymnasium environment for agents that acquire language. (GitHub)

In the environment, one or two objects (card suites) move around in a scene. The environment outputs a scene representation map and its text description as observation. The scene representation map consists of features (shapes and colors, each represented as a one-hot vector) of objects embedded in a 2D map. The text is in Interlingua. Verbs include: pausa (pauses), va (goes), colpa (hits), and passa (passes). Adjectives indicate the colors of objects. Adverbs indicate the direction of the movement.

An agent that acquires (learns) language from this environment is fed with the observation. It is supposed to associate object descriptions in the text with the object representation in the scene, to learn motion and interaction of objects, and to associate the learned activity representation with predicates in the text.

Sample text in the observation

Trifolio pausa
Trifolio va sub
Trifolio verde colpa Diamante
Trifolio va sup
Trifolio verde va sup

Spada va dextre con Corde
Spada blau va dextre con Corde

Diamante rubie passa Corde
Diamante rubie va sub sinistre
Diamante colpa le muro
Corde jalne passa Diamante

Tuesday, March 18, 2025

Implementation of a Parser without Grammar with Neural Sequence Memory

[日本語版]

Abstract

A parser without grammar was implemented with neural sequence memory. It parses part-of-speech (POS) sequences represented on the sequence memory to create parse trees. It groups frequently occurring POS pairs in the sequence into a binary tree. For the syntactic category of the parent (root) node of a binary tree, it used the POS inferred from the preceding and following POSs, enabling the construction of higher binary trees from pairs of a parent node and a POS or another parent node. Experiments with artificial grammar have shown that the mechanism is capable of primitive parsing.

Introduction

Human languages are known to have constituent structure [Anderson 2022], which is a recursive (tree-like) structure of constituents. The constituent structure is also thought to be the basis for compositional semantics [ibid.]. In natural language processing, parsers with manually constructed grammars have been used to construct constituent structure from word sequences. As manual grammar construction is costly, attempts have also been made to automatically construct grammar (grammar induction) and to build parsers without grammar (unsupervised parsing) [Tu 2021][Mori 1995][Kim 2019]. Since humans learn their native language without explicit grammar instructed, it is not desirable to provide explicit grammar in cognitive modeling. Recent language processing systems based on deep learning have shown remarkable performance in applications. While they do not require explicit grammar, it is not clear whether they use hierarchical constituents or whether compositionality is properly handled. Thus, creating a model that builds constituent structure (parses) without explicitly providing grammar would contribute to research on human cognitive models, and implementing such a model in neural circuits would make it more biologically plausible.

Based on the above, a neural circuit that constructs constituent structures (parses) from part-of-speech (POS) sequences was implemented and is reported below.

Method

A mechanism to construct a tree structure on sequential memory implemented by neural circuits was devised. This mechanism creates a binary tree with frequently occurring POS pairs as child nodes. As the syntactic category of the parent node of a binary tree, it used the POS estimated from the POSs (categories) preceding and succeeding the tree, to construct a higher-level binary tree from a pair of a parent node and another POS or parent node.

Sequence Memory

Neural sequential memory was used. The memory used one-hot vectors as the internal states of the sequence, and a mechanism to store the sequence through associations between internal states and mutual association between input patterns and internal states. Associations (association matrices) are set up by a single stimulus exposure (biologically corresponding to one-shot synaptic potentiation). Similar mechanisms have been reported as the competitive queuing model [Bullock 2003][Houghton 1990][Burgess 1999]

Estimating Syntactic Categories

A POS was used as the syntactic category of the parent node of a binary tree. The POS is inferred from the POSs before and after the tree. For this purpose, a learning model was used to infer a POS from the preceding and following POS. As it is similar to what is known as the CBOW (Continuous Bag of Words) in language models such as Word2Vec, it will be referred to as CBOC (Continuous Bag of Categories) below.

Parsing Algorithm

Read words up to the end of the sentence (EOS) and look up a dictionary to find POSs of the words to create one-hot vectors representing the POSs.

Memorize POS vectors in the sequence memory.

For each consecutive POS pair (head|tail), the activation value is calculated (see below) and assigned to the tail node.

Repeat the following:

For the sequence memory (tail) node with the maximum activation value, a new sequence memory node n is set for the parent node of a binary tree, whose child nodes are the tail node and the preceding node (head).
A POS vector is estimated by the CBOC predictor from the two POS vectors associated with the nodes before and after the head and tail nodes, and is associated with n.
Set links between n and the previous and next nodes
Set links between n and the constituent (child) nodes
Regard (pre|n) and (n|post) as POS pairs to calculate the activation value of n and post.
Set the activation values of child nodes head and tail of n to 0.
Delete the sequence links set for the head and tail nodes.

The implementation of the algorithm is located on GitHub.

Figure 1: Parsing Example

BOS: Beginning of Sentence, Det.: Article, Adj.: Adjective,

IV: Intransitive Verb, EOS: End of Sentence

Categories given to parent nodes: Nd1: Det.+ IV ⇒ Noun,

Nd2: BOS+IV ⇒ Proper Noun, Nd3: BOS+EOS ⇒ None

Black lines: Constituent links, Red lines: Final sequence memory links,

Green dashed line: sequence memory links set initially or midway and deleted

Experiments

Input Sentences and Constituent Structure

Word sequences generated from a context-free grammar (see Appendix) were used as input sentences. To be precise, bracketed constituent structure strings were generated from a context-free grammar, and the parser used word sequences after removing the brackets. Notation for generation probability was used in the grammar, and sentences were generated based on the specified probabilities.

Evaluation Method

The edit distance between the input and output tree structures was used for evaluation. The input-output pairs in the bracketed tree forms were fed to an edit distance calculation tool. Non-terminal symbols were not given neither in the output nor in the output.

The following were used as activity values.

Random values (baseline)
Frequencies of bigram occurrence (number of occurrences ÷ total number)
Bigram conditional probabilities p(head|tail)

Dataset

For cross-validation, a pair of datasets a and b, each containing 5000 sentences generated from the grammar with probabilities, and a pair of datasets A and B, each containing 10,000 sentences generated from a grammar without probabilities, were used. The maximum sentence lengths of datasets a, b, A, and B were 10, 11, 18, and 19, respectively.

Figure 2 shows the frequency of POS bigrams divided by the total number of data and the conditional probabilities (divided by 8 for comparison) for dataset a, sorted by the frequency.

Figure 2: Statistics of POS bigrams (dataset a）

Occ.: Frequency (number of occurrences/total number),

Prob/8: Conditional payment accuracy p(head|tail)/8

Learning

POS occurrences were counted from the generated sentences (Figure 2).

Note: Preliminary experiments have shown that both frequency of occurrence and conditional probability can be approximately approximated by perceptrons. However, both theoretically and practically, it is easier and more accurate to use simple statistical values so that there is no point in using neural learning.

A perceptron with one hidden layer (with PyTorch) was used as the CBOC predictor, and the mean squared error was used as the loss function (after having tried the cross-entropy error). The number of training epochs of the learning device was set to 20.

Results

Cross-validation was performed on each dataset pair. Namely, the system tried to parse the sentences from one of the datasets using the statistics and CBOC model obtained from the other dataset. The input and output trees were compared using an edit distance calculation tool. The cross-validated mean edit distance per word is shown in Table 1. In all datasets, using conditional probabilities as the criterion (activation values) for grouping performed better than using absolute frequencies.

Table 1: Edit distance

Activation values	Random	Frequency	Conditional Probability
Grammar with Probabilities	1.14	0.76	0
Grammar without Probabilities	1.34	0.79	0.65

For datasets based on probabilistic grammars, using conditional probabilities as activation values resulted in correct answers. For datasets based on grammars without probabilities, using conditional probabilities as activation values did not necessarily yield correct answers. For example, the following error occurred:

Correct answer:{{Det {Adj N}} {Adv IV}}

Prediction:{{{Det {Adj N}} Adv} IV}

While it is correct to group Adv:IV first, as its bigram frequency is the same as PN:Adv in the grammar without probabilities, PN:Adv{Det {Adj N}:Adv was grouped first (the estimated category for {Det {Adj N}} should be PN).

Conclusion

Incorporating prior knowledge about binary trees into a neural circuit model enabled parsing depending on the nature of the grammar that generates the sentences. It is to be verified whether natural language sentences have the same properties as the artificial grammar used here. The use of a parsed corpus such as Penn Treebank [Marcus 1999] would be used to investigate the statistical properties of the grammar and the applicability of the current method. For research aimed at language acquisition, CHILDES Treebank [Pearle 2019] based on a CHILDES corpus [Sanchez 2019]) could be used.

The current report does not compare performance with previous research. Since the ways of using and comparing trees in existing research are not uniform, it is necessary to reproduce (re-implement) the previous methods and align the conditions for strict comparison.

The current report deals with relatively simple sentences in context-free grammar. Extra mechanisms may be needed to parse sentences that include conjunctions, and sentences with multiple clauses. Even among sentences with a single clause, sentences containing prepositional phrases are to be examined. Furthermore, dealing with phenomena such as gender and number agreement would require adding grammatical attributes for those phenomena.

While POS sequences were provided as input in this report, it is ultimately desirable for a human cognitive model to be able to derive syntactic structures from acoustic input. Between the acoustic and POS levels, categories such as phonemes and morphemes are assumed to exist. Regarding phonemes, it is thought that categorization of acoustic signals into phonemes is learned in the native language environment. For extracting morphemes from phoneme sequences, statistical methods [Mochihashi 2009] and mechanisms based on patterns of non-phonemic attributes (accent, stress, etc.) can be employed. I have conducted a simple preliminary experiment to derive POS-like categories from word sequences, and confirmed that it was possible.

Considering this attempt as cognitive modeling, its biological plausibility (i.e., whether there is a corresponding mechanism in the brain) would be an issue. In the human brain, parsing is said to be carried out by Broca's area in the left hemisphere, and thus, it is thought to be primarily a function of the cerebral cortex. Since parsing is a process of time series, the cerebral cortex has the ability to handle time series, perhaps with sequence memory. Note that the brain's syntactic analysis is automatic (it does not require conscious effort). From an evolutionary perspective, it would be natural to think that human parsing evolved from a general-purpose mechanism of the cerebral cortex. The basic algorithm in this report consisted of grouping consecutive pairs that frequently occur and suppressing grouped "lower patterns." While it seems to have a cognitive advantage in that it directs attention to pairs of frequent patterns, neuroscientific research is needed to determine whether such a mechanism exists in the brain. Note that the combination of consecutive pairs is called ‘merge’ in generative linguistics, and is considered to be the fundamental process in syntactic processing.

One direction for future research is to link constituent structure with semantic representations to enable the compositional treatment of meaning.

References

[Anderson 2022] Anderson, C., et al.: Essentials of Linguistics, 2nd edition, eCampusOntario (2022)

[Tu 2021] Tu, K., et al.: Unsupervised Natural Language Parsing (Introductory Tutorial), in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts (2021)

[Mori 1995] Mori, S., Nagao, M: Parsing Without Grammar, in Proceedings of the Fourth International Workshop on Parsing Technologies, 174–185. Association for Computational Linguistics (1995)

[Kim 2019] Kim, Y., et al.: Unsupervised Recurrent Neural Network Grammars (2019). https://doi.org/10.48550/arXiv.1904.03746

[Bullock 2003] Bullock, D., Rhodes, B.: Competitive queuing for planning and serial performance. Handbook of Brain Theory and Neural Networks, MIT Press (2003)

[Houghton 1990] Houghton, G.: The problem of serial order: A neural network model of sequence learning and recall. Current research in natural language generation (1990) https://api.semanticscholar.org/CorpusID:59675195

[Burgess 1999] Burgess, N., Hitch, G. J.: Memory for serial order: A network model of the phonological loop and its timing. Psychological Review, 106 (3), 551–581 (1999) https://doi.org/10.1037/0033-295x.106.3.551

[Sanchez 2019] Sanchez, A., et al. childes-db: A flexible and reproducible interface to the child language data exchange system. Behav Res 51, 1928–1941 (2019) https://doi.org/10.3758/s13428-018-1176-7

[Pearle 2019] Pearl, L.S. and Sprouse, J.: Comparing solutions to the linking problem using an integrated quantitative framework of language acquisition, Language 95, no. 4 (2019): 583–611. https://www.jstor.org/stable/48771165, https://doi.org/10.1353/lan.2019.0067

[Marcus 1999] Marcus, M.P., et al.: Treebank-3 LDC99T42. Web Download. Philadelphia: Linguistic Data Consortium (1999) https://doi.org/10.35111/gq1x-j780

[Mochihashi 2009] Mochihashi, D., et al.: Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling, in Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 100–108 (2009)

Appendix

Context-free grammar used

Rules with : to the right of the leftmost category symbol are phrase structure rules, and rules with - are lexical rules.

The decimal value on the left indicates the probability that that rule will be selected among a group of rules that have the same leftmost category. For rules without specifications, the probabilities were determined by subtracting the specified probabilities from 1 and equally distributing the remaining probability.

S: NP + VP

NP : It + N1

N1 : Adj + N1 0.2

N1 : N

NP : PN

VP: Adv + VP 0.2

VP : IV

VP: TV + NP

Det - le, a

N - n1, n2, n3

PN - Pn1, Pn2, Pn3

Adj - adj1, ad2, adj3

Adv - av1, av2, av3

IV - iv1, iv2, iv3

TV - tv1, tv2, tv3

Friday, December 13, 2024

A Neural Model of Rule Discovery with Relatively Short-Term Sequence Memory

I put an article with the title above on arXiv: https://arxiv.org/abs/2412.06839

Abstract: This report proposes a neural cognitive model for discovering regularities in event sequences. In a fluid intelligence task, the subject is required to discover regularities from relatively short-term memory of the first-seen task. Some fluid intelligence tasks require discovering regularities in event sequences. Thus, a neural network model was constructed to explain fluid intelligence or regularity discovery in event sequences with relatively short-term memory. The model was implemented and tested with delayed match-to-sample tasks.

Additional remarks:

It used the neural sequence memory mentioned in the previous post.
It is based on rote sequence memory. Though you may wonder a learning program must make generalization, most fluid intelligence tasks are one-shot and would not require generalization.
To test more general fluid intelligence capabilities, it would be better testing it with visual analogy tasks such as Raven's progressive matrix tests or those found in ARC.

Source code: https://github.com/rondelion/SMA4DM2S_NN

Saturday, November 16, 2024

Implementation of Neural Sequence Memory

I was forgetting to report on the neural sequence memory implemented in May.

GitHub: https://github.com/rondelion/SequenceMemory

Animals (including human beings) can keep things in (short-term) memory for performing tasks. The 'working memory' includes sequence memory; we can memorize sequences of events. Though it seems that it can be implemented with an associative memory that associates an input with the next input, it is not the case, because different input may follow the same input: e.g., A⇒B, A⇒C in ABAACCDAB… Thus, a proper sequence memory must have ‘latent’ states to represent states in sequences.

The specifications of the implementation are as follows:

A latent state is represented as a one-hot vector, which guarantees the independence among the states. The number of states corresponds to the number of events to be memorized.
Latent states have mutual associative links with the input.
Latent states have forward and backward associative links among themselves to represent a sequence.
It memorizes a sequence with a one-shot exposure by the instant reinforcement of the associative links (as in 'short-term potentiation' of synapses).
It can ‘replay’ a sequence with an input stimulus.
Latent states have decaying activation so that the least activated state can be ‘recycled.’

The idea here is similar to the competitive queuing model (see Bullock, 2003; Houghton, 1990; Burgess, 1999).

The figure below shows an input sequence (above) and remembered sequence (bottom):

Thursday, November 7, 2024

CBOW and Part-of-Speech Clustering

Word embeddings used (introduced) in Word2Vec are known to represent semantic clusters of words. While its semantic aspect has been largely focused, as the distribution hypothesis, on which the word embedding is based, is 'syntactic' in the sense it is only concerned with the formal features (distribution) of words, the embedding should represent parts-of-speech (POS) as well. So I made an experiment described below (perhaps, similar experiments may have been done elsewhere, but anyway).

made a small set of CFG grammar (see below) and generated sample sentences.
created embeddings with the continuous bag of words (CBOW) learning.
clustered the embeddings to compare with the 'ground truth' (the word-POS correspondence in the grammar).

Set-up

Number of words (voc. size): 20

Number of POS: 7

Number of sentences: 500 (100 was too small)

CBOW learning and the embedding: set up a simple perceptron like predictor that predicts a word from the two adjacent words. Weights to predict a word from the hidden layer (number of cells: 10) was used as embedding.

Clustering

The figure shows an Isomap clustering of embeddings. Words are clustered according to their parts of speech.

I tried neural network based clustering methods. As a sparse autoencoder did not work for this purpose, I tried a SOM-like method and got the following (number of cells: 14, the same training data as for the CBOW training: 500 sentences/2099 words, one epoch).

Adv 7 [0. 0. 0. 0. 0. 0. 0.13 0. 0. 0. 0. 0. 0. 0. ]

PN 10 [0. 0. 0. 0. 0. 0. 0. 0. 0.06 0.13 0. 0. 0. 0. ]

IV 6 [0. 0. 0. 0. 0. 0.11 0. 0. 0. 0. 0. 0. 0. 0. ]

Adj 2 [0. 0.09 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]

Det 1 [0.18 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]

TV 4 [0. 0. 0. 0.13 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]

N 13 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.06 0.12 0. ]

It shows the correspondence between the cells and parts-of-speech (the second column represents the index of the most correlated cell).

Though clustering does not work always (it depends on the initialized weights), it is confirmed that CBOW embeddings generally represent parts-of-speech in this set-up.

SOM-like learning code:

min_index = np.argmin(((self.weights - feature) ** 2).sum(axis=1))

for i in range(-2, 3):

try:

if i != 0:

self.weights[min_index + i] += self.alpha * (feature - self.weights[min_index + i]) / abs(i)

except:

pass

Grammar

S : NP + VP

NP : Det + N1

N1 : Adj + N

N1 : N

NP : PN

VP : Adv + VP

VP : IV

VP : TV + NP

Det - le, un

N - n1, n2, n3

PN - Pn1, Pn2, Pn3

Adj - aj1, aj2, aj3

Adv - av1, av2, av3

IV - iv1, iv2, iv3

TV - tv1, tv2, tv3

Saturday, March 9, 2024

Implementation of a Simple Visuomotor Environment and Brain-inspired Visuomotor Agent

日本語版

Abstract: As many animals, including humans, make behavioral decisions based on visual information, a cognitive model of the visuomotor system would serve as a basis in intelligence research, including AGI. This article reports on the implementation of a relatively simple system: a virtual environment that displays shapes and cursors and an agent that performs gaze shift and cursor control based on the information from the environment. The visual system is modeled after that of humans with the central and peripheral fields of view, and the agent architecture is based on the structure of the brain.

1. Introduction

This article reports on the implementation of a simple environment and agent architecture for decision making based on visual information, which would serve as part of more generic cognitive models/architectures. It also addresses human ‘active vision,’ where visual information is collected and integrated through gaze shift.

This work adopts a strategy of starting with a relatively simple model. The implemented two-dimensional visual environment displays simple figures and cursors. Figures and a cursor can be moved (dragged) by instructions from the agent.

As for the agent, the following were modeled and implemented, imitating the human visual system.

1) distinction between central and peripheral vision,

2) gaze shift based on salience in the peripheral vision [1],

3) unsupervised learning of shapes captured in the central vision,

4) reinforcement learning of cursor movement and dragging,

5) “surprise” due to changes in the environment caused by actions and habituation due to learning,

6) reward based on “surprise”.

Here 3), 4), and 5) involve learning and are provided with learning models. Agent's action consists of gaze shift and cursor movement + dragging. gaze shift in the model does not learn and is driven by salience.

2. Environment

The environment has a screen divided into an N × N grid (Figure 1). The center of the screen is a "stage" consisting of an M × M grid (M<N). The edges of the stage are marked with border lines. M different shapes are displayed on the stage. The visual information presented to the agent is a color bitmap of the field of view (M × M grid) centered on the gaze. The gaze is located at the center of a grid cell on the stage, and shifted when the environment is given a gaze shift signal (a vector of maximum and minimum values [± M, ± M]). It does not move off the stage. Two cursors of different colors are displayed on the stage. When the environment is given a cursor movement signal (a vector of maximum and minimum [± 1, ± 1]), one of the cursors may move, while it does not move off the stage. If the cursor is superimposed on a figure and the environment is given a non zero cursor move and grab signal, the figure is moved in the same direction and distance as the cursor move (i.e., dragged). Figure 1 shows an example display.

Figure 1: Environment

3. Agent

The agent receives the input of a color bitmap of the field of view from the environment, and outputs gaze shift, cursor movement, and grab signals to the environment. The agent has an architecture consisting of the following modules (Fig.2 – the following parentheses indicate module names in the figure). Salience Calculation Module (Periphery2Saliency), Gaze Shift Module (PriorityMap2Gaze), Central Visual Field Change Prediction Module (FoveaDiffPredictor), Surprise-reward calculation module (SurpriseReward), object recognition module (ObjectRecognizer), and Cursor Control Module (CursorActor). See the figure for connections between modules.

Figure 2: Architecture

The Cursor Control Module uses reinforcement learning rewarded by changes in the external world caused by its own action (contingency detection) [2].

As for correspondence with the brain, the saliency calculation module corresponds to the superior colliculus, the Gaze Shift Module corresponds to the neural circuit from the superior colliculus to the eye, and the Object Recognition Module corresponds to the ‘what path’ of the visual cortex, which performs object identification. As the Central Visual Field Change Prediction Module and the surprise-reward calculation module use the output of the object recognition module, it could correspond to a visual association cortex such as the frontal eye field [3]. The Cursor Control Module would correspond to the motor cortex.

3.1 Salience Calculation Module (Periphery2Saliency)

After reducing the resolution of the input bitmap, it creates a monochrome brightness map corresponding to the peripheral visual field, and adds an edge detection map and a time differential map to it. Though it is said that the log-polar coordinate system is used in human peripheral vision, the ordinary Cartesian coordinates were used for engineering interpretability and amenability with off-the-shelf tools such as the regular CNN.

3.2 Gaze Shift Module (PriorityMap2Gaze)

A gaze shift signal is calculated to move the gaze to the part with maximum saliency based on the saliency (priority) map from the saliency calculation module.

3.3 Object Recognition Module (ObjectRecognizer)

It feeds the bitmap of the central visual field to an unsupervised learner, and outputs the latent variables of the learner.

3.4 Central Visual Field Change Prediction Module (FoveaDiffPredictor)

‘Central visual field change’ refers to the scalar (summed) time difference of the Object Recognition Module output. The module predicts it from the outputs of the Object Recognition Module and Cursor Control Module at the previous time. If a gaze shift has occurred at the previous time, no prediction is made and the output is set to zero (saccade suppression). Prediction is learned, and its output is the prediction error.

3.5 Surprise-Reward Calculation Module (SurpriseReward)

It outputs {scalar (summed) value of time difference of Object Recognition Module output x prediction error (the output of the Central Visual Field Change Prediction Module)}.' The output becomes zero if the prediction error is zero or if there is no time change in the output of the Object Recognition Module.

3.6 Cursor Control Module (CursorActor)

It is a reinforcement learner that observes the output of the Object Recognition Module and outputs the cursor control (movement vector + grab) signal. The reward is the output of the Surprise-Reward Calculation Module.

4 Implementation and Test

The code is located here:

4.1 Environment

The environment was implemented with Python and PyGame. Card game symbols (pips) were used as figures. The initial positions of figures and cursors are at random for each episode (the initial position of the cursor controlled by the agent was set on a figure).

4.2 Agent

The agent was implemented with Python and BriCA (Brain-inspired Computing Architecture)[4], a computational platform for developing brain-inspired software. As BriCA supports modular architecture development, the reuse of the implementation in more complex architectures could be easier. With the BriCA platform, architectural design is first specified in a spreadsheet and then converted into an architecture description language (BriCA language). At runtime, the interpreter loads and executes the BriCA language description. BriCA modules exchange numerical vector signals in a token-passing manner. PyTorch was used as a machine learning platform.

Salience Calculation Module (Periphery2Saliency)

It reduces the resolution of the input bitmap, calculates a monochrome brightness map corresponding to the peripheral visual field, and adds an edge detection map and a time differential map to the brightness map with preconfigured weights.

Gaze Shift Module (PriorityMap2Gaze)

It computes the ‘priority map’ by 1) adding random noise to the output of the saliency calculation module (salience map), and 2) adding the priority map at the previous time multiplied by the damping coefficient. The gaze shift signal is calculated so that the gaze moves to the field of view corresponding to the part with the maximum value in the priority map.

Object recognition module (ObjectRecognizer)

βVAE (from Princeton U.: code) was used after kinds of autoencoders had been compared as unsupervised learners. The choice was made with the expectation that the number of output dimensions would be relatively small and it provides interpretable (distangled) latent variables.

Central Visual Field Change Prediction Module (FoveaDiffPredictor)

It predicts scalar changes in the central visual field from the output of the Object Recognition Module and Cursor Control Module at the previous time, and outputs the prediction error. A three-layer perceptron was used as a predictor.

Surprise-Reward Calculation Module (SurpriseReward)

It outputs {the scalar value of the time difference of the Object Recognition Module output × prediction error (Central Visual Field Change Prediction Module output)}.

Cursor Control Module (CursorActor)

It uses a cerebral cortex/basal ganglia loop model [5] (code), based on the hypothesis that the cerebral cortex predicts actions through learning, and the basal ganglia determines (through reinforcement learning) whether to perform the action. The implemented basal ganglia model determines whether or not it is possible to perform it based on the given observation data and type of action (Go/NoGo) through reinforcement learning. Meanwhile, the cortical model initially selects the type of action at random, and as the learning of the basal ganglia model progresses, it begins to predict and present the type of action performed from observational data. The used reinforcement learning algorithm was DQN (Deep Q-Network).

4.3 Experiments (Tests)

Experiments (tests) and learning were performed by modules starting from the area closest to the visual input.

Salience Calculation Module and Gaze Shift Module

These modules do not depend on other modules and do not perform learning. They were qualitatively tested with their own environment (Vision1Env.py), where circles with multiple colors, intensities, and sizes were presented in the field of view. Gaze shift was observed and parameters parameters (e.g., intensity, edge, time differential weight for saliency map calculation) were adjusted by the developer.

Object Recognition Module

All combinations of images that would appear in the central visual field were fed to the βVAE (with the number of latent variables=10) to be trained (TrainFovea_VAE.py). While original images were generally reconstructed after about 10,000 episodes, the latent (disentangled) variables corresponding to the elements in the images were not found.

Central Visual Field Change Prediction Module

The three-layer perceptron was trained to predict changes in the central visual field from the outputs of the Object Recognition Module and of the Cursor Control Module except for immediately after saccades. The loss became zero around episode 150.

Surprise-Reward Calculation Module

The multiplication was performed correctly (no learning is performed in this module).

Cursor Control Module

It was trained to output the cursor control (movement vector +grab) signal by observing the output of the Object Recognition Module and rewarded by the output of the Surprise-Reward Calculation Module (the Central Visual Field Change Prediction Module had not been trained).

The amount of reward acquired was tripled compared to random trials (average reward 0.12) (Fig.3).

Figure 3: Cursor Control Module learning results

Horizontal axis: number of episodes

Vertical axis: average reward (average of 5 trials)

5. Conclusion

The article reported on the implementation of an environment that displays shapes and cursors on the screen, and an agent that moves the eye and controls the cursor based on visual information.

Tasks that utilize gaze shift (active vision tasks) have been developed elsewhere. DeepMind has developed PsychLab with tasks using gaze shift [6]^*1. The image recognition learning task using gaze shift is part of what is called object centric learning (👉 review). Working memory tasks such as oculomotor delayed response tasks^*2 use gaze shift. Papers [7] and [8] propose biologically plausible models of active vision.

In this article, learning was performed using “surprise'' or prediction errors as reward, which is a regular way in unsupervised learning. Learning about changes in the environment due to one's own actions (contingencies) through prediction errors or “surprises'' appears as a theme in psychology [2]. There are various studies related to surprise, exploratory behavior, and curiosity [9][10][11](chapter 3).

Papers [12] and [13] provide neural models similar to that in this article, though more specific ([12] does not model central/peripheral vision as it is concerned with the rat).

When controlling gaze shift using reinforcement learning, it would be necessary to explicitly model the frontal eye field as the corresponding region of the brain (the model would have a mechanism similar to the Cursor Control Module). The representation of the scene consisting of kinds of objects and their location (presumably integrated around the hippocampus) would also be required in tasks using gaze shift.

A model of areas around the hippocampus is important for the recognition of scene sequences, as the hippocampus is also said to be responsible for episodic memory. The model of the prefrontal cortex would be required for working memory tasks, as the region is said to be involved in it.

Finally, the environment was implemented having in mind the modeling of visual understanding of other people's actions and language acquisition presupposing such understanding. Thus, what additional structures will be needed for those models shall be studied.

*1: In this hackathon, a subset of tasks from PsychLab was used.

*2: In this hackathon, a match-to-sample task that requires working memory and gaze shift was used.

References

[1] Veale, et al.: How is visual salience computed in the brain? Insights from behaviour, neurobiology and modelling, Phil. Trans. R. Soc. B, 372(1714) (2017). https://doi.org/10.1098/rstb.2016.0113
[2] Hiraki, K.: Detecting contingency: A key to understanding development of self and social cognition, Japanese Psychological Research, 48(3) (2006).
https://doi.org/10.1111/j.1468-5884.2006.00319.x
[3] Ferrera, V. and Barborica, A.: Internally Generated Error Signals in Monkey Frontal Eye Field during an Inferred Motion Task, Journal of Neuroscience, 30 (35) (2010). https://doi.org/10.1523/JNEUROSCI.2977-10.2010
[4] Kouichi Takahashiet al.: A Generic Software Platform for Brain-inspired Cognitive Computing, Procedia Computer Science, 71 (2015). https://doi.org/10.1016/j.procs.2015.12.185
[5] Arakawa, N.: Implementation of a Model of the Cortex Basal Ganglia Loop, ArXiv (2024). https://doi.org/10.48550/arXiv.2402.13275
[6] Leibo, J., et al.: Psychlab: A Psychology Laboratory for Deep Reinforcement Learning Agents, ArXiv (2018) https://doi.org/10.48550/arXiv.1801.08116
[7] Hoang, K. et al.: Active vision: on the relevance of a bio-inspired approach for object detection, Bioinspiration & Biomimetics, 15(2) (2020).
https://doi.org/10.1088/1748-3190/ab504c
[8] McBride, S., Huelse, M., and Lee, M.: Identifying the Computational Requirements of an In- tegrated Top-Down-Bottom-Up Model for Overt Visual Attention within an Active Vision System. PLoS ONE 8(2) (2013). https://doi.org/10.1371/journal.pone.0054585
[9] Oudeyer P.Y., Kaplan , F., and Hafner, V.: Intrinsic Motivation Systems for Autonomous Mental Development, IEEE Transactions on Evolutionary Computation, 11(2). (2007). https://doi.org/10.1109/TEVC.2006.890271

[10] Schmidhuber, H.: Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990–2010), IEEE Transactions on Autonomous Mental Development, 2(3) (2010). https://doi.org/10.1109/tamd.2010.2056368
[11] Cangelosi, A., et al.: Developmental Robotics: From Babies to Robots, MIT Press (2015) https://doi.org/10.7551/mitpress/9320.001.0001
[12] Fiore V., et al.: Instrumental conditioning driven by neutral stimuli: A model tested with a simulated robotic rat, in Proceedings of the Eighth International Conference on Epigenetic Robotics (2008).
[13] Santucci, V.G., et al.: Biological Cumulative Learning through Intrinsic Motivations: A Sim- ulated Robotic Study on the Development of Visually-Guided Reaching, in Proceedings of the Tenth International Conference on Epigenetic Robotics (2010).

Tuesday, October 3, 2023

AutoEncoder-based Predictor (implementation)

I have been 'playing around' with autoencoder implementations to realize 'a predictor,' as the principal function of the neocortex is supposed to be prediction. I tried a simple autoencoder and a sparse autoencoder from a cerenaut repository and a β-VAE implementation from a project repository of Princeton University (see the explanatory article). I chose the β-VAE, for I'll use it to model the association cortex, where the use of CNN may not be appropriate (the β-VAE does not use CNN but only Linear layers). (And the simple one may not be potent enough.)

I constructed a predictor with the encoder, decoder, and autoencoder factory from the repository with a single modification in the decoder setting. Namely, the predictor differs only with the decoder output setting; while the autoencoder predicts encoder input, the predictor predicts other input.

The implementation is found here: https://github.com/rondelion/AEPredictor

A test result with MNIST rotation (to predict rotated images) is shown below after 100 epochs of training: