Friday, December 13, 2024

A Neural Model of Rule Discovery with Relatively Short-Term Sequence Memory

I put an article with the title above on arXiv: https://arxiv.org/abs/2412.06839

Abstract: This report proposes a neural cognitive model for discovering regularities in event sequences. In a fluid intelligence task, the subject is required to discover regularities from relatively short-term memory of the first-seen task. Some fluid intelligence tasks require discovering regularities in event sequences. Thus, a neural network model was constructed to explain fluid intelligence or regularity discovery in event sequences with relatively short-term memory. The model was implemented and tested with delayed match-to-sample tasks.

Additional remarks:

  • It used the neural sequence memory mentioned in the previous post.
  • It is based on rote sequence memory.  Though you may wonder a learning program must make generalization, most fluid intelligence tasks are one-shot and would not require generalization.
  • To test more general fluid intelligence capabilities, it would be better testing it with visual analogy tasks such as Raven's progressive matrix tests or those found in ARC.

Saturday, November 16, 2024

Implementation of Neural Sequence Memory

I was forgetting to report on the neural sequence memory implemented in May.

GitHub: https://github.com/rondelion/SequenceMemory

Animals (including human beings) can keep things in (short-term) memory for performing tasks.  The 'working memory' includes sequence memory; we can memorize sequences of events. Though it seems that it can be implemented with an associative memory that associates an input with the next input, it is not the case, because different input may follow the same input: e.g., A⇒B, A⇒C in ABAACCDAB…  Thus, a proper sequence memory must have ‘latent’ states to represent states in sequences.

The specifications of the implementation are as follows:

  • A latent state is represented as a one-hot vector, which guarantees the independence among the states.  The number of states corresponds to the number of events to be memorized.
  • Latent states have mutual associative links with the input.
  • Latent states have forward and backward associative links among themselves to represent a sequence.
  • It memorizes a sequence with a one-shot exposure by the instant reinforcement of the associative links (as in 'short-term potentiation' of synapses).
  • It can ‘replay’ a sequence with an input stimulus.
  • Latent states have decaying activation so that the least activated state can be ‘recycled.’

The idea here is similar to the competitive queuing model (see Bullock, 2003; Houghton, 1990; Burgess, 1999).

The figure below shows an input sequence (above) and remembered sequence (bottom):

Thursday, November 7, 2024

CBOW and Part-of-Speech Clustering

Word embeddings used (introduced) in Word2Vec are known to represent semantic clusters of words.  While its semantic aspect has been largely focused, as the distribution hypothesis, on which the word embedding is based, is 'syntactic' in the sense it is only concerned with the formal features (distribution) of words, the embedding should represent parts-of-speech (POS) as well.  So I made an experiment described below (perhaps, similar experiments may have been done elsewhere, but anyway).

  1. made a small set of CFG grammar (see below) and generated sample sentences.
  2. created embeddings with the continuous bag of words (CBOW) learning.
  3. clustered the embeddings to compare with the 'ground truth' (the word-POS correspondence in the grammar).

Set-up

Number of words (voc. size): 20
Number of POS: 7
Number of sentences: 500 (100 was too small)
CBOW learning and the embedding: set up a simple perceptron like predictor that predicts a word from the two adjacent words.  Weights to predict a word from the hidden layer (number of cells: 10) was used as embedding.

Clustering

The figure shows an Isomap clustering of embeddings.  Words are clustered according to their parts of speech.

I tried neural network based clustering methods.  As a sparse autoencoder did not work for this purpose, I tried a SOM-like method and got the following  (number of cells: 14, the same training data as for the CBOW training: 500 sentences/2099 words, one epoch).

Adv   7 [0.   0.   0.   0.   0.   0.   0.13 0.   0.   0.   0.   0.   0.   0.  ]
PN   10 [0.   0.   0.   0.   0.   0.   0.   0.   0.06 0.13 0.   0.   0.   0.  ]
IV    6 [0.   0.   0.   0.   0.   0.11 0.   0.   0.   0.   0.   0.   0.   0.  ]
Adj   2 [0.   0.09 0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.  ]
Det   1 [0.18 0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.  ]
TV    4 [0.   0.   0.   0.13 0.   0.   0.   0.   0.   0.   0.   0.   0.   0.  ]
N    13 [0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.06 0.12 0.  ]

It shows the correspondence between the cells and parts-of-speech (the second column represents the index of the most correlated cell).
Though clustering does not work always (it depends on the initialized weights), it is confirmed that CBOW embeddings generally represent parts-of-speech in this set-up.

SOM-like learning code:

min_index = np.argmin(((self.weights - feature) ** 2).sum(axis=1))
    for i in range(-2, 3):
        try:
            if i != 0:
                self.weights[min_index + i] += self.alpha * (feature - self.weights[min_index + i]) / abs(i)
        except:
            pass

Grammar

S : NP + VP
NP : Det + N1
N1 : Adj + N
N1 : N
NP : PN
VP : Adv + VP
VP : IV
VP : TV + NP
Det - le, un
N - n1, n2, n3
PN - Pn1, Pn2, Pn3
Adj - aj1, aj2, aj3
Adv - av1, av2, av3
IV - iv1, iv2, iv3
TV - tv1, tv2, tv3

Saturday, March 9, 2024

Implementation of a Simple Visuomotor Environment and Brain-inspired Visuomotor Agent

日本語版

Abstract: As many animals, including humans, make behavioral decisions based on visual information, a cognitive model of the visuomotor system would serve as a basis in intelligence research, including AGI. This article reports on the implementation of a relatively simple system: a virtual environment that displays shapes and cursors and an agent that performs gaze shift and cursor control based on the information from the environment. The visual system is modeled after that of humans with the central and peripheral fields of view, and the agent architecture is based on the structure of the brain.

1. Introduction

This article reports on the implementation of a simple environment and agent architecture for decision making based on visual information, which would serve as part of more generic cognitive models/architectures.  It also addresses human ‘active vision,’ where visual information is collected and integrated through gaze shift.

This work adopts a strategy of starting with a relatively simple model.  The implemented two-dimensional visual environment displays simple figures and  cursors. Figures and a cursor can be moved (dragged) by instructions from the agent.

As for the agent, the following were modeled and implemented, imitating the human visual system.

1) distinction between central and peripheral vision,

2) gaze shift based on salience in the peripheral vision [1],

3) unsupervised learning of shapes captured in the central vision,

4) reinforcement learning of cursor movement and dragging,

5) “surprise” due to changes in the environment caused by actions and habituation due to learning,

6) reward based on “surprise”.

Here 3), 4), and 5) involve learning and are provided with learning models.  Agent's action consists of gaze shift and cursor movement + dragging. gaze shift in the model does not learn and is driven by salience.

2. Environment 

The environment has a screen divided into an N × N grid (Figure 1).  The center of the screen is a "stage" consisting of an M × M grid (M<N).  The edges of the stage are marked with border lines.  M different shapes are displayed on the stage. The visual information presented to the agent is a color bitmap of the field of view (M × M grid) centered on the gaze.  The gaze is located at the center of a grid cell on the stage, and shifted when the environment is given a gaze shift signal (a vector of maximum and minimum values [± M, ± M]).  It does not move off the stage. Two cursors of different colors are displayed on the stage.  When the environment is given a cursor movement signal (a vector of maximum and minimum [± 1, ± 1]), one of the cursors may move, while it does not move off the stage.  If the cursor is superimposed on a figure and the environment is given a non zero cursor move and grab signal, the figure is moved in the same direction and distance as the cursor move (i.e., dragged).  Figure 1 shows an example display.

Figure 1: Environment

3. Agent

The agent receives the input of a color bitmap of the field of view from the environment, and outputs gaze shift, cursor movement, and grab signals to the environment. The agent has an architecture consisting of the following modules (Fig.2 – the following parentheses indicate module names in the figure). Salience Calculation Module (Periphery2Saliency), Gaze Shift Module (PriorityMap2Gaze), Central Visual Field Change Prediction Module (FoveaDiffPredictor), Surprise-reward calculation module (SurpriseReward), object recognition module (ObjectRecognizer), and Cursor Control Module (CursorActor). See the figure for connections between modules.

Figure 2: Architecture
The Cursor Control Module uses reinforcement learning rewarded by changes in the external world caused by its own action (contingency detection) [2]
As for correspondence with the brain, the saliency calculation module corresponds to the superior colliculus, the Gaze Shift Module corresponds to  the neural circuit from the superior colliculus to the eye, and the Object Recognition Module corresponds to the ‘what path’ of the visual cortex, which performs object identification.  As the Central Visual Field Change Prediction Module and the surprise-reward calculation module use the output of the object recognition module, it could correspond to a visual association cortex such as the frontal eye field [3]. The Cursor Control Module would correspond to the motor cortex.

3.1 Salience Calculation Module (Periphery2Saliency) 

After reducing the resolution of the input bitmap, it creates a monochrome brightness map corresponding to the peripheral visual field, and adds an edge detection map and a time differential map to it. Though it is said that the log-polar coordinate system is used in human peripheral vision, the ordinary Cartesian coordinates were used for engineering interpretability and amenability with off-the-shelf tools such as the regular CNN.

3.2 Gaze Shift Module (PriorityMap2Gaze) 

A gaze shift signal is calculated to move the gaze to the part with maximum saliency based on the saliency (priority) map from the saliency calculation module.

3.3 Object Recognition Module (ObjectRecognizer) 

It feeds the bitmap of the central visual field to an unsupervised learner, and outputs the latent variables of the learner.

3.4 Central Visual Field Change Prediction Module (FoveaDiffPredictor) 

‘Central visual field change’ refers to the scalar (summed) time difference of the Object Recognition Module output. The module predicts it from the outputs of the Object Recognition Module and Cursor Control Module at the previous time.  If a gaze shift has occurred at the previous time, no prediction is made and the output is set to zero (saccade suppression). Prediction is learned, and its output is the prediction error.

3.5 Surprise-Reward Calculation Module (SurpriseReward) 

It outputs {scalar (summed) value of time difference of Object Recognition Module output x prediction error (the output of the Central Visual Field Change Prediction Module)}.'  The output becomes zero if the prediction error is zero or if there is no time change in the output of the Object Recognition Module.

3.6 Cursor Control Module (CursorActor) 

It is a reinforcement learner that observes the output of the Object Recognition Module and outputs the cursor control (movement vector + grab) signal. The reward is the output of the Surprise-Reward Calculation Module.

4 Implementation and Test

The code is located here:

4.1 Environment 

The environment was implemented with Python and PyGame.  Card game symbols (pips) were used as figures.  The initial positions of figures and cursors are at random for each episode (the initial position of the cursor controlled by the agent was set on a figure).

4.2 Agent 

The agent was implemented with Python and BriCA (Brain-inspired Computing Architecture)[4], a computational platform for developing brain-inspired software. As BriCA supports modular architecture development, the reuse of the implementation in more complex architectures could be easier.  With the BriCA platform, architectural design is first specified in a spreadsheet and then converted into an architecture description language (BriCA language).  At runtime, the interpreter loads and executes the BriCA language description. BriCA modules exchange numerical vector signals in a token-passing manner.  PyTorch was used as a machine learning platform.

Salience Calculation Module (Periphery2Saliency)

It reduces the resolution of the input bitmap, calculates a monochrome brightness map corresponding to the peripheral visual field, and adds an edge detection map and a time differential map to the brightness map with preconfigured weights.

Gaze Shift Module (PriorityMap2Gaze)

It computes the ‘priority map’ by 1) adding random noise to the output of the saliency calculation module (salience map), and 2) adding the priority map at the previous time multiplied by the damping coefficient.  The gaze shift signal is calculated so that the gaze moves to the field of view corresponding to the part with the maximum value in the priority map.

Object recognition module (ObjectRecognizer) 

βVAE (from Princeton U.: code) was used after kinds of autoencoders had been compared as unsupervised learners.  The choice was made with the expectation that the number of output dimensions would be relatively small and it provides interpretable (distangled) latent variables.

Central Visual Field Change Prediction Module (FoveaDiffPredictor)

It predicts scalar changes in the central visual field from the output of the Object Recognition Module and Cursor Control Module at the previous time, and outputs the prediction error. A three-layer perceptron was used as a predictor.
Surprise-Reward Calculation Module (SurpriseReward)
It outputs {the scalar value of the time difference of the Object Recognition Module output × prediction error (Central Visual Field Change Prediction Module output)}.

Cursor Control Module (CursorActor) 

It uses a cerebral cortex/basal ganglia loop model [5] (code), based on the hypothesis that the cerebral cortex predicts actions through learning, and the basal ganglia determines (through reinforcement learning) whether to perform the action.  The implemented basal ganglia model determines whether or not it is possible to perform it based on the given observation data and type of action (Go/NoGo) through reinforcement learning.  Meanwhile, the cortical model initially selects the type of action at random, and as the learning of the basal ganglia model progresses, it begins to predict and present the type of action performed from observational data.  The used reinforcement learning algorithm was DQN (Deep Q-Network).

4.3 Experiments (Tests)

Experiments (tests) and learning were performed by modules starting from the area closest to the visual input.
Salience Calculation Module and Gaze Shift Module 
These modules do not depend on other modules and do not perform learning.  They were qualitatively tested with their own environment (Vision1Env.py), where circles with multiple colors, intensities, and sizes were presented in the field of view.  Gaze shift was observed and parameters parameters (e.g., intensity, edge, time differential weight for saliency map calculation) were adjusted by the developer.

Object Recognition Module

All combinations of images that would appear in the central visual field were fed to the βVAE (with the number of latent variables=10) to be trained (TrainFovea_VAE.py).  While original images were generally reconstructed after about 10,000 episodes, the latent (disentangled) variables corresponding to the elements in the images were not found.

Central Visual Field Change Prediction Module

The three-layer perceptron was trained to predict changes in the central visual field from the outputs ​​of the Object Recognition Module and of the Cursor Control Module except for immediately after  saccades.  The loss became zero around episode 150.

Surprise-Reward Calculation Module

The multiplication was performed correctly (no learning is performed in this module).

Cursor Control Module

It was trained to output the cursor control (movement vector +grab) signal by observing the output of the Object Recognition Module and rewarded by the output of the Surprise-Reward Calculation Module (the Central Visual Field Change Prediction Module had not been trained).
The amount of reward acquired was tripled compared to random trials (average reward 0.12) (Fig.3).
Figure 3: Cursor Control Module learning results
Horizontal axis: number of episodes
Vertical axis: average reward (average of 5 trials)

5. Conclusion 

The article reported on the implementation of an environment that displays shapes and cursors on the screen, and an agent that moves the eye and controls the cursor based on visual information.
Tasks that utilize gaze shift (active vision tasks) have been developed elsewhere.  DeepMind has developed PsychLab with tasks using gaze shift [6]*1. The image recognition learning task using gaze shift is part of what is called object centric learning (👉 review).  Working memory tasks such as oculomotor delayed response tasks*2 use gaze shift.  Papers [7] and [8] propose biologically plausible models of active vision.
In this article, learning was performed using “surprise'' or prediction errors as reward, which is a regular way in unsupervised learning.  Learning about changes in the environment due to one's own actions (contingencies) through prediction errors or “surprises'' appears as a theme in psychology [2]. There are various studies related to surprise, exploratory behavior, and curiosity [9][10][11](chapter 3).
Papers [12] and [13] provide neural models similar to that in this article, though more specific ([12] does not model central/peripheral vision as it is concerned with the rat).
When controlling gaze shift using reinforcement learning, it would be necessary to explicitly model the frontal eye field as the corresponding region of the brain (the model would have a mechanism similar to the Cursor Control Module).  The representation of the scene consisting of kinds of objects and their location (presumably integrated around the hippocampus) would also be required in tasks using gaze shift.
A model of areas around the hippocampus is important for the recognition of scene sequences, as the hippocampus is also said to be responsible for episodic memory.  The model of the prefrontal cortex would be required for working memory tasks, as the region is said to be involved in it.
Finally, the environment was implemented having in mind the modeling of visual understanding of other people's actions and language acquisition presupposing such understanding.  Thus, what additional structures will be needed for those models shall be studied.

*1: In this hackathon, a subset of tasks from PsychLab was used.
*2: In this hackathon, a match-to-sample task that requires working memory and gaze shift was used.

References 

  1. [1]  Veale, et al.: How is visual salience computed in the brain? Insights from behaviour, neurobiology and modelling, Phil. Trans. R. Soc. B, 372(1714) (2017). https://doi.org/10.1098/rstb.2016.0113

  2. [2]  Hiraki, K.: Detecting contingency: A key to understanding development of self and social cognition, Japanese Psychological Research, 48(3) (2006).
    https://doi.org/10.1111/j.1468-5884.2006.00319.x

  3. [3]  Ferrera, V. and Barborica, A.: Internally Generated Error Signals in Monkey Frontal Eye Field during an Inferred Motion Task, Journal of Neuroscience, 30 (35) (2010). https://doi.org/10.1523/JNEUROSCI.2977-10.2010

  4. [4] Kouichi Takahashiet al.: A Generic Software Platform for Brain-inspired Cognitive Computing, Procedia Computer Science, 71 (2015). https://doi.org/10.1016/j.procs.2015.12.185

  5. [5]  Arakawa, N.: Implementation of a Model of the Cortex Basal Ganglia Loop, ArXiv (2024). https://doi.org/10.48550/arXiv.2402.13275

  6. [6]  Leibo, J., et al.: Psychlab: A Psychology Laboratory for Deep Reinforcement Learning Agents, ArXiv (2018) https://doi.org/10.48550/arXiv.1801.08116

  7. [7]  Hoang, K. et al.: Active vision: on the relevance of a bio-inspired approach for object detection, Bioinspiration & Biomimetics, 15(2) (2020).
    https://doi.org/10.1088/1748-3190/ab504c

  8. [8]  McBride, S., Huelse, M., and Lee, M.: Identifying the Computational Requirements of an In- tegrated Top-Down-Bottom-Up Model for Overt Visual Attention within an Active Vision System. PLoS ONE 8(2) (2013). https://doi.org/10.1371/journal.pone.0054585

  9. [9]  Oudeyer P.Y., Kaplan , F., and Hafner, V.: Intrinsic Motivation Systems for Autonomous Mental Development, IEEE Transactions on Evolutionary Computation, 11(2). (2007). https://doi.org/10.1109/TEVC.2006.890271

  1. [10]  Schmidhuber, H.: Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990–2010), IEEE Transactions on Autonomous Mental Development, 2(3) (2010). https://doi.org/10.1109/tamd.2010.2056368

  2. [11]  Cangelosi, A., et al.: Developmental Robotics: From Babies to Robots, MIT Press (2015) https://doi.org/10.7551/mitpress/9320.001.0001

  3. [12]  Fiore V., et al.: Instrumental conditioning driven by neutral stimuli: A model tested with a simulated robotic rat, in Proceedings of the Eighth International Conference on Epigenetic Robotics (2008).

  4. [13]  Santucci, V.G., et al.: Biological Cumulative Learning through Intrinsic Motivations: A Sim- ulated Robotic Study on the Development of Visually-Guided Reaching, in Proceedings of the Tenth International Conference on Epigenetic Robotics (2010). 

Tuesday, October 3, 2023

AutoEncoder-based Predictor (implementation)

I have been 'playing around' with autoencoder implementations to realize 'a predictor,' as the principal function of the neocortex is supposed to be prediction.  I tried a simple autoencoder and a sparse autoencoder from a cerenaut repository and a β-VAE implementation from a project repository of Princeton University (see the explanatory article).  I chose the β-VAE, for I'll use it to model the association cortex, where the use of CNN may not be appropriate (the β-VAE does not use CNN but only Linear layers).   (And the simple one may not be potent enough.)

I constructed a predictor with the encoder, decoder, and autoencoder factory from the repository with a single modification in the decoder setting.  Namely, the predictor differs only with the decoder output setting; while the autoencoder predicts encoder input, the predictor predicts other input.

The implementation is found here: https://github.com/rondelion/AEPredictor

A test result with MNIST rotation (to predict rotated images) is shown below after 100 epochs of training:


Sunday, July 9, 2023

Basic Salience - Saccade model

I implemented a simple salience-saccade model.  Visit the repository for details.  The model can be used for any (active) vision-based agent building.

In 2021, I wrote about Visual Task and Architectures.  The current implementation is about the where path, saliency map, and active vision (gaze control) in the post.  As for the what path, I did a rudimentary implementation in 2021.  I implemented a cortico-thalamo-BG control algorithm in 2022.  I also worked on the match-to-sample task of a non-visual type this year (previous post).

While I might go for experiments on minimal visual word acquisition, I should add the what path (object recognition) to the current model in any case.

Monday, April 24, 2023

Solving a Delayed Match-to-Sample Task with Sequential Memory

Introduction

This report presents a solution and implementation of a delayed Match-to-Sample Task using episode sequences.  (See my post for the importance of the M2S task in AGI research.)

A delayed Match-to-Sample task is a task to determine whether a presented (target) pattern is the same as another one (sample) presented previously in the session.  In the case of a graphic-based task, either the shape or the color of the presented graphic can be used as the matching attribute.  In this report, a cue (task switch) is presented before sample presentation to specify the matching attribute.  Both the cue and the matching pattern are low-dimensional binary vectors for the sake of simplicity.

Working memory is required to solve a DM2S task.  The agent needs to remember the cue (task switch), select a part of a pattern presented as the attribute of the sample according to the cue, remember the part, and compare it with the attribute of the target pattern presented later.  Due to the need for working memory, it is assumed that simple reinforcement learning cannot solve the problem.

In this report, the agent memorizes the sequences appearing in all task episodes (for a long term) and solves the task by finding a past sequence that would lead to success in the current episode (memory for a short term).  Implementation has shown that in the simplest setting, the agent can solve the task after experiencing several hundred episodes in most cases.

The Method

Sequence Memory

The agent memorizes the entire input-output sequences of episodes experienced.  The memory has a tree structure with the root at the end of the episode.  The tree branches according to inputs-outputs, and its nodes have information on the number of successes and the number of experiences.

Using the Sequence Memory

The agent remembers the input-output sequence in each episode and searches sequences in the ‘long-term’ sequence memory that matches the current sequence and leads to success.  The sequence memory is indexed with the partial observation sequences as a key to allow the longest match.  Among the sub-sequences matched by the index, the one with the highest value (success rate x number of successes) at the beginning is used (the reason for using the number of successes is to eliminate the ones that succeeded due to a fluke), and the action is decided by following the rest of the sub-sequences at the end of the sub-sequence.  The sequence memory for each episode corresponds to the working memory, and its ‘long-term’ sequence memory corresponds to the policy in reinforcement learning.

Architecture


Fig.1 Architecture

The agent consists of the Gate, Episodic Memory, and Action Chooser.

Gate

Attention is paid to a part of the observation and gated observation (non-attended parts are masked) is output.  It also outputs whether there has been a change in observation (obs. change).
Attention is determined by the salience of the environmental input and the attention signal from Episodic Memory; if there is a definite attention from Episodic Memory and the target of the attention is salient, the part is selected; otherwise, one of the salient parts is selected as the target of attention with equal probability.  If there is no salient part in the observation (if it is a 0 vector), no attention is given and the attention output is a 0 vector.

Episodic Memory

It receives gated observation, attention, obs. change, and reward from the Gate, and outputs attention instruction to Gate and action instruction to Action Chooser.
At the end of each episode, Episodic Memory registers the input-output sequence of the episode in the sequence memory.
If a (sub-)sequence of the gated observation matches a success sequence in the memory, Episodic Memory determines outputs according to the rest of the sequence.  Episodic Memory receives information about attentional and action choices made ('efferent copy') from Gate and Action Chooser respectively, to be recorded in the sequence memory.
For two steps immediately after a change in the observation (obs. change), Episodic Memory chooses only ‘attentions.’  This is to allow the agent to check the situation before outputting to the external environment (it also narrows the search space).

Action Chooser

It receives an action instruction (probability vector) from Episodic Memory, performs action selection, and passes the results to the environment and Episodic Memory.

Implementation and Experimental Results

Environment/Task

Phases

The task has the following phases:
{task switch presentation, blank, sample presentation, blank, target presentation, blank}

Input/Output

The output from the environment (observation) is a binary sequence consisting of {task switch, attribute sequence, control switch}.
The number of dimensions of an attribute sequence is the number of attributes x the attribute dimension.  Each attribute is a one-hot vector having the attribute dimension.
A task switch is a one-hot vector with the attribute dimension that specifies the attribute to be extracted (for implementation convenience, attribute dimension > number of attributes).
The number of dimensions of the control switch is also a binary vector of the attribute dimension, with the first column being 1 in the sample presentation phase, the second column being 1 in the target presentation and response phase, and with columns being 0 otherwise.  The output of the blank phases is a 0 vector.
Reward values are either 0 (failure) or 1 (success).
There are three types of inputs (actions) from the agent: {0, 1, 2}.

Success Conditions

The environment gives success only when the attribute specified in the task switch matches the sample and target and the input from the agent in the target presentation phase is 2, or when the attribute specified in the task switch does not match the sample and target and the input from the agent in the target presentation phase is 1.

Implementation

Python and Open AI Gym are used.
The agent implementation used Python and BriCA (Brain-inspired Computing Architecture), a platform for building cerebral agents, in which information is passed between modules at each time step in defined connections.

Experimental Setup

Length (steps) of the phases

Task switch presentation: 2, Blank: 1, Sample presentation: 2, Target presentation and response: 3
Number of attributes and attribute dimensions: 2 or 3, respectively

Perplexity of inputs and actions (size of the search space)

The number of different inputs and outputs that can appear is the number shown below, and all of these must be experienced in order to gain full knowledge. Since the environment is stochastic, there is no guarantee that a complete experience can be obtained in a finite number of trials.
When number of attributes: 2, attribute dimension: 2: 2 x (4 x 3) x (4 x 5) = 480
When number of attributes: 3 and attribute dimension: 3: 3 x (8 x 4) x (8 x 6) = 3,024
Solution: task switches x (attribute values x Attention destinations) x (attribute values x (attention destinations + action types))

Results


Fig. 2 Experimental results
Vertical axis: average reward, horizontal axis: episodes x 100
Blue line: number of attributes: 2, attribute dimension: 2;
Red line: number of attributes: 3, attribute dimension: 3

The learning curves differ according to the number of attributes and attribute dimension settings. In the setting with a minimum complexity (blue line – number of attributes: 2, attribute dimension: 2), the task is solved in a few hundred trials in most cases.

Comparison with Reinforcement Learning

It was examined whether the reinforcement learning agents (vpg, a2c, and ppo from TensorForce) learn the task.  The results are shown in the graph below, and it appears that proper learning does not occur.

Fig. 3 Experimental results of reinforcement learning
Vertical axis: average reward; horizontal axis: episodes x 100

Discussions

Comparison with Reinforcement Learning

The proposed system can generally solve the task if it has enough experience to tell matched sequences are not of ‘fluke.’  With the perplexity of the task (see above), it is assumed that the problem is solved with a minimum number of trials.
While a reinforcement learner may also maintain a graph of the ‘Markov’ series leading to the reward (e.g., Bellman backup tree), the sub-series are not normally memorized and used for matching.  In this implementation, the number of successes is also stored to avoid ‘fluke’ sequences, whereas only probability and reward evaluation values are stored in normal RL.

Related Research

[McCallum 1995] uses case trees for problem solving and refers to further works in the context of reinforcement learning.
My post in 2022 proposed a “model of fluid intelligence based on examining experienced sequences,” a mechanism that allows agents to discover the conditions of the sequences required by the task.  In the real world, it is not possible to know in advance how far back from the reward the agent should remember, so the proposed strategy could be applied to start with a sequence near the reward and extend the policy sequence if it does not work.
I also reported in another post in 2022 on an attempt to solve a delayed Match-to-Sample task with a brain-inspired working memory architecture, which did not store sequences and learned to select attention and action independently; it could not identify overall successful sequences.

Biological Plausibility

While the current implementation is not biologically plausible in that it does not use artificial neural networks (or other neural mimicking mechanisms), its design was inspired by the information processing mechanisms of the brain.
Gate incorporates the mechanisms of attention and salience maps in the mammalian visual system.  If attention is thought of as eye movements, it can also be understood as the mechanism of active vision.
In the brain, episodic memory is believed to be held in the hippocampus.  If so, it is conceivable that episodic memory can be recalled from partial input-output sequences and used for action selection (see [Sasaki 2018][Dragoi 2011] for the discussion of hippocampal use of sequential memory).
In the current implementation, a single module (Episodic Memory) was used to manage the control of both attention and action; it might be better to implement modules separately because they differ in terms of timing (Gate runs before Episodic Memory while Action Chooser runs after).

Information Compression

In this implementation, the environmental input is a low-dimensional vector; even so, the number of cases becomes quite large if all of the input-output pattern sequences are to be searched (see above on perplexity).  When dealing with real environments, it would be necessary to compress information with deep learning or other methods to reduce the search space.  The pattern matching method implemented in this study is based on perfect (strict) matching; with analog data from real environments, the use of a more flexible matching method would be a must.  For this purpose, it would also be desirable to use artificial neural networks.
In the current implementation, Episodic Memory stores the masked environmental input (gated observation) as it is; if recognition of the attended attribute and the choice of the attention is used for action, the attribute itself need not be remembered, and it will reduce the perplexity.

Future Directions

Future directions may include: validation with other intelligence test tasks (e.g., analogy tasks), search for more biologically plausible architectures, tasks using image (see the information compression section above), and search for "causal inference" capabilities such as those performed by human infants.

References

[McCallum 1995] McCallum, R.A., Instance-Based Utile Distinctions for Reinforcement Learning with Hidden State, Proceedings of the Twelfth International Conference on Machine
[Sasaki 2018] Takuya Sasaki, et al.: Dentate network activity is necessary for spatial working memory by supporting CA3 sharp-wave ripple generation and prospective firing of CA3 neurons, Nature Neuroscience vol. 21 (2018) https://doi.org/10.1038/s41593-017-0061-5
[Dragoi 2011] George Dragoi and Susumu Tonegawa: Preplay of future place cell sequences by hippocampal cellular assemblies, Nature 469 (7330) (2011)