Monday, April 24, 2023

Solving a Delayed Match-to-Sample Task with Sequential Memory

Introduction

This report presents a solution and implementation of a delayed Match-to-Sample Task using episode sequences.  (See my post for the importance of the M2S task in AGI research.)

A delayed Match-to-Sample task is a task to determine whether a presented (target) pattern is the same as another one (sample) presented previously in the session.  In the case of a graphic-based task, either the shape or the color of the presented graphic can be used as the matching attribute.  In this report, a cue (task switch) is presented before sample presentation to specify the matching attribute.  Both the cue and the matching pattern are low-dimensional binary vectors for the sake of simplicity.

Working memory is required to solve a DM2S task.  The agent needs to remember the cue (task switch), select a part of a pattern presented as the attribute of the sample according to the cue, remember the part, and compare it with the attribute of the target pattern presented later.  Due to the need for working memory, it is assumed that simple reinforcement learning cannot solve the problem.

In this report, the agent memorizes the sequences appearing in all task episodes (for a long term) and solves the task by finding a past sequence that would lead to success in the current episode (memory for a short term).  Implementation has shown that in the simplest setting, the agent can solve the task after experiencing several hundred episodes in most cases.

The Method

Sequence Memory

The agent memorizes the entire input-output sequences of episodes experienced.  The memory has a tree structure with the root at the end of the episode.  The tree branches according to inputs-outputs, and its nodes have information on the number of successes and the number of experiences.

Using the Sequence Memory

The agent remembers the input-output sequence in each episode and searches sequences in the ‘long-term’ sequence memory that matches the current sequence and leads to success.  The sequence memory is indexed with the partial observation sequences as a key to allow the longest match.  Among the sub-sequences matched by the index, the one with the highest value (success rate x number of successes) at the beginning is used (the reason for using the number of successes is to eliminate the ones that succeeded due to a fluke), and the action is decided by following the rest of the sub-sequences at the end of the sub-sequence.  The sequence memory for each episode corresponds to the working memory, and its ‘long-term’ sequence memory corresponds to the policy in reinforcement learning.

Architecture


Fig.1 Architecture

The agent consists of the Gate, Episodic Memory, and Action Chooser.

Gate

Attention is paid to a part of the observation and gated observation (non-attended parts are masked) is output.  It also outputs whether there has been a change in observation (obs. change).
Attention is determined by the salience of the environmental input and the attention signal from Episodic Memory; if there is a definite attention from Episodic Memory and the target of the attention is salient, the part is selected; otherwise, one of the salient parts is selected as the target of attention with equal probability.  If there is no salient part in the observation (if it is a 0 vector), no attention is given and the attention output is a 0 vector.

Episodic Memory

It receives gated observation, attention, obs. change, and reward from the Gate, and outputs attention instruction to Gate and action instruction to Action Chooser.
At the end of each episode, Episodic Memory registers the input-output sequence of the episode in the sequence memory.
If a (sub-)sequence of the gated observation matches a success sequence in the memory, Episodic Memory determines outputs according to the rest of the sequence.  Episodic Memory receives information about attentional and action choices made ('efferent copy') from Gate and Action Chooser respectively, to be recorded in the sequence memory.
For two steps immediately after a change in the observation (obs. change), Episodic Memory chooses only ‘attentions.’  This is to allow the agent to check the situation before outputting to the external environment (it also narrows the search space).

Action Chooser

It receives an action instruction (probability vector) from Episodic Memory, performs action selection, and passes the results to the environment and Episodic Memory.

Implementation and Experimental Results

Environment/Task

Phases

The task has the following phases:
{task switch presentation, blank, sample presentation, blank, target presentation, blank}

Input/Output

The output from the environment (observation) is a binary sequence consisting of {task switch, attribute sequence, control switch}.
The number of dimensions of an attribute sequence is the number of attributes x the attribute dimension.  Each attribute is a one-hot vector having the attribute dimension.
A task switch is a one-hot vector with the attribute dimension that specifies the attribute to be extracted (for implementation convenience, attribute dimension > number of attributes).
The number of dimensions of the control switch is also a binary vector of the attribute dimension, with the first column being 1 in the sample presentation phase, the second column being 1 in the target presentation and response phase, and with columns being 0 otherwise.  The output of the blank phases is a 0 vector.
Reward values are either 0 (failure) or 1 (success).
There are three types of inputs (actions) from the agent: {0, 1, 2}.

Success Conditions

The environment gives success only when the attribute specified in the task switch matches the sample and target and the input from the agent in the target presentation phase is 2, or when the attribute specified in the task switch does not match the sample and target and the input from the agent in the target presentation phase is 1.

Implementation

Python and Open AI Gym are used.
The agent implementation used Python and BriCA (Brain-inspired Computing Architecture), a platform for building cerebral agents, in which information is passed between modules at each time step in defined connections.

Experimental Setup

Length (steps) of the phases

Task switch presentation: 2, Blank: 1, Sample presentation: 2, Target presentation and response: 3
Number of attributes and attribute dimensions: 2 or 3, respectively

Perplexity of inputs and actions (size of the search space)

The number of different inputs and outputs that can appear is the number shown below, and all of these must be experienced in order to gain full knowledge. Since the environment is stochastic, there is no guarantee that a complete experience can be obtained in a finite number of trials.
When number of attributes: 2, attribute dimension: 2: 2 x (4 x 3) x (4 x 5) = 480
When number of attributes: 3 and attribute dimension: 3: 3 x (8 x 4) x (8 x 6) = 3,024
Solution: task switches x (attribute values x Attention destinations) x (attribute values x (attention destinations + action types))

Results


Fig. 2 Experimental results
Vertical axis: average reward, horizontal axis: episodes x 100
Blue line: number of attributes: 2, attribute dimension: 2;
Red line: number of attributes: 3, attribute dimension: 3

The learning curves differ according to the number of attributes and attribute dimension settings. In the setting with a minimum complexity (blue line – number of attributes: 2, attribute dimension: 2), the task is solved in a few hundred trials in most cases.

Comparison with Reinforcement Learning

It was examined whether the reinforcement learning agents (vpg, a2c, and ppo from TensorForce) learn the task.  The results are shown in the graph below, and it appears that proper learning does not occur.

Fig. 3 Experimental results of reinforcement learning
Vertical axis: average reward; horizontal axis: episodes x 100

Discussions

Comparison with Reinforcement Learning

The proposed system can generally solve the task if it has enough experience to tell matched sequences are not of ‘fluke.’  With the perplexity of the task (see above), it is assumed that the problem is solved with a minimum number of trials.
While a reinforcement learner may also maintain a graph of the ‘Markov’ series leading to the reward (e.g., Bellman backup tree), the sub-series are not normally memorized and used for matching.  In this implementation, the number of successes is also stored to avoid ‘fluke’ sequences, whereas only probability and reward evaluation values are stored in normal RL.

Related Research

[McCallum 1995] uses case trees for problem solving and refers to further works in the context of reinforcement learning.
My post in 2022 proposed a “model of fluid intelligence based on examining experienced sequences,” a mechanism that allows agents to discover the conditions of the sequences required by the task.  In the real world, it is not possible to know in advance how far back from the reward the agent should remember, so the proposed strategy could be applied to start with a sequence near the reward and extend the policy sequence if it does not work.
I also reported in another post in 2022 on an attempt to solve a delayed Match-to-Sample task with a brain-inspired working memory architecture, which did not store sequences and learned to select attention and action independently; it could not identify overall successful sequences.

Biological Plausibility

While the current implementation is not biologically plausible in that it does not use artificial neural networks (or other neural mimicking mechanisms), its design was inspired by the information processing mechanisms of the brain.
Gate incorporates the mechanisms of attention and salience maps in the mammalian visual system.  If attention is thought of as eye movements, it can also be understood as the mechanism of active vision.
In the brain, episodic memory is believed to be held in the hippocampus.  If so, it is conceivable that episodic memory can be recalled from partial input-output sequences and used for action selection (see [Sasaki 2018][Dragoi 2011] for the discussion of hippocampal use of sequential memory).
In the current implementation, a single module (Episodic Memory) was used to manage the control of both attention and action; it might be better to implement modules separately because they differ in terms of timing (Gate runs before Episodic Memory while Action Chooser runs after).

Information Compression

In this implementation, the environmental input is a low-dimensional vector; even so, the number of cases becomes quite large if all of the input-output pattern sequences are to be searched (see above on perplexity).  When dealing with real environments, it would be necessary to compress information with deep learning or other methods to reduce the search space.  The pattern matching method implemented in this study is based on perfect (strict) matching; with analog data from real environments, the use of a more flexible matching method would be a must.  For this purpose, it would also be desirable to use artificial neural networks.
In the current implementation, Episodic Memory stores the masked environmental input (gated observation) as it is; if recognition of the attended attribute and the choice of the attention is used for action, the attribute itself need not be remembered, and it will reduce the perplexity.

Future Directions

Future directions may include: validation with other intelligence test tasks (e.g., analogy tasks), search for more biologically plausible architectures, tasks using image (see the information compression section above), and search for "causal inference" capabilities such as those performed by human infants.

References

[McCallum 1995] McCallum, R.A., Instance-Based Utile Distinctions for Reinforcement Learning with Hidden State, Proceedings of the Twelfth International Conference on Machine
[Sasaki 2018] Takuya Sasaki, et al.: Dentate network activity is necessary for spatial working memory by supporting CA3 sharp-wave ripple generation and prospective firing of CA3 neurons, Nature Neuroscience vol. 21 (2018) https://doi.org/10.1038/s41593-017-0061-5
[Dragoi 2011] George Dragoi and Susumu Tonegawa: Preplay of future place cell sequences by hippocampal cellular assemblies, Nature 469 (7330) (2011)