Saturday, February 12, 2022

Report on the Implementation of a Minimal Working Memory Agent

This article reports an attempt to implement working memory, a basic cognitive function essential for AGI, in a minimal form, taking biological plausibility into account.

The implementation is based on the following hypotheses:

  1. The choice of a part of the input to be retained in the working memory is driven by attention, which is an action carried out by the PFC (prefrontal cortex).

  2. The attending action is made by the PFC⇔Th loop controlled by BG (PFC-Th-BG loop) like any other actions.

  3. The attending action is exclusive as in any other action, and only one of them is executed in dlPFC at a time.

  4. Working memory is retained for a certain period of time with the activity of neurons such as bi-stable neurons in the PFC, and WM should be "reselected" by the PFC-Th-BG loop for longer retention.

  5. The PFC uses the information stored in the working memory and sensory information to predict and execute the action to cope with tasks.

The Task

The task used in the implementation is a Match-to-Sample task simplified as much as possible, whose input is a low-dimensional binary vector consisting of the following:

  • Choice of the attribute to be used for comparing the sample to the target (task switch)
    (a one-hot vector whose dimension is the number of attributes)

  • Attribute list (each attribute is represented as a two-dimensional binary vector.)

  • Control vector (sample presentation phase [1, 0], target presentation phase [1, 1], others [0, 0])

The task consists of five phases: sample presentation, pause (delay), target presentation, pause (delay), and reward presentation.  The task switch is given only during the sample presentation phase. In the target presentation phase, if the attributes of the sample and targets specified by the task switch match, the task accepts 2 ([0, 1]), and if they do not match, the task accepts 1 ([1, 0]) as the correct answer, respectively.  If the answer is correct, the task returns reward 1, otherwise reward 0 after a delay.
In executing the task, the following assumptions were made on the agent side:

  • No prior knowledge of the composition of the input vector is given.

  • It does not memorize all the attributes.

How it works

Input Register

Input:
    • observation input
    • attention signal: instructs the attribute (a part of the observation input) to be retained 

Output: whether the retained observation input attribute and the corresponding attribute of a new observation input match (recognition)

Function: When an attention signal is given, the specified portion of the observation input is retained for a certain period of time, where a new observation input comes in, it outputs whether the specified portion of the retained content matches the portion of the input.

Implementation: Simple register + comparator

Neural Correspondence: the short-term memory layer of the PFC

Register Controller

Input: observation input

Output: attention signal

Function: determines the attribute to be retained in the input, and sends an attention signal to the input register (once per episode).

Implementation: 👉 A Minimal Cortex-Basal Ganglia Architecture

Neural Correspondence: (dl) PFC-Th-BG loop

Action Determiner

Input: observation input + recognition signal (from the Register)

Output: action selection

Function: determines the action to be output from the input (once per episode).

Implementation:  👉 A Minimal Cortex-Basal Ganglia Architecture

Neural Correspondence: PFC-Th-BG loop

Learning

Reward-based learning took place in two locations, i.e., the action determiner and the register controller.  Both were implemented with the Minimal Cortex-Basal Ganglia Architecture.

Though it was expected that the two learnings would bootstrap the task successes, it did not work, and it was found that curriculum learning barely worked (see the Results section for details).

Architecture

Fig. 1

Implementation

Frameworks used

  • Cognitive Architecture Description Framework: BriCA BriCA (Brain-inspired Computing Architecture), a computational platform for brain-based software development, was used.
  • Environment description framework: OpenAI Gym OpenAI Gym is a widely used framework for agent learning environments.

Delayed reward learner

A Minimal Cortex-Basal Ganglia Architecture was used.  BriCA is also used in the learner.  It also uses PyTorch to model the cortex.  For delayed reward learning (the basal ganglia part), a frequency-based learning algorithm was used.

Results

Though the above implementation was used to train the task, the performance did not improve even after 400,000 episodes of trials.  So it was given up and curriculum learning was tried instead, where one learner was trained while another learner was replaced with a "stub" that output correctly, and subsequently the training data was used to train the agent without the stub (with the two learners).

When a stub is used for the Register Controller

Fig. 2 Training Action Determiner when a stub is used for Register Controller

Horizontal axis: episodes (x 100; 40,000 episodes learned)

avr. reward: average reward

reward per go: reward per action output


Fig. 3 Training the two learners with the learning data of Action Determiner
learned with a stub for Register Controller

Horizontal axis: episodes (x 100)

avr. reward: average reward

reward per go: reward per action output

go in sample phase: Register Controller output performed in the sample phase

correct wm: correctly set working memory

go in target phase: actions performed in the target phase


The reward per action output and the percentage of correctly set working memories barely exceeded 0.5.  Phase selection was relatively successful.

When a stub is used for Action Determiner
Fig. 4 Training Register Controller when a stub is used for Action Determiner

Horizontal axis: episodes (x 100)

avr. reward: average reward

reward per go: reward per action output

go in sample phase: Register Controller output performed in the sample phase

correct wm: correctly set working memory

go in target phase: actions performed in the target phase

The proportion of correctly set working memories did not approach to 1, because even if the working memory is not set correctly, it can be rewarded by the random selection of the Action Determiner.


Training the two learners with the learning data of Register Controller learned with a stub for Action Determiner

Horizontal axis: episodes (x 100)

avr. reward: average reward

reward per go: reward per action output

go in sample phase: Register Controller output performed in the sample phase

correct wm: correctly set working memory

go in target phase: actions performed in the target phase

The reward per action output and the percentage of correctly set working memories barely exceeded 0.5.  Meanwhile, the percentage of correctly set working memories was on a downward trend over 10,000 episodes.  Phase selection was relatively successful.

Discussions

Existing model

While PBWM is known as a model of working memory, it was not adopted because of the following "unnaturalness" on biological plausibility:

  • While it assumes that the basal ganglia (BG) gate sensory input, they are supposed to gate the cortical output-thalamus loop.
    cf. Benarroch, E. (2008) The midline and intralaminar thalamic nuclei Fig.2

  • While it assumes working memory is retained during NoGo, the default state for the BG, if this is the case, then working memory retention becomes the default state.

Meanwhile, it would be worthwhile to try the 1-2-AX working memory task used in the PBWM paper for comparison.

Learning Performance

The two choices, attentional selection of attributes to be remembered and action selection based on the working memory, could not be learned from scratch in this implementation.  The original expectation was that the correct attentional choice would provide more information for the correct action choice via the working memory, and thus the performance of the action choice would improve, and with it the learning of the correct attentional choice would improve in a "bootstrapping" way.  However, it was not observed.  While it was not proved that this strategy would not work in other implementations, since the search space is larger in more complex tasks, it would be better to consider other strategies that allow learning to take place in a biologically reasonable number of episodes.

Though the curriculum learning "barely" worked this time, the results were not satisfactory.  It took about 10,000 episodes for the stub-based learning to converge, which is also beyond the biologically reasonable range.  This point may be improved by changing parameters in the Minimal Cortex-Basal Ganglia Architecture.

Prospects

In order to improve learning efficiency, the search space should be smaller.  For diachronic learning of multiple actions, it would be worthwhile to devise a (biologically plausible) mechanism that memorizes series of observations/actions that lead to success and examines similar cases while regarding them as hypotheses.