This article describes an implementation plan for a brain-inspired cognitive architecture to solve visual working memory tasks.

The Task

A visual working memory task (the delayed match-to-sample task, in particular) will be used.

The reasons are as follows:

Sensory input is essential for embodied AI.
Vision is a sensory modality widely used in AI tasks.
Working memory is an essential cognitive function in the performance of complex tasks (as required by AGI).
Match-to-sample tasks are used in the 5th WBA hackathon, and resources (a task environment and sample code) are available.

Visual system

The human visual system

The features of the human visual system listed below shall be reproduced as much as possible.

Central and peripheral visions
The center of the visual field (fovea) and the periphery have different resolutions. It is also known that they have different color and motion perceptions.
Active vision
Since the peripheral vision does not have sufficient resolution, eye movements (saccades) are necessary for collecting information with the central vision.
Information integration from different points in time-space is regarded as a case of the binding problem.
What vs. Where Paths
The mammalian visual system has the what pathway, which identifies objects, and the where pathway, which codes the location and movement of objects.
The two pathways originate in the small (parvo-) and large (magno-) cells of the LGN (lateral geniculate nucleus), respectively.
Receptive field hierarchy
The hierarchy of the visual cortex (V1⇒V2...) has a hierarchy of receptive fields.
Saliency map
It is coded in the superior colliculus (SC) to guide the saccades.
Binocular vision

Visual system implementation

Central and peripheral visions
While separate pathways were used in the 4th & 5th WBA hackathons, they can be combined into one with the log-polar coordinate used in the mammalian visual system.
Monocular and binocular visions
As binocular vision is not necessary for the current task, the monocular vision shall be used.
The What Path
The implementation will use logarithmic polar coordinate images as input.
Edges and blob detection shall be considered in pre-processing.
The Where Path
In the tasks concerned, it will be sufficient to obtain only the gaze position for the where path.
Receptive field hierarchy
The way to implement the receptive field hierarchy with existing machine learners should be studied.
Saliency Map
Time integral and winner-take-all circuits shall be implemented.
Signal strength and its time derivative shall be used for bottom-up saliency.
Spatial filters (detection of feature points like corners and blobs) shall also be considered.
For scanning, the saliency of the most recently visited locations will be suppressed.
Active vision
Saccades move the gaze to the most salient coordinates in the saliency map when a certain saccade condition is met. Training is required to calibrate the saccade destination.
Recognition of figures that are too large to fit in the fovea requires image learning with active vision, in which the higher-order visual cortex should require gaze position information.

Implementation of visual cortices

The visual processing described here differs from the conventional image processing of deep learners in the following points.

Central and peripheral visions
Active vision
~~Receptive field hierarchy~~

Some of assumed image processing will be dealt with an off-the-shelf image processing library such as OpenCV:

Logarithmic polar coordinate conversion
Spatial filters (e.g., edge & blob detection)

The design of the receptive field hierarchy should be done carefully. It is also necessary to consider whether and how to use convolutional networks (CNNs).

Back-propagation and supervised learning between layers will not be used, as they are not biologically plausible. Learning will be carried out with auto-encoders within layers.

The design choice must be made between a static and dynamic model. As active vision requires the integration of eye movements and image input sequences, a dynamic model such as RSM will be considered.

The figure below summarizes the design.

Fig. 1 Visual System Architecture
While the eye position information is assumed to be a top-down input
from MTL in the figure, it might as well be a bottom-up input.

Training Vision

STEP1: Training the Saccade Driver (Fig. 1)

Saccade calibration will be performed with the difference between the fovea position after saccade and the center of saliency as the error.

STEP2: Training of the visual cortex

Two possible training methods are auto-encoding with log-polar coordinate images (output of the spatiotemporal filters in Fig. 1) as input and training with the output of the Saccade Driver (Fig. 1) with the difference between the prediction of the t+1 image and the actual image as error. The latter is optional.

Working memory

It is known that the dorsolateral prefrontal cortex (dlPFC) is involved in working memory, but there is no established theory on the mechanism. Here, the implementation of a proposed model of working memory (2021-07) is considered.

The basic idea is as follows:

The prefrontal cortex has recurrent loops that hold short-term memory.
The short-term memory at a cortical region (columnar structure) starts to be retained when the thalamic gate is disinhibited by the basal ganglia (Go).
The short-term memory is released after a certain time period unless it is reselected as Go.

The prefrontal cortex (dlPFC) and the corresponding basal ganglia receive the following information (via MTL)

The output from the visual what path
Eye position (Positional Coding in Fig. 1)
Keyboard selection (action)
Reward signals

Task execution

The premotor cortex is hypothesized here to perform the task execution (keyboard selection).

The premotor cortex and the corresponding basal ganglia receive the following information

Short-term memory contents (from dlPFC)
Keyboard selection (action)
Reward signals

Gaze control

The FEF (Frontal Eye Field) controls the gaze.

The FEF and the corresponding basal ganglia receive the following information.

Short-term memory contents (from dlPFC)
Eye position (Positional Coding in Fig. 1)
Reward signals

Control by FEF is optional.

Implementation of the cortico-thalamo-BG control

The basal ganglia is supposed to control the cortico-thalamic loop by inhibiting nuclei in the thalamus.

An Actor-Critic function from an off-the-shelf RL library will be used to implement the control by the basal ganglia.

Whether the lateral inhibitory and triggering circuits for winner-take-all control are located in the cortex or thalamus needs to be investigated.

The accumulator model will be employed. Accumulation in the cortico-thalamic loop will be hypothesized to occur while the basal ganglia circuit is Go.

cf. Agarwal, A. et al.: Better Safe than Sorry: Evidence Accumulation Allows for Safe Reinforcement Learning. arXiv:1809.09147 [cs.LG].

Learning in the executive cortex

The executive cortices (prefrontal cortex, premotor cortex, and FEF) should also be learning and make predictions. It is conceivable that prediction is used for reinforcement learning or that the results of reinforcement learning are used for prediction. However, as the need of such learning in the current task is not clear, the implementation will be optional.

Summary: the architecture

The non-optional architecture of the current plan is shown in Fig. 2.

Fig. 2
SC: superior colliculus, BG: basal ganglia, preMotor: premotor cortex

It is simpler than the sample architecture for the 5th WBA Hackathon (i.e., LIP: lateral interparietal area, FEF: frontal eye field, and the motor cortex are omitted). Meanwhile, Spatio-Temporal Filters have been added. In the neural system, the filtering is supposed to be realized in the optic nerve system between the retina and the primary visual cortex, including the LGN.

rondelion AI

Tuesday, July 27, 2021

Visual Task and Architectures (2021)