Saturday, December 4, 2021

A Minimal Cortex-Basal Ganglia Architecture

I have implemented a minimal decision learning mechanism inspired by the cortex-basal ganglia-thalamus loop as part of my effort to create brain-inspired cognitive architecture.

It is based on the following hypothesis:

  • The cortex predicts the decision.
  • The basal ganglia (BG) determines the execution of the decision.
In this implementation, the cortex part initially makes random decisions and the BG part judges if a decision is appropriate for execution.  In terms of learning, while the BG part learns Go/NoGo policy, the cortex part learns (to predict) action from the input and decisions actually executed by the BG part's Go.

Reasons for the hypothesis

The hypothesis would be corroborated from the following:

  • As for the cerebral cortex making action choices (predictions):
    • Thalamic matrices that undergo inhibition from the BG may not be "fine grained" enough for choices made by cortical regions such as mini-columns.
    • The GPi of the BG to (de-)inhibit the thalamus may also not be "fine grained" enough for choices made by cortical regions.
  • The BG has been said to control the timing of action initiation.
  • The hypothesis reconciles the role of reinforcement learning in the BG and prediction in the cerebral cortex in action selection.
  • Reinforcement learning in the basal ganglia is necessary to cope with delayed reward.

Specifications

The Cortical part

The cortical part receives "observation input" and outputs action selection.

Output predictor: learns to predict action from the observed inputs and the executed action as supervisory signals.
A two-layer perceptron was used for the implementation.

Output moderator: calculates its output with the output of the predictor and noise input.  As the predictor learns to make correct predictions, the contribution of the noise decreases.  Specifically, the rate of correct task answers was used as the probability of using output prediction.

Output selector: selects the largest output (winner-take-all) from the moderator and gates its output with the Go/NoGo signal from the BG.

Thalamus

The thalamus was not implemented and Go/NoGo signals from the BG were directly passed to the output selector of the cortical part.
In the brain, the thalamus receives inhibitory signals from the BG (NoGo) and when the inhibitory signal from the basal ganglia is lost, the Go signal is sent to the cortex.

The BG part

Reinforcement learning for the Go/NoGo decision was performed with the cortical input and its potential output as states.

Overall architecture


Fig.1

The task for testing

Delayed Reward Task

Task Environment (CBT1Env.py)

Action is accepted while a cue is presented for a certain amount of time.
The cue can be one of the following: [1,1,0], [0,0,1], [0,1,1], [1,0,0].
The observation value out of the presentation period is [0,0,0].
If the relationship between the cue and action satisfies the following conditions, reward 1 will be given after a certain delay.
Cue [1,1,0] or [0,0,1] and action [1,0]
Cue [0,1,1] or [1,0,0] and action [0,1]

Penalty

If action other than [0,0] is made within the presentation period without meeting the reward condition above, a negative reward is given for the "wrong try."

Implementation and Evaluation

Frameworks used

Cognitive Architecture Description Framework: BriCA

BriCA (Brain-inspired Computing Architecture) was used to construct the architecture.  BriCA is a computational platform for brain-based software development (see [1] for its significance).

Environment description framework: OpenAI Gym

OpenAI Gym is a widely used framework for agent learning environments.

Machine learning framework: PyTorch

PyTorch is widely used as a machine learning framework.

Reinforcement Learning Framework: TensorForce

TensorForce is a reinforcement learning framework also widely used.

For learning the basal ganglia, a frequency-based method was also used.

The ideas in "Minimal Cognitive Architecture" (in this blog) was used to combine BriCA, Gym, PyTorch, and TensorForce.

Training the BG part

Reinforcement learner

The BG part has an internal environment, whose observation values are the observation input and output selector states, and reward is from the external environment, and decides whether to let the agent take an action (Go/NoGo).

Debugging to synchronize BriCA with the external environment, internal environment, and reinforcement learner was an onerous task.

Initially, the internal environment used the time steps of the external environment as its time steps.  However, it turned out that learning was not stable because the learner was rewarded for producing different outputs before the agent decided on an action.  So, the architecture has changed so that the agent has only one action choice for an episode, and the internal environment has only one time step for the presentation period (the biological implication of this is to be examined).

Fig. 2 shows the results of the training.  As for RL algorithms, VPG (Vanilla Policy Gradient), A2C (Actor-Critic), and PPO (Proximal Policy Optimization) are used as they are available in TensorForce.

Fig.2
Horizontal axis: The number multiplied by 100 is the number of episodes.
Vertical axis: Average reward for Go
Penalty = -0.7
For VPG and A2C, the average reward approached 1 after 4,000 trials.
In PPO, there was no apparent learning (reason unknown).

When no penalty was given, RL algorithms always chose Go, so learning was not successful.

Supplementary Result at 2023-10

In October 2023, an additional experiment with Deep-Q Network (TensorForce) turned out to work fine.


Fig.2'
Legends are the same as Fig.2
Green line shows the result for DQN.

Frequency-based learner

Mechanism: It calculates the probability of receiving a positive reward for the Go decision with each observed value, and to perform a coin toss (binomial distribution) with the probability to determine Go/NoGo.  The probability distribution of the N coin tosses is a beta distribution (with the parameters a and b), and the average is a/(a+b) = k+1/(k+1 + n-k+1) = k+1/(n+2), where k is the number of heads and n-k is the number of tails.  The frequency-based learner has an advantage in probabilistic interpretability.

Results: The results converged to a/(a+b) = number of co-occurrences of Go and reward ÷ (number of co-occurrences of Go and reward + number of occurrences of Go).  As in the case of reinforcement learning, this value does not become 1 even after learning progresses, leaving room for trial and error (thus resulting errors).  Fig. 3 shows a graph of the results.  The average reward for Go approaches to 1 faster than in reinforcement learning.

Fig.3
Horizontal axis: The number multiplied by 100 is the number of episodes.
Vertical axis: Average reward for Go

Comparison with Reinforcement Learning

The results of learning the task with reinforcement learners are shown in Fig. 4.
As the number of actions per episode was set to one with the results in Fig.3, the experiment was conducted with only one step for the presentation period and no reward delay.
As reinforcement learning algorithms, VPG, A2C, and PPO in TensorForce, were used.

Test program: CBT1EnvRLTest.py


Fig.4
Horizontal axis: The number multiplied by 100 is the number of episodes.
Vertical axis: Average reward
Penalty = -0.7

In Fig. 4, VPG performed better.  However, learning was not generally stable (e.g.,  PPO also worked when the presentation period was set to 3 steps and the reward delay was set to 2 steps).

Training the cortical part

A two-layer perceptron (with PyTorch) was trained with the observation and the executed action choices.  The task performance rate was used as the probability to use cortical output prediction for Go/No judgment in the BG part (if prediction is not used, the output candidates are randomly determined).

Since the learning rate of the perceptron was small compared to that of the reinforcement learner, the input was multiplied with epoch loops by hundred, which resulted in proper learning (Fig. 5).

As a loss function, Binary Cross Entropy (which is considered to be suitable for binary classification tasks) was used.  As the optimization algorithm, Stochastic Gradient Descent (SGD) was used, and its learning coefficient was set to 0.5 (through trial and error).

Fig. 5 plots the change in the loss value with the frequency-based learning for training the BG part.  The loss value decreases in a manner that follows the BG learning.

Fig.5
Horizontal axis: The number multiplied by 100 is the number of episodes
Vertical Axis: The average reward or loss value for Go
Penalty = -0.7
Blue line: Average reward for frequency learning
Red line: Loss for frequency learning

Biological plausibility

The interior mechanisms of the basal ganglia and the thalamus were not implemented, so they cannot not be evaluated.

While the cortical implementation consists of three parts (output predictor, output moderator, and output selector), there is no evidence for their biological plausibility.  Nonetheless, as the cerebral cortex codes output, input-output mapping must take place somewhere inside, and output is related to at least Layer 5.  There are also findings that output is gated by Go/NoGo signals via the basal ganglia and thalamus.  In the implementation of the output moderator, random selection of noise and prediction was used.  It could be a homework for neuroscience to find a functionally similar mechanism in the brain.

If action choices are coded in (mini-) columns, inter-column lateral inhibition (for winner-takes-all) could be found, which I have not found in literature.

The task performance was used as the probability to decide whether to use prediction in the implementation of the cortical part.  This assumption would not be unreasonable, as dopaminergic signals are provided to the biological cortex.

As for the connection between the cortex and the basal ganglia, the implementation passed the input for the cortical part and the output to be selected in the cortical part to the BG part.  The former corresponds to the connection from the cortex to the striatum patch in Figure 1.3 of [2], and the latter corresponds to the connection from the cortex to the striatum matrix in the same figure.

Issues

A question to be asked is whether the mechanism works in more complex tasks.  It will be tested in tasks requiring working memory or with more practical perceptual input, for which an architecture that includes models of particular cortical areas and other brain regions will be required.

It is said that accumulation of evidence takes place in action decisions in the basal ganglia [3][4].  When dealing with uncertain input or multiple types of evidence, the accumulation mechanism should be introduced.

In addition, the Drift-Diffusion model discussed in the literature [5][6], modeling the trade-off between taking action and understanding the situation (making a choice at the expense of accuracy under time pressure), will also need to be implemented.

References

[1] Takahashi, K. et al.: A Generic Software Platform for Brain-Inspired Cognitive Computing, Procedia Computer Science 71:31-37, doi: 10.1016/j.procs.2015.12.185 (2015)

[2] Tomohiko Yoshizawa, A Physiological Study of the Striatum Based on Reinforcement Learning Models of the Basal Ganglia, Doctoral Dissertation, NAIST-IS-DD1461011 (2018)
doi: 10.3951/sobim.25.167 (2001)

[3] Agarwal, A. et al.: Better Safe than Sorry: Evidence Accumulation Allows for Safe Reinforcement Learning. arXiv:1809.09147 [cs.LG] (2018)

[4] Bogacz, R. et al. The Basal Ganglia and Cortex Implement Optimal Decision Making Between Alternative Actions. Neural computation. 19. 442-77.
doi: 0.1162/neco.2007.19.2.442 (2007).

[5] Dunovan, K. et al.: Believer-Skeptic Meets Actor-Critic: Rethinking the Role of Basal Ganglia Pathways during Decision-Making and Reinforcement Learning, doi.org/10.3389/fnins.2016.00106  (2016)

[6] Ratcliff, R. and McKoon, G.: The Diffusion Decision Model: Theory and Data for Two-Choice Decision Tasks. doi: 10.1162/neco.2008.12-06-420 (2008)

        Tuesday, July 27, 2021

        Visual Task and Architectures (2021)

        This article describes an implementation plan for a brain-inspired cognitive architecture to solve visual working memory tasks.

        The Task

        A visual working memory task (the delayed match-to-sample task, in particular) will be used.

        The reasons are as follows:

        • Sensory input is essential for embodied AI.
        • Vision is a sensory modality widely used in AI tasks.
        • Working memory is an essential cognitive function in the performance of complex tasks (as required by AGI).
        • Match-to-sample tasks are used in the 5th WBA hackathon, and resources (a task environment and sample code) are available.

        Visual system

        The human visual system

        The features of the human visual system listed below shall be reproduced as much as possible.

        • Central and peripheral visions
          The center of the visual field (fovea) and the periphery have different resolutions.  It is also known that they have different color and motion perceptions.
        • Active vision
          Since the peripheral vision does not have sufficient resolution, eye movements (saccades) are necessary for collecting information with the central vision.
          Information integration from different points in time-space is regarded as a case of the binding problem.
        • What vs. Where Paths
          The mammalian visual system has the what pathway, which identifies objects, and the where pathway, which codes the location and movement of objects.
          The two pathways originate in the small (parvo-) and large (magno-) cells of the LGN (lateral geniculate nucleus), respectively.
        • Receptive field hierarchy
          The hierarchy of the visual cortex (V1⇒V2...) has a hierarchy of receptive fields.
        • Saliency map
          It is coded in the superior colliculus (SC) to guide the saccades.
        • Binocular vision

        Visual system implementation

        • Central and peripheral visions
          While separate pathways were used in the 4th & 5th WBA hackathons, they can be combined into one with the log-polar coordinate used in the mammalian visual system.
        • Monocular and binocular visions
          As binocular vision is not necessary for the current task, the monocular vision shall be used.
        • The What Path
          The implementation will use logarithmic polar coordinate images as input.
          Edges and blob detection shall be considered in pre-processing.
        • The Where Path
          In the tasks concerned, it will be sufficient to obtain only the gaze position for the where path.
        • Receptive field hierarchy
          The way to implement the receptive field hierarchy with existing machine learners should be studied.
        • Saliency Map
          Time integral and winner-take-all circuits shall be implemented.
          Signal strength and its time derivative shall be used for bottom-up saliency.
          Spatial filters (detection of feature points like corners and blobs) shall also be considered.
          For scanning, the saliency of the most recently visited locations will be suppressed.
        • Active vision
          Saccades move the gaze to the most salient coordinates in the saliency map when a certain saccade condition is met.  Training is required to calibrate the saccade destination.
          Recognition of figures that are too large to fit in the fovea requires image learning with active vision, in which the higher-order visual cortex should require gaze position information.

        Implementation of visual cortices

        The visual processing described here differs from the conventional image processing of deep learners in the following points.

        • Central and peripheral visions
        • Active vision
        • Receptive field hierarchy

        Some of assumed image processing will be dealt with an off-the-shelf image processing library such as OpenCV:

        • Logarithmic polar coordinate conversion
        • Spatial filters (e.g., edge & blob detection)

        The design of the receptive field hierarchy should be done carefully.  It is also necessary to consider whether and how to use convolutional networks (CNNs).

        Back-propagation and supervised learning between layers will not be used, as they are not biologically plausible.  Learning will be carried out with auto-encoders within layers.

        The design choice must be made between a static and dynamic model.  As active vision requires the integration of eye movements and image input sequences, a dynamic model such as RSM will be considered.

        The figure below summarizes the design.

        Fig. 1 Visual System Architecture
        While the eye position information is assumed to be a top-down input
        from MTL in the figure, it might as well be a bottom-up input.

        Training Vision

        STEP1: Training the Saccade Driver (Fig. 1)

        Saccade calibration will be performed with the difference between the fovea position after saccade and the center of saliency as the error.

        STEP2: Training of the visual cortex

        Two possible training methods are auto-encoding with log-polar coordinate images (output of the spatiotemporal filters in Fig. 1) as input and training with the output of the Saccade Driver (Fig. 1) with the difference between the prediction of the t+1 image and the actual image as error.  The latter is optional.

        Working memory

        It is known that the dorsolateral prefrontal cortex (dlPFC) is involved in working memory, but there is no established theory on the mechanism.  Here, the implementation of a proposed model of working memory (2021-07) is considered.

        The basic idea is as follows:

        • The prefrontal cortex has recurrent loops that hold short-term memory.
        • The short-term memory at a cortical region (columnar structure) starts to be retained when the thalamic gate is disinhibited by the basal ganglia (Go).
        • The short-term memory is released after a certain time period unless it is reselected as Go.

        The prefrontal cortex (dlPFC) and the corresponding basal ganglia receive the following information (via MTL)

        • The output from the visual what path
        • Eye position (Positional Coding in Fig. 1)
        • Keyboard selection (action)
        • Reward signals

        Task execution

        The premotor cortex is hypothesized here to perform the task execution (keyboard selection).

        The premotor cortex and the corresponding basal ganglia receive the following information

        • Short-term memory contents (from dlPFC)
        • Keyboard selection (action)
        • Reward signals

        Gaze control

        The FEF (Frontal Eye Field) controls the gaze.

        The FEF and the corresponding basal ganglia receive the following information.

        • Short-term memory contents (from dlPFC)
        • Eye position (Positional Coding in Fig. 1)
        • Reward signals

        Control by FEF is optional.

        Implementation of the cortico-thalamo-BG control

        The basal ganglia is supposed to control the cortico-thalamic loop by inhibiting nuclei in the thalamus.

        An Actor-Critic function from an off-the-shelf RL library will be used to implement the control by the basal ganglia. 

        Whether the lateral inhibitory and triggering circuits for winner-take-all control are located in the cortex or thalamus needs to be investigated.

        The accumulator model will be employed.  Accumulation in the cortico-thalamic loop will be hypothesized to occur while the basal ganglia circuit is Go.

        cf. Agarwal, A. et al.: Better Safe than Sorry: Evidence Accumulation Allows for Safe Reinforcement Learning. arXiv:1809.09147 [cs.LG].

        Learning in the executive cortex

        The executive cortices (prefrontal cortex, premotor cortex, and FEF) should also be learning and make predictions.  It is conceivable that prediction is used for reinforcement learning or that the results of reinforcement learning are used for prediction.  However, as the need of such learning in the current task is not clear, the implementation will be optional.

        Summary: the architecture

        The non-optional architecture of the current plan is shown in Fig. 2.

        Fig. 2 
        SC: superior colliculus, BG: basal ganglia, preMotor: premotor cortex

        It is simpler than the sample architecture for the 5th WBA Hackathon (i.e., LIP: lateral interparietal area, FEF: frontal eye field, and the motor cortex are omitted).  Meanwhile, Spatio-Temporal Filters have been added.  In the neural system, the filtering is supposed to be realized in the optic nerve system between the retina and the primary visual cortex, including the LGN.


        Sunday, April 25, 2021

        Minimal Cognitive Architecture

         I made an experimentation with a cognitive architecture with a minimal setup (Fig.1).

        Code on GitHub

        Fig.1

        The objectives of the experimentation are as follows:
        • Usability checking of BriCA, a brain-inspired computational framework, which performs computation by passing numerical vectors among modules.
        • Curriculum learning
          With the hypothesis that perceptual components in animals or generally intelligent agents are trained independently with tasks, the perceptual component was trained first, then used later.
        • Separating an internal learning environment from the exterior environment
          An internal environment was set up with the hypothesis that the brain is a society of agents, each of which has its own environment.
        With these objectives met, the architecture will serve as a design pattern for more complex architectures.

        The following software frameworks were used:

        Task and reinforcement learning

        OpenAI Gym CartPole is an introductory task for reinforcement learning.
        Tensorforce provides a sample actor-critic (PPO) code to solve the task.
        Actor-critic was chosen as the brain (basal ganglia) is supposed to use an actor-critic mechanism.

        Training the perceptual module

        The observation data from CartPole is just a four-dimensional numeric array.  So dubbing it as 'VisualComponent' is a bit of exaggeration and a simple auto-encoder without CNN was used.
        The observation data for training the perceptual module is taken while it is directly fed to the motor component which learns by PPO.  Though observation data could be taken while motor component chooses the action randomly, it would limit the range of data as it fails with a few actions in each episode.
        The auto-encoder model is trained with an independent utility and saved in a file to be used in the perceptual module later.  In this 'curriculum learning', I followed the manner found in the set-up of the working memory hackathon held by WBAI and Cerenaut.
        The output dimension of the auto-encoder was the same as the input so that no information compression was made, as the dimension (=four) of the input is low enough (or too low).

        Using BriCA

        BriCA is a brain-inspired computational framework.  While it can be used to model signal passing among regions in the brain, it restricts the computation: the information flow is limited to the 'axonal' direction and it has its own time steps.
        For one thing, it cannot use gradient 'back-prop' beyond module boundaries and it is a motivation for training the auto-encoder based perceptual module in an unsupervised manner.
        An issue with its own time steps is the discrepancy with those with (Gym) environment.  It takes more than one BriCA steps to complete an observe-action loop, which would cause negative effects on reinforcement learning (especially when it is not tuned to such a setup).  Thus, in this experimentation, I added a token mechanism for the system to synchronize with the observe-action loop.

        Internal Environment

        The brain (or basal ganglia) is supposed to have multiple reinforcement learning modules to control various external and internal actions.  That is, they require their own internal environments that accept module actions and produce their own observations.  In this experimentation, the motor component has its own internal environment.  Tensorforce was used as a framework, for it enables flexible learning setup.

        Result

        The agent with a learned perceptual model could learn the task, while the scores were not as good as those from the system with the observation directly fed to the motor component.  This is assumedly because of the information loss in the learned perceptual module.  The observation is low-dimensional and the auto-encoder does not have the merit of information compression.

        Future Direction

        The architecture created in the experimentation will serve as a design pattern for coming cognitive modelings.  The mind map below (Fig.1) shows possible extensions.

        Fig.2