Saturday, December 4, 2021

A Minimal Cortex-Basal Ganglia Architecture

I have implemented a minimal decision learning mechanism inspired by the cortex-basal ganglia-thalamus loop as part of my effort to create brain-inspired cognitive architecture.

It is based on the following hypothesis:

  • The cortex predicts the decision.
  • The basal ganglia (BG) determines the execution of the decision.
In this implementation, the cortex part initially makes random decisions and the BG part judges if a decision is appropriate for execution.  In terms of learning, while the BG part learns Go/NoGo policy, the cortex part learns (to predict) action from the input and decisions actually executed by the BG part's Go.

Reasons for the hypothesis

The hypothesis would be corroborated from the following:

  • As for the cerebral cortex making action choices (predictions):
    • Thalamic matrices that undergo inhibition from the BG may not be "fine grained" enough for choices made by cortical regions such as mini-columns.
    • The GPi of the BG to (de-)inhibit the thalamus may also not be "fine grained" enough for choices made by cortical regions.
  • The BG has been said to control the timing of action initiation.
  • The hypothesis reconciles the role of reinforcement learning in the BG and prediction in the cerebral cortex in action selection.
  • Reinforcement learning in the basal ganglia is necessary to cope with delayed reward.


The Cortical part

The cortical part receives "observation input" and outputs action selection.

Output predictor: learns to predict action from the observed inputs and the executed action as supervisory signals.
A two-layer perceptron was used for the implementation.

Output moderator: calculates its output with the output of the predictor and noise input.  As the predictor learns to make correct predictions, the contribution of the noise decreases.  Specifically, the rate of correct task answers was used as the probability of using output prediction.

Output selector: selects the largest output (winner-take-all) from the moderator and gates its output with the Go/NoGo signal from the BG.


The thalamus was not implemented and Go/NoGo signals from the BG were directly passed to the output selector of the cortical part.
In the brain, the thalamus receives inhibitory signals from the BG (NoGo) and when the inhibitory signal from the basal ganglia is lost, the Go signal is sent to the cortex.

The BG part

Reinforcement learning for the Go/NoGo decision was performed with the cortical input and its potential output as states.

Overall architecture


The task for testing

Delayed Reward Task

Task Environment (

Action is accepted while a cue is presented for a certain amount of time.
The cue can be one of the following: [1,1,0], [0,0,1], [0,1,1], [1,0,0].
The observation value out of the presentation period is [0,0,0].
If the relationship between the cue and action satisfies the following conditions, reward 1 will be given after a certain delay.
Cue [1,1,0] or [0,0,1] and action [1,0]
Cue [0,1,1] or [1,0,0] and action [0,1]


If action other than [0,0] is made within the presentation period without meeting the reward condition above, a negative reward is given for the "wrong try."

Implementation and Evaluation

Frameworks used

Cognitive Architecture Description Framework: BriCA

BriCA (Brain-inspired Computing Architecture) was used to construct the architecture.  BriCA is a computational platform for brain-based software development (see [1] for its significance).

Environment description framework: OpenAI Gym

OpenAI Gym is a widely used framework for agent learning environments.

Machine learning framework: PyTorch

PyTorch is widely used as a machine learning framework.

Reinforcement Learning Framework: TensorForce

TensorForce is a reinforcement learning framework also widely used.

For learning the basal ganglia, a frequency-based method was also used.

The ideas in "Minimal Cognitive Architecture" (in this blog) was used to combine BriCA, Gym, PyTorch, and TensorForce.

Training the BG part

Reinforcement learner

The BG part has an internal environment, whose observation values are the observation input and output selector states, and reward is from the external environment, and decides whether to let the agent take an action (Go/NoGo).

Debugging to synchronize BriCA with the external environment, internal environment, and reinforcement learner was an onerous task.

Initially, the internal environment used the time steps of the external environment as its time steps.  However, it turned out that learning was not stable because the learner was rewarded for producing different outputs before the agent decided on an action.  So, the architecture has changed so that the agent has only one action choice for an episode, and the internal environment has only one time step for the presentation period (the biological implication of this is to be examined).

Fig. 2 shows the results of the training.  As for RL algorithms, VPG (Vanilla Policy Gradient), A2C (Actor-Critic), and PPO (Proximal Policy Optimization) are used as they are available in TensorForce.

Horizontal axis: The number multiplied by 100 is the number of episodes.
Vertical axis: Average reward for Go
Penalty = -0.7
For VPG and A2C, the average reward approached 1 after 4,000 trials.
In PPO, there was no apparent learning (reason unknown).

When no penalty was given, RL algorithms always chose Go, so learning was not successful.

Supplementary Result at 2023-10

In October 2023, an additional experiment with Deep-Q Network (TensorForce) turned out to work fine.

Legends are the same as Fig.2
Green line shows the result for DQN.

Frequency-based learner

Mechanism: It calculates the probability of receiving a positive reward for the Go decision with each observed value, and to perform a coin toss (binomial distribution) with the probability to determine Go/NoGo.  The probability distribution of the N coin tosses is a beta distribution (with the parameters a and b), and the average is a/(a+b) = k+1/(k+1 + n-k+1) = k+1/(n+2), where k is the number of heads and n-k is the number of tails.  The frequency-based learner has an advantage in probabilistic interpretability.

Results: The results converged to a/(a+b) = number of co-occurrences of Go and reward ÷ (number of co-occurrences of Go and reward + number of occurrences of Go).  As in the case of reinforcement learning, this value does not become 1 even after learning progresses, leaving room for trial and error (thus resulting errors).  Fig. 3 shows a graph of the results.  The average reward for Go approaches to 1 faster than in reinforcement learning.

Horizontal axis: The number multiplied by 100 is the number of episodes.
Vertical axis: Average reward for Go

Comparison with Reinforcement Learning

The results of learning the task with reinforcement learners are shown in Fig. 4.
As the number of actions per episode was set to one with the results in Fig.3, the experiment was conducted with only one step for the presentation period and no reward delay.
As reinforcement learning algorithms, VPG, A2C, and PPO in TensorForce, were used.

Test program:

Horizontal axis: The number multiplied by 100 is the number of episodes.
Vertical axis: Average reward
Penalty = -0.7

In Fig. 4, VPG performed better.  However, learning was not generally stable (e.g.,  PPO also worked when the presentation period was set to 3 steps and the reward delay was set to 2 steps).

Training the cortical part

A two-layer perceptron (with PyTorch) was trained with the observation and the executed action choices.  The task performance rate was used as the probability to use cortical output prediction for Go/No judgment in the BG part (if prediction is not used, the output candidates are randomly determined).

Since the learning rate of the perceptron was small compared to that of the reinforcement learner, the input was multiplied with epoch loops by hundred, which resulted in proper learning (Fig. 5).

As a loss function, Binary Cross Entropy (which is considered to be suitable for binary classification tasks) was used.  As the optimization algorithm, Stochastic Gradient Descent (SGD) was used, and its learning coefficient was set to 0.5 (through trial and error).

Fig. 5 plots the change in the loss value with the frequency-based learning for training the BG part.  The loss value decreases in a manner that follows the BG learning.

Horizontal axis: The number multiplied by 100 is the number of episodes
Vertical Axis: The average reward or loss value for Go
Penalty = -0.7
Blue line: Average reward for frequency learning
Red line: Loss for frequency learning

Biological plausibility

The interior mechanisms of the basal ganglia and the thalamus were not implemented, so they cannot not be evaluated.

While the cortical implementation consists of three parts (output predictor, output moderator, and output selector), there is no evidence for their biological plausibility.  Nonetheless, as the cerebral cortex codes output, input-output mapping must take place somewhere inside, and output is related to at least Layer 5.  There are also findings that output is gated by Go/NoGo signals via the basal ganglia and thalamus.  In the implementation of the output moderator, random selection of noise and prediction was used.  It could be a homework for neuroscience to find a functionally similar mechanism in the brain.

If action choices are coded in (mini-) columns, inter-column lateral inhibition (for winner-takes-all) could be found, which I have not found in literature.

The task performance was used as the probability to decide whether to use prediction in the implementation of the cortical part.  This assumption would not be unreasonable, as dopaminergic signals are provided to the biological cortex.

As for the connection between the cortex and the basal ganglia, the implementation passed the input for the cortical part and the output to be selected in the cortical part to the BG part.  The former corresponds to the connection from the cortex to the striatum patch in Figure 1.3 of [2], and the latter corresponds to the connection from the cortex to the striatum matrix in the same figure.


A question to be asked is whether the mechanism works in more complex tasks.  It will be tested in tasks requiring working memory or with more practical perceptual input, for which an architecture that includes models of particular cortical areas and other brain regions will be required.

It is said that accumulation of evidence takes place in action decisions in the basal ganglia [3][4].  When dealing with uncertain input or multiple types of evidence, the accumulation mechanism should be introduced.

In addition, the Drift-Diffusion model discussed in the literature [5][6], modeling the trade-off between taking action and understanding the situation (making a choice at the expense of accuracy under time pressure), will also need to be implemented.


[1] Takahashi, K. et al.: A Generic Software Platform for Brain-Inspired Cognitive Computing, Procedia Computer Science 71:31-37, doi: 10.1016/j.procs.2015.12.185 (2015)

[2] Tomohiko Yoshizawa, A Physiological Study of the Striatum Based on Reinforcement Learning Models of the Basal Ganglia, Doctoral Dissertation, NAIST-IS-DD1461011 (2018)
doi: 10.3951/sobim.25.167 (2001)

[3] Agarwal, A. et al.: Better Safe than Sorry: Evidence Accumulation Allows for Safe Reinforcement Learning. arXiv:1809.09147 [cs.LG] (2018)

[4] Bogacz, R. et al. The Basal Ganglia and Cortex Implement Optimal Decision Making Between Alternative Actions. Neural computation. 19. 442-77.
doi: 0.1162/neco.2007.19.2.442 (2007).

[5] Dunovan, K. et al.: Believer-Skeptic Meets Actor-Critic: Rethinking the Role of Basal Ganglia Pathways during Decision-Making and Reinforcement Learning,  (2016)

[6] Ratcliff, R. and McKoon, G.: The Diffusion Decision Model: Theory and Data for Two-Choice Decision Tasks. doi: 10.1162/neco.2008.12-06-420 (2008)

        No comments:

        Post a Comment