I have implemented a minimal decision learning mechanism inspired by the cortex-basal ganglia-thalamus loop as part of my effort to create brain-inspired cognitive architecture.
It is based on the following hypothesis:
- The cortex predicts the decision.
- The basal ganglia (BG) determines the execution of the decision.
Reasons for the hypothesis
The hypothesis would be corroborated from the following:
- As for the cerebral cortex making action choices (predictions):
- Thalamic matrices that undergo inhibition from the BG may not be "fine grained" enough for choices made by cortical regions such as mini-columns.
- The GPi of the BG to (de-)inhibit the thalamus may also not be "fine grained" enough for choices made by cortical regions.
- The BG has been said to control the timing of action initiation.
- The hypothesis reconciles the role of reinforcement learning in the BG and prediction in the cerebral cortex in action selection.
- Reinforcement learning in the basal ganglia is necessary to cope with delayed reward.
Specifications
The Cortical part
Output predictor: learns to predict action from the observed inputs and the executed action as supervisory signals.
A two-layer perceptron was used for the implementation.
Output moderator: calculates its output with the output of the predictor and noise input. As the predictor learns to make correct predictions, the contribution of the noise decreases. Specifically, the rate of correct task answers was used as the probability of using output prediction.
Output selector: selects the largest output (winner-take-all) from the moderator and gates its output with the Go/NoGo signal from the BG.
Thalamus
The BG part
Overall architecture
The task for testing
Delayed Reward Task
Task Environment (CBT1Env.py)
Penalty
Implementation and Evaluation
Frameworks used
BriCA (Brain-inspired Computing Architecture) was used to construct the architecture. BriCA is a computational platform for brain-based software development (see [1] for its significance).
OpenAI Gym is a widely used framework for agent learning environments.
PyTorch is widely used as a machine learning framework.
TensorForce is a reinforcement learning framework also widely used.
For learning the basal ganglia, a frequency-based method was also used.
The ideas in "Minimal Cognitive Architecture" (in this blog) was used to combine BriCA, Gym, PyTorch, and TensorForce.
Training the BG part
Reinforcement learner
The BG part has an internal environment, whose observation values are the observation input and output selector states, and reward is from the external environment, and decides whether to let the agent take an action (Go/NoGo).
Debugging to synchronize BriCA with the external environment, internal environment, and reinforcement learner was an onerous task.
Initially, the internal environment used the time steps of the external environment as its time steps. However, it turned out that learning was not stable because the learner was rewarded for producing different outputs before the agent decided on an action. So, the architecture has changed so that the agent has only one action choice for an episode, and the internal environment has only one time step for the presentation period (the biological implication of this is to be examined).
Fig. 2 shows the results of the training. As for RL algorithms, VPG (Vanilla Policy Gradient), A2C (Actor-Critic), and PPO (Proximal Policy Optimization) are used as they are available in TensorForce.
Supplementary Result at 2023-10
In October 2023, an additional experiment with Deep-Q Network (TensorForce) turned out to work fine.
Green line shows the result for DQN.
Frequency-based learner
Mechanism: It calculates the probability of receiving a positive reward for the Go decision with each observed value, and to perform a coin toss (binomial distribution) with the probability to determine Go/NoGo. The probability distribution of the N coin tosses is a beta distribution (with the parameters a and b), and the average is a/(a+b) = k+1/(k+1 + n-k+1) = k+1/(n+2), where k is the number of heads and n-k is the number of tails. The frequency-based learner has an advantage in probabilistic interpretability.
Results: The results converged to a/(a+b) = number of co-occurrences of Go and reward ÷ (number of co-occurrences of Go and reward + number of occurrences of Go). As in the case of reinforcement learning, this value does not become 1 even after learning progresses, leaving room for trial and error (thus resulting errors). Fig. 3 shows a graph of the results. The average reward for Go approaches to 1 faster than in reinforcement learning.
Comparison with Reinforcement Learning
The results of learning the task with reinforcement learners are shown in Fig. 4.
As the number of actions per episode was set to one with the results in Fig.3, the experiment was conducted with only one step for the presentation period and no reward delay.
As reinforcement learning algorithms, VPG, A2C, and PPO in TensorForce, were used.
Test program: CBT1EnvRLTest.py
In Fig. 4, VPG performed better. However, learning was not generally stable (e.g., PPO also worked when the presentation period was set to 3 steps and the reward delay was set to 2 steps).
Training the cortical part
A two-layer perceptron (with PyTorch) was trained with the observation and the executed action choices. The task performance rate was used as the probability to use cortical output prediction for Go/No judgment in the BG part (if prediction is not used, the output candidates are randomly determined).
Since the learning rate of the perceptron was small compared to that of the reinforcement learner, the input was multiplied with epoch loops by hundred, which resulted in proper learning (Fig. 5).
As a loss function, Binary Cross Entropy (which is considered to be suitable for binary classification tasks) was used. As the optimization algorithm, Stochastic Gradient Descent (SGD) was used, and its learning coefficient was set to 0.5 (through trial and error).
Fig. 5 plots the change in the loss value with the frequency-based learning for training the BG part. The loss value decreases in a manner that follows the BG learning.
Biological plausibility
While the cortical implementation consists of three parts (output predictor, output moderator, and output selector), there is no evidence for their biological plausibility. Nonetheless, as the cerebral cortex codes output, input-output mapping must take place somewhere inside, and output is related to at least Layer 5. There are also findings that output is gated by Go/NoGo signals via the basal ganglia and thalamus. In the implementation of the output moderator, random selection of noise and prediction was used. It could be a homework for neuroscience to find a functionally similar mechanism in the brain.
If action choices are coded in (mini-) columns, inter-column lateral inhibition (for winner-takes-all) could be found, which I have not found in literature.
The task performance was used as the probability to decide whether to use prediction in the implementation of the cortical part. This assumption would not be unreasonable, as dopaminergic signals are provided to the biological cortex.
Issues
A question to be asked is whether the mechanism works in more complex tasks. It will be tested in tasks requiring working memory or with more practical perceptual input, for which an architecture that includes models of particular cortical areas and other brain regions will be required.
In addition, the Drift-Diffusion model discussed in the literature [5][6], modeling the trade-off between taking action and understanding the situation (making a choice at the expense of accuracy under time pressure), will also need to be implemented.
References
[1] Takahashi, K. et al.: A Generic Software Platform for Brain-Inspired Cognitive Computing, Procedia Computer Science 71:31-37, doi: 10.1016/j.procs.2015.12.185 (2015)
[2] Tomohiko Yoshizawa, A Physiological Study of the Striatum Based on Reinforcement Learning Models of the Basal Ganglia, Doctoral Dissertation, NAIST-IS-DD1461011 (2018)
doi: 10.3951/sobim.25.167 (2001)
[3] Agarwal, A. et al.: Better Safe than Sorry: Evidence Accumulation Allows for Safe Reinforcement Learning. arXiv:1809.09147 [cs.LG] (2018)
[4] Bogacz, R. et al. The Basal Ganglia and Cortex Implement Optimal Decision Making Between Alternative Actions. Neural computation. 19. 442-77.
doi: 0.1162/neco.2007.19.2.442 (2007).
[5] Dunovan, K. et al.: Believer-Skeptic Meets Actor-Critic: Rethinking the Role of Basal Ganglia Pathways during Decision-Making and Reinforcement Learning, doi.org/10.3389/fnins.2016.00106 (2016)
[6] Ratcliff, R. and McKoon, G.: The Diffusion Decision Model: Theory and Data for Two-Choice Decision Tasks. doi: 10.1162/neco.2008.12-06-420 (2008)