日本語版
Abstract: As many animals, including humans, make behavioral decisions based on visual information, a cognitive model of the visuomotor system would serve as a basis in intelligence research, including AGI. This article reports on the implementation of a relatively simple system: a virtual environment that displays shapes and cursors and an agent that performs gaze shift and cursor control based on the information from the environment. The visual system is modeled after that of humans with the central and peripheral fields of view, and the agent architecture is based on the structure of the brain.
1. Introduction
This article reports on the implementation of a simple environment and agent architecture for decision making based on visual information, which would serve as part of more generic cognitive models/architectures. It also addresses human ‘active vision,’ where visual information is collected and integrated through gaze shift.
This work adopts a strategy of starting with a relatively simple model. The implemented two-dimensional visual environment displays simple figures and cursors. Figures and a cursor can be moved (dragged) by instructions from the agent.
As for the agent, the following were modeled and implemented, imitating the human visual system.
1) distinction between central and peripheral vision,
2) gaze shift based on salience in the peripheral vision [1],
3) unsupervised learning of shapes captured in the central vision,
4) reinforcement learning of cursor movement and dragging,
5) “surprise” due to changes in the environment caused by actions and habituation due to learning,
6) reward based on “surprise”.
Here 3), 4), and 5) involve learning and are provided with learning models. Agent's action consists of gaze shift and cursor movement + dragging. gaze shift in the model does not learn and is driven by salience.
2. Environment
The environment has a screen divided into an N × N grid (Figure 1). The center of the screen is a "stage" consisting of an M × M grid (M<N). The edges of the stage are marked with border lines. M different shapes are displayed on the stage. The visual information presented to the agent is a color bitmap of the field of view (M × M grid) centered on the gaze. The gaze is located at the center of a grid cell on the stage, and shifted when the environment is given a gaze shift signal (a vector of maximum and minimum values [± M, ± M]). It does not move off the stage. Two cursors of different colors are displayed on the stage. When the environment is given a cursor movement signal (a vector of maximum and minimum [± 1, ± 1]), one of the cursors may move, while it does not move off the stage. If the cursor is superimposed on a figure and the environment is given a non zero cursor move and grab signal, the figure is moved in the same direction and distance as the cursor move (i.e., dragged). Figure 1 shows an example display.
Figure 1: Environment
3. Agent
The agent receives the input of a color bitmap of the field of view from the environment, and outputs gaze shift, cursor movement, and grab signals to the environment. The agent has an architecture consisting of the following modules (Fig.2 – the following parentheses indicate module names in the figure). Salience Calculation Module (Periphery2Saliency), Gaze Shift Module (PriorityMap2Gaze), Central Visual Field Change Prediction Module (FoveaDiffPredictor), Surprise-reward calculation module (SurpriseReward), object recognition module (ObjectRecognizer), and Cursor Control Module (CursorActor). See the figure for connections between modules.
Figure 2: Architecture The Cursor Control Module uses reinforcement learning rewarded by changes in the external world caused by its own action (contingency detection)
[2].
As for correspondence with the brain, the saliency calculation module corresponds to the superior colliculus, the Gaze Shift Module corresponds to the neural circuit from the superior colliculus to the eye, and the Object Recognition Module corresponds to the ‘what path’ of the visual cortex, which performs object identification. As the Central Visual Field Change Prediction Module and the surprise-reward calculation module use the output of the object recognition module, it could correspond to a visual association cortex such as the frontal eye field
[3]. The Cursor Control Module would correspond to the motor cortex.
3.1 Salience Calculation Module (Periphery2Saliency)
After reducing the resolution of the input bitmap, it creates a monochrome brightness map corresponding to the peripheral visual field, and adds an edge detection map and a time differential map to it. Though it is said that the log-polar coordinate system is used in human peripheral vision, the ordinary Cartesian coordinates were used for engineering interpretability and amenability with off-the-shelf tools such as the regular CNN.
3.2 Gaze Shift Module (PriorityMap2Gaze)
A gaze shift signal is calculated to move the gaze to the part with maximum saliency based on the saliency (priority) map from the saliency calculation module.
3.3 Object Recognition Module (ObjectRecognizer)
It feeds the bitmap of the central visual field to an unsupervised learner, and outputs the latent variables of the learner.
3.4 Central Visual Field Change Prediction Module (FoveaDiffPredictor)
‘Central visual field change’ refers to the scalar (summed) time difference of the Object Recognition Module output. The module predicts it from the outputs of the Object Recognition Module and Cursor Control Module at the previous time. If a gaze shift has occurred at the previous time, no prediction is made and the output is set to zero (saccade suppression). Prediction is learned, and its output is the prediction error.
3.5 Surprise-Reward Calculation Module (SurpriseReward)
It outputs {scalar (summed) value of time difference of Object Recognition Module output x prediction error (the output of the Central Visual Field Change Prediction Module)}.' The output becomes zero if the prediction error is zero or if there is no time change in the output of the Object Recognition Module.
3.6 Cursor Control Module (CursorActor)
It is a reinforcement learner that observes the output of the Object Recognition Module and outputs the cursor control (movement vector + grab) signal. The reward is the output of the Surprise-Reward Calculation Module.
4 Implementation and Test
The code is located
here:
4.1 Environment
The environment was implemented with Python and PyGame. Card game symbols (pips) were used as figures. The initial positions of figures and cursors are at random for each episode (the initial position of the cursor controlled by the agent was set on a figure).
4.2 Agent
The agent was implemented with Python and BriCA (Brain-inspired Computing Architecture)
[4], a computational platform for developing brain-inspired software. As BriCA supports modular architecture development, the reuse of the implementation in more complex architectures could be easier. With
the BriCA platform, architectural design is first specified in a spreadsheet and then converted into an architecture description language (BriCA language). At runtime, the interpreter loads and executes the BriCA language description. BriCA modules exchange numerical vector signals in a token-passing manner. PyTorch was used as a machine learning platform.
It reduces the resolution of the input bitmap, calculates a monochrome brightness map corresponding to the peripheral visual field, and adds an edge detection map and a time differential map to the brightness map with preconfigured weights.
It computes the ‘priority map’ by 1) adding random noise to the output of the saliency calculation module (salience map), and 2) adding the priority map at the previous time multiplied by the damping coefficient. The gaze shift signal is calculated so that the gaze moves to the field of view corresponding to the part with the maximum value in the priority map.
Object recognition module (ObjectRecognizer)
βVAE (from Princeton U.:
code) was used after kinds of autoencoders had been compared as unsupervised learners. The choice was made with the expectation that the number of output dimensions would be relatively small and it provides interpretable (distangled) latent variables.
Central Visual Field Change Prediction Module (FoveaDiffPredictor)
It predicts scalar changes in the central visual field from the output of the Object Recognition Module and Cursor Control Module at the previous time, and outputs the prediction error. A three-layer perceptron was used as a predictor.
Surprise-Reward Calculation Module (SurpriseReward)
It outputs {the scalar value of the time difference of the Object Recognition Module output × prediction error (Central Visual Field Change Prediction Module output)}.
Cursor Control Module (CursorActor)
It uses a cerebral cortex/basal ganglia loop model
[5] (
code), based on the hypothesis that the cerebral cortex predicts actions through learning, and the basal ganglia determines (through reinforcement learning) whether to perform the action. The implemented basal ganglia model determines whether or not it is possible to perform it based on the given observation data and type of action (Go/NoGo) through reinforcement learning. Meanwhile, the cortical model initially selects the type of action at random, and as the learning of the basal ganglia model progresses, it begins to predict and present the type of action performed from observational data. The used reinforcement learning algorithm was DQN (Deep Q-Network).
4.3 Experiments (Tests)
Experiments (tests) and learning were performed by modules starting from the area closest to the visual input.
Salience Calculation Module and Gaze Shift Module
These modules do not depend on other modules and do not perform learning. They were qualitatively tested with their own environment (Vision1Env.py), where circles with multiple colors, intensities, and sizes were presented in the field of view. Gaze shift was observed and parameters parameters (e.g., intensity, edge, time differential weight for saliency map calculation) were adjusted by the developer.
Object Recognition Module
All combinations of images that would appear in the central visual field were fed to the βVAE (with the number of latent variables=10) to be trained (TrainFovea_VAE.py). While original images were generally reconstructed after about 10,000 episodes, the latent (disentangled) variables corresponding to the elements in the images were not found.
Central Visual Field Change Prediction Module
The three-layer perceptron was trained to predict changes in the central visual field from the outputs of the Object Recognition Module and of the Cursor Control Module except for immediately after saccades. The loss became zero around episode 150.
Surprise-Reward Calculation Module
The multiplication was performed correctly (no learning is performed in this module).
Cursor Control Module
It was trained to output the cursor control (movement vector +grab) signal by observing the output of the Object Recognition Module and rewarded by the output of the Surprise-Reward Calculation Module (the Central Visual Field Change Prediction Module had not been trained).
The amount of reward acquired was tripled compared to random trials (average reward 0.12) (Fig.3).
Figure 3: Cursor Control Module learning results
Horizontal axis: number of episodes
Vertical axis: average reward (average of 5 trials)
5. Conclusion
The article reported on the implementation of an environment that displays shapes and cursors on the screen, and an agent that moves the eye and controls the cursor based on visual information.
Tasks that utilize gaze shift (active vision tasks) have been developed elsewhere. DeepMind has developed
PsychLab with tasks using gaze shift
[6]*1. The image recognition learning task using gaze shift is part of what is called object centric learning (👉
review). Working memory tasks such as
oculomotor delayed response tasks*2 use gaze shift. Papers
[7] and
[8] propose biologically plausible models of active vision.
In this article, learning was performed using “surprise'' or prediction errors as reward, which is a regular way in unsupervised learning. Learning about changes in the environment due to one's own actions (contingencies) through prediction errors or “surprises'' appears as a theme in psychology
[2]. There are various studies related to surprise, exploratory behavior, and curiosity
[9][10][11](chapter 3).Papers
[12] and
[13] provide neural models similar to that in this article, though more specific (
[12] does not model central/peripheral vision as it is concerned with the rat).
When controlling gaze shift using reinforcement learning, it would be necessary to explicitly model the frontal eye field as the corresponding region of the brain (the model would have a mechanism similar to the Cursor Control Module). The representation of the scene consisting of kinds of objects and their location (presumably integrated around the hippocampus) would also be required in tasks using gaze shift.
A model of areas around the hippocampus is important for the recognition of scene sequences, as the hippocampus is also said to be responsible for episodic memory. The model of the prefrontal cortex would be required for working memory tasks, as the region is said to be involved in it.
Finally, the environment was implemented having in mind the modeling of visual understanding of other people's actions and language acquisition presupposing such understanding. Thus, what additional structures will be needed for those models shall be studied.
*2: In
this hackathon, a match-to-sample task that requires working memory and gaze shift was used.
References
-
[1] Veale, et al.: How is visual salience computed in
the brain? Insights from behaviour, neurobiology
and modelling, Phil. Trans. R. Soc. B, 372(1714) (2017).
https://doi.org/10.1098/rstb.2016.0113
-
[2] Hiraki, K.: Detecting contingency: A key
to understanding development of self and
social cognition, Japanese Psychological Research, 48(3) (2006).
https://doi.org/10.1111/j.1468-5884.2006.00319.x
-
[3] Ferrera, V. and Barborica, A.: Internally Generated Error Signals in Monkey Frontal Eye Field
during an Inferred Motion Task, Journal of Neuroscience, 30 (35) (2010). https://doi.org/10.1523/JNEUROSCI.2977-10.2010
-
[4] Kouichi Takahashiet al.: A Generic Software Platform for Brain-inspired Cognitive Computing, Procedia Computer Science, 71 (2015). https://doi.org/10.1016/j.procs.2015.12.185
-
[5] Arakawa, N.: Implementation of a Model of
the Cortex Basal Ganglia Loop, ArXiv (2024).
https://doi.org/10.48550/arXiv.2402.13275
-
[6] Leibo, J., et al.: Psychlab: A Psychology
Laboratory for Deep Reinforcement Learning
Agents, ArXiv (2018) https://doi.org/10.48550/arXiv.1801.08116
-
[7] Hoang, K. et al.: Active vision: on the relevance of a bio-inspired approach for object detection, Bioinspiration & Biomimetics, 15(2) (2020).
https://doi.org/10.1088/1748-3190/ab504c
-
[8] McBride, S., Huelse, M., and Lee, M.: Identifying the Computational Requirements of an In-
tegrated Top-Down-Bottom-Up Model for Overt
Visual Attention within an Active Vision System.
PLoS ONE 8(2) (2013). https://doi.org/10.1371/journal.pone.0054585
-
[9] Oudeyer P.Y., Kaplan , F., and Hafner,
V.: Intrinsic Motivation Systems for Autonomous Mental Development, IEEE Transactions on Evolutionary Computation, 11(2).
(2007). https://doi.org/10.1109/TEVC.2006.890271
-
[10] Schmidhuber, H.: Formal Theory of Creativity,
Fun, and Intrinsic Motivation (1990–2010), IEEE
Transactions on Autonomous Mental Development, 2(3) (2010). https://doi.org/10.1109/tamd.2010.2056368
-
[11] Cangelosi, A., et al.: Developmental Robotics: From Babies to Robots,
MIT Press (2015) https://doi.org/10.7551/mitpress/9320.001.0001
-
[12] Fiore V., et al.: Instrumental conditioning driven
by neutral stimuli: A model tested with a simulated robotic rat, in Proceedings of the Eighth
International Conference on Epigenetic Robotics
(2008).
-
[13] Santucci, V.G., et al.: Biological Cumulative
Learning through Intrinsic Motivations: A Sim-
ulated Robotic Study on the Development of
Visually-Guided Reaching, in Proceedings of the
Tenth International Conference on Epigenetic
Robotics (2010).
No comments:
Post a Comment