Tuesday, December 2, 2025

A Language Model Grounded to a Simple Visual Environment with Active Vision

[日本語版]

Abstract: When modeling language acquisition to realize human-like AGI, it is important to set up an adequate cognitive model and its grounding to the environment. In this report, a relatively simple environment was created, in which one or two figures move around on a screen, with their movements described in text. A constructed agent learns a language model that describes the movement of the figures by observing input from the environment, including live text commentary. The agent’s vision, modeled after that of humans, fetches the features of the figures from the environment through gaze shift. Word prediction is based on the statistical features of the previous word and the features of the figures and the movement and placement of the figures calculated within the agent.

  1. Introduction

This report presents a simple model of human language acquisition, which learns a "language model" by observing one or two figures moving around in a two-dimensional space and the live text commentary, and generates live text that describes the movement of the figures.

The report focuses on active vision as a human cognitive function. Humans can extract detailed information from only one location within their field of view at a time. This stems from the fact that only the central portion of the human visual field possesses high resolution (central vs. peripheral visions). To extract detailed information from multiple objects, gaze shifts are required, and the information obtained sequentially must be integrated (bound) for further processing. This mechanism is termed active vision. Research on active vision considering biological plausibility includes [1] and [2].

With regard to language functions, active vision is required when humans recognize or generate sentences referring to multiple visual objects. Research linking language acquisition in young children to active vision is found in [3].

The learning mechanisms to be reported were kept as simple as possible.  One could ground the language model in the environment by preparing large datasets of images paired with descriptive text and training a model like a transformer (known as multimodal LLM [4]). While this method seems sound from an engineering point of view, the use of massive datasets and ‘deep’ backpropagation is considered biologically implausible from the perspective of achieving human-like AGI or building a human cognitive model.

  1. Environment 

In the environment, figures move around on a two-dimensional stage. The number of figures is either one or two, the shape of a figure is of card suits, and a figure has a distinct color from a set of four colors.  The initial position, initial movement direction, shape, and color are selected randomly (if there are two figures, the same attributes are not selected).  When a figure collides with the boundary of the stage or another figure, it reverses its direction according to rigid body collision rules. Also, after a certain amount of time passes after a motion change, it will probabilistically change its direction, most likely towards the center of the stage. The environment provides a live commentary of the figures’ movement in text. The text contains the following elements in order from the beginning:

Subject: name of the shape of a figure
Adjective: name of the color of the figure (optional)
Directional adverb: motion direction (up, down, left, right)
Verb: word representing one of {move, reverse, stop, collide, pass}
Object: when the number of figures is 2:

  • If the verb represents move and there is another figure that moves along: “con”+the name of its shape
  • If the verb represents collide: either the name of a wall or the name of the shape of another figure
  • If the verb represents passing- by: the name of the shape of another figure

The output from the environment is as follows:

Stage map: the shapes and colors of figures are given as features at their positions on the map. For human observers, image rendering is optionally given (Figure 1).
Text: Live text about the movement of figures (not given if the verb cannot be determined, as at the beginning of an episode).

Figure 1
Fig. 1 The text says "Heart passes Diamond."

  1. Agent 

The ‘agent’ observes input from the environment and learns a language model that describes the situation.

3.1 Language Model

It is trained to predict the next word (1-hot vector) from the following set of features:
Statistical features of the previous word: see the §5.1 on "Acquiring statistical features of words"
Shape features of the figure gazed at (1-hot Vector)
Color features of the figure gazed at (1-hot Vector)
Whether gaze shift occurred: for distinguishing between subject and object.
Features of figures other than shape and color (see §3.3 on "Computing features of figures for training the language model")
Motion direction of the figure gazed at
Presence of figures near the figure gazed at
When another figure is present near the figure being gazed at, the following features were also used:
  • Whether the figure is approaching
  • Whether the figures are moving in the same direction
  • Whether a collision with the figure is predicted
Whether the figure gazed at has a boundary in its vicinity
Whether the figure gazed at has changed its direction

3.2 Gaze Shift

The gaze shifts to the most salient figure in the visual field.  The field of view does not change with the gaze shift; it only determines from which part of the stage detailed information is obtained. The agent learns whether to shift the gaze.

3.2.1 Learning mechanism

Gaze shift occurs when a current word describes a different shape and color from those of the figure being gazed at.  Specifically, the agent associates the shape and color features from the word describing shape and color, takes the dot product with the features obtained from the figure gazed at on the map, and multiplies it with the salience of the figure. The agent learns the association between the words describing shape and color and the shape and color features of figures beforehand (see §5.2).

Learning gaze shift
The agent learns whether to shift the gaze based on the pair of statistical features of the previously predicted word and the next predicted word.  Specifically, the agent would shift the gaze when the previously predicted word has the attribute of a verb that takes an object and the next predicted word has the attribute of a word that describes a shape.

3.2.2 Generation mode

In the text generation, there is no word input, so whether to shift the gaze is determined based on the pair of statistical features of the previously predicted word and the next predicted word.

3.3 Computing Features of Figures for Language Model Training

The agent calculates the following for all figures in its field of view in addition to shape and color:
  • Direction of movement of the figure: Difference from the previous position
  • Presence or absence of figures near the figure
If there is another figure in the vicinity of a figure, the following features are also used: 
  • Whether the nearby figure is approaching
  • Whether the nearby figure is moving in the same direction
  • Whether a collision with the nearby figure predicted (determined by a built-in algorithm)
  • Presence of boundaries near the figure
  • Whether if the figure has changed its direction
The calculation is performed regardless of whether the figure is being gazed at or not. In relation to human vision, this means that the calculation is parallel and does not depend on central vision (in actual human vision, the situation is more complicated than the setting here, since peripheral vision also changes with gaze shift). (See related discussions on feature integration theory[5]). 

4 Implementation

Python was used for implementation (see code on GitHub).  PyGame was used for rendering the environment. PyTorch was used as the machine learning environment.  The language model used a perceptron with one hidden layer, with Cross Entropy Loss as a loss function, and Softmax as the output function. The next word was predicted by "dice roll" with a multinomial distribution. To decide whether to shift gaze, a perceptron with one hidden layer was used (with Binary Cross Entropy Loss as the loss function and Sigmoid as the output function). It outputs a two-dimensional vector (Go vs. No Go) and the larger value was selected.  The association between the words and shape and color features of the figures was represented in a correlation table obtained from the data.

5 Experiment

5.1 Acquiring Statistical Features of Words

The statistical features of words used in the language model were obtained by pre-training a perceptron with CBOW (continuous bag of words) prediction for text from the environment.  The features were the weight matrix from the 1-hot vectors of words to the hidden layer (a kind of embedding). The number of elements in the hidden layer (=10) was kept smaller than the number of words (=20) so that the features would represent word categories.  (Distributional semantic representations of words are fundamental in recent machine learning-based language processing, and it is assumed that corresponding representations would exist in the brain.)
In the pre-training, the environment ran for 1,000 episodes (episode length = 100), and the generated text (53,982 sentences) was used.
Figure 2 shows the distribution of acquired features visualized with Isomap.  Clusters of words represent colors (bottom left), shapes (bottom right), vertical directions (top left), and horizontal directions (top right).  Verbs cluster toward the bottom left.  The definite article “le” for “le muro” (the wall) is in the center bottom left, while “muro” is at the top right edge.  Note that the text is in Interlingua.

Fig.2 Distribution of statistical features of words

5.2 Acquisition of Shape and Color Word-Feature Associations.

The agent ran with one figure in the environment for 100 episodes (episode length = 100) in learning mode to acquire a correlation table between shape and color words and features (learning stops when the number of iterations exceeds a fixed threshold).

5.3 Training 

The agent was trained for 2,000 episodes (episode length = 100) in the environment with two figures, using the statistical features of words and the association between words and shape/color features obtained in pre-trainings.  The number of units in the hidden layer of the LM learner was set to 60, and the number of units in the hidden layer of the gaze shift learner was set to 10.  Only a live report about a figure is used for learning among the input text from the environment.

5.4 Evaluation 

The agent ran for 100 episodes (episode length = 100) to generate sentences (2,642 sentences).
Agreement between the subject and figure on the shape and color 
Sentences in which shape words were output as subjects totaled 2,296 (87%).  Among those in which color-describing adjectives followed the subject, the percentage matching the actual color of the figure was 84%.  Table 1 shows the match rates of other elements with the reality, where the subject matched with a figure in the environment (decimal places in % values are rounded).

Table 1


Recall

Precision

Verb

63%

70%

Direction

82%

82%

Go-along exp.

61%

85%

Object

75%

100%


The performance in verb selection was poor. Here note that verbs have varying rates of occurrence (e.g., “va” denoting movement accounts for one-third), and this distribution yields recall=precision of approximately 48% in a random guess.  The recall for  the “go-along expression” indicating that two figures move in the same direction was also poor, probably due to the poor feature engineering. The object (the name of the shape passing by or accompanying) matched reality with a rather high accuracy rate. The primary goal of this report was to apply the mechanism of acquiring information from objects using gaze shifts (active vision) to language models, and the successful selection of objects validated the mechanism. The description of collisions between figures was not generated in this evaluation, likely due to the small number of such instances in the training data.

6 Conclusion

The purpose of this report was to verify a language model that obtains features of figures through gaze shifts. The experiment confirmed it by implementing a mechanism that obtained features from objects within the field of view using gaze shifts and selected subject and object words. To ameliorate the verb selection result, the animation of figures in the environment could be in-betweened. For the problem that collisions between figures—which rarely occur in the environment—were not reported in the generated text, episodic memory may be effective. Humans achieve efficient language acquisition with fewer sentence exposures than our experiment through mechanisms like “fast mapping,” perhaps using episodic memory.  (While deep transformers and large datasets might also be effective, it is against the biological plausibility pursued here).  The backpropagation used in this report could be replaced with a more biologically plausible one (e.g., [6]).  Besides, this report did not cover phonological functions (such as phoneme recognition and the double segmentation that segments phonemes into words) or generative grammatical functions.

References

[1] Hoang, K. et al.: Active vision: on the relevance of a bio-inspired approach for object detection. Bioinspiration & Biomimetics, 15(2) (2020). https://doi.org/10.1088/1748-3190/ab504c 
[2] McBride, S., Huelse, M., and Lee, M.: Identifying the Computational Requirements of an Integrated Top-Down-Bottom-Up Model for Overt Visual Attention within an Active Vision System. PLoS ONE 8(2) (2013). https://doi.org/10. 1371/journal.pone.0054585 
[3] Yu, C.: Embodied Active Vision in Language Learning and Grounding. In Lecture Notes in Computer Science, vol 4840 (2007). https:// doi.org/10.1007/978-3-540-77343-6_5 
[4] Duzhen Zhang, et al.: MM-LLMs: Recent Ad vances in MultiModal Large Language Mod els. arXiv (2024) https://doi.org/10.48550/ arXiv.2401.13601 
[5] Kristj´ansson, A., Egeth, H.: How feature integration theory integrated cognitive psychology, neurophysiology, and psychophysics. Atten Per cept Psychophys 82, 7–23 (2020). https://doi. org/10.3758/s13414-019-01803-7 
[6] Nejad, K., et al.: Self-supervised predictive learning accounts for cortical layer-specificity. Nat Commun16, 6178 (2025). https://doi. org/10.1038/s41467-025-61399-5