Tuesday, December 2, 2025

A Language Model Grounded to a Simple Visual Environment with Active Vision

Abstract: When modeling language acquisition to realize human-like AGI, it is important to set up an adequate cognitive model and its grounding to the environment. In this report, a relatively simple environment was created, in which one or two figures move around on a screen, with their movements described in text. A constructed agent learns a language model that describes the movement of the figures by observing input from the environment, including live text commentary. The agent’s vision, modeled after that of humans, fetches the features of the figures from the environment through gaze shift. Word prediction is based on the statistical features of the previous word and the features of the figures and the movement and placement of the figures calculated within the agent.

Introduction

This report presents a simple model of human language acquisition, which learns a "language model" by observing one or two figures moving around in a two-dimensional space and the live text commentary, and generates live text that describes the movement of the figures.

The report focuses on active vision as a human cognitive function. Humans can extract detailed information from only one location within their field of view at a time. This stems from the fact that only the central portion of the human visual field possesses high resolution (central vs. peripheral visions). To extract detailed information from multiple objects, gaze shifts are required, and the information obtained sequentially must be integrated (bound) for further processing. This mechanism is termed active vision. Research on active vision considering biological plausibility includes [1] and [2].

With regard to language functions, active vision is required when humans recognize or generate sentences referring to multiple visual objects. Research linking language acquisition in young children to active vision is found in [3].

The learning mechanisms to be reported were kept as simple as possible. One could ground the language model in the environment by preparing large datasets of images paired with descriptive text and training a model like a transformer (known as multimodal LLM [4]). While this method seems sound from an engineering point of view, the use of massive datasets and ‘deep’ backpropagation is considered biologically implausible from the perspective of achieving human-like AGI or building a human cognitive model.

Environment

In the environment, figures move around on a two-dimensional stage. The number of figures is either one or two, the shape of a figure is of card suits, and a figure has a distinct color from a set of four colors. The initial position, initial movement direction, shape, and color are selected randomly (if there are two figures, the same attributes are not selected). When a figure collides with the boundary of the stage or another figure, it reverses its direction according to rigid body collision rules. Also, after a certain amount of time passes after a motion change, it will probabilistically change its direction, most likely towards the center of the stage. The environment provides a live commentary of the figures’ movement in text. The text contains the following elements in order from the beginning:

Subject: name of the shape of a figure
Adjective: name of the color of the figure (optional)
Directional adverb: motion direction (up, down, left, right)
Verb: word representing one of {move, reverse, stop, collide, pass}
Object: when the number of figures is 2:

If the verb represents move and there is another figure that moves along: “con”+the name of its shape
If the verb represents collide: either the name of a wall or the name of the shape of another figure
If the verb represents passing- by: the name of the shape of another figure

The output from the environment is as follows:

Stage map: the shapes and colors of figures are given as features at their positions on the map. For human observers, image rendering is optionally given (Figure 1).
Text: Live text about the movement of figures (not given if the verb cannot be determined, as at the beginning of an episode).

Fig. 1 The text says "Heart passes Diamond."

Agent

The ‘agent’ observes input from the environment and learns a language model that describes the situation.

3.1 Language Model

It is trained to predict the next word (1-hot vector) from the following set of features:

Statistical features of the previous word: see the §5.1 on "Acquiring statistical features of words"

Shape features of the figure gazed at (1-hot Vector)

Color features of the figure gazed at (1-hot Vector)

Whether gaze shift occurred: for distinguishing between subject and object.

Features of figures other than shape and color (see §3.3 on "Computing features of figures for training the language model")

Motion direction of the figure gazed at

Presence of figures near the figure gazed at

When another figure is present near the figure being gazed at, the following features were also used:

Whether the figure is approaching
Whether the figures are moving in the same direction
Whether a collision with the figure is predicted

Whether the figure gazed at has a boundary in its vicinity

Whether the figure gazed at has changed its direction

3.2 Gaze Shift

The gaze shifts to the most salient figure in the visual field. The field of view does not change with the gaze shift; it only determines from which part of the stage detailed information is obtained. The agent learns whether to shift the gaze.

3.2.1 Learning mechanism

Gaze shift occurs when a current word describes a different shape and color from those of the figure being gazed at. Specifically, the agent associates the shape and color features from the word describing shape and color, takes the dot product with the features obtained from the figure gazed at on the map, and multiplies it with the salience of the figure. The agent learns the association between the words describing shape and color and the shape and color features of figures beforehand (see §5.2).

Learning gaze shift

The agent learns whether to shift the gaze based on the pair of statistical features of the previously predicted word and the next predicted word. Specifically, the agent would shift the gaze when the previously predicted word has the attribute of a verb that takes an object and the next predicted word has the attribute of a word that describes a shape.

3.2.2 Generation mode

In the text generation, there is no word input, so whether to shift the gaze is determined based on the pair of statistical features of the previously predicted word and the next predicted word.

3.3 Computing Features of Figures for Language Model Training

The agent calculates the following for all figures in its field of view in addition to shape and color:

Direction of movement of the figure: Difference from the previous position
Presence or absence of figures near the figure

If there is another figure in the vicinity of a figure, the following features are also used:

Whether the nearby figure is approaching
Whether the nearby figure is moving in the same direction
Whether a collision with the nearby figure predicted (determined by a built-in algorithm)
Presence of boundaries near the figure
Whether if the figure has changed its direction

The calculation is performed regardless of whether the figure is being gazed at or not. In relation to human vision, this means that the calculation is parallel and does not depend on central vision (in actual human vision, the situation is more complicated than the setting here, since peripheral vision also changes with gaze shift). (See related discussions on feature integration theory[5]).

4 Implementation

Python was used for implementation (see code on GitHub). PyGame was used for rendering the environment. PyTorch was used as the machine learning environment. The language model used a perceptron with one hidden layer, with Cross Entropy Loss as a loss function, and Softmax as the output function. The next word was predicted by "dice roll" with a multinomial distribution. To decide whether to shift gaze, a perceptron with one hidden layer was used (with Binary Cross Entropy Loss as the loss function and Sigmoid as the output function). It outputs a two-dimensional vector (Go vs. No Go) and the larger value was selected. The association between the words and shape and color features of the figures was represented in a correlation table obtained from the data.

5 Experiment

5.1 Acquiring Statistical Features of Words

The statistical features of words used in the language model were obtained by pre-training a perceptron with CBOW (continuous bag of words) prediction for text from the environment. The features were the weight matrix from the 1-hot vectors of words to the hidden layer (a kind of embedding). The number of elements in the hidden layer (=10) was kept smaller than the number of words (=20) so that the features would represent word categories. (Distributional semantic representations of words are fundamental in recent machine learning-based language processing, and it is assumed that corresponding representations would exist in the brain.)

In the pre-training, the environment ran for 1,000 episodes (episode length = 100), and the generated text (53,982 sentences) was used.

Figure 2 shows the distribution of acquired features visualized with Isomap. Clusters of words represent colors (bottom left), shapes (bottom right), vertical directions (top left), and horizontal directions (top right). Verbs cluster toward the bottom left. The definite article “le” for “le muro” (the wall) is in the center bottom left, while “muro” is at the top right edge. Note that the text is in Interlingua.

Fig.2 Distribution of statistical features of words

5.2 Acquisition of Shape and Color Word-Feature Associations.

The agent ran with one figure in the environment for 100 episodes (episode length = 100) in learning mode to acquire a correlation table between shape and color words and features (learning stops when the number of iterations exceeds a fixed threshold).

5.3 Training

The agent was trained for 2,000 episodes (episode length = 100) in the environment with two figures, using the statistical features of words and the association between words and shape/color features obtained in pre-trainings. The number of units in the hidden layer of the LM learner was set to 60, and the number of units in the hidden layer of the gaze shift learner was set to 10. Only a live report about a figure is used for learning among the input text from the environment.

5.4 Evaluation

The agent ran for 100 episodes (episode length = 100) to generate sentences (2,642 sentences).

Agreement between the subject and figure on the shape and color

Sentences in which shape words were output as subjects totaled 2,296 (87%). Among those in which color-describing adjectives followed the subject, the percentage matching the actual color of the figure was 84%. Table 1 shows the match rates of other elements with the reality, where the subject matched with a figure in the environment (decimal places in % values are rounded).

Table 1

	Recall	Precision
Verb	63%	70%
Direction	82%	82%
Go-along exp.	61%	85%
Object	75%	100%

The performance in verb selection was poor. Here note that verbs have varying rates of occurrence (e.g., “va” denoting movement accounts for one-third), and this distribution yields recall=precision of approximately 48% in a random guess. The recall for the “go-along expression” indicating that two figures move in the same direction was also poor, probably due to the poor feature engineering. The object (the name of the shape passing by or accompanying) matched reality with a rather high accuracy rate. The primary goal of this report was to apply the mechanism of acquiring information from objects using gaze shifts (active vision) to language models, and the successful selection of objects validated the mechanism. The description of collisions between figures was not generated in this evaluation, likely due to the small number of such instances in the training data.

6 Conclusion

The purpose of this report was to verify a language model that obtains features of figures through gaze shifts. The experiment confirmed it by implementing a mechanism that obtained features from objects within the field of view using gaze shifts and selected subject and object words. To ameliorate the verb selection result, the animation of figures in the environment could be in-betweened. For the problem that collisions between figures—which rarely occur in the environment—were not reported in the generated text, episodic memory may be effective. Humans achieve efficient language acquisition with fewer sentence exposures than our experiment through mechanisms like “fast mapping,” perhaps using episodic memory. (While deep transformers and large datasets might also be effective, it is against the biological plausibility pursued here). The backpropagation used in this report could be replaced with a more biologically plausible one (e.g., [6]). Besides, this report did not cover phonological functions (such as phoneme recognition and the double segmentation that segments phonemes into words) or generative grammatical functions.

References

[1] Hoang, K. et al.: Active vision: on the relevance of a bio-inspired approach for object detection. Bioinspiration & Biomimetics, 15(2) (2020). https://doi.org/10.1088/1748-3190/ab504c

[2] McBride, S., Huelse, M., and Lee, M.: Identifying the Computational Requirements of an Integrated Top-Down-Bottom-Up Model for Overt Visual Attention within an Active Vision System. PLoS ONE 8(2) (2013). https://doi.org/10. 1371/journal.pone.0054585

[3] Yu, C.: Embodied Active Vision in Language Learning and Grounding. In Lecture Notes in Computer Science, vol 4840 (2007). https:// doi.org/10.1007/978-3-540-77343-6_5

[4] Duzhen Zhang, et al.: MM-LLMs: Recent Ad vances in MultiModal Large Language Mod els. arXiv (2024) https://doi.org/10.48550/ arXiv.2401.13601

[5] Kristj´ansson, A., Egeth, H.: How feature integration theory integrated cognitive psychology, neurophysiology, and psychophysics. Atten Per cept Psychophys 82, 7–23 (2020). https://doi. org/10.3758/s13414-019-01803-7

[6] Nejad, K., et al.: Self-supervised predictive learning accounts for cortical layer-specificity. Nat Commun16, 6178 (2025). https://doi. org/10.1038/s41467-025-61399-5

Friday, April 18, 2025

A Simple Environment for Language Acquiring Agents

I made a Gymnasium environment for agents that acquire language. (GitHub)

In the environment, one or two objects (card suites) move around in a scene. The environment outputs a scene representation map and its text description as observation. The scene representation map consists of features (shapes and colors, each represented as a one-hot vector) of objects embedded in a 2D map. The text is in Interlingua. Verbs include: pausa (pauses), va (goes), colpa (hits), and passa (passes). Adjectives indicate the colors of objects. Adverbs indicate the direction of the movement.

An agent that acquires (learns) language from this environment is fed with the observation. It is supposed to associate object descriptions in the text with the object representation in the scene, to learn motion and interaction of objects, and to associate the learned activity representation with predicates in the text.

Sample text in the observation

Trifolio pausa
Trifolio va sub
Trifolio verde colpa Diamante
Trifolio va sup
Trifolio verde va sup

Spada va dextre con Corde
Spada blau va dextre con Corde

Diamante rubie passa Corde
Diamante rubie va sub sinistre
Diamante colpa le muro
Corde jalne passa Diamante

Tuesday, March 18, 2025

Implementation of a Parser without Grammar with Neural Sequence Memory

[日本語版]

Abstract

A parser without grammar was implemented with neural sequence memory. It parses part-of-speech (POS) sequences represented on the sequence memory to create parse trees. It groups frequently occurring POS pairs in the sequence into a binary tree. For the syntactic category of the parent (root) node of a binary tree, it used the POS inferred from the preceding and following POSs, enabling the construction of higher binary trees from pairs of a parent node and a POS or another parent node. Experiments with artificial grammar have shown that the mechanism is capable of primitive parsing.

Introduction

Human languages are known to have constituent structure [Anderson 2022], which is a recursive (tree-like) structure of constituents. The constituent structure is also thought to be the basis for compositional semantics [ibid.]. In natural language processing, parsers with manually constructed grammars have been used to construct constituent structure from word sequences. As manual grammar construction is costly, attempts have also been made to automatically construct grammar (grammar induction) and to build parsers without grammar (unsupervised parsing) [Tu 2021][Mori 1995][Kim 2019]. Since humans learn their native language without explicit grammar instructed, it is not desirable to provide explicit grammar in cognitive modeling. Recent language processing systems based on deep learning have shown remarkable performance in applications. While they do not require explicit grammar, it is not clear whether they use hierarchical constituents or whether compositionality is properly handled. Thus, creating a model that builds constituent structure (parses) without explicitly providing grammar would contribute to research on human cognitive models, and implementing such a model in neural circuits would make it more biologically plausible.

Based on the above, a neural circuit that constructs constituent structures (parses) from part-of-speech (POS) sequences was implemented and is reported below.

Method

A mechanism to construct a tree structure on sequential memory implemented by neural circuits was devised. This mechanism creates a binary tree with frequently occurring POS pairs as child nodes. As the syntactic category of the parent node of a binary tree, it used the POS estimated from the POSs (categories) preceding and succeeding the tree, to construct a higher-level binary tree from a pair of a parent node and another POS or parent node.

Sequence Memory

Neural sequential memory was used. The memory used one-hot vectors as the internal states of the sequence, and a mechanism to store the sequence through associations between internal states and mutual association between input patterns and internal states. Associations (association matrices) are set up by a single stimulus exposure (biologically corresponding to one-shot synaptic potentiation). Similar mechanisms have been reported as the competitive queuing model [Bullock 2003][Houghton 1990][Burgess 1999]

Estimating Syntactic Categories

A POS was used as the syntactic category of the parent node of a binary tree. The POS is inferred from the POSs before and after the tree. For this purpose, a learning model was used to infer a POS from the preceding and following POS. As it is similar to what is known as the CBOW (Continuous Bag of Words) in language models such as Word2Vec, it will be referred to as CBOC (Continuous Bag of Categories) below.

Parsing Algorithm

Read words up to the end of the sentence (EOS) and look up a dictionary to find POSs of the words to create one-hot vectors representing the POSs.

Memorize POS vectors in the sequence memory.

For each consecutive POS pair (head|tail), the activation value is calculated (see below) and assigned to the tail node.

Repeat the following:

For the sequence memory (tail) node with the maximum activation value, a new sequence memory node n is set for the parent node of a binary tree, whose child nodes are the tail node and the preceding node (head).
A POS vector is estimated by the CBOC predictor from the two POS vectors associated with the nodes before and after the head and tail nodes, and is associated with n.
Set links between n and the previous and next nodes
Set links between n and the constituent (child) nodes
Regard (pre|n) and (n|post) as POS pairs to calculate the activation value of n and post.
Set the activation values of child nodes head and tail of n to 0.
Delete the sequence links set for the head and tail nodes.

The implementation of the algorithm is located on GitHub.

Figure 1: Parsing Example

BOS: Beginning of Sentence, Det.: Article, Adj.: Adjective,

IV: Intransitive Verb, EOS: End of Sentence

Categories given to parent nodes: Nd1: Det.+ IV ⇒ Noun,

Nd2: BOS+IV ⇒ Proper Noun, Nd3: BOS+EOS ⇒ None

Black lines: Constituent links, Red lines: Final sequence memory links,

Green dashed line: sequence memory links set initially or midway and deleted

Experiments

Input Sentences and Constituent Structure

Word sequences generated from a context-free grammar (see Appendix) were used as input sentences. To be precise, bracketed constituent structure strings were generated from a context-free grammar, and the parser used word sequences after removing the brackets. Notation for generation probability was used in the grammar, and sentences were generated based on the specified probabilities.

Evaluation Method

The edit distance between the input and output tree structures was used for evaluation. The input-output pairs in the bracketed tree forms were fed to an edit distance calculation tool. Non-terminal symbols were not given neither in the output nor in the output.

The following were used as activity values.

Random values (baseline)
Frequencies of bigram occurrence (number of occurrences ÷ total number)
Bigram conditional probabilities p(head|tail)

Dataset

For cross-validation, a pair of datasets a and b, each containing 5000 sentences generated from the grammar with probabilities, and a pair of datasets A and B, each containing 10,000 sentences generated from a grammar without probabilities, were used. The maximum sentence lengths of datasets a, b, A, and B were 10, 11, 18, and 19, respectively.

Figure 2 shows the frequency of POS bigrams divided by the total number of data and the conditional probabilities (divided by 8 for comparison) for dataset a, sorted by the frequency.

Figure 2: Statistics of POS bigrams (dataset a）

Occ.: Frequency (number of occurrences/total number),

Prob/8: Conditional payment accuracy p(head|tail)/8

Learning

POS occurrences were counted from the generated sentences (Figure 2).

Note: Preliminary experiments have shown that both frequency of occurrence and conditional probability can be approximately approximated by perceptrons. However, both theoretically and practically, it is easier and more accurate to use simple statistical values so that there is no point in using neural learning.

A perceptron with one hidden layer (with PyTorch) was used as the CBOC predictor, and the mean squared error was used as the loss function (after having tried the cross-entropy error). The number of training epochs of the learning device was set to 20.

Results

Cross-validation was performed on each dataset pair. Namely, the system tried to parse the sentences from one of the datasets using the statistics and CBOC model obtained from the other dataset. The input and output trees were compared using an edit distance calculation tool. The cross-validated mean edit distance per word is shown in Table 1. In all datasets, using conditional probabilities as the criterion (activation values) for grouping performed better than using absolute frequencies.

Table 1: Edit distance

Activation values	Random	Frequency	Conditional Probability
Grammar with Probabilities	1.14	0.76	0
Grammar without Probabilities	1.34	0.79	0.65

For datasets based on probabilistic grammars, using conditional probabilities as activation values resulted in correct answers. For datasets based on grammars without probabilities, using conditional probabilities as activation values did not necessarily yield correct answers. For example, the following error occurred:

Correct answer:{{Det {Adj N}} {Adv IV}}

Prediction:{{{Det {Adj N}} Adv} IV}

While it is correct to group Adv:IV first, as its bigram frequency is the same as PN:Adv in the grammar without probabilities, PN:Adv{Det {Adj N}:Adv was grouped first (the estimated category for {Det {Adj N}} should be PN).

Conclusion

Incorporating prior knowledge about binary trees into a neural circuit model enabled parsing depending on the nature of the grammar that generates the sentences. It is to be verified whether natural language sentences have the same properties as the artificial grammar used here. The use of a parsed corpus such as Penn Treebank [Marcus 1999] would be used to investigate the statistical properties of the grammar and the applicability of the current method. For research aimed at language acquisition, CHILDES Treebank [Pearle 2019] based on a CHILDES corpus [Sanchez 2019]) could be used.

The current report does not compare performance with previous research. Since the ways of using and comparing trees in existing research are not uniform, it is necessary to reproduce (re-implement) the previous methods and align the conditions for strict comparison.

The current report deals with relatively simple sentences in context-free grammar. Extra mechanisms may be needed to parse sentences that include conjunctions, and sentences with multiple clauses. Even among sentences with a single clause, sentences containing prepositional phrases are to be examined. Furthermore, dealing with phenomena such as gender and number agreement would require adding grammatical attributes for those phenomena.

While POS sequences were provided as input in this report, it is ultimately desirable for a human cognitive model to be able to derive syntactic structures from acoustic input. Between the acoustic and POS levels, categories such as phonemes and morphemes are assumed to exist. Regarding phonemes, it is thought that categorization of acoustic signals into phonemes is learned in the native language environment. For extracting morphemes from phoneme sequences, statistical methods [Mochihashi 2009] and mechanisms based on patterns of non-phonemic attributes (accent, stress, etc.) can be employed. I have conducted a simple preliminary experiment to derive POS-like categories from word sequences, and confirmed that it was possible.

Considering this attempt as cognitive modeling, its biological plausibility (i.e., whether there is a corresponding mechanism in the brain) would be an issue. In the human brain, parsing is said to be carried out by Broca's area in the left hemisphere, and thus, it is thought to be primarily a function of the cerebral cortex. Since parsing is a process of time series, the cerebral cortex has the ability to handle time series, perhaps with sequence memory. Note that the brain's syntactic analysis is automatic (it does not require conscious effort). From an evolutionary perspective, it would be natural to think that human parsing evolved from a general-purpose mechanism of the cerebral cortex. The basic algorithm in this report consisted of grouping consecutive pairs that frequently occur and suppressing grouped "lower patterns." While it seems to have a cognitive advantage in that it directs attention to pairs of frequent patterns, neuroscientific research is needed to determine whether such a mechanism exists in the brain. Note that the combination of consecutive pairs is called ‘merge’ in generative linguistics, and is considered to be the fundamental process in syntactic processing.

One direction for future research is to link constituent structure with semantic representations to enable the compositional treatment of meaning.

References

[Anderson 2022] Anderson, C., et al.: Essentials of Linguistics, 2nd edition, eCampusOntario (2022)

[Tu 2021] Tu, K., et al.: Unsupervised Natural Language Parsing (Introductory Tutorial), in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts (2021)

[Mori 1995] Mori, S., Nagao, M: Parsing Without Grammar, in Proceedings of the Fourth International Workshop on Parsing Technologies, 174–185. Association for Computational Linguistics (1995)

[Kim 2019] Kim, Y., et al.: Unsupervised Recurrent Neural Network Grammars (2019). https://doi.org/10.48550/arXiv.1904.03746

[Bullock 2003] Bullock, D., Rhodes, B.: Competitive queuing for planning and serial performance. Handbook of Brain Theory and Neural Networks, MIT Press (2003)

[Houghton 1990] Houghton, G.: The problem of serial order: A neural network model of sequence learning and recall. Current research in natural language generation (1990) https://api.semanticscholar.org/CorpusID:59675195

[Burgess 1999] Burgess, N., Hitch, G. J.: Memory for serial order: A network model of the phonological loop and its timing. Psychological Review, 106 (3), 551–581 (1999) https://doi.org/10.1037/0033-295x.106.3.551

[Sanchez 2019] Sanchez, A., et al. childes-db: A flexible and reproducible interface to the child language data exchange system. Behav Res 51, 1928–1941 (2019) https://doi.org/10.3758/s13428-018-1176-7

[Pearle 2019] Pearl, L.S. and Sprouse, J.: Comparing solutions to the linking problem using an integrated quantitative framework of language acquisition, Language 95, no. 4 (2019): 583–611. https://www.jstor.org/stable/48771165, https://doi.org/10.1353/lan.2019.0067

[Marcus 1999] Marcus, M.P., et al.: Treebank-3 LDC99T42. Web Download. Philadelphia: Linguistic Data Consortium (1999) https://doi.org/10.35111/gq1x-j780

[Mochihashi 2009] Mochihashi, D., et al.: Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling, in Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 100–108 (2009)

Appendix

Context-free grammar used

Rules with : to the right of the leftmost category symbol are phrase structure rules, and rules with - are lexical rules.

The decimal value on the left indicates the probability that that rule will be selected among a group of rules that have the same leftmost category. For rules without specifications, the probabilities were determined by subtracting the specified probabilities from 1 and equally distributing the remaining probability.

S: NP + VP

NP : It + N1

N1 : Adj + N1 0.2

N1 : N

NP : PN

VP: Adv + VP 0.2

VP : IV

VP: TV + NP

Det - le, a

N - n1, n2, n3

PN - Pn1, Pn2, Pn3

Adj - adj1, ad2, adj3

Adv - av1, av2, av3

IV - iv1, iv2, iv3

TV - tv1, tv2, tv3