I made a Gymnasium environment for agents that acquire language. (GitHub)
In the environment, one or two objects (card suites) move around in a scene. The environment outputs a scene representation map and its text description as observation. The scene representation map consists of features (shapes and colors, each represented as a one-hot vector) of objects embedded in a 2D map. The text is in Interlingua. Verbs include: pausa (pauses), va (goes), colpa (hits), and passa (passes). Adjectives indicate the colors of objects. Adverbs indicate the direction of the movement.
An agent that acquires (learns) language from this environment is fed with the observation. It is supposed to associate object descriptions in the text with the object representation in the scene, to learn motion and interaction of objects, and to associate the learned activity representation with predicates in the text.
Sample text in the observation
Trifolio pausaTrifolio va subTrifolio verde colpa DiamanteTrifolio va supTrifolio verde va sup
Spada va dextre con CordeSpada blau va dextre con Corde
Diamante rubie passa CordeDiamante rubie va sub sinistreDiamante colpa le muroCorde jalne passa Diamante