Setting:
- Representation of things in the environment
There is (always) an object in the environment having one of three kinds of shapes (rectangle, triangle, circle) and three kinds of colors (red, blue, green), and the system is given two symbols respectively representing the shape and color of the object. - Sentential strings
The system is given a sentence string also representing the situation (shape and color). The task for the system is to extract lexical entries (strings) that would represent the shapes and colors.
The sentences are in Interlingua. The grammar for this experiment is as follows (in a pseudo BNF):
S ⇒"tomas"?"reguarda"?S'"nonne"?
S'⇒"illoes"?"un"Shape Color?
S'⇒{"illo"|"le"Shape}"es"Color
Shape⇒{"rectangulo"|"triangulo"|"circulo"}
Color⇒{"rubie"|"blau"|"verde"}
where "tomas" is the name of the system being addressed and "reguarda" means "look!," "illo" "that," "es" "is," and "nonne" "isn't it," respectively. '?' represents 0 or 1 occurrence.
The following is sample sentences randomly generated from the grammar above:
3 1 illoesrubie
1 2 tomasilloesblaunonne
1 1 unrectangulo
2 1 letrianguloesrubie
1 3 reguardalerectanguloesverde
2 1 reguardailloesrubienonne
Two numbers at the beginning of sentences represent the shape and color.
Algorithm:
The system collects all the ngrams in given sentences and calculates tf*idf with the shapes and colors in the environment. Basically, ngrams with the highest scores with particular colors or shapes are supposed to represent the features.First try:
Blue : bla,1.098066
Red : bi,1.087502
Blue : au,1.065256
Triangle : iangul,1.053565
Red : oesr,1.021488
Blue : loesbla,1.017333
Green : verd,1.000361
Rectangle: ectangulo,0.987110
Circle : rculo,0.954782
Green : sverd,0.945333
# of Sentence: 1,000
The left-hand is the features to be represented by the ngrams in the right hand.
At first glance, it looks like a disaster, but if you look closely, the extracted strings are mostly parts of the shape/color words.Second try:
Instead of tf*idf, tf*idf*ngram.length was used. The result seems better.
Sentence Count=50
Rectangle:rectangulo,2.628845 Triangle : triangulo,2.613976 Blue : uloesblau,2.477903 Red : loesrubie,2.287229 Triangle :rianguloes,2.263571 Rectangle:nrectangul,2.250493 Triangle : letriangu,2.159996 Rectangle: unrectang,2.147516 Red :uloesrubie,2.109837 Circle :lecirculoe,1.920132
Sentence Count=100
Circle : circulo,2.277660 Red : loesrubie,2.239706 Blue : loesblau,2.206975 Green : loesverde,2.143210 Red :uloesrubie,2.090366 Rectangle:rectangulo,2.048243 Blue :illoesblau,2.045903 Triangle : triangulo,2.039147 Green :illoesverd,1.920606 Circle : uncirculo,1.905309
Sentence Count=500
Rectangle:rectangulo,2.330337 Triangle : triangulo,2.217422 Red : loesrubie,2.180421 Blue : loesblau,2.163493 Green : loesverde,2.127100 Circle : circulo,1.923614 Rectangle:nrectangul,1.856743 Triangle :rianguloes,1.801742 Rectangle: unrectang,1.771783 Blue :illoesblau,1.755595
Sentence Count=1000
Triangle : triangulo,2.314919 Rectangle:rectangulo,2.272904 Red : loesrubie,2.244439 Blue : loesblau,2.115484 Green : loesverde,2.077109 Circle : circulo,1.857920 Triangle :ntriangulo,1.785375 Red :lloesrubie,1.773095 Triangle :rianguloes,1.710290 Green :illoesverd,1.709630
It seems it cannot extract color terms properly with the given setting. Presumably, it must learn other terms such as "illo" and lexical items may be determined by fitting them into sentences (looking for best segmentation).