Tuesday, September 23, 2014

Preliminary experiment on Lexicon Acquisition #1

As part of Phase ||| experiments, I conducted a preliminary experiment on lexicon acquisition, where a simple learner associates strings within given sentence strings with signals that represent things in the environment.

  • Representation of things in the environment
    There is (always) an object in the environment having one of three kinds of shapes (rectangle, triangle, circle) and three kinds of colors (red, blue, green), and the system is given two symbols respectively representing the shape and color of the object.
  • Sentential strings
    The system is given a sentence string also representing the situation (shape and color).  The task for the system is to extract lexical entries (strings) that would represent the shapes and colors.
    The sentences are in Interlingua.  The grammar for this experiment is as follows (in a pseudo BNF):
    S ⇒"tomas"?"reguarda"?S'"nonne"?
    S'⇒"illoes"?"un"Shape Color?
    where "tomas" is the name of the system being addressed and "reguarda" means "look!," "illo" "that," "es" "is," and "nonne" "isn't it," respectively.  '?' represents 0 or 1 occurrence.
    The following is sample sentences randomly generated from the grammar above:
3 1 illoesrubie
1 2 tomasilloesblaunonne
1 1 unrectangulo
2 1 letrianguloesrubie
1 3 reguardalerectanguloesverde
2 1 reguardailloesrubienonne
Two numbers at the beginning of sentences represent the shape and color.
The system collects all the ngrams in given sentences and calculates tf*idf with the shapes and colors in the environment.  Basically, ngrams with the highest scores with particular colors or shapes are supposed to represent the features.
First try:
Blue     :       bla,1.098066
Red      :        bi,1.087502
Blue     :        au,1.065256
Triangle :    iangul,1.053565
Red      :      oesr,1.021488
Blue     :   loesbla,1.017333
Green    :      verd,1.000361
Rectangle: ectangulo,0.987110
Circle   :     rculo,0.954782
Green    :     sverd,0.945333 
# of Sentence: 1,000
The left-hand is the features to be represented by the ngrams in the right hand.
At first glance, it looks like a disaster, but if you look closely, the extracted strings are mostly parts of the shape/color words.
Second try:
Instead of tf*idf, tf*idf*ngram.length was used.  The result seems better.
Sentence Count=50
Triangle : triangulo,2.613976
Blue     : uloesblau,2.477903
Red      : loesrubie,2.287229
Triangle :rianguloes,2.263571
Triangle : letriangu,2.159996
Rectangle: unrectang,2.147516
Red      :uloesrubie,2.109837
Circle   :lecirculoe,1.920132
Sentence Count=100
Circle   :   circulo,2.277660
Red      : loesrubie,2.239706
Blue     :  loesblau,2.206975
Green    : loesverde,2.143210
Red      :uloesrubie,2.090366
Blue     :illoesblau,2.045903
Triangle : triangulo,2.039147
Green    :illoesverd,1.920606
Circle   : uncirculo,1.905309
Sentence Count=500
Triangle : triangulo,2.217422
Red      : loesrubie,2.180421
Blue     :  loesblau,2.163493
Green    : loesverde,2.127100
Circle   :   circulo,1.923614
Triangle :rianguloes,1.801742
Rectangle: unrectang,1.771783
Blue     :illoesblau,1.755595
Sentence Count=1000
Triangle : triangulo,2.314919
Red      : loesrubie,2.244439
Blue     :  loesblau,2.115484
Green    : loesverde,2.077109
Circle   :   circulo,1.857920
Triangle :ntriangulo,1.785375
Red      :lloesrubie,1.773095
Triangle :rianguloes,1.710290
Green    :illoesverd,1.709630
It seems it cannot extract color terms properly with the given setting.  Presumably, it must learn other terms such as "illo" and lexical items may be determined by fitting them into sentences (looking for best segmentation). 

No comments:

Post a Comment