Saturday, October 11, 2014

Simple Word Segmentation Experiment

I made a simple word segmentation experiment with word candidates created by the method in the previous post.  The segmentation logic is a quasi viterbi search with the cost function = the number of segment.  When no word candidate is found, a character is used as a segment instead.  Here is a part of the result:

^lecirculoesrubie$
% le circulo e s rubie $
# le circulo es rubie $
^tomasuncirculo$
% tomas un circulo $ 
# tomas un circulo $ 
^lerectanguloesrubie$
% le rectangulo e s rubie $ 
# le rectangulo e s rubie $ 
^illoesrubie$
% il loesrubie $ 
# illo es rubie $ 
^ilfacefrigide$
% ilfacefrigide $ 
# il face frigide $
^tomasilloesblau$
% tomasilloes blau $ 
# tomas illo es blau $ 
* lines starting with "^" are strings to be segmented.
* lines starting with "%" are the segmentation result.
* lines starting with "#" are intended segmentation.

Of course, this all depends on the extracted word candidates (see below).
Children who are learning a language can have cues for extracting words other than statistical nature of given strings; i.e., semantic cues (cf. the preliminary experiment) or accents (I guess accents should be very important for perceiving words).  In any case, to make a language model more sensible, grammatical categories (classes) must be introduced and semantics (symbol grounding) should be considered again...

Word candidates used in the experiment above :
        lo,757
        il,537*
      loes,440
      illo,291*
    illoes,260
        le,256*
    angulo,251*
        au,251
        un,250
        ta,205
        as,196
    ilface,170
      blau,167*
     rubie,165*
     verde,154*
      esun,152
 triangulo,133*
 loesrubie,130
   circulo,129*
rectangulo,118*
  loesblau,117
 loesverde,117
  anguloes,115
     tomas,109*
       lor,108
   ascolta,87*
ilfacecalor,86
ilfacefrigide,84
  illoesun,76
ilesunpaucobscur,76
  reguarda,59*
     nonne,39*
   uloblau,37
tomasilloes,36
* Intended words are marked with '*'.
* The numbers are the frequency of strings in the corpus of 1000 utterances.
* Candidates whose occurrence is fewer than 30 were not used.

No comments:

Post a Comment