^lecirculoesrubie$
% le circulo e s rubie $
# le circulo es rubie $
^tomasuncirculo$
% tomas un circulo $
# tomas un circulo $
^lerectanguloesrubie$
% le rectangulo e s rubie $
# le rectangulo e s rubie $
^illoesrubie$
% il loesrubie $
# illo es rubie $
^ilfacefrigide$
% ilfacefrigide $
# il face frigide $
^tomasilloesblau$
% tomasilloes blau $
# tomas illo es blau $
* lines starting with "%" are the segmentation result.
* lines starting with "#" are intended segmentation.
Of course, this all depends on the extracted word candidates (see below).
Children who are learning a language can have cues for extracting words other than statistical nature of given strings; i.e., semantic cues (cf. the preliminary experiment) or accents (I guess accents should be very important for perceiving words). In any case, to make a language model more sensible, grammatical categories (classes) must be introduced and semantics (symbol grounding) should be considered again...
Word candidates used in the experiment above :
lo,757
il,537*
loes,440
illo,291*
illoes,260
le,256*
angulo,251*
au,251
un,250
ta,205
as,196
ilface,170
blau,167*
rubie,165*
verde,154*
esun,152
triangulo,133*
loesrubie,130
circulo,129*
rectangulo,118*
loesblau,117
loesverde,117
anguloes,115
tomas,109*
lor,108
ascolta,87*
ilfacecalor,86
ilfacefrigide,84
illoesun,76
ilesunpaucobscur,76
reguarda,59*
nonne,39*
uloblau,37
tomasilloes,36
* Intended words are marked with '*'.* The numbers are the frequency of strings in the corpus of 1000 utterances.
* Candidates whose occurrence is fewer than 30 were not used.