Thursday, November 8, 2012

Suffix finder

Well, this isn't really an AI topic, but an elementary NLP exercise.
The AWK script extracts word suffixes from a word histogram.

-----

#! /usr/bin/awk -f 
# Creates the suffix histogram of a word histogram
#
/^[^_]/ {
  words[$1]=$2  # word histogram
  wordt[$1]=$2  # word histogram copy (to evade the awk bug that destroys strings within nested reference of an array)
}
END {
  for (x in words) {
     if (length(x)>3) {  # word length > 3
       for (i=length(x)-1;i>=length(x)-3;i--) {
                         # suffix length <= 3
         if (wordt[substr(x,1,i)]!="") {
              suffix = substr(x,i+1)
              stem = substr(x,1,i)
              # Look for suffix whose stem part is a word.
              if (suffixes[suffix]=="") suffixes[suffix]=words[x]
              else suffixes[suffix]=suffixes[suffix]+words[x]
              if (suffix_count[suffix]=="") suffix_count[suffix]=1
              else suffix_count[suffix] = suffix_count[suffix]=suffix_count[suffix]+1
           }
         }
      }
  }
  for (key in suffixes) {
    if (suffix_count[key]>=10) print key, suffixes[key]
    # print suffixes only used by more than 9 kinds of words.
  }
}
----
Result from the CHILDES/Brown/Adam corpus(sorted by counts):


's 2664
s 1812
r 935
y 539
ed 446
es 444
e 282
n 250 # broke-n, etc.
d 168
er 145
ly 135
'd 76


No comments:

Post a Comment