rondelion AI: Suffix finder

Thursday, November 8, 2012

Suffix finder

Well, this isn't really an AI topic, but an elementary NLP exercise.
The AWK script extracts word suffixes from a word histogram.

-----

#! /usr/bin/awk -f
# Creates the suffix histogram of a word histogram
#
/^[^_]/ {
words[$1]=$2 # word histogram
wordt[$1]=$2 # word histogram copy (to evade the awk bug that destroys strings within nested reference of an array)
}
END {
for (x in words) {
if (length(x)>3) { # word length > 3
for (i=length(x)-1;i>=length(x)-3;i--) {
# suffix length <= 3
if (wordt[substr(x,1,i)]!="") {
suffix = substr(x,i+1)
stem = substr(x,1,i)
# Look for suffix whose stem part is a word.
if (suffixes[suffix]=="") suffixes[suffix]=words[x]
else suffixes[suffix]=suffixes[suffix]+words[x]
if (suffix_count[suffix]=="") suffix_count[suffix]=1
else suffix_count[suffix] = suffix_count[suffix]=suffix_count[suffix]+1
}
}
}
}
for (key in suffixes) {
if (suffix_count[key]>=10) print key, suffixes[key]
# print suffixes only used by more than 9 kinds of words.
}
}
----
Result from the CHILDES/Brown/Adam corpus(sorted by counts):

's 2664

s 1812

r 935

y 539

ed 446

es 444

e 282

n 250 # broke-n, etc.

d 168

er 145

ly 135

'd 76

rondelion AI

Thursday, November 8, 2012

Suffix finder

No comments:

Post a Comment