The AWK script extracts word suffixes from a word histogram.
-----
#! /usr/bin/awk -f
# Creates the suffix histogram of a word histogram
#
/^[^_]/ {
words[$1]=$2 # word histogram
wordt[$1]=$2 # word histogram copy (to evade the awk bug that destroys strings within nested reference of an array)
}
END {
for (x in words) {
if (length(x)>3) { # word length > 3
for (i=length(x)-1;i>=length(x)-3;i--) {
# suffix length <= 3
if (wordt[substr(x,1,i)]!="") {
suffix = substr(x,i+1)
stem = substr(x,1,i)
# Look for suffix whose stem part is a word.
if (suffixes[suffix]=="") suffixes[suffix]=words[x]
else suffixes[suffix]=suffixes[suffix]+words[x]
if (suffix_count[suffix]=="") suffix_count[suffix]=1
else suffix_count[suffix] = suffix_count[suffix]=suffix_count[suffix]+1
}
}
}
}
for (key in suffixes) {
if (suffix_count[key]>=10) print key, suffixes[key]
# print suffixes only used by more than 9 kinds of words.
}
}
----
Result from the CHILDES/Brown/Adam corpus(sorted by counts):
's 2664
s 1812
r 935
y 539
ed 446
es 444
e 282
n 250 # broke-n, etc.
d 168
er 145
ly 135
'd 76
No comments:
Post a Comment