Monday, January 14, 2013

A survey on parts of speech

Towards the end of last year, I conducted a small experiment surveying the distribution of words and parts of speech with the CHILDES corpus (the Adam set by Roger Brown, 2004).  The purpose of the experiment was to see to which extent the distribution around a word contributes to the determination of its part of speech.  The occurrence of certain words directly before and after the target word was used for the discriminative feature vector.  First, the vector of the probabilities of those words was collected for parts of speech.  In the discrimination phase, the part of speech that has the probability vector closest (in Euclid distance) to the target word was chosen to be the candidate p.o.s.  As the result, when 200 preceding words and 200 following words are used as features, this method yielded 64% of accuracy.  The result was not much different when the target words were limited to non-ambiguous ones.  Cosine similarity did not work at all.
Apparently, this method alone does not determine parts of speech.  Also, one could see here the difficulty in clustering parts of speech by means of word distribution probabilities (and Euclid distance).  So I gave up continuing experiments in this line.  While massive data and the state of the art mathematical modeling methods such as iHMM or NPYLM may yield promising results, human children may be using different strategies such as building semantic categories before learning grammatical categories.

1 comment:

  1. I belatedly read the seminal paper by Redington, Chater, and Finch (1998): "Distributional Information: A Powerful Cue for Acquiring Syntactic Categories," according to which the distributional analysis works with the CHILDES corpus. They used 1.5 million words for obtaining statistics. Ahem, my survey lacked the amount...