rondelion AI: CBOW and Part-of-Speech Clustering

Word embeddings used (introduced) in Word2Vec are known to represent semantic clusters of words. While its semantic aspect has been largely focused, as the distribution hypothesis, on which the word embedding is based, is 'syntactic' in the sense it is only concerned with the formal features (distribution) of words, the embedding should represent parts-of-speech (POS) as well. So I made an experiment described below (perhaps, similar experiments may have been done elsewhere, but anyway).

made a small set of CFG grammar (see below) and generated sample sentences.
created embeddings with the continuous bag of words (CBOW) learning.
clustered the embeddings to compare with the 'ground truth' (the word-POS correspondence in the grammar).

Set-up

Number of words (voc. size): 20

Number of POS: 7

Number of sentences: 500 (100 was too small)

CBOW learning and the embedding: set up a simple perceptron like predictor that predicts a word from the two adjacent words. Weights to predict a word from the hidden layer (number of cells: 10) was used as embedding.

Clustering

The figure shows an Isomap clustering of embeddings. Words are clustered according to their parts of speech.

I tried neural network based clustering methods. As a sparse autoencoder did not work for this purpose, I tried a SOM-like method and got the following (number of cells: 14, the same training data as for the CBOW training: 500 sentences/2099 words, one epoch).

Adv 7 [0. 0. 0. 0. 0. 0. 0.13 0. 0. 0. 0. 0. 0. 0. ]

PN 10 [0. 0. 0. 0. 0. 0. 0. 0. 0.06 0.13 0. 0. 0. 0. ]

IV 6 [0. 0. 0. 0. 0. 0.11 0. 0. 0. 0. 0. 0. 0. 0. ]

Adj 2 [0. 0.09 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]

Det 1 [0.18 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]

TV 4 [0. 0. 0. 0.13 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]

N 13 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.06 0.12 0. ]

It shows the correspondence between the cells and parts-of-speech (the second column represents the index of the most correlated cell).

Though clustering does not work always (it depends on the initialized weights), it is confirmed that CBOW embeddings generally represent parts-of-speech in this set-up.

SOM-like learning code:

min_index = np.argmin(((self.weights - feature) ** 2).sum(axis=1))

for i in range(-2, 3):

try:

if i != 0:

self.weights[min_index + i] += self.alpha * (feature - self.weights[min_index + i]) / abs(i)

except:

pass

Grammar

S : NP + VP

NP : Det + N1

N1 : Adj + N

N1 : N

NP : PN

VP : Adv + VP

VP : IV

VP : TV + NP

Det - le, un

N - n1, n2, n3

PN - Pn1, Pn2, Pn3

Adj - aj1, aj2, aj3

Adv - av1, av2, av3

IV - iv1, iv2, iv3

TV - tv1, tv2, tv3

rondelion AI

Thursday, November 7, 2024

CBOW and Part-of-Speech Clustering

Set-up

Clustering

SOM-like learning code:

Grammar

No comments:

Post a Comment