MATLAB: Increasing vocabulary of pre-trained word embeddings

analyticsdeeplearningregressionsequenceText Analytics Toolboxtext;vocabularywordembedding

Can we extend the pre-trained word embeddings and increase the vocabulary?

Best Answer

  • Yes. In order to add more words to the existing vocabulary given by 'fastTextWordEmbedding', you can try the following:
    1. Obtain the wordEmbedding object for 'fastTextWordEmbedding'-
    >> emb = fastTextWordEmbedding;
    2. Obtain the vocabulary from the wordEmbedding object:
    >> vocab = emb.Vocabulary;
    3. Add more words to the string array, for example:
    >> vocab(end+1) = 'Hi';
    >> vocab(end+1) = 'Hello';
    4. Write to a text file with UTF-8 encoding in either the word2vec or GloVe text embedding format, or a zip file containing a text file of this format. You can use fopen, fprintf and fclose for this step:
    5. Use 'readWordEmbedding' to read this text file with additional words, to get a new word embedding object. The doc page for 'readWordEmbedding' would explain more about why the file needs to be in the above format.