my_ml_notes

The World of Embedding in Natural Language Processing

Definition:

Embeddings are continuous vector representation of words or tokens capturing semantic representation in a high dimensional space.

Since machines does not understand words, we need to first convert them to number in a process called tokenization.

Word similarity is important in semantic tasks. For example to know how similar two words are can help in identifying the similarity of phrases or sentences where those words are used.

Vector semantics is the way to represent word meaning in NLP (Natural Language Processing) [~Ref1]. The idea of vector semantics is to represent a word as a point in a multidimentional space that is derived from a distribution of word neighbors. Thus vectors for representing words are called embeddings.

Word2vec is a common example of dense vectors. Figure below shows a two-dimensional PCA projection of embeddings for some words using word2vec model which visualize how embeddings learned the semantic meaning. of the words (e.g. countries are grouped together in the vector space.) For an example of PCA (Principal Component Analysis) check [ ~Ref3].

img

Picture: Image by Author ([~Ref2])

Measuring Similarity between two words (v and w) is normally done using cosine function - based on teh dot product operator from linear algebra (a.k.a. inner product) [~Ref1]. \(dot-product(v,w) = v.w\sum_{i=1}^{N} v_i w_i\)

Dense versus sparce vectors:

Sparse vectors are very long vectors with mostly zeros in a vector as “most words simply never occur in the context of others”.

Dense vectors, as the oposite of sparse vectors, will have real-valued numbers that can be negative. Dense vectors works better for NLP task as it do better at capturing synonymous. Examples of dense embedding computing are skip-gram and word2vec

Types of Embeddings

Some types of embeddings that is worth mentioning are:

Comparing Embeddings

References:

[~Ref1] Dan Jurafsky’s book on language models

[ ~Ref2] Notes on Natural Langage Processing with Deep Learning

[ ~Ref3] Principal Component Analysis - PCA - Step-by-step use of PCA for dimensionality reduction.

[ ~Ref4] Natural Language Processing with Deep Learning CS224N, Christopher Manning, Stanford.

[~Ref5] Tokenizer Arena