Lemmatization and Stemming using spaCy

Lemmatization

A lemma is the base form of a token. The lemma of walking, walks, walked is walk. Lemmatization is the process of reducing the words to their base form or lemmas.

The following code shows how to reduce words to their lemmas.

Customizing Lemmatization

Customization of lemmatization may be required where nicknames of a geographical location may be used. The following code shows how to replace “Angeltown” to “San Fransisco”.

Stemming

Stemming refers to reducing a word to its root form. The stem does not have to be a valid word at all. Stemming algorithms remove affixes (suffixes and prefixes). For example, the stem of “university ”is “univers”.

spaCy does not contain any function for stemming. spaCy provides lemmatization via dictionary lookup and each language has its own dictionary. NLTK package has several stemming functions including Porter and Lanchester.

Reference: