The lemmatiser derives the base form (lemma) of words using a set of rules and an optional dictionary that express the relation between word forms and base forms.
The CST Lemmatizer works for Bulgarian, Danish, English, Estonian, Farsi, French, Greek, Icelandic, Italian, Latin, Macedonian, Dutch, Polish, Portuguese, Romanian, Russian, Slovakian, Serbian, Slovenian, Spanish, Czech, German, Ukrainian and Hungarian. The program supports input in XML or plain text format.
Find the linguistic resources (24 languages) on: http://cst.dk/download/cstlemma/ (They will be moved to the CLARIN-DK-repository).
The trainings data comes from various sources. A large part came from the Slovenian CLARIN homepage http://www.clarin.si (MULTEXT-East encyclopedias for Bulgarian, Estonian, Farsi, Macedonian, Romanian, Slovakian, Slovenian, Czech, Ukrainian and Hungarian.)
What are Lemmatizers and Stemmers?
Lemmatizers and stemmers are valuable human language technology tools to improve precision and recall in an information retrieval setting. For example, stemming and lemmatization make it possible to match a query in one morphological form with a word in a document in another morphological form. Lemmatizers can also be used in lexicography to find new words in text material, including the words’ frequency of use. Other applications are creation of index lists for book indexes as well as key word lists Lemmatization is the process of reducing a word to its base form, normally the dictionary look-up form (lemma) of the word.
CST = Center for SprogTeknologi / Center for Language and Technology