CLARIN – Anne Sofie Jakobsen

Making texts ready for your corpus in one take using CLARIN-DK tools

– Which tools did you use?

“I used the CLARIN-DK tools that allowed me to convert and annotate a large number of files for a corpus. I am as a part of my PhD project investigating Danish academic vocabulary and for that purpose I am compiling a corpus of Danish academic texts, AcaDan.”

– How did the tool help you?

“The tool speeded up the process of preparing the files for the corpus in that the workflow combined the necessary preparatory steps. Each file was converted from PDF to a txt format which was tokenized, lemmatized and pos-tagged. Processing one file takes under 1 minute and it is easy to process a number of files simultaneously and still get individual output files.”

– Would you recommend the tool and to whom?

“Yes, it has been very helpful in my corpus compilation. I would reccomend it to researchers who are creating their own corpora and need a fast and easy way of preparing the texts. Note that the tool described here offers other kinds of workflows than the one described here and you can choose the workflow that fits your needs.”


Digital Humanities Lab Denmark

Aarhus University
Jens Chr. Skous Vej 4
DK-8000 Aarhus C