Inspiration

Research examples

Get inspired by examples of research performed by using the DIGHUMLAB infrastructure.

Tour de CLARIN: Denmark

“Tour de CLARIN” is an initiative that aims to highlight prominent activities of a particular CLARIN national consortium, in this case CLARIN-DK....

Read More

Experiments with Big Video

New technologies give enhanced methods for video ethnography The DIGHUMLAB community VILA supports research into embodied human interaction in a wide range of environments and with a focus on social...

Read More

Gesta Danorum

Language technology, a shortcut to scientific evidence. This case is an example of how language technology can be exploited in research within the humanities....

Read More

LARM.fm

This case from LARM.fm explains how to locate missing metadata for radio programmes by using the programme schedules. Be inspired to use LARM.fm....

Read More

Sign up for our newsletter

Sign up for DIGHUMLAB’s newsletter to get news, events, digital inspiration and updates from the field of digital humanities.  

DIGHUMLAB will use the information you provide on this form to be in touch with you and to provide updates on our services, events, publications, news and digital inspiration. You can change your mind at any time by clicking the unsubscribe link in the footer of any email from us or by contacting us at info@dighumlab.org.

Research publications

Find examples of relevant publications from the DIGHUMLAB community.

Language-based Materials and Tools

Martinez Alonso, Hector; Plank, Barbara; Johannsen, Barbara, Trærup, Anders; Søgaard, Anders: Active Learning for Sense Annotation. In: Proceedings of the 20th Nordic Conference of Computational Linguistics NODALIDA 2015. Linköping University...

Read More

LARM.fm research publications

Have, I. & Nielsen, J. (2020). LARM.fm user manual: A digital radio and TV research infrastructure for researchers, teachers and students. 3rd version. Aarhus: DIGHUMLAB. Michelsen, M., Have, I., Krogh, M., & Nielsen, S. K. (eds.). (2018). Tunes...

Read More

Netlab

Scientific articles Brügger, N.: Digital Humanities in the 21st Century: Digital Material as a Driving Force.  Digital Humanities Quarterly, 10(3), 2016. Brügger, N.: Humanities, Digital Humanities, Media studies, Internet studies : An...

Read More

Experimental labs

Davidsen, J. & Kjær, M. (ed.): Introduktion til videoanalyse, Samfundslitteratur, 2018. p. 13-35 Davidsen, J., & McIlvenny, P (2017). Research on Language and Social interaction – Blog Davidsen, J. &...

Read More

User impressions

LARM.fm – Silke Holmqvist

“Media in general, both radio and TV, are heavily under-prioritised as historical sources, even though they are central for our modern time. In a historical perspective, we know very little...

Read More

Research examples

Get inspired by examples of research performed by using the DIGHUMLAB infrastructure.

LARM.fm in practice: A use case with Silke Holmqvist, PhD student, AU

This article is in Danish. An English translation is available here.  

Dét, som LARM.fm-platformen tilbyder, finder man bare ikke andre steder

Silke Holmqvist, ph.d.-studerende ved Afdeling for Historie og Klassiske Studier ved Aarhus Universitet, er ved at lægge sidste hånd på sit ph.d.-projekt om forestillinger om gæstearbejdernes følelsesliv på fjernsyn fra 1960’erne til 1980’erne. I løbet af projektet har LARM.fm givet hende mulighed for at kortlægge metadata for en af projektets bærende kildegrupper, tv-programmer.

Silke Holmqvists ph.d.-projekt handler om forestillinger om gæstearbejderes følelsesliv, sådan som det vises og fortolkes gennem billeder, som cirkulerer i det offentlige rum – fx gennem medier – fra 1960’erne til 1980’erne.

”Især ’tyrken’, som han blev kaldt dengang i starten af 1970’erne, blev for eksempel fremstillet som dydig, hårdarbejdende og loyal. I starten af 1970’erne handler stereotypen om at han drikker ikke, går tidligt i seng og sparer på pengene til familien. I samtiden er danskerne begyndt at købe fjernsyn, skifte tapetet ud, købe nye gardiner og designersofaer til deres parcelhuse. Der opstår en idealisering af og forestilling om den dedikerede, sparsomme og beundringsværdige brune mand, der ankommer på togstationen for at dedikere en periode af sit liv til arbejde langt væk fra sin familie. Sidenhen i slut 70’erne og 1980’erne ser vi problematiseringen af gæstearbejdere i Danmark. Nu så man ham på fjernsyn for eksempel i sociale boligbyggerier. Han blev vist som uarbejdsom, patriarkalsk og på overførselsindkomst. Men altså det handler om den samme mand, og i hele perioden har alle mulige gæstearbejdere jo været lige så forskellige fra hinanden, som etniske danskere er,” forklarer Silke Holmqvist.

Overblik | Ph.d.-projekt, Silke Holmqvist

PROJEKTTITEL

Inventing an immigrant – An emotional geography of guest worker images in Denmark c. 1960-1989

 FORSKNINGSSPØRGSMÅL

Uddrag for projektet: “I ask how media, including mass media but also the urban environment as a medium in itself, influenced the changing contours of the figure of the guest worker. The research design pays attention to the interaction between the emotional repertoires associated with or identified by visible minority workers and the urban places (material and fictional geographies) in which the guest worker was installed”.

TEORETISK GRUNDLAG

  • Kulturhistorie
  • Indvandrings- og minoritetsforskning
  • Følelsesgeografi

EMPIRISK GRUNDLAG (et udpluk)

  • Fjernsyn, radio og trykpresse om periodens indvandring
  • Erindringsindsamlinger fra gæstearbejdere
  • Tyrkisk/dansk litteratur, selvbiografier af gæstearbejdere

Hvad har du brugt LARM.fm til?

”Jeg startede mit ph.d.-forløb med at sidde og bladre alle fjernsynsprogramoversigter igennem fra tre årtier nede på Det Kgl. Bibliotek. Dengang kendte jeg ikke til LARM.fm’s digitalisering af programoversigterne – og det kunne have været fedt at kunne søge digitalt i stedet for at bladre,” fortæller Silke.

Silke har brugt LARM.fm til at tjekke programoversigter – dels de programmer, hun selv allerede havde fundet i de trykte programoversigter hos Det Kgl. Bibliotek, dels de DR-programmer, hun fik adgang til. Med LARM.fm har hun haft mulighed for at dobbelttjekke metadata på programmerne – data om, hvornår de er blevet sendt, har det været i primetime, hvornår er de blevet genudsendt, hvordan bliver programmerne beskrevet, osv.

Silke Holmqvist fortæller, at hun også har brugt LARM.fm i sin undervisning, hvor hun har opfordret sine studerende til at bruge det, fordi det er et oplagt sted at hente kildemateriale.

”Medier i det hele taget, både radio og TV, er så underprioriterede som historiske kilder, men de er så centrale for vores moderne tid. I et historisk perspektiv ved vi så lidt om, hvad der egentlig er foregået i primetime. Danskerne sad og sugede 1 time og nogle og tyve minutter til sig i gennemsnit hver dag i en tyveårig periode, men vi ved ikke særlig meget om hvad de kiggede på. Og det kan LARM.fm være med til at give et rigtig godt indblik i,” forklarer Silke.

Sådan har Silke brugt LARM.fm – et eksempel

”Jeg vidste, hvilke årtier jeg gerne ville lede i. Jeg har afgrænset dem med 5 års intervaller for at være sikker på, at jeg ikke fik uoverskueligt mange resultater. Så har jeg oprettet mine egne projekter, hvor jeg først har søgt på gæstearbejder, fremmedarbejder, arbejdsindvandring, tyrker, pakistaner, jugoslaver osv. Så har jeg lavet de samme søgninger på tværs af årtier.

I det store billede vidste jeg godt, hvad der skulle ligge der fra de fysiske programoversigter fra Det Kgl. Bibliotek, men nogle gange blev jeg overrasket over, at der lå et program, jeg ikke kendte til, eller en radioudsendelse, jeg kunne høre. Jeg er blevet positivt overrasket over, hvor mange radioprogrammer jeg har kunnet tilgå.”

Tre fordele ved LARM.fm

Brugbare funktioner

Opret egne projekter, annotere, hop frem i lyden med 10 sek. fungerer rigtig godt.

Intuitiv brugerflade

Platformen er nem at tilgå og hurtigt at komme i gang med og har et brugervenligt interface.

Godt sted at finde historisk kildemateriale

Radio og TV er underprioriteret som kilder, men er centrale for den moderne periode

Det er virkelig sjældent, at man ikke finder noget i en LARM.fm-søgning, der er perspektiverende i forhold til ens arbejde

Ifølge Silke Holmqvist er LARM.fm et oplagt sted at hente kildemateriale, for kolleager og studerende, fordi det har meget at byde på:

”Jeg kan sagtens anbefale LARM.fm til andre, og jeg gør det hele tiden. Hvis en kollega sidder med et specifikt historisk emne, så er det virkelig sjældent, at man ikke finder noget i en LARM.fm-søgning – et radioprogram, der er perspektiverende i forhold til det, man arbejder med. LARM.fm er et rigtig godt sted at orientere sig bredt – om alt muligt, der er sket i vores fortid.”

Silke ser frem til at skrive sin afhandling færdig og forsvare den. Håbet er at få lov til at arbejdere videre med fjernsynsmediet og minoritetstematikken i en postdoc-stilling.

”Jeg vil blive ved med at have brug for, at LARM.fm-platformen er tilgængelig, for det, den har, kan man bare ikke finde andre steder,” slutter Silke.

Bag forskeren

Silke Holmqvist er ph.d.-studerende ved Historie og Klassiske Studier på Aarhus Universitet. Hun har en BA i idehistorie og en MA i kulturhistorie fra Aarhus Universitet.

Hendes forskningsinteresser omfatter moderne historie, visuel historie, kulturhistorie, minoritetsstudier og følelsesgeografi.

Read more

Tour de CLARIN: Denmark

In April and May 2019, CLARIN-DK was featured in the CLARIN series “Tour de CLARIN” designed to highlight the many national consortia under CLARIN-EU. 

Since 2016, the tour de CLARIN initiative has been periodically highlighting prominent user involvement activities in the CLARIN network in order to

  • increase the visibility of its members,
  • reveal the richness of the CLARIN landscape,
  • and display the full range of activities that show what CLARIN has to offer to researchers, teachers, students, professionals and the general public interested in using and processing language data in various forms.

Read the blog posts about CLARIN-DK below or download the full second volume of the “Tour de CLARIN” series.


Introduction | Tour de CLARIN: Denmark

Denmark has been a member of CLARIN ERIC since February 2012 and is one of its founding members. The Danish infrastructure CLARIN-DK was funded through two projects, the DK-CLARIN (2008-2010), and the DIGHUMLAB project (2011-2017). Since 2018, CLARIN-DK has …

Read full introduction

Written by Costanza Navarreta, edited by Darja Fišer and Jakob Lenardič

Denmark has been a member of CLARIN ERIC since February 2012 and is one of its founding members. The Danish infrastructure CLARIN-DK was funded through two projects, the DK-CLARIN (2008-2010), and the DIGHUMLAB project (2011-2017). Since 2018, CLARIN-DK has been funded by the Faculty of Humanities and the Department of Nordic Studies and Linguistics, University of Copenhagen. The Danish national coordinator is Costanza Navarretta and the leading institution is the Centre for Language Technology, which is part of the Department of Nordic Studies and Linguistics.

CLARIN-DK involves the following institutions:

CLARIN-DK is a stable national research infrastructure where researchers can deposit, share and download language resources such as domain-specific corpora (e.g., The Danish Parliament Corpus 2009 – 2017 and the Johannes V. Jensen Corpus, which is a literary corpus collecting the works of the famous modernist poet Johannes Jensen from the early 20th century), as well as lexicons, word lists, speech transcriptions, and audio/video files in a secure way. CLARIN-DK also offers on-line language technology tools comprising e.g. a tokeniser, PoS tagger, a lemmatiser for Danish and English, a named entity recogniser for Danish, a keyword extractor, a TEI-to-text converter and a pipeline to linguistic annotation. Tools for performing basic frequency counts of words in textual data are also included as well as visualisation and corpus linguistics tools developed by other research groups, such as Korp and Voyant. Aside from being a certified B Centre, CLARIN-DK also runs a Knowledge Centre called DANSK, which provides expertise and help with using the language resources and technologies offered by the Danish consortium together withThe Danish Language Council.

CLARIN-DK is involved in various Danish research projects and networks. For example, it is part of the Danish collaboration initiative DIGHUMLAB that involves various research communities, such as NetLAB, which is aimed at the cross-disciplinary study of internet materials, and LARM.fm, which is an online platform used for automatically locating missing metadata of broadcast radio programmes. CLARIN-DK is also partner in an external funded research project Infrastrukturalisme with PI Henrik Jørgensen, Aarhus University. The consortium is also involved in a research network, Multimodal Child Language Acquisition, with the University of Hong Kong and The Chinese Hong Kong University, (PI Costanza Navarretta), and contributes tools and guidance in a number of research activities comprising the linguistic annotation of medieval documents and TEI encoding of literary corpora, mainly at the University of Copenhagen. CLARIN-DK is also involved in research data management and the promotion of FAIR data in the Humanities.

The CLARIN-DK team participates in the following CLARIN  committees: Standing Committee for CLARIN Technical Centres  (Lene Offersgaard, Bart Jongejan), Legal and Ethical Issues Committee: Sussi Olsen, Assessment Committee (Lene Offersgaard as Chair).


Tool | CLARIN-DK presents the CST lemmatizer

Lemmatizers generalize over the different forms of a word used in free text and provide its lemma, which is the base or dictionary look-up form. They are therefore one of the basic NLP tools which are not only important for NLP, but also for lexicographic work and all text-based studies…

Read full text

Written by Bart Jongejan and Costanza Navarretta, edited by Darja Fišer and Jakob Lenardič

Lemmatizers generalize over the different forms of a word used in free text and provide its lemma, which is the base or dictionary look-up form. They are therefore one of the basic NLP tools which are not only important for NLP, but also for lexicographic work and all text-based studies. They are especially indispensable in morphologically rich languages that have a large number of word forms for the same lemma, which severely hinders querying or processing all of them in running text.

The CST lemmatizer has been developed over many years and as part of various projects, especially the Danish STO (Jongejan and Haltrup 2005) and the Nordic Tvärsök (Jongejan and Dalianis 2009). While it was initially used as a tool to support Danish lexicographic work, it has gradually been extended with a dynamic self-learning algorithm which learns new lemmatization rules from morphological lexica that contain the relations between word forms and their corresponding lemmas. The lemmatization rules are organized in a decision tree.

In comparison to other state-of-the-art stemmers and rule-based lemmatizers, the current version of the CST lemmatizer learns lemmatization rules not only from word endings, and recognizes a wide variety of derivational patterns; e.g., prefixation, infixation, suffixation.  Therefore, it can deal with languages with different morphological systems. Currently, the CST lemmatizer has been trained on 25 languages. The list of these language-trained versions of the CST lemmatiser available from the Center for Language Technology is in Figure 1.


Figure 1: The languages for which the trained CST-lemmatiser is available.

Danish and English texts can be lemmatized online with the CST lemmatizer. The lemmatizer is available for download via GITHUB. Figure 2 shows the CLARIN-DK web service for the CST-lemmatizer, while Figure 3 shows a Danish example sentence that was lemmatized with the tool.


Figure 2: The online CST lemmatiser on CLARIN-DK.


Figure 3: Lemmatization of the Danish sentence Dog, året der er gået, kan også have budt på tunge stunder — ikke alt er glæde for os alle  (“However, the past year can also have provided sad moments – not everything can give happiness to all of us ”), which is taken  from the 2017 New Eve talk by the Danish Queen.

The CST lemmatizer trained for Danish has been used in many NLP projects, but also outside the NLP community.  Frederik Hjorth, who is a political science researcher at the Department of Political Science, University of Copenhagen, has applied the CST lemmatizer to political speeches as one of the preprocessing steps in order to investigate how members of the existing political parties have addressed right-wing populists who have been challenging the order of the established political system (Hjorth 2018). The results of the study  indicate that young politicians are often willing to engage with the populists as well as with other politicians across the political spectrum in name of democratic freedom (which Hjorth calls the strategy of engagement), while older politicians often describe the populist challengers as morally illegitimate (which Hjorth calls the strategy of disparagement) and refuse to discuss with them.

The CST lemmatizer was also used for many other languages in different linguistic projects. For example, it was trained on Russian (Sharoff and Nivre 2011) and then used e.g. for event identification (Solovyev and Ivanov 2016), and for anaphora and co-reference resolution (Toldova et al. 2014).

References

Jongejan, Bart and Dorte Haltrup. 2005. The CST Lemmatiser. Center for Sprogteknologi, University of Copenhagen version 2.7. http://cst.dk/online/lemmatiser/cstlemma.pdf

Jongejan, Bart and Hercules Dalianis. 2009. Automatic Training of Lemmatization Rules That Handle Morphological Changes in Pre-, in- and Suffixes Alike. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 – ACL-IJCNLP ’09. Vol. 1 Suntec, Singapore: Association for Computational Linguistics p. 145.

Frederik Hjorth. 2018. Establishment Responses to Populist Challenges: Evidence from Legislative Speech. 2018 Annual Meeting of the Danish Political Science Associationhttp://fghjorth.github.io/papers/responses.pdf

Sharoff, Serge and Joachim Nivre. 2011. The proper place of men and machines in language technology: Processing Russian without any linguistic knowledge. In Proc. Computational Linguistics and Intelligent Technologies DIALOGUE2011, Bekasovo, 591–604. https://pdfs.semanticscholar.org/36df/5fbe04f425e9b089437e979581d1f5375a94.pdf

Solovyev, Valery and Vladimir Ivanov. 2016.  Knowledge-driven event extraction in Russian: corpus-based linguistic resources, Computational intelligence and neuroscience, 11 pages. https://doi.org/10.1155%2f2016%2f4183760  

Toldova, Svetlana et al. 2014. RU-EVAL-2014: Evaluating Anaphora and Coreference Resolution for Russian. Computational Linguistics and Intellectual Technologies, Vol. 13 (20), pp. 681-694.


Resource | CLARIN-DK presents the Grundtvig’s Work Corpus

Nikolai Frederik Severin Grundtvig was a theologian, a priest, a philosopher, a poet, a writer, a teacher and a politician (member of the Rigsdagen, one of the two parts of the Parliament), who lived in Denmark between 1783 and 1872. He was contemporary with Hans Christian Andersen and Søren Kirkegaard …

Read full text

Written by Dorte H. Hansen and Costanza Navarretta, edited by Darja Fišer and Jakob Lenardič

Nikolai Frederik Severin Grundtvig was a theologian, a priest, a philosopher, a poet, a writer, a teacher and a politician (member of the Rigsdagen, one of the two parts of the Parliament), who lived in Denmark between 1783 and 1872. He was contemporary with Hans Christian Andersen and Søren Kirkegaard. Grundtvig’s ideas have had a lasting impact on many areas of Danish culture like education, politics and the church. For example, Grundtvig advocated for a reform of the school system, which also included educating adults to participate actively in society and in the cultural life. Therefore, Grundtvig is considered to be the mind behind the folk high school. He was part of the national romantic movement, and contributed to the development of the Danish national awareness. Grundtvig’s written works are thus an important key to the understanding of Danish culture and mentality.


Figure 1: N.F.S. Grundtvig

The collection Grundtvig’s Works are published by the Grundtvig Center at the University of Aarhus and will contain 1000 text critical and commented editions of the printed authorship by N.F.S. Grundtvig when finalized in 2030. The works are available to the public through a searchable interface including registers of persons, places and bible citations.

The researchers at the Grundtvig Center wanted to a reliable and consistent way to cite the publication and a sustainable and interoperable environment in which they could share the work among other scholars and the public in general. Since the Grundtvig Center itself does not offer the possibility for downloading the underlying files, CLARIN-DK was approached as a repository provider.


Figure 2: The corpus in the CLARIN-DK repository

The corpus, now deposited in CLARIN-DK’s D-Space repository (http://hdl.handle.net/20.500.12115/31 ), consists of app. 1300 TEI encoded xml-files of which approximately 450 are critical editions manually annotated with person names, place names, mythological names, bible citations and comments. When new versions of the works are released, they will be uploaded as new versions of the corpus in the CLARIN-DK repository.


Figure 3: A look into Haandbog i Verdens-Historien (Handbook in World History) from 1833.

The language excerpt in Figure 3 shows the old orthography from before the Danish language revision in 1948, e.g.:

Original… som Man i det attende Aarhundrede troede, at Solen, efter Sigende, staaer stille istedenfor at staae op …
Normalised Danish… som man troede i det 18. århundrede, at solen efter sigende står stille i stedet for at stå op …
Literal English translation… as thought in the 1800th century, that the sun after what they said, is staying still instead of rising …
  
  

Furthermore, the excerpt shows the manual mark-up of the corpus, done by philologists at the Grundtvig Center. There are references to e.g. person names (Joseph), mythological places (Midgaard) and actual places (Europe) and comments to parts of the text (Overhuggelse af Knuden , literal English translation: the cut of the knot). The actual comment is not shown in the text.

The corpus is an excellent resource for researchers who wish to apply digital methods to investigate various aspects of Grundtvig and his epoch. For example, researchers might want to investigate Grundtvig as a historical person, address the 19th century’s literature language or orthography, or dig into his work when studying the theoretic background of the Danish folk high school tradition. The corpus is also important for scholars applying Linked Data in order to investigate the 19th century since the corpus contains the annotations of people, places and events.


Event | CLARIN-DK presents the event “Teaching the teachers – an interactive workshop for the Voyant Tools”

Digital methods are only slowly gaining ground in the teaching of literary studies in Denmark. While many lecturers are interested in introducing digital methods to their students, they often lack the knowledge of existing tools. From previous workshops, CLARIN-DK learned that neither …

Read full text

Written by Lene Offersgaard and Dorte H. Hansen, edited by Darja Fišer and Jakob Lenardič

Digital methods are only slowly gaining ground in the teaching of literary studies in Denmark. While many lecturers are interested in introducing digital methods to their students, they often lack the knowledge of existing tools. From previous workshops, CLARIN-DK learned that neither traditional NLP tools like lemmatizers, POS-taggers, and named entity recognizers, nor simple command line scripting, were suitable in such teaching scenarios. This is why CLARIN-DK started to explore other technologies, such as data visualization tools that could serve as a better and easier entry point to the use of digital methodologies for non-computational researchers and teachers.

We opted for Voyant Tools, introduced to us by information specialists from HUMlab – a datalab at the Copenhagen University Library. Voyant Tools is an online environment that performs automatic text analysis with functionalities such as word frequency lists, frequency distribution plots, and KWIC displays (Figure 15). CLARIN-DK and HUMlab have organized several interactive workshops presenting the use of this environment to lecturers and researchers at the Faculty of Humanities at the University of Copenhagen. CLARIN-DK hosted a dedicated event at the Department of Nordic Studies and Linguistics on 21 November 2018, which was attended by 12 teachers and researchers.


Figure 1: the Voyant Tools

In order to tailor the events to the needs of the participants, CLARIN-DK asked some of them in advance which literary works were most relevant to be showcased and which research questions could be investigated and discussed during the events. They opted for novels written around the Modern Breakthrough period, an era in the Scandinavian literature which started at the end of the 19th century and in which Naturalism replaced Romanticism. The Archive of Danish Literature (http://adl.dk) provided a collection of 54 novels. The novels were preprocessed and uploaded to a local instance of Voyant Tools by the CLARIN-DK team and information specialists from HUMlab.

A research question addressed the use of terms before and after the Modern Breakthrough (1870 – 1890). If it was possible to visualize changes in the use of, for example, terms for emotions (like love) which are typical for the Romanticism period compared to the use of more concrete terms (like work) which should be more common in the Naturalism novels. Using the Trends tool in Voyant (Figure 16), it was found that the term for love is used relatively more often before 1875 than after 1888. Moreover, the term for work is not used before 1875 in the novels, while it was used after then. Therefore, the use of these terms indicates that there is a shift in the use of common themes around the Modern Breakthrough. However, by using this simplistic method, it is impossible to differentiate novels representing the Modern Breakthrough.


Figure 2: The chronological distribution of the terms love vs. work for the period between 1826 and 1899 with regard to 54 novels

We therefore investigated if other tools in Voyant could also confirm the differences between the two literary periods. In the ScatterPlot tool it is, among other things, possible to visualize the results of document similarity analysis. Figure 17 shows the document similarity using the TF-IDF frequency count for all novels in the corpus. In the figure, the novels by Herman Bang and a few novels by Sophus Schandorph are clearly separated from the other works. The novels from the late 19th century of these two writers are considered representatives of the Modern Breakthrough. It was now up to the researchers to interpret the similarities in the other groups of the scatter plot and from there to pose more research questions.


Figure 3: Novel similarity based on TF-IDF counts.

In this and other workshops, the participants soon realized that studying texts through isolated words (word forms) was limiting, and there was a clear need for lemmatization. Moreover, the need for PoS-tagged texts became evident since some researchers were interested in investigating adjectives showing emotions, while others were interested in analysing events, requiring the automatic extraction of verbs. Despite this, Voyant Tools has proved to be very illustrative and useful to get a first quantitative overview of a collection of novels, and it allowed the comparison of two or more novels.

As a follow up to this event, the CLARIN-DK team will organize a workshop introducing corpus tools and corpus querying techniques in linguistically annotated texts for Literary Studies. The event will also showcase how automatic linguistic annotations are performed on texts from before and after the Danish orthographic reform of 1948, and discuss how it is possible to circumvent problems encountered when applying NLP tools developed for contemporary texts to older texts.


Interview | Interview with Klaus Nielsen, the chief editor at the Grundtvig Study Centre

I obtained my PhD from the University of Copenhagen in 2012 and my thesis was a combination of traditional literary theory and book history, a philological field that focuses on a more mechanical-analytical study of the publication process of literary works. I focused on Gittes monologer …

Read full interview

The interview was conducted via Skype by Jakob Lenardič.

1. What is your scholarly background and your current academic position?

I obtained my PhD from the University of Copenhagen in 2012 and my thesis was a combination of traditional literary theory and book history, a philological field that focuses on a more mechanical-analytical study of the publication process of literary works. I focused on Gittes monologer, a famous collection of satirical poems by the Danish poet Per Højholt published in different versions between 1980 and 1984. I was able to observe crucial textual differences between their various published versions, which allowed me to arrive at a much richer interpretation of the poems that wouldn’t be possible with the final, best-known 1984 version alone. This showed me how important it is to combine traditional qualitative literary analysis with analytical methods that also take into consideration non-textual information such as publication history.

I now work as chief editor at Grundtvig Study Centre, where we are preparing a critical edition of the collected works of N.F.S. Grundtvig, a very prolific and multidisciplinary Danish author who published around 37,000 pages of text from 1804 to his death in 1872. We are making this corpus available in an online environment, with manual annotations that follow the scholarly standards of textual criticism. In a sense, my PhD was an important methodological steppingstone for my current work related to the Grundtvig’s Works Corpus, which also involves a close study of the differences between the various published editions.

2. The Grundtvig’s work corpus has been published through the CLARIN-DK repository. How did this collaboration start? How do you benefit from this collaboration?

We released the first version of our corpus through the CLARIN-DK repository in 2018 at the suggestion of Lene Offersgaard, with whom we were collaborating on a related project at the time. This was a great opportunity for us because we had been receiving feedback from some of our more devoted users who said they wanted the corpus in a downloadable format. We’ve also made an agreement with CLARIN-DK that as soon as we publish a new version of the corpus through our online environment, we’ll also update the version deposited in the repository with the newest, more richly annotated one.

3. How is Grundtvig’s corpus structured? What are some of the challenges you come across when annotating the corpus?

The corpus is extremely varied in terms of content, since Grundtvig was a polihistorian who wrote on a variety of different subjects. Perhaps most prominently, he wrote books on Danish history and Nordic mythology, carried out linguistic studies of Old Icelandic and Old English, translated from Latin, wrote political and philosophical texts, and composed around 1,500 hymns, many of which are still sung today in Denmark. For this reason, Grundtvig’s views are representative of the intellectual and cultural zeitgeist of Denmark in the 19th century.

There’s a downside to his varied repertoire, in that annotation is still manually intensive. We do use a database for place and person names that we feed into a named-entity recognizer, but even in this case, we often have to manually verify the results. For example, Grundtvig often refers to the philosopher Søren Kierkegaard, who was a contemporary of his, and our software is generally successful in identifying this particular named entity. However, Grundtvig often refers to him by his last name only, but since Søren Kierkegaard had a brother who was also a published author in the same period, we have to manually check the automatic recognition to make sure that the software made a link to the correct referent. In addition to this, we often come across obsolete words, in which case we manually add their possible historical meaning. This can only be done by closely reading and interpreting the surrounding text. Nevertheless, we will use the parts that have already been annotated as a baseline for a semi-automated processing of the remaining two-thirds of the corpus in the future.

One of the greatest challenges in terms of mark-up pertains to identifying Biblical references, especially in cases where Grundtvig doesn’t use direct quotes taken from the Bible but his own modified variants, or where he makes indirect references to the more obscure motifs and quotes. Although we have theologians both internal and external who closely read the texts and manually identify such references, it would be invaluable if we could also make use of a language tool that would help automatize this process of identification. I don’t think that such a tool exists yet, but it would be a very welcome addition to the CLARIN infrastructure in my opinion. Similarly, it would be great to have a tool that can automatically recognize proverbs and sayings, which abound in Grundtvig’s works, given that his work is a major part of the Danish cultural heritage. Although I’m not an expert in digital technologies, it seems that developing such a tool wouldn’t be too hard a task, as there already exist readymade digital collections of Danish proverbs that could be used as a baseline for training the tool.

4. Has the corpus been successfully used by an external research project?

Yes, Baunvig and Nielbo (2017) have  used our corpus in a case study to determine how digital methods can benefit the analysis of very large collections of written text, and uncover new perspectives and interpretations. Grundtvig Studies is a popular subfield in literary history in Denmark, and many studies on Grundtvig have been published in the past fifty years. However, previous researchers weren’t able to use digital methods and tools, which means that their claims were influenced by the limitations inherent to a purely manual approach to analysis. As I’ve said, Grundtvig produced around 37,000 pages in his lifetime, which is simply too much text for an individual researcher to read and then be able to recollect the finer details. For instance, there is an older study in which it is claimed that Grundtvig started suffering from a series of psychological problems in the 1830s, which was reflected in the texts he wrote in this decade. However, Baunvig and Nielbo (2017) were able to show, by using quantitative methods such as measuring the amount of information entropy in the corpus, that his psychological turmoil actually started earlier than was previously claimed, which is of course an important finding from a purely historical viewpoint. There has also been a follow-up study of our corpus conducted by Nielbo et al. (2018).

5. What makes this corpus particularly valuable for the CLARIN infrastructure?

I think that our rather thorough manual approach to the corpus is an important contribution for a more accurate understanding of the historical developments of the Danish language, especially its orthography. What is important in this respect is that there were no orthographic rules in Grundtvig’s time, only tendencies, which means that spelling was quite liberal in comparison to contemporary Danish. Consequently, we’re often in doubt whether the way Grundtvig spelled a certain word is an instance of spelling variation that was attested at the time or if it is just a spelling mistake on his part. This is particularly problematic in cases where Grundtvig’s idiosyncratic spelling can’t be found in the historical dictionaries of 19th century Danish, since this intuitively makes you think that the spelling variant was a mistake. However, such dictionaries weren’t compiled on the basis of the original edition but often used later published editions that had gone through the editing process, where spelling variation was normalized. This means that if a researcher wanted to study the vocabulary of 19th century Danish just on the basis of such dictionaries, he or she would miss the attested variations and consequently get a warped view of how people actually wrote at the time. By contrast, we spend a lot of time closely analysing and proofreading the materials, so we are able to present a resource that serves as a much more complex, as well as accurate, presentation of the linguistic situation at the time.

6. Could you give an example of such orthographic variation? How did you resolve it?

I actually came across a fairly interesting orthographic problem just recently when I was annotating Grundtvig’s History of the Northmen, which is one of the few texts he had written in English. In this text, Grundtvig used the word kempion in the sense of “champion” or “hero”; however, this spelling variant isn’t listed in the Oxford English Dictionary, which only includes the variant campion with an instead of an e. Because my colleagues and I weren’t sure how to solve this issue, we consulted a Professor of Middle English, and he believed it to be a spelling mistake that should be corrected in the edited corpus, given that the Oxford English Dictionary is extremely comprehensive and thorough in its account of English etymology. However, when I searched for the variant kempion on Google, I found out that it was actually attested at the time, and it was for instance used by Sir Walter Scott in his 1822 novel The Pirate, which Grundtvig was alluding to.

7. Are there any other aspects of the CLARIN-DK infrastructure that are important for your work at the centre?

Yes, especially in relation to how proactively they reach out as part of their user-involvement initiative. Last year, CLARIN-DK organized a tutorial for the philologists at our centre where they demonstrated how Voyant tools can simplify our annotation process. Using Voyant has turned out to be extremely helpful when we come across obsolete phrases the meaning of which we don’t know and can’t find in the historical dictionaries. By using Voyant’s extended search capabilities and visualisation tools, we are now able to easily chart the occurrences of this unknown phrase in the entire corpus, and then extract only those texts where this phrase seems to occur in a similar context, which then helps us determine its actual meaning.

I am also pleased to say that CLARIN-DK has already made the first version of our corpus available through their installation of the Voyant tools. We plan on updating this test version with newer ones with regularity. In the long run, I believe the availability of the corpus through CLARIN-DK’s Voyant tools will significantly streamline user assistance.

8. Your professional website says that you’re also interested in audio literature. Is this something that you’re still actively researching?

No, my research on audio literature  was mostly confined to my PhD project, because Per Højholt, who is the author of the poems that I was analysing, had read them aloud on Danish radio in the 1980s. By using an audio-analysis software called PRAAT, I measured prosodic features such as the author’s pitch and reading speed, and I was able to see how he deliberately changed his voice in accordance with the way the point-of-view character developed through the course of the poems’ narrative. This was a rather small but important finding since it hadn’t been previously acknowledged in the relevant literature on Gittes Monologer how the author’s spoken performance of his own work added new dimensions to the understanding of the poems themselves.

9. What kind of new research questions does audio literature offer in the context of Digital Humanities? Do you think that CLARIN could contribute to this field?

When I was writing my thesis, research on audio literature was still a very new field, but nowadays it is more readily agreed upon that audio recordings can serve as crucial material for textual analysis. Literary theorists are now conducting important research on the link between the reader of the audio text and the content of the text itself, and this opens up many interesting questions. Let’s say, for instance, that we are dealing with a novel written in the first person, and that the narrator is a woman. Should the reader of the audio version then also be a woman, or conversely, what interpretative repercussions would arise if the reader were actually a man? That is, the person’s voice crucially affects the way people perceive the text, much in the same way that the sort of typography of an old book can evoke various pre-conceptions in the reader about the book’s content.

Given how audio literature opens up interesting questions relevant for the emerging digital humanities, I think that new digital tools for analysing recorded literary works would serve as very welcomes additions to the CLARIN infrastructure.

10. What are your hopes for CLARIN-DK in the future?

I think that one of the future challenges for Digital Humanities in Denmark is to find a common platform where our whole research community can have a more unified and interoperable access to as many carefully annotated resources as possible. I believe that CLARIN-DK is an excellent candidate in the country for this, because our experience with releasing the Grundtvig’s Work corpus has proven to us that their repository is a stable environment through which corpora can be released in a sustainable fashion and with well-presented metadata. On top of that, the repository also allows us to integrate our corpora with other services in the consortium. For this reason, it can only be a good thing if more digital humanities scholars in Denmark decide to deposit their resources in the CLARIN-DK repository.


Tour de CLARIN Volume II publication now available

This second volume of Tour de CLARIN is organized into two parts. In Part 1, we present the seven CLARIN countries which have been featured since November 2018, when the first volume was published: Estonia, Latvia, Denmark, Italy, Slovenia, Hungary, and Bulgaria.

In Part 2, we present the work of the four Knowledge Centres that have been visited thus far: the Knowledge Centre for treebanking, the Knowledge Centre for the Languages of Sweden, the TalkBank Knowledge Centre, and the Czech Knowledge Centre for Corpus Linguistics.


Download now

Get in touch with CLARIN-DK

Questions or comments? Do you wish to join the community? Or do you want to get started with a workshop or a meeting?

Please get in touch with Costanza Navarretta, community lead in CLARIN-EU.

Read more

DIGITAL JOURNEYS: Helle’s case from the Digital Literacy course

New Paths, Old Sources: Cityscapes in the Danish Press, 1905-2005

Helle Strandgaard Jensen and Mikkel Thelle, both Associate Professors at Department of History and Classical Studies, Aarhus University, have studied the changing representation of cities in historical newspapers. By studying cityscapes in the Danish press, 1905-2005 they created a workflow which enables historians to do distant readings of newspapers. Participating in the Digital Literacy course has taught Helle more about the possibilities and limitations of digital methods.

Motivation

Helle’s motivation for participating in the Digital Literacy course was twofold. She had previously worked with digital literacy from a historical perspective, looking at how concepts of ‘media literacy’ had changed in the second half of the twentieth century and now wanted to explore the phenomenon in relation to her own discipline. Secondly, Helle and her colleague Mikkel Thelle wanted to provide fellow historians with an approach to digital methods they could adapt to their own subfield: 

“We wanted to show other historians some of the advantages of using digital methods. We wanted to prove how explorative approaches can complement the research methods we traditionally use in our field.”

In History, digital methods can be used for overcoming several challenges. One is to provide an explorative approach to large data sets. Using distant reading methods, researchers can explore connections in large amounts of data without committing to specific research questions from the beginning. These connections can generate new types of questions for later close reading:

“Historians have always worked with a lot of different types of data which can be combined in different ways. Historians have used statistics, big data and computers in their work since the 1950s. But to have an explorative approach is important. Being able to examine connections in a large set of data generates questions that can complement other types of data.”

About the project

New Paths, Old Sources: Cityscapes in the Danish Press, 1905-2005

Helle Strandgaard Jensen is an Associate Professor of Contemporary Cultural History. Her research focuses on two areas: Media history and historians’ use of digital and analogue archives. The Digital Literacy project is made in collaboration with her colleague Mikkel Thelle and aims to show other historians how digital methods can be used:

“The question was if we were able to produce a workflow that could be adopted by historians fairly easily, provide them with an understanding of what digital methods can do, and finally allow them to work with the research questions they are used to working with.”

Method

  • An IT supporter from the Digital Literacy course, Ross Deans Kristensen-McLachlan, developed a script that automatically aggregates the data Helle and Mikkel wanted to collect for their project. The script allowed the researchers to ask many kinds of questions about the representation of Danish cities in three big newspapers from 1905 to 2005. The results, in the form of changing cityscapes, can then be visualised on a heat map.

Data

  • Query results from SMURF which collects data from Mediestream, the media collections of the Royal Danish Library.

New personal competences

Experience with digital methods

Knowing the limits and possibilities of digital methods.

Working with large data sets

Better understanding of which research questions can be asked with a large set of data.

Communication and teaching

Communicating and teaching digital methods to students and colleagues.

Next steps

There is a growing interest in digital methods at the Department of History and Classical Studies. A BA course has been established in which students are introduced to basic digital approaches. Having participated in the Digital Literacy course has also been helpful in this regard:

“I’ve gained more confidence when I now go back and teach my students in digital methods. What entry level do we need to have, and how do we make the challenges smaller for novices?”

The workflow that Helle and Mikkel created in the Digital Literacy project will be incorporated in teaching and will also be used for asking other types of research questions.

With her colleague Adela Sobotkova, Helle is the leader of Center for Digital History (CEDHAR). In this role, she will continue to work with digital methods in research and teaching.

Behind the researcher

Helle Strandgaard Jensen is an Associate Professor of Scandinavian cultural history at the Department of History and Classical Studies, School of Culture and Society, Aarhus University. 

She is also co-director of Center for Digital History Aarhus (CEDHAR).

Her research interests include e.g. contemporary media history in Scandinavia, Western Europe and the US after 1945.

Behind the Digital Literacy course

The Digital Literacy project is a competence development project organised by the Digital Arts Initiative at Aarhus University. It is a unique opportunity for researchers to qualify themselves in the digital area – with their own research questions as a point of departure.

Read more

Mapping the Danish web and Danish digital history

How the mapping of the Danish web happens

Through the supercomputer at The Royal Danish Library and newly developed algorithms, Professor Niels Brügger dives into the Danish part of the World Wide Web to map our digital history. Here he tells how.

Read more

DIGITAL JOURNEYS: Vladimir’s case from the Digital Literacy course

Powering large-scale reviews of energy security vs. social impact literature with topic modelling to locate cross-referencing between them

Vladimir Douglas Pacheco Cueva, Associate Professor of International Studies at Aarhus University, has embarked on a digital quest to expand his data sets to test if his analyses and hypotheses hold once scaled up. This case gives an insight into his digital journey through (and beyond) his participation in the Digital Literacy course at Arts, Aarhus University.

Read more

DIGITAL JOURNEYS: Kirstine and Anne’s case from the Digital Literacy course

Unveiling the character gallery of sermons: Labelling and social network analysis of 11,955 contemporary Danish sermons

Kirstine Helboe Johansen, Associate Professor in Practical Theology, and Anne Agersnap, PhD student in The Study of Religion, both Aarhus University, are interested in questions of how religion is actualised in contemporary society – and how such questions can be addressed digitally. This case gives an insight into their digital journey through (and beyond) their participation in the Digital Literacy course.

Read more

DIGITAL JOURNEYS: Janne’s case from the Digital Literacy course

Investigating the historical development of tracking and e-commerce technologies on the Danish Web

Janne Nielsen, assistant professor at the Department of Media and Journalism Studies at Aarhus University, is widening her digital horizon to face the concrete challenges of her everyday research. Her participation in the Digital Literacy course has whetted her digital appetite, and this case provides an insight into her digital journey through (and beyond) the course.

Read more

DIGITAL JOURNEYS: Anne’s case from the Digital Literacy course

Tracing Cold War perceptions of nuclear weapons in Denmark through distant (and close) reading

Anne Sørensen, history researcher at the School of Communication and Culture, Aarhus University, has embarked on a journey to expand her digital horizon – most recently by participating in the Digital Literacy course. This case gives insight into her digital journey through (and beyond) the course.

Read more

Experiments with Big Video

New technologies give enhanced methods for video ethnography

Researchers at Aalborg University have been experimenting with new technologies and enhanced methods for EMCA and video ethnography. One key focus has been to collect richer video and sound recordings in a variety of settings.

Read more

Gesta Danorum

Language technology, a shortcut to scientific evidence

This case is an example of how language technology can be exploited in research within the humanities. The resource that this case is based on is Gesta Danorum written about 1200 by the Danish historian, Saxo.

Read more

LARM.fm

Locating missing metadata for radio programmes by using the programme schedules

Programmes that are part of a series often have the name of the series as their title. In this case, a search for the series title results in a list of programmes with the same title.

Read more

 

Research publications

Find examples of relevant publications from the DIGHUMLAB community.

Language-based Materials and Tools

Martinez Alonso, Hector; Plank, Barbara; Johannsen, Barbara, Trærup, Anders; Søgaard, Anders: Active Learning for Sense Annotation. In: Proceedings of the 20th Nordic Conference of Computational Linguistics NODALIDA 2015. Linköping University Electronic Press, 2015.

Martinez Alonso, Hector; Johannsen, Anders Trærup; Olsen, Sussi; Nimb, Sanni; Sørensen, Nicolai; Braasch, Anna; Søgaard, Anders; Pedersen, Bolette Sandford: Supersense tagging for Danish. In: Proceedings of the 20th Nordic Conference of Computational Linguistics NODALIDA 2015. Linköping University Electronic Press, 2015.

Read more

LARM.fm research publications

Andersen, J. S., Thøgersen, J., Larsen, B.: Larm Audio Research Archive – en infrastruktur til forskning og undervisning i radio og lyd 2010-2014

Granly, E., Stougaard, B. og Have, I. (red.), Sound Archives, særnummer for tidsskriftet SoundEffects Vol 5, No 2.

Kreutzfeldt, Jacob: ”State Controled Avantgarde?: Emil Bønnelyckes radiophonic city portrait of Copenhagen”. A Cultural History of the Avant-Garde in the Nordic Countries. ed. / Tania Ørum; Per Stounbjerg et al. Vol. 2 Edition Rodopi B.V, 2015.

Read more

Netlab

Brügger, N., Schroeder, R (editors): The Web as History. UCL Press. 2017

Brügger, N.: Digital Humanities in the 21st Century: Digital Material as a Driving Force.  Digital Humanities Quarterly, 10(3), 2016.

Brügger, N.: Humanities, Digital Humanities, Media studies, Internet studies : An inaugural lecture.
Center for Internetforskning, Aarhus Universitet, 2015. 16 s. (Skrifter fra Center for Internetforskning)

Read more

Experimental labs

Davidsen, J. & Kjær, M. (ed.): Introduktion til videoanalyse, Samfundslitteratur, 2018. p. 13-35

Davidsen, J., & McIlvenny, P (2017). Research on Language and Social interaction – Blog

Davidsen, J. & Ryberg, T. (2017): “This is the size of one meter”: Children’s bodily-material collaboration”. Intern. J. Comput.-Support. Collab. Learn (2017) 12: 65. doi:10.1007/s11412-017-9248-8

Read more

 

Menu