DARIAH partners up with Princeton University to deliver NLP Training for Humanists

The Center for Digital Humanities at Princeton University has recently received a grant from the US-based National Endowment for the Humanities to train humanities scholars in the development of natural language processing (NLP) tools for lesser-resourced languages.

The “New Languages for NLP” workshop series will be hosted at Princeton in 2021 and 2022 in cooperation with the Digital Research Infrastructure for Arts and Humanities (DARIAH) and the Library of Congress LC Labs.

Participants will work over the course of a year — between June 2021 and May 2022 — and will meet for three intensive workshops where they will learn how to annotate linguistic data and train statistical language models using cutting-edge NLP tools. 

They will also learn best practices in project and research data management as well as join discussions with leaders in the fields of multilingual NLP and DH. Furthermore, they will advance their own research projects by creating, employing and interrogating text-analysis tools and methods, while increasing much-needed linguistic diversity in the field of NLP.

“In addition to helping a number of researchers create and adopt NLP tools, we will be creating a body of knowledge and learning resources to share with the wider community via our educational portal DARIAH-Campus,” Tasovac said. “Knowledge is brittle and, especially in this age of information overload, it is essential that we capture it rather than let it get buried under an avalanche of digital noise. As a research infrastructure, we’re delighted that we can work together with Princeton on delivering, preserving and sustaining the outputs of this project.”

NLP has revolutionized our ability to analyze texts at scale. However, of the world’s more than 7,500 languages, the major NLP resources only support eighty-five. While large linguistic datasets exist for high-resource languages such as English or German, text mining, topic modeling and other methods of computational text analysis are unavailable for the vast majority of languages — especially those that are minority, regional or endangered.

A Call for Proposals will be published in early November. For more info, check out the project website.