Automating metadata extraction and cataloguing: experiences from the National Libraries of Norway and Finland

PierreBeauguitte · August 20, 2024, 5:05am

Presenters

Pierre Beauguitte (@PierreBeauguitte)
Osma Suominen (@Osma)

Slides and Recording

Slides (PDF)
Recordings: TIB AV Portal, YouTube

Abstract

The increasing volume of grey literature, such as reports produced by public sector organizations and academia, poses significant cataloguing, discoverability, and accessibility challenges in digital libraries. To help address these challenges, the National Library of Norway (NLN) and the National Library of Finland (NLF) have explored different strategies to automatically extract bibliographic metadata from PDF files. This presentation will first discuss METEOR, an open-source tool developed by the NLN that uses rule-based logic and keywords and is already integrated in the production workflow as a suggestion engine for librarians. Meanwhile, the NLF is exploring the potential of fine-tuned, locally hosted large language models for extracting bibliographic metadata. The strengths and weaknesses of both approaches are analyzed, as well as the common obstacles they face. This session will also present our joint efforts to prepare high quality datasets for training and evaluation of metadata extraction systems along with newly developed metrics suited to the task. Finally, the discussion will focus on the integration of external catalogues and authority registries in these processes, enabling the use of persistent identifiers for entities in the metadata. Our presentation seeks to share practical solutions, promote methodology exchange, and inspire community collaboration.

Kat · November 26, 2024, 2:31pm

Q: Roughly how much work went into creating the FinGreyLit document set?
A: Summer intern worked on the curation. Then staff members also invested equivalent of 2-4 weeks of full time work.

niklasl · November 26, 2024, 2:33pm

Have you tried this approach for other kinds of literature, such as deriving metadata e.g. from title pages of “regular” books (e.g. novels)?

Osma · November 26, 2024, 2:39pm

Currently, no. However, the FinGreyLit data set does include some (freely available) e-books and book chapters, see the statistics report for details (currently 89 books and 35 book parts).

PierreBeauguitte · November 26, 2024, 2:41pm

The same concept is definitely applicable to the colophon and/or title page of a book, but we only rarely get books delivered without metadata, so we haven’t focused so much on it. But yes, we have experimented a little with it!
The question of sharing datasets is also trickier with books - most are protected with copyright, but a lot of grey literature is open.

hudakhan · November 26, 2024, 5:02pm

Hi all! Are the slides for this talk available? I’d be very interested in sharing with colleagues. Thank you so much!

Osma · November 26, 2024, 5:18pm

Oh yes, here are the slides! The recording will be published later as well.

SWIB24 Automating Metadata Extraction and Cataloguing.pdf (2.3 MB)

hudakhan · November 26, 2024, 6:47pm

Thank you so much Osma!

eduards · November 27, 2024, 8:24am

Hi! Thank you for the excellent presentation! I’m curious—could you share which ILS or cataloguing system METEOR is integrated into at the National Library of Norway?

PierreBeauguitte · November 27, 2024, 9:41am

Hi, and thank you for the feedback! Meteor API produces a JSON response, that is loaded as suggestions in DIMO, our registration / mini-cataloguing software. DIMO is developed in-house, but unfortunately not open source (yet?). After revision by a cataloger, the metadata is written as a MARC21 record and sent to our catalog, which in this instance is Alma (Ex Libris).
I hope this answers your question!