LLM-Based Information Extraction to Support Scientific Literature Research and Publication Workflows
We evaluated the performance of open-weight and proprietary LLMs on the automatic extraction of key concepts from scientific texts. We focused on a specific example domain, namely business process management in computer science. The models answered expert-generated extraction questions for coding 122 papers from the Business Process Management Conference using the full text of the papers, and these answers were compared to our manual gold standard.
Open-weight models like Qwen-2.5 and Llama-3.3 seem competitive with proprietary models. Building on this work, in 2026 we plan on training specialized small models for cost-efficient local extraction and explore the workflow integration in Zotero. We additionally plan to explore embedding and retrieval for longer key concepts, such as the research question of the paper, on a centralized search service to improve related-work discovery.
Samy Ateia (PhD Student, University of Regensburg, Project: Semantic Aspects part of NFDIxCS)