From Triples to Text: LLM-Based Approach to Querying Wikibase Content

Presenter

Kolja Bailly (@baillyk)

Slides and Recordings

Abstract

The Open Science Lab (OSL) at TIB Hannover develops open source solutions for the management of research data with Wikibase, an extension of the MediaWiki software suite. This presentation shows the integration of AI-based approaches within MediaWiki, utilizing Retrieval-Augmented Generation (RAG), a methodology that allows Large Language Models (LLMs) to interact with custom data sources. The llama_index_mediawiki-service is a containerized solution, based on the LlamaIndex framework, designed to run a LLM that enhances the usability and accessibility of data hosted on MediaWiki instances. Computational resources can be used from remote services such as Huggingface API or GWDG SAIA or locally, preserving user privacy by keeping all data local. The results provide context-aware responses to user queries in natural language or support the user in the creation of SPARQL queries. OSL has updated the service to index data saved in several structured formats including MediaWiki pages and Wikibase statements. By leveraging LlamaIndex, a vector index can be created that stores data from the Wiki instance in a format that allows comparison of semantic similarity. A demo instance of the service has been applied to a Wiki instance containing data about historic manor houses in the Baltic Sea Region, a joint project between OSL and the University of Greifswald. While still in development, this demo offers a promising step towards easy-to-use and free-of-charge open-source LLM integration in MediaWiki.

Q: If you have a SPARQL template with just a few blanks to fill, isn’t using an LLM silly? You could use a form instead, couldn’t you?

Q: In examples, data was in German. Questions and answers were in English. How does this work across languages?

Q: If the RAG module becomes standalone, will it expose a stable API so that platforms like Arches or custom FastAPI services can plug in easily?

Q: From user perspective. Some examples where answer was correct but others where they were not correct. As a user, how do you check? Can you go to the original data sources?

Q: Is this a sensible starting point to build a system, if I have have a large RDF graph I want to “chat with”?

@Osma had a question around multiple components in the system (I didn’t capture the specifics, sorry!)

Q: Can you ask the user to rephrase the question so that better answers may be retrieved? Or, to give instructions that may help the user getting the best answer?

My Q: You have many components in the system. How do you handle updates? When integrations break? Security issues?

These are the slides of my presentation: 2025_11_19 SWIB25-Wikibase4research RAG.pdf (1.4 MB)

1 Like