Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Human-Centric Ground Truth Evaluation and Acceptance (Hu-GTVA): An Oversight by Design process for RAG-LLM evaluation
RISE Research Institutes of Sweden, Digital Systems, Data Science.ORCID iD: 0009-0004-8393-1683
RISE Research Institutes of Sweden, Digital Systems, Industrial Systems.ORCID iD: 0000-0002-3687-6755
RISE Research Institutes of Sweden, Digital Systems, Mobility and Systems.ORCID iD: 0000-0001-9215-3896
2025 (English)Report (Other academic)
Abstract [en]

We present Hu-GTVA, a framework for human-grounded test case generation, validation, and acceptance designed to create contextual ground truths for the evaluation of retrieval-augmented generation (RAG) systems in high-stakes public-sector contexts. The framework addresses the challenge of aligning RAG system evaluations with expert-grounded domain knowledge by combining automated test case generation, structured expert annotation, and dual-review protocols. We demonstrate its application in collaboration with the Swedish National Financial Management Authority (ESV), where it supports the evaluation of Konsekvenshjälpen, a RAG-LLM system for regulatory impact assessment assistance. Hu-GTVA takes conceptual motivation from both the principle of Oversight by Design and the regulatory requirement of Human Oversight under Article 14 of the EU AI Act. Oversight by Design emphasizes integrating oversight considerations already during the design phase, while Human Oversight defines who, when, and what must be governed to ensure accountable AI use. Drawing from both, Hu-GTVA introduces structured expert review, acceptance criteria, and quantitative agreement metrics to bring human judgment into the evaluation process before deployment. Designed for modularity and domain adaptability, the framework can be extended to other high-risk settings such as healthcare or critical infrastructure. Hu-GTVA offers a reproducible and human-centered pre-hoc RAG-LLM evaluation pipeline.

Place, publisher, year, edition, pages
Borås: RISE Research Institutes of Sweden , 2025. , p. 30
Series
RISE Rapport
Keywords [en]
Human oversight, oversight by design, ground truth, RAG, RAG LLM, RAG evaluation, RAGChecker, RAGAS, TruLens
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:ri:diva-80077ISBN: 978-91-90109-25-0 (print)OAI: oai:DiVA.org:ri-80077DiVA, id: diva2:2023841
Available from: 2025-12-22 Created: 2025-12-22 Last updated: 2026-01-22Bibliographically approved

Open Access in DiVA

fulltext(1337 kB)136 downloads
File information
File name FULLTEXT01.pdfFile size 1337 kBChecksum SHA-512
2f1c672659575b48f17407a3aa047c8a8fa5d2889ef8bedf3c393e127c2fd41030cdb80fbcfe305bc51c59c788896dc986e6677a45e589a0f45c57a2b74ac80f
Type fulltextMimetype application/pdf

Authority records

Fahria, KabirMowla, NishatStenberg, Susanne

Search in DiVA

By author/editor
Fahria, KabirMowla, NishatStenberg, Susanne
By organisation
Data ScienceIndustrial SystemsMobility and Systems
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 418 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf