Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Overview of ELOQUENT 2024: Shared Tasks for Evaluating Generative Language Model Quality
Silo AI, Finland.ORCID iD: 0000-0003-4042-4919
RISE Research Institutes of Sweden, Digital Systems, Data Science.ORCID iD: 0000-0003-3246-1664
RISE Research Institutes of Sweden, Digital Systems, Data Science.ORCID iD: 0000-0002-9162-6433
University of Edinburg, UK.ORCID iD: 0009-0001-3339-6334
Show others and affiliations
2024 (English)In: Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349, Vol. 14959 LNCS, p. 53-72Article in journal (Refereed) Published
Abstract [en]

ELOQUENT is a set of shared tasks for evaluating the quality and usefulness of generative language models. ELOQUENT aims to apply high-level quality criteria, grounded in experiences from deploying models in real-life tasks, and to formulate tests for those criteria, preferably implemented to require minimal human assessment effort and in a multilingual setting. The tasks for the first year of ELOQUENT were (1) Topical quiz, in which language models are probed for topical competence; (2) HalluciGen, in which we assessed the ability of models to generate and detect hallucinations; (3) Robustness, in which we assessed the robustness and consistency of a model output given variation in the input prompts; and (4) Voight-Kampff, run in partnership with the PAN lab, with the aim of discovering whether it is possible to automatically distinguish human-generated text from machine-generated text. This first year of experimentation has shown—as expected—that using self-assessment with models judging models is feasible, but not entirely straight-forward, and that a a judicious comparison with human assessment and application context is necessary to be able to trust self-assessed quality judgments. 

Place, publisher, year, edition, pages
Springer Science and Business Media Deutschland GmbH , 2024. Vol. 14959 LNCS, p. 53-72
Keywords [en]
Generative language model; Human assessment; Language model; LLM; Modeling quality; Quality criteria; Self-assessed quality; Shared task; Generative adversarial networks
National Category
Natural Language Processing
Identifiers
URN: urn:nbn:se:ri:diva-76049DOI: 10.1007/978-3-031-71908-0_3Scopus ID: 2-s2.0-85205360663OAI: oai:DiVA.org:ri-76049DiVA, id: diva2:1909234
Available from: 2024-10-30 Created: 2024-10-30 Last updated: 2025-04-22Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Dürlich, LuiseGogoulou, EvangeliaNivre, JoakimZahra, Shorouq

Search in DiVA

By author/editor
Karlgren, JussiDürlich, LuiseGogoulou, EvangeliaGuillou, LianeNivre, JoakimTalman, AarneZahra, Shorouq
By organisation
Data Science
In the same journal
Lecture Notes in Computer Science
Natural Language Processing

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 114 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf