Change search
Link to record
Permanent link

Direct link
Publications (10 of 10) Show all publications
Sahlgren, M., Karlgren, J., Dürlich, L., Gogoulou, E., Talman, A. & Zahra, S. (2024). ELOQUENT 2024 - Robustness Task. In: CEUR Workshop Proceedings: . Paper presented at 25th Working Notes of the Conference and Labs of the Evaluation Forum, CLEF 2024. Grenoble. 9 September 2024 through 12 September 2024 (pp. 703-707). CEUR-WS, 3740
Open this publication in new window or tab >>ELOQUENT 2024 - Robustness Task
Show others...
2024 (English)In: CEUR Workshop Proceedings, CEUR-WS , 2024, Vol. 3740, p. 703-707Conference paper, Published paper (Refereed)
Abstract [en]

ELOQUENT is a set of shared tasks for evaluating the quality and usefulness of generative language models. ELOQUENT aims to apply high-level quality criteria, grounded in experiences from deploying models in real-life tasks, and to formulate tests for those criteria, preferably implemented to require minimal human assessment effort and in a multilingual setting. One of the tasks for the first year of ELOQUENT was the robustness task, in which we assessed the robustness and consistency of a model output given variation in the input prompts. We found that indeed the consistency varied, both across prompt items and across models, and on a methodological note we find that using a oracle model for assessing the submitted responses is feasible, and intend to investigate consistency across such assessments for different oracle models. We intend to run this task in coming editions for ELOQUENT to establish a solid methodology for further assessing consistency, which we believe to be a crucial component of trustworthiness as a top level quality characteristic of generative language models. 

Place, publisher, year, edition, pages
CEUR-WS, 2024
Keywords
High level languages; First year; Human assessment; Language model; Model outputs; Oracle model; Quality characteristic; Quality criteria; Generative adversarial networks
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:ri:diva-75019 (URN)2-s2.0-85201575633 (Scopus ID)
Conference
25th Working Notes of the Conference and Labs of the Evaluation Forum, CLEF 2024. Grenoble. 9 September 2024 through 12 September 2024
Note

This lab has been supported by the European Commission through the DeployAI project (grant number 101146490), by the Swedish Research Council (grant number 2022-02909), and by UK Research and Innovation (UKRI) under the UK government's Horizon Europe funding guarantee [grant number 10039436 (Utter)]. 

Available from: 2024-09-09 Created: 2024-09-09 Last updated: 2025-09-23Bibliographically approved
Karlgren, J., Dürlich, L., Gogoulou, E., Guillou, L., Nivre, J., Sahlgren, M. & Talman, A. (2024). ELOQUENT CLEF Shared Tasks for Evaluation of Generative Language Model Quality. Paper presented at 46th European Conference on Information Retrieval, ECIR 2024. Glasgow, UK. 24 March 2024 through 28 March 2024. Lecture Notes in Computer Science, 14612 LNCS, 459-465
Open this publication in new window or tab >>ELOQUENT CLEF Shared Tasks for Evaluation of Generative Language Model Quality
Show others...
2024 (English)In: Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349, Vol. 14612 LNCS, p. 459-465Article in journal (Refereed) Published
Abstract [en]

ELOQUENT is a set of shared tasks for evaluating the quality and usefulness of generative language models. ELOQUENT aims to bring together some high-level quality criteria, grounded in experiences from deploying models in real-life tasks, and to formulate tests for those criteria, preferably implemented to require minimal human assessment effort and in a multilingual setting. The selected tasks for this first year of ELOQUENT are (1) probing a language model for topical competence; (2) assessing the ability of models to generate and detect hallucinations; (3) assessing the robustness of a model output given variation in the input prompts; and (4) establishing the possibility to distinguish human-generated text from machine-generated text.

Place, publisher, year, edition, pages
Springer Science and Business Media Deutschland GmbH, 2024
Keywords
Benchmarking; CLEF; Generative language model; Human assessment; Language model; LLM; Modeling quality; Multilinguality; Quality benchmark; Quality criteria; Shared task; Computational linguistics
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:ri:diva-72876 (URN)10.1007/978-3-031-56069-9_63 (DOI)2-s2.0-85189366495 (Scopus ID)
Conference
46th European Conference on Information Retrieval, ECIR 2024. Glasgow, UK. 24 March 2024 through 28 March 2024
Available from: 2024-04-26 Created: 2024-04-26 Last updated: 2025-09-23Bibliographically approved
Dong, G., Bate, A., Haguinet, F., Westman, G., Dürlich, L., Hviid, A. & Sessa, M. (2024). Optimizing Signal Management in a Vaccine Adverse Event Reporting System: A Proof-of-Concept with COVID-19 Vaccines Using Signs, Symptoms, and Natural Language Processing. Drug Safety, 47(2), 173
Open this publication in new window or tab >>Optimizing Signal Management in a Vaccine Adverse Event Reporting System: A Proof-of-Concept with COVID-19 Vaccines Using Signs, Symptoms, and Natural Language Processing
Show others...
2024 (English)In: Drug Safety, ISSN 0114-5916, E-ISSN 1179-1942, Vol. 47, no 2, p. 173-Article in journal (Refereed) Published
Abstract [en]

Introduction: The Vaccine Adverse Event Reporting System (VAERS) has already been challenged by an extreme increase in the number of individual case safety reports (ICSRs) after the market introduction of coronavirus disease 2019 (COVID-19) vaccines. Evidence from scientific literature suggests that when there is an extreme increase in the number of ICSRs recorded in spontaneous reporting databases (such as the VAERS), an accompanying increase in the number of disproportionality signals (sometimes referred to as ‘statistical alerts’) generated is expected. Objectives: The objective of this study was to develop a natural language processing (NLP)-based approach to optimize signal management by excluding disproportionality signals related to listed adverse events following immunization (AEFIs). COVID-19 vaccines were used as a proof-of-concept. Methods: The VAERS was used as a data source, and the Finding Associated Concepts with Text Analysis (FACTA+) was used to extract signs and symptoms of listed AEFIs from MEDLINE for COVID-19 vaccines. Disproportionality analyses were conducted according to guidelines and recommendations provided by the US Centers for Disease Control and Prevention. By using signs and symptoms of listed AEFIs, we computed the proportion of disproportionality signals dismissed for COVID-19 vaccines using this approach. Nine NLP techniques, including Generative Pre-Trained Transformer 3.5 (GPT-3.5), were used to automatically retrieve Medical Dictionary for Regulatory Activities Preferred Terms (MedDRA PTs) from signs and symptoms extracted from FACTA+. Results: Overall, 17% of disproportionality signals for COVID-19 vaccines were dismissed as they reported signs and symptoms of listed AEFIs. Eight of nine NLP techniques used to automatically retrieve MedDRA PTs from signs and symptoms extracted from FACTA+ showed suboptimal performance. GPT-3.5 achieved an accuracy of 78% in correctly assigning MedDRA PTs. Conclusion: Our approach reduced the need for manual exclusion of disproportionality signals related to listed AEFIs and may lead to better optimization of time and resources in signal management. © 2023, The Author(s).

Place, publisher, year, edition, pages
Adis, 2024
National Category
Other Medical Engineering Natural Language Processing
Identifiers
urn:nbn:se:ri:diva-68786 (URN)10.1007/s40264-023-01381-6 (DOI)2-s2.0-85178895864 (Scopus ID)
Available from: 2024-01-15 Created: 2024-01-15 Last updated: 2025-09-23
Karlgren, J., Dürlich, L., Gogoulou, E., Guillou, L., Nivre, J., Sahlgren, M., . . . Zahra, S. (2024). Overview of ELOQUENT 2024: Shared Tasks for Evaluating Generative Language Model Quality. Lecture Notes in Computer Science, 14959 LNCS, 53-72
Open this publication in new window or tab >>Overview of ELOQUENT 2024: Shared Tasks for Evaluating Generative Language Model Quality
Show others...
2024 (English)In: Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349, Vol. 14959 LNCS, p. 53-72Article in journal (Refereed) Published
Abstract [en]

ELOQUENT is a set of shared tasks for evaluating the quality and usefulness of generative language models. ELOQUENT aims to apply high-level quality criteria, grounded in experiences from deploying models in real-life tasks, and to formulate tests for those criteria, preferably implemented to require minimal human assessment effort and in a multilingual setting. The tasks for the first year of ELOQUENT were (1) Topical quiz, in which language models are probed for topical competence; (2) HalluciGen, in which we assessed the ability of models to generate and detect hallucinations; (3) Robustness, in which we assessed the robustness and consistency of a model output given variation in the input prompts; and (4) Voight-Kampff, run in partnership with the PAN lab, with the aim of discovering whether it is possible to automatically distinguish human-generated text from machine-generated text. This first year of experimentation has shown—as expected—that using self-assessment with models judging models is feasible, but not entirely straight-forward, and that a a judicious comparison with human assessment and application context is necessary to be able to trust self-assessed quality judgments. 

Place, publisher, year, edition, pages
Springer Science and Business Media Deutschland GmbH, 2024
Keywords
Generative language model; Human assessment; Language model; LLM; Modeling quality; Quality criteria; Self-assessed quality; Shared task; Generative adversarial networks
National Category
Natural Language Processing
Identifiers
urn:nbn:se:ri:diva-76049 (URN)10.1007/978-3-031-71908-0_3 (DOI)2-s2.0-85205360663 (Scopus ID)
Available from: 2024-10-30 Created: 2024-10-30 Last updated: 2025-09-23Bibliographically approved
Dürlich, L., Gogoulou, E., Guillou, L., Nivre, J. & Zahra, S. (2024). Overview of the CLEF-2024 Eloquent Lab: Task 2 on HalluciGen. In: CEUR Workshop Proceedings: . Paper presented at 25th Working Notes of the Conference and Labs of the Evaluation Forum, CLEF 2024. Grenoble. 9 September 2024 through 12 September 2024 (pp. 691-702). CEUR-WS, 3740
Open this publication in new window or tab >>Overview of the CLEF-2024 Eloquent Lab: Task 2 on HalluciGen
Show others...
2024 (English)In: CEUR Workshop Proceedings, CEUR-WS , 2024, Vol. 3740, p. 691-702Conference paper, Published paper (Refereed)
Abstract [en]

In the HalluciGen task we aim to discover whether LLMs have an internal representation of hallucination. Specifically, we investigate whether LLMs can be used to both generate and detect hallucinated content. In the cross-model evaluation setting we take this a step further and explore the viability of using an LLM to evaluate output produced by another LLM. We include generation, detection, and cross-model evaluation steps for two scenarios: paraphrase and machine translation. Overall we find that performance of the baselines and submitted systems is highly variable, however initial results are promising and lessons learned from this year’s task will provide a solid foundation for future iterations of the task. In particular, we highlight that human validation of generated output is ideally necessary to ensure the robustness of the cross-model evaluation results. We aim to address this challenge in future iterations of HalluciGen. 

Place, publisher, year, edition, pages
CEUR-WS, 2024
Keywords
Computational linguistics; Computer aided language translation; Modeling languages; Cross model; Detection models; Evaluation; Generative language model; Hallucination; Internal representation; Language model; Machine translations; Model evaluation; Performance; Machine translation
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:ri:diva-75021 (URN)2-s2.0-85201646530 (Scopus ID)
Conference
25th Working Notes of the Conference and Labs of the Evaluation Forum, CLEF 2024. Grenoble. 9 September 2024 through 12 September 2024
Note

This lab has been partially supported by the Swedish Research Council (grant number 2022-02909) and by UK Research and Innovation (UKRI) under the UK government's Horizon Europe funding guarantee [grant number 10039436 (Utter)].

Available from: 2024-09-06 Created: 2024-09-06 Last updated: 2025-09-23Bibliographically approved
Bevendorff, J., Wiegmann, M., Karlgren, J., Dürlich, L., Gogoulou, E., Talman, A., . . . Stein, B. (2024). Overview of the “Voight-Kampff” Generative AI Authorship Verification Task at PAN and ELOQUENT 2024. In: CEUR Workshop Proceedings: . Paper presented at 25th Working Notes of the Conference and Labs of the Evaluation Forum, CLEF 2024. Grenoble, France 9 September 2024 through 12 September 2024 (pp. 2486-2506). CEUR-WS, 3740
Open this publication in new window or tab >>Overview of the “Voight-Kampff” Generative AI Authorship Verification Task at PAN and ELOQUENT 2024
Show others...
2024 (English)In: CEUR Workshop Proceedings, CEUR-WS , 2024, Vol. 3740, p. 2486-2506Conference paper, Published paper (Refereed)
Abstract [en]

The “Voight-Kampff” Generative AI Authorship Verification task aims to determine whether a text was generated by an AI or written by a human. As in its fictional inspiration,1 the Voight-Kampff task structures AI detection as a builder-breaker challenge: The builders, participants in the PAN lab, submit software to detect AI-written text and the breakers, participants in the ELOQUENT lab, submit AI-written text with the goal of fooling the builders. We formulate the task in a way that is reminiscent of a traditional authorship verification problem, where given a pair of texts, their human or machine authorship is to be inferred. For this first task installment, we further restrict the problem so that each pair is guaranteed to contain one human and one machine text. Hence the task description reads: Given two texts, one authored by a human, one by a machine: pick out the human. In total, we evaluated 43 detection systems (30 participant submissions and 13 baselines), ranging from linear classifiers to perplexity-based zero-shot systems. We tested them on 70 individual test set variants organized in 14 base collections, each designed on different constraints such as short texts, Unicode obfuscations, or language switching. The top systems achieve very high scores, proving themselves not perfect but sufficiently robust across a wide range of specialized testing regimes. Code used for creating the datasets and evaluating the systems, baselines, and data are available on GitHub.

Place, publisher, year, edition, pages
CEUR-WS, 2024
Keywords
Problem oriented languages; AI detection; Authorship verification; Detection system; Human-machine; One-machine; Task description; Task structure; Verification problems; Verification task; Written texts
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:ri:diva-75037 (URN)2-s2.0-85201598034 (Scopus ID)
Conference
25th Working Notes of the Conference and Labs of the Evaluation Forum, CLEF 2024. Grenoble, France 9 September 2024 through 12 September 2024
Note

The “Voight-Kampff” Generative AI Authorship Detection Task @ PAN 2024 has been funded as part ofthe OpenWebSearch project by the European Commission (OpenWebSearch.eu, GA 101070014).

Available from: 2024-09-06 Created: 2024-09-06 Last updated: 2025-09-23Bibliographically approved
Dürlich, L., Gogoulou, E. & Nivre, J. (2023). On the Concept of Resource-Efficiency in NLP. In: Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa): . Paper presented at 24th Nordic Conference on Computational Linguistics (NoDaLiDa) (pp. 135-145).
Open this publication in new window or tab >>On the Concept of Resource-Efficiency in NLP
2023 (English)In: Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), 2023, p. 135-145Conference paper, Published paper (Refereed)
Abstract [en]

Resource-efficiency is a growing concern in the NLP community. But what are the resources we care about and why? How do we measure efficiency in a way that is reliable and relevant? And how do we balance efficiency and other important concerns? Based on a review of the emerging literature on the subject, we discuss different ways of conceptualizing efficiency in terms of product and cost, using a simple case study on fine-tuning and knowledge distillation for illustration. We propose a novel metric of amortized efficiency that is better suited for life-cycle analysis than existing metrics.

National Category
Natural Language Processing
Identifiers
urn:nbn:se:ri:diva-67526 (URN)
Conference
24th Nordic Conference on Computational Linguistics (NoDaLiDa)
Available from: 2023-10-11 Created: 2023-10-11 Last updated: 2025-09-23Bibliographically approved
Dürlich, L., Nivre, J. & Stymne, S. (2023). What Causes Unemployment?: Unsupervised Causality Mining from Swedish Governmental Reports. In: RESOURCEFUL 2023 - Workshop on Resources and Representations for Under-Resourced Languages and Domains, Proceedings of the 2nd: . Paper presented at 2nd Workshop on Resources and Representations for Under-Resourced Languages and Domains, RESOURCEFUL 2023; Conference date: 22 May 2023 (pp. 25-29). Association for Computational Linguistics
Open this publication in new window or tab >>What Causes Unemployment?: Unsupervised Causality Mining from Swedish Governmental Reports
2023 (English)In: RESOURCEFUL 2023 - Workshop on Resources and Representations for Under-Resourced Languages and Domains, Proceedings of the 2nd, Association for Computational Linguistics , 2023, p. 25-29Conference paper, Published paper (Refereed)
Abstract [en]

Extracting statements about causality from text documents is a challenging task in the absence of annotated training data. We create a search system for causal statements about user-specified concepts by combining pattern matching of causal connectives with semantic similarity ranking, using a language model fine-tuned for semantic textual similarity. Preliminary experiments on a small test set from Swedish governmental reports show promising results in comparison to two simple baselines. 

Place, publisher, year, edition, pages
Association for Computational Linguistics, 2023
Keywords
Computational linguistics; Search engines; Semantics; Annotated training data; Language model; Pattern-matching; Search system; Semantic similarity; Similarity rankings; Swedishs; Test sets; Text document; Textual similarities; Pattern matching
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:ri:diva-68044 (URN)2-s2.0-85175867685 (Scopus ID)9781959429739 (ISBN)
Conference
2nd Workshop on Resources and Representations for Under-Resourced Languages and Domains, RESOURCEFUL 2023; Conference date: 22 May 2023
Note

This work was funded by Vinnova in the project 2019-02252: Datalab for results in the public sector. We thank Sven-Olof Junker, Martin Sparr, Fredrik Carlsson, Sebastian Reimann, and Gustav Finnve-den for valuable discussions. The computations were enabled by resources in project UPPMAX 2020/2-2 at the Uppsala Multidisciplinary Center for Advanced Computational Science.

Available from: 2023-11-23 Created: 2023-11-23 Last updated: 2025-09-23Bibliographically approved
Dürlich, L., Riemann, S., Finnveden, G., Nirve, J. & Stymne, S. (2022). Cause and Effect in Governmental Reports: Two Data Sets for Causality Detection in Swedish. In: Proceedings of the First Workshop on Natural Language Processing for Political Sciences (PoliticalNLP), Marseille, Framnce,. 24 June 2022: . Paper presented at First Workshop on Natural Language Processing for Political Sciences (pp. 46-55).
Open this publication in new window or tab >>Cause and Effect in Governmental Reports: Two Data Sets for Causality Detection in Swedish
Show others...
2022 (English)In: Proceedings of the First Workshop on Natural Language Processing for Political Sciences (PoliticalNLP), Marseille, Framnce,. 24 June 2022, 2022, p. 46-55Conference paper, Published paper (Refereed)
Abstract [en]

Causality detection is the task of extracting information about causal relations from text. It is an important task for different types of document analysis, including political impact assessment. We present two new data sets for causality detection in Swedish. The first data set is annotated with binary relevance judgments, indicating whether a sentence contains causality information or not. In the second data set, sentence pairs are ranked for relevance with respect to a causality query, containing a specific hypothesized cause and/or effect. Both data sets are carefully curated and mainly intended for use as test data. We describe the data sets and their annotation, including detailed annotation guidelines. In addition, we present pilot experiments on cross-lingual zero-shot and few-shot causality detection, using training data from English and German.

Keywords
test analysis, causality, causality detection, annotation, cross-lingual transfer
National Category
Natural Language Processing
Identifiers
urn:nbn:se:ri:diva-59295 (URN)
Conference
First Workshop on Natural Language Processing for Political Sciences
Available from: 2022-05-30 Created: 2022-05-30 Last updated: 2025-09-23Bibliographically approved
Nivre, J., Basirat, A., Dürlich, L. & Moss, A. (2022). Nucleus Composition in Transition-based Dependency Parsing. Computational linguistics - Association for Computational Linguistics (Print), 48(4), 849-886
Open this publication in new window or tab >>Nucleus Composition in Transition-based Dependency Parsing
2022 (English)In: Computational linguistics - Association for Computational Linguistics (Print), ISSN 0891-2017, E-ISSN 1530-9312, Vol. 48, no 4, p. 849-886Article in journal (Refereed) Published
Abstract [en]

Dependency-based approaches to syntactic analysis assume that syntactic structure can be analyzed in terms of binary asymmetric dependency relations holding between elementary syntactic units. Computational models for dependency parsing almost universally assume that an elementary syntactic unit is a word, while the influential theory of Lucien Tesnière instead posits a more abstract notion of nucleus, which may be realized as one or more words. In this article, we investigate the effect of enriching computational parsing models with a concept of nucleus inspired by Tesnière. We begin by reviewing how the concept of nucleus can be defined in the framework of Universal Dependencies, which has become the de facto standard for training and evaluating supervised dependency parsers, and explaining how composition functions can be used to make neural transition-based dependency parsers aware of the nuclei thus defined. We then perform an extensive experimental study, using data from 20 languages to assess the impact of nucleus composition across languages with different typological characteristics, and utilizing a variety of analytical tools including ablation, linear mixed-effects models, diagnostic classifiers, and dimensionality reduction. The analysis reveals that nucleus composition gives small but consistent improvements in parsing accuracy for most languages, and that the improvement mainly concerns the analysis of main predicates, nominal dependents, clausal dependents, and coordination structures. Significant factors explaining the rate of improvement across languages include entropy in coordination structures and frequency of certain function words, in particular determiners. Analysis using dimensionality reduction and diagnostic classifiers suggests that nucleus composition increases the similarity of vectors representing nuclei of the same syntactic type. 

Place, publisher, year, edition, pages
MIT Press Journals, 2022
Keywords
Abstracting, Computational linguistics, Structure (composition), Abstract notions, Computational modelling, Coordination structures, Dependency parser, Dependency parsing, Dependency relation, Dimensionality reduction, Nuclei composition, Syntactic analysis, Syntactic structure, Syntactics
National Category
Computer Sciences
Identifiers
urn:nbn:se:ri:diva-61573 (URN)10.1162/coli_a_00450 (DOI)2-s2.0-85143253082 (Scopus ID)
Note

Funding details: Vetenskapsrådet, VR, 2016-01817; Funding text 1: We are grateful to Miryam de Lhoneux, Artur Kulmizev, and Sara Stymne for valuable comments and suggestions. We thank the action editor and the three reviewers for constructive comments that helped us improve the final version. The research presented in this article was supported by the Swedish Research Council (grant 2016-01817).

Available from: 2022-12-20 Created: 2022-12-20 Last updated: 2025-09-23Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0003-3246-1664

Search in DiVA

Show all publications