Change search
Link to record
Permanent link

Direct link
Publications (10 of 11) Show all publications
Sahlgren, M., Karlgren, J., Dürlich, L., Gogoulou, E., Talman, A. & Zahra, S. (2024). ELOQUENT 2024 - Robustness Task. In: CEUR Workshop Proceedings: . Paper presented at 25th Working Notes of the Conference and Labs of the Evaluation Forum, CLEF 2024. Grenoble. 9 September 2024 through 12 September 2024 (pp. 703-707). CEUR-WS, 3740
Open this publication in new window or tab >>ELOQUENT 2024 - Robustness Task
Show others...
2024 (English)In: CEUR Workshop Proceedings, CEUR-WS , 2024, Vol. 3740, p. 703-707Conference paper, Published paper (Refereed)
Abstract [en]

ELOQUENT is a set of shared tasks for evaluating the quality and usefulness of generative language models. ELOQUENT aims to apply high-level quality criteria, grounded in experiences from deploying models in real-life tasks, and to formulate tests for those criteria, preferably implemented to require minimal human assessment effort and in a multilingual setting. One of the tasks for the first year of ELOQUENT was the robustness task, in which we assessed the robustness and consistency of a model output given variation in the input prompts. We found that indeed the consistency varied, both across prompt items and across models, and on a methodological note we find that using a oracle model for assessing the submitted responses is feasible, and intend to investigate consistency across such assessments for different oracle models. We intend to run this task in coming editions for ELOQUENT to establish a solid methodology for further assessing consistency, which we believe to be a crucial component of trustworthiness as a top level quality characteristic of generative language models. 

Place, publisher, year, edition, pages
CEUR-WS, 2024
Keywords
High level languages; First year; Human assessment; Language model; Model outputs; Oracle model; Quality characteristic; Quality criteria; Generative adversarial networks
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:ri:diva-75019 (URN)2-s2.0-85201575633 (Scopus ID)
Conference
25th Working Notes of the Conference and Labs of the Evaluation Forum, CLEF 2024. Grenoble. 9 September 2024 through 12 September 2024
Note

This lab has been supported by the European Commission through the DeployAI project (grant number 101146490), by the Swedish Research Council (grant number 2022-02909), and by UK Research and Innovation (UKRI) under the UK government's Horizon Europe funding guarantee [grant number 10039436 (Utter)]. 

Available from: 2024-09-09 Created: 2024-09-09 Last updated: 2024-09-09Bibliographically approved
Karlgren, J., Dürlich, L., Gogoulou, E., Guillou, L., Nivre, J., Sahlgren, M. & Talman, A. (2024). ELOQUENT CLEF Shared Tasks for Evaluation of Generative Language Model Quality. Paper presented at 46th European Conference on Information Retrieval, ECIR 2024. Glasgow, UK. 24 March 2024 through 28 March 2024. Lecture Notes in Computer Science, 14612 LNCS, 459-465
Open this publication in new window or tab >>ELOQUENT CLEF Shared Tasks for Evaluation of Generative Language Model Quality
Show others...
2024 (English)In: Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349, Vol. 14612 LNCS, p. 459-465Article in journal (Refereed) Published
Abstract [en]

ELOQUENT is a set of shared tasks for evaluating the quality and usefulness of generative language models. ELOQUENT aims to bring together some high-level quality criteria, grounded in experiences from deploying models in real-life tasks, and to formulate tests for those criteria, preferably implemented to require minimal human assessment effort and in a multilingual setting. The selected tasks for this first year of ELOQUENT are (1) probing a language model for topical competence; (2) assessing the ability of models to generate and detect hallucinations; (3) assessing the robustness of a model output given variation in the input prompts; and (4) establishing the possibility to distinguish human-generated text from machine-generated text.

Place, publisher, year, edition, pages
Springer Science and Business Media Deutschland GmbH, 2024
Keywords
Benchmarking; CLEF; Generative language model; Human assessment; Language model; LLM; Modeling quality; Multilinguality; Quality benchmark; Quality criteria; Shared task; Computational linguistics
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:ri:diva-72876 (URN)10.1007/978-3-031-56069-9_63 (DOI)2-s2.0-85189366495 (Scopus ID)
Conference
46th European Conference on Information Retrieval, ECIR 2024. Glasgow, UK. 24 March 2024 through 28 March 2024
Available from: 2024-04-26 Created: 2024-04-26 Last updated: 2024-05-15Bibliographically approved
Ekgren, A., Gyllensten, A. C., Stollenwerk, F., Öhman, J., Isbister, T., Gogoulou, E., . . . Sahlgren, M. (2024). GPT-SW3: An Autoregressive Language Model for the Scandinavian Languages. In: 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings: . Paper presented at Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024 Hybrid, Torino, Italy. 20 May 2024 through 25 May 2024 (pp. 7886-7900). European Language Resources Association (ELRA)
Open this publication in new window or tab >>GPT-SW3: An Autoregressive Language Model for the Scandinavian Languages
Show others...
2024 (English)In: 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings, European Language Resources Association (ELRA) , 2024, p. 7886-7900Conference paper, Published paper (Refereed)
Abstract [en]

This paper details the process of developing the first native large generative language model for the North Germanic languages, GPT-SW3. We cover all parts of the development process, from data collection and processing, training configuration and instruction finetuning, to evaluation, applications, and considerations for release strategies. We discuss pros and cons of developing large language models for smaller languages and in relatively peripheral regions of the globe, and we hope that this paper can serve as a guide and reference for other researchers that undertake the development of large generative models for smaller languages. 

Place, publisher, year, edition, pages
European Language Resources Association (ELRA), 2024
Keywords
Computational linguistics; Auto-regressive; Data collection; Development process; Generative model; Language model; Large language model; Low resource languages; Multilinguality; Peripheral regions; Release strategies; Data handling
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:ri:diva-73777 (URN)2-s2.0-85195971043 (Scopus ID)
Conference
Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024 Hybrid, Torino, Italy. 20 May 2024 through 25 May 2024
Note

The GPT-SW3 initiative has been enabled by the collaboration and support from the following organizations: RISE (collaboration on experiments, data storage and compute), NVIDIA (support with the deduplication code base and Nemo Megatron), Vinnova (funding via contracts 2019-02996, 2020-04658 and 2022-00949), WASP WARA media and language (access to Berzelius via SNIC/NAISS). The computations and data handling were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS) and the Swedish National Infrastructure for Computing (SNIC) at Berzelius partially funded by the Swedish Research Council through grant agreements 2022-06725 and 2018-05973. Johan Raber at the National Supercomputer Center is acknowledged for assistance concerning technical and implementational aspects in making the code run on the Berzelius resources

Available from: 2024-06-25 Created: 2024-06-25 Last updated: 2024-06-25Bibliographically approved
Dürlich, L., Gogoulou, E., Guillou, L., Nivre, J. & Zahra, S. (2024). Overview of the CLEF-2024 Eloquent Lab: Task 2 on HalluciGen. In: CEUR Workshop Proceedings: . Paper presented at 25th Working Notes of the Conference and Labs of the Evaluation Forum, CLEF 2024. Grenoble. 9 September 2024 through 12 September 2024 (pp. 691-702). CEUR-WS, 3740
Open this publication in new window or tab >>Overview of the CLEF-2024 Eloquent Lab: Task 2 on HalluciGen
Show others...
2024 (English)In: CEUR Workshop Proceedings, CEUR-WS , 2024, Vol. 3740, p. 691-702Conference paper, Published paper (Refereed)
Abstract [en]

In the HalluciGen task we aim to discover whether LLMs have an internal representation of hallucination. Specifically, we investigate whether LLMs can be used to both generate and detect hallucinated content. In the cross-model evaluation setting we take this a step further and explore the viability of using an LLM to evaluate output produced by another LLM. We include generation, detection, and cross-model evaluation steps for two scenarios: paraphrase and machine translation. Overall we find that performance of the baselines and submitted systems is highly variable, however initial results are promising and lessons learned from this year’s task will provide a solid foundation for future iterations of the task. In particular, we highlight that human validation of generated output is ideally necessary to ensure the robustness of the cross-model evaluation results. We aim to address this challenge in future iterations of HalluciGen. 

Place, publisher, year, edition, pages
CEUR-WS, 2024
Keywords
Computational linguistics; Computer aided language translation; Modeling languages; Cross model; Detection models; Evaluation; Generative language model; Hallucination; Internal representation; Language model; Machine translations; Model evaluation; Performance; Machine translation
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:ri:diva-75021 (URN)2-s2.0-85201646530 (Scopus ID)
Conference
25th Working Notes of the Conference and Labs of the Evaluation Forum, CLEF 2024. Grenoble. 9 September 2024 through 12 September 2024
Note

This lab has been partially supported by the Swedish Research Council (grant number 2022-02909) and by UK Research and Innovation (UKRI) under the UK government's Horizon Europe funding guarantee [grant number 10039436 (Utter)].

Available from: 2024-09-06 Created: 2024-09-06 Last updated: 2024-09-06Bibliographically approved
Bevendorff, J., Wiegmann, M., Karlgren, J., Dürlich, L., Gogoulou, E., Talman, A., . . . Stein, B. (2024). Overview of the “Voight-Kampff” Generative AI Authorship Verification Task at PAN and ELOQUENT 2024. In: CEUR Workshop Proceedings: . Paper presented at 25th Working Notes of the Conference and Labs of the Evaluation Forum, CLEF 2024. Grenoble, France 9 September 2024 through 12 September 2024 (pp. 2486-2506). CEUR-WS, 3740
Open this publication in new window or tab >>Overview of the “Voight-Kampff” Generative AI Authorship Verification Task at PAN and ELOQUENT 2024
Show others...
2024 (English)In: CEUR Workshop Proceedings, CEUR-WS , 2024, Vol. 3740, p. 2486-2506Conference paper, Published paper (Refereed)
Abstract [en]

The “Voight-Kampff” Generative AI Authorship Verification task aims to determine whether a text was generated by an AI or written by a human. As in its fictional inspiration,1 the Voight-Kampff task structures AI detection as a builder-breaker challenge: The builders, participants in the PAN lab, submit software to detect AI-written text and the breakers, participants in the ELOQUENT lab, submit AI-written text with the goal of fooling the builders. We formulate the task in a way that is reminiscent of a traditional authorship verification problem, where given a pair of texts, their human or machine authorship is to be inferred. For this first task installment, we further restrict the problem so that each pair is guaranteed to contain one human and one machine text. Hence the task description reads: Given two texts, one authored by a human, one by a machine: pick out the human. In total, we evaluated 43 detection systems (30 participant submissions and 13 baselines), ranging from linear classifiers to perplexity-based zero-shot systems. We tested them on 70 individual test set variants organized in 14 base collections, each designed on different constraints such as short texts, Unicode obfuscations, or language switching. The top systems achieve very high scores, proving themselves not perfect but sufficiently robust across a wide range of specialized testing regimes. Code used for creating the datasets and evaluating the systems, baselines, and data are available on GitHub.

Place, publisher, year, edition, pages
CEUR-WS, 2024
Keywords
Problem oriented languages; AI detection; Authorship verification; Detection system; Human-machine; One-machine; Task description; Task structure; Verification problems; Verification task; Written texts
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:ri:diva-75037 (URN)2-s2.0-85201598034 (Scopus ID)
Conference
25th Working Notes of the Conference and Labs of the Evaluation Forum, CLEF 2024. Grenoble, France 9 September 2024 through 12 September 2024
Note

The “Voight-Kampff” Generative AI Authorship Detection Task @ PAN 2024 has been funded as part ofthe OpenWebSearch project by the European Commission (OpenWebSearch.eu, GA 101070014).

Available from: 2024-09-06 Created: 2024-09-06 Last updated: 2024-09-06Bibliographically approved
Dürlich, L., Gogoulou, E. & Nivre, J. (2023). On the Concept of Resource-Efficiency in NLP. In: Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa): . Paper presented at 24th Nordic Conference on Computational Linguistics (NoDaLiDa) (pp. 135-145).
Open this publication in new window or tab >>On the Concept of Resource-Efficiency in NLP
2023 (English)In: Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), 2023, p. 135-145Conference paper, Published paper (Refereed)
Abstract [en]

Resource-efficiency is a growing concern in the NLP community. But what are the resources we care about and why? How do we measure efficiency in a way that is reliable and relevant? And how do we balance efficiency and other important concerns? Based on a review of the emerging literature on the subject, we discuss different ways of conceptualizing efficiency in terms of product and cost, using a simple case study on fine-tuning and knowledge distillation for illustration. We propose a novel metric of amortized efficiency that is better suited for life-cycle analysis than existing metrics.

National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:ri:diva-67526 (URN)
Conference
24th Nordic Conference on Computational Linguistics (NoDaLiDa)
Available from: 2023-10-11 Created: 2023-10-11 Last updated: 2024-05-15Bibliographically approved
Gogoulou, E., Ekgren, A., Isbister, T. & Sahlgren, M. (2022). Cross-lingual Transfer of Monolingual Models. In: 2022 Language Resources and Evaluation Conference, LREC 2022: . Paper presented at 13th International Conference on Language Resources and Evaluation Conference, LREC 2022, 20 June 2022 through 25 June 2022 (pp. 948-955). European Language Resources Association (ELRA)
Open this publication in new window or tab >>Cross-lingual Transfer of Monolingual Models
2022 (English)In: 2022 Language Resources and Evaluation Conference, LREC 2022, European Language Resources Association (ELRA) , 2022, p. 948-955Conference paper, Published paper (Refereed)
Abstract [en]

Recent studies in cross-lingual learning using multilingual models have cast doubt on the previous hypothesis that shared vocabulary and joint pre-training are the keys to cross-lingual generalization. We introduce a method for transferring monolingual models to other languages through continuous pre-training and study the effects of such transfer from four different languages to English. Our experimental results on GLUE show that the transferred models outperform an English model trained from scratch, independently of the source language. After probing the model representations, we find that model knowledge from the source language enhances the learning of syntactic and semantic knowledge in English. ©  licensed under CC-BY-NC-4.0.

Place, publisher, year, edition, pages
European Language Resources Association (ELRA), 2022
Keywords
Learning systems, Cross-lingual, Generalisation, Model knowledge, Model representation, Pre-training, Semantics knowledge, Source language, Semantics
National Category
Computer Sciences
Identifiers
urn:nbn:se:ri:diva-62612 (URN)2-s2.0-85144427655 (Scopus ID)9791095546726 (ISBN)
Conference
13th International Conference on Language Resources and Evaluation Conference, LREC 2022, 20 June 2022 through 25 June 2022
Note

 Funding details: VINNOVA, 2019-02996; Funding text 1: This work is supported by the Swedish innovation agency (Vinnova) under contract 2019-02996. 

Available from: 2023-01-24 Created: 2023-01-24 Last updated: 2024-05-15Bibliographically approved
Ekgren, A., Gyllensten, A., Gogoulou, E., Heiman, A., Verlinden, S., Öhman, J., . . . Sahlgren, M. (2022). Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish. In: 2022 Language Resources and Evaluation Conference, LREC 2022: . Paper presented at 13th International Conference on Language Resources and Evaluation Conference, LREC 2022, 20 June 2022 through 25 June 2022 (pp. 3509-3518). European Language Resources Association (ELRA)
Open this publication in new window or tab >>Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish
Show others...
2022 (English)In: 2022 Language Resources and Evaluation Conference, LREC 2022, European Language Resources Association (ELRA) , 2022, p. 3509-3518Conference paper, Published paper (Refereed)
Abstract [en]

We present GPT-SW3, a 3.5 billion parameter autoregressive language model, trained on a newly created 100 GB Swedish corpus. This paper provides insights with regard to data collection and training process, and discusses the challenges of proper evaluation. The results of quantitive evaluation using perplexity indicate that GPT-SW3 is a competent model in comparison with existing autoregressive models of similar size. Additionally, we perform an extensive prompting study which reveals the good text generation capabilities of GPT-SW3. © licensed under CC-BY-NC-4.0.

Place, publisher, year, edition, pages
European Language Resources Association (ELRA), 2022
Keywords
Evaluation, Language models, Prompting, Auto-regressive, Autoregressive modelling, Data collection process, Language model, Large-scales, Quantitive, Swedishs, Training process, Computational linguistics
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:ri:diva-62614 (URN)2-s2.0-85144340418 (Scopus ID)9791095546726 (ISBN)
Conference
13th International Conference on Language Resources and Evaluation Conference, LREC 2022, 20 June 2022 through 25 June 2022
Available from: 2023-01-24 Created: 2023-01-24 Last updated: 2024-05-15Bibliographically approved
Gogoulou, E., Boman, M., Ben Abdesslem, F., Isacsson, N., Kaldo, V. & Sahlgren, M. (2021). Predicting treatment outcome from patient texts: The case of internet-based cognitive behavioural therapy. In: EACL 2021 - 16th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference: . Paper presented at 16th Conference of the European Chapter of the Associationfor Computational Linguistics, EACL 2021, 19 April 2021 through 23 April 2021 (pp. 575-580). Association for Computational Linguistics (ACL)
Open this publication in new window or tab >>Predicting treatment outcome from patient texts: The case of internet-based cognitive behavioural therapy
Show others...
2021 (English)In: EACL 2021 - 16th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, Association for Computational Linguistics (ACL) , 2021, p. 575-580Conference paper, Published paper (Refereed)
Abstract [en]

We investigate the feasibility of applying standard text categorisation methods to patient text in order to predict treatment outcome in Internet-based cognitive behavioural therapy. The data set is unique in its detail and size for regular care for depression, social anxiety, and panic disorder. Our results indicate that there is a signal in the depression data, albeit a weak one. We also perform terminological and sentiment analysis, which confirm those results. 

Place, publisher, year, edition, pages
Association for Computational Linguistics (ACL), 2021
Keywords
Computational linguistics, Sentiment analysis, Data set, Internet based, Social anxieties, Treatment outcomes, Patient treatment
National Category
Applied Psychology
Identifiers
urn:nbn:se:ri:diva-53529 (URN)2-s2.0-85107290691 (Scopus ID)9781954085022 (ISBN)
Conference
16th Conference of the European Chapter of the Associationfor Computational Linguistics, EACL 2021, 19 April 2021 through 23 April 2021
Available from: 2021-06-17 Created: 2021-06-17 Last updated: 2024-05-15Bibliographically approved
Carlsson, F., Gogoulou, E., Ylipää, E., Cuba Gyllensten, A. & Sahlgren, M. (2021). Semantic Re-tuning with Contrastive Tension. In: : . Paper presented at International Conference on Learning Representations, 2021.
Open this publication in new window or tab >>Semantic Re-tuning with Contrastive Tension
Show others...
2021 (English)Conference paper, Published paper (Refereed)
Abstract [en]

Extracting semantically useful natural language sentence representations frompre-trained deep neural networks such as Transformers remains a challenge. Wefirst demonstrate that pre-training objectives impose a significant task bias ontothe final layers of models, with a layer-wise survey of the Semantic Textual Similarity (STS) correlations for multiple common Transformer language models. Wethen propose a new self-supervised method called Contrastive Tension (CT) tocounter such biases. CT frames the training objective as a noise-contrastive taskbetween the final layer representations of two independent models, in turn makingthe final layer representations suitable for feature extraction. Results from multiple common unsupervised and supervised STS tasks indicate that CT outperformsprevious State Of The Art (SOTA), and when combining CT with supervised datawe improve upon previous SOTA results with large margins.

National Category
Computer Sciences
Identifiers
urn:nbn:se:ri:diva-59816 (URN)
Conference
International Conference on Learning Representations, 2021
Available from: 2022-07-28 Created: 2022-07-28 Last updated: 2024-05-15
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-9162-6433

Search in DiVA

Show all publications