Change search
Refine search result
1 - 20 of 20
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Blomqvist, Eva
    et al.
    Linköping University, Sweden.
    Alirezaie, Marjan
    Örebro University, Sweden.
    Santini, Marina
    RISE Research Institutes of Sweden, Digital Systems, Prototyping Society.
    Towards causal knowledge graphs - position paper2020In: CEUR Workshop Proceedings, CEUR-WS , 2020, p. 58-62Conference paper (Refereed)
    Abstract [en]

    In this position paper, we highlight that being able to analyse the cause-effect relationships for determining the causal status among a set of events is an essential requirement in many contexts and argue that cannot be overlooked when building systems targeting real-world use cases. This is especially true for medical contexts where the understanding of the cause(s) of a symptom, or observation, is of vital importance. However, most approaches purely based on Machine Learning (ML) do not explicitly represent and reason with causal relations, and may therefore mistake correlation for causation. In the paper, we therefore argue for an approach to extract causal relations from text, and represent them in the form of Knowledge Graphs (KG), to empower downstream ML applications, or AI systems in general, with the ability to distinguish correlation from causation and reason with causality in an explicit manner. So far, the bottlenecks in KG creation have been scalability and accuracy of automated methods, hence, we argue that two novel features are required from methods for addressing these challenges, i.e. (i) the use of Knowledge Patterns to guide the KG generation process towards a certain resulting knowledge structure, and (ii) the use of a semantic referee to automatically curate the extracted knowledge. We claim that this will be an important step forward for supporting interpretable AI systems, and integrating ML and knowledge representation approaches, such as KGs, which should also generalise well to other types of relations, apart from causality. © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

  • 2.
    Brännvall, Rickard
    et al.
    RISE Research Institutes of Sweden, Digital Systems, Data Science.
    Forsgren, Henrik
    RISE Research Institutes of Sweden, Digital Systems, Data Science.
    Linge, Helena
    RISE Research Institutes of Sweden, Digital Systems, Data Science.
    Santini, Marina
    RISE Research Institutes of Sweden, Digital Systems, Prototyping Society.
    Salehi, Alireza
    RISE Research Institutes of Sweden, Digital Systems, Prototyping Society.
    Rahimian, Fatemeh
    RISE Research Institutes of Sweden, Digital Systems, Data Science.
    Homomorphic encryption enables private data sharing for digital health: Winning entry to the Vinnova innovation competition Vinter 2021-222022In: 34th Workshop of the Swedish Artificial Intelligence Society, SAIS 2022, Institute of Electrical and Electronics Engineers Inc. , 2022Conference paper (Refereed)
    Abstract [en]

    People living with type 1 diabetes often use several apps and devices that help them collect and analyse data for a better monitoring and management of their disease. When such health related data is analysed in the cloud, one must always carefully consider privacy protection and adhere to laws regulating the use of personal data. In this paper we present our experience at the pilot Vinter competition 2021-22 organised by Vinnova. The competition focused on digital services that handle sensitive diabetes related data. The architecture that we proposed for the competition is discussed in the context of a hypothetical cloud-based service that calculates diabetes self-care metrics under strong privacy preservation. It is based on Fully Homomorphic Encryption (FHE)-a technology that makes computation on encrypted data possible. Our solution promotes safe key management and data life-cycle control. Our benchmarking experiment demonstrates execution times that scale well for the implementation of personalised health services. We argue that this technology has great potentials for AI-based health applications and opens up new markets for third-party providers of such services, and will ultimately promote patient health and a trustworthy digital society.

  • 3.
    Capshaw, Riley
    et al.
    Linköping University, Sweden.
    Blomqvist, Eva
    Linköping University, Sweden.
    Santini, Marina
    RISE Research Institutes of Sweden, Digital Systems, Prototyping Society.
    Alirezaie, Marjan
    Örebro University, Sweden.
    BERT is as Gentle as a Sledgehammer: Too Powerful or Too Blunt? It Depends on the Benchmark2021Conference paper (Other academic)
    Abstract [en]

    In this position statement, we wish to contribute to the discussion about how to assess quality and coverage of a model.

    We believe that BERT's prominence as a single-step pipeline for contextualization and classification highlights the need for benchmarks to evolve concurrently with models. Much recent work has touted BERT's raw power for solving natural language tasks, so we used a 12-layer uncased BERT pipeline with a linear classifier as a quick-and-dirty model to score well on the SemEval 2010 Task 8 dataset for relation classification between nominals. We initially expected there to be significant enough bias from BERT's training to influence downstream tasks, since it is well-known that biased training corpora can lead to biased language models (LMs). Gender bias is the most common example, where gender roles are codified within language models. To handle such training data bias, we took inspiration from work in the field of computer vision. Tang et al. (2020) mitigate human reporting bias over the labels of a scene graph generation task using a form of causal reasoning based on counterfactual analysis. They extract the total direct effect of the context image on the prediction task by "blanking out" detected objects, intuitively asking "What if these objects were not here?" If the system still predicts the same label, then the original prediction is likely caused by bias in some form. Our goal was to remove any effects from biases learned during BERT's pre-training, so we analyzed total effect (TE) instead. However, across several experimental configurations we found no noticeable effects from using TE analysis. One disappointing possibility was that BERT might be resistant to causal analysis due to its complexity. Another was that BERT is so powerful (or blunt?) that it can find unanticipated trends in its input, rendering any human-generated causal analysis of its predictions useless. We nearly concluded that what we expected to be delicate experimentation was more akin to trying to carve a masterpiece sculpture with a self-driven sledgehammer. We then found related work where BERT fooled humans by exploiting unexpected characteristics of a benchmark. When we used BERT to predict a relation for random words in the benchmark sentences, it guessed the same label as it would have for the corresponding marked entities roughly half of the time. Since the task had nineteen roughly-balanced labels, we expected much less consistency. This finding repeated across all pipeline configurations; BERT was treating the benchmark as a sequence classification task! Our final conclusion was that the benchmark is inadequate: all sentences appeared exactly once with exactly one pair of entities, so the task was equivalent to simply labeling each sentence. We passionately claim from our experience that the current trend of using larger and more complex LMs must include concurrent evolution of benchmarks. We as researchers need to be diligent in keeping our tools for measuring as sophisticated as the models being measured, as any scientific domain does.

    Download full text (pdf)
    fulltext
  • 4.
    Danielsson, Benjamin
    et al.
    Linköping University, Sweden.
    Santini, Marina
    RISE Research Institutes of Sweden, Digital Systems, Prototyping Society.
    Lundberg, Peter
    Linköping University, Sweden.
    Al-Abasse, Yosef
    Linköping University, Sweden.
    Jönsson, Arne
    Linköping University, Sweden.
    Eneling, Emma
    Linköping University, Sweden.
    Stridsman, Magnus
    Linköping University, Sweden.
    Classifying Implant-Bearing Patients via their Medical Histories: a Pre-Study on Swedish EMRs with Semi-Supervised GAN-BERT2022In: 2022 Language Resources and Evaluation Conference, LREC 2022, European Language Resources Association (ELRA) , 2022, p. 5428-5435Conference paper (Refereed)
    Abstract [en]

    In this paper, we compare the performance of two BERT-based text classifiers whose task is to classify patients (more precisely, their medical histories) as having or not having implant(s) in their body. One classifier is a fully-supervised BERT classifier. The other one is a semi-supervised GAN-BERT classifier. Both models are compared against a fully-supervised SVM classifier. Since fully-supervised classification is expensive in terms of data annotation, with the experiments presented in this paper, we investigate whether we can achieve a competitive performance with a semi-supervised classifier based only on a small amount of annotated data. Results are promising and show that the semi-supervised classifier has a competitive performance when compared with the fully-supervised classifier. © licensed under CC-BY-NC-4.0.

  • 5.
    Falkenjack, Johan
    et al.
    Linköping University, Sweden.
    Santini, Marina
    RISE - Research Institutes of Sweden, ICT, SICS.
    Jönsson, Arne
    Linköping University, Sweden.
    An Exploratory Study on Genre Classification using Readability Features2016In: The Sixth Swedish Language Technology Conference (SLTC), 2016Conference paper (Refereed)
    Abstract [en]

    We present a preliminary study that explores whether text features used for readability assessment are reliable genre-revealing features. We empirically explore the difference between genre and domain. We carry out two sets of experiments with both supervised and unsupervised methods. Findings on the Swedish national corpus (the SUC) show that readability cues are good indicators of genre variation.

  • 6.
    Jerdhaf, O
    et al.
    Linköping University, Sweden.
    Santini, Marina
    RISE Research Institutes of Sweden, Digital Systems, Prototyping Society.
    Lundberg, P
    Linköping University, Sweden.
    Karlsson, A
    Linköping University, Sweden.
    Jönsson, A
    Linköping University, Sweden.
    Implant Term Extraction from Swedish Medical Records–Phase 1: Lessons Learned.2021Conference paper (Other academic)
    Download full text (pdf)
    fulltext
  • 7.
    Jerdhaf, Oskar
    et al.
    Linköping University, Sweden.
    Santini, Marina
    RISE Research Institutes of Sweden, Digital Systems, Prototyping Society.
    Lundberg, Peter
    Linköping University, Sweden.
    Karlsson, Anette
    Linköping University, Sweden.
    Jönsson, Arne
    Linköping University, Sweden.
    Focused Terminology Extraction for CPSs The Case of "Implant Terms" in Electronic Medical Records2021In: 2021 IEEE International Conference on Communications Workshops (ICC Workshops), 2021Conference paper (Refereed)
    Abstract [en]

    Language Technology is an essential component of many Cyber-Physical Systems (CPSs) because specialized linguistic knowledge is indispensable to prevent fatal errors. We present the case of automatic identification of implant terms. The need of an automatic identification of implant terms spurs from safety reasons because patients who have an implant may or may be not submitted to Magnetic Resonance Imaging (MRI). Normally, MRI scans are safe. However, in some cases an MRI scan may not be recommended. It is important to know if a patient has an implant, because MRI scanning is incompatible with some implants. At present, the process of ascertain whether a patient could be at risk is lengthy, manual, and based on the specialized knowledge of medical staff. We argue that this process can be sped up, streamlined and become safer by sieving through patients’ medical records. In this paper, we explore how to discover implant terms in electronic medical records (EMRs) written in Swedish with an unsupervised approach. To this aim we use BERT, a state-of-the-art deep learning algorithm based on pre-trained word embeddings. We observe that BERT discovers a solid proportion of terms that are indicative of implants.

  • 8.
    Jerdhaf, Oskar
    et al.
    Linkoping University, Sweden .
    Santini, Marina
    RISE Research Institutes of Sweden, Digital Systems, Prototyping Society.
    Lundberg, Peter
    Linkoping University, Sweden .
    Karlsson, Anette
    Linkoping University, Sweden .
    Jönsson, Arne
    Linkoping University, Sweden .
    Implant Terms: Focused Terminology Extraction with Swedish BERT - Preliminary Results2020Conference paper (Refereed)
    Abstract [en]

    Certain implants are imperative to detect be-fore MRI scans. However, implant terms, like‘pacemaker’ or ‘stent’, are sparse and difficultto identify in noisy and hastily written elec-tronic medical records (EMRs). In this pa-per, we explore how to discover implant termsin Swedish EMRs with an unsupervised ap-proach.To this purpose, we use BERT, astate-of-the-art deep learning algorithm, andfine-tune a model built on pre-trained SwedishBERT. We observe that BERT discovers asolid proportion of indicative implant terms.

    Download full text (pdf)
    fulltext
  • 9.
    Rennes, Evelina
    et al.
    Linköping University, Sweden.
    Santini, Marina
    RISE Research Institutes of Sweden, Digital Systems, Prototyping Society.
    Jönsson, Arne
    Linköping University, Sweden.
    The Swedish Simplification Toolkit: Designed with Target Audiences in Mind2022In: 2nd Workshop on Tools and Resources for REAding DIfficulties, READI 2022 - collocated with the International Conference on Language Resources and Evaluation Conference, LREC 2022, European Language Resources Association (ELRA) , 2022, p. 31-38Conference paper (Refereed)
    Abstract [en]

    In this paper, we present the current version of The Swedish Simplification Toolkit. The toolkit includes computational and empirical tools that have been developed along the years to explore a still neglected area of NLP, namely the simplification of “standard” texts to meet the needs of target audiences. Target audiences, such as people affected by dyslexia, aphasia, autism, but also children and second language learners, require different types of text simplification and adaptation. For example, while individuals with aphasia have difficulties in reading compounds (such as arbetsmarknadsdepartement, eng. ministry of employment), second language learners struggle with cultural-specific vocabulary (e.g. konflikträdd, eng. afraid of conflicts). The toolkit allows user to selectively select the types of simplification that meet the specific needs of the target audience they belong to. The Swedish Simplification Toolkit is one of the first attempts to overcome the one-fits-all approach that is still dominant in Automatic Text Simplification, and proposes a set of computational methods that, used individually or in combination, may help individuals reduce reading (and writing) difficulties.

    Download full text (pdf)
    fulltext
  • 10.
    Santini, Marina
    RISE - Research Institutes of Sweden, ICT, SICS.
    Improving Cross-Lingual Enterprise Information Access2014Conference paper (Refereed)
    Abstract [en]

    In this position paper it is argued that cross-lingual enterprise information access is underdeveloped and underexploited. Some use cases are presented. It is pointed out that very little of the extensive research findings in cross-lingual and multilingual information retrieval have penetrated enterprise search. It is claimed that with little investment in R&D, it would be relatively easy to create a re-usable cross-lingual enterprise search module to automatically and reliably translate search queries (one of the most used approach in Cross-Lingual Information Retrieval) from a foreign language to a target language in order to retrieve relevant documents.

  • 11.
    Santini, Marina
    et al.
    RISE - Research Institutes of Sweden, ICT, SICS.
    Danielsson, Benjamin
    Linköping Universtity, Sweden.
    Jönsson, Arne
    RISE - Research Institutes of Sweden, ICT, SICS. Linköping Universtity, Sweden.
    Introducing the notion of ‘contrast’ features for language technology2019In: International Conference on Database and Expert Systems Applications              DEXA 2019: Database and Expert Systems Applications, Springer Verlag , 2019, p. 189-198Conference paper (Refereed)
    Abstract [en]

    In this paper, we explore whether there exist ‘contrast’ features that help recognize if a text variety is a genre or a domain. We carry out our experiments on the text varieties that are included in the Swedish national corpus, called Stockholm-Umeå Corpus or SUC, and build several text classification models based on text complexity features, grammatical features, bag-of-words features and word embeddings. Results show that text complexity features and grammatical features systematically perform better on genres rather than on domains. This indicates that these features can be used as ‘contrast’ features because, when in doubt about the nature of a text category, they help bring it to light.

  • 12.
    Santini, Marina
    et al.
    RISE Research Institutes of Sweden, Digital Systems, Prototyping Society.
    Jerdhaf, O
    Linköping University, Sweden.
    Karlsson, A
    Linköping University, Sweden.
    Eneling, E
    Linköping University, Sweden.
    Stridsman, M
    Linköping University, Sweden.
    Jönsson, A
    Linköping University, Sweden.
    Lundberg, P
    Linköping University, Sweden; CMIV, Sweden.
    The Potential of AI-Based Clinical Text Mining to Improve Patient Safety: the Case of Implant Terms and Patient Journals.2021Conference paper (Other academic)
    Download full text (pdf)
    fulltext
  • 13.
    Santini, Marina
    et al.
    RISE - Research Institutes of Sweden, ICT, SICS.
    Jönsson, Arne
    RISE - Research Institutes of Sweden, ICT, SICS. Linköping University, Sweden.
    Nyström, Mikael
    RISE - Research Institutes of Sweden, ICT, SICS. Linköping University, Sweden.
    Alireza, Marjan
    Örebro University, Sweden.
    A Web Corpus for eCare: Collection, Lay Annotation and Learning - First Results2017Conference paper (Refereed)
    Abstract [en]

    In this position paper, we put forward two claims: 1) it is possible to design a dynamic and extensible corpus without running the risk of getting into scalability problems; 2) it is possible to devise noise-resistant Language Technology applications without affecting performance. To support our claims, we describe the design, construction and limitations of a very specialized medical web corpus, called eCare_Sv_01, and we present two experiments on lay-specialized text classification. eCare_Sv_01 is a small corpus of web documents written in Swedish. The corpus contains documents about chronic diseases. The sublanguage used in each document has been labelled as “lay” or “specialized” by a lay annotator. The corpus is designed as a flexible text resource, where additional medical documents will be appended over time. Experiments show that the lay-specialized labels assigned by the lay annotator are reliably learned by standard classifiers. More specifically, Experiment 1 shows that scalability is not an issue when increasing the size of the datasets to be learned from 156 up to 801 documents. Experiment 2 shows that lay-specialized labels can be learned regardless of the large amount of disturbing factors, such as machine translated documents or low-quality texts that are numerous in the corpus.

  • 14.
    Santini, Marina
    et al.
    RISE - Research Institutes of Sweden (2017-2019), ICT, SICS.
    Jönsson, Arne
    RISE - Research Institutes of Sweden (2017-2019), ICT, SICS. Linköping University, Sweden.
    Strandqvist, Wiktor
    RISE - Research Institutes of Sweden (2017-2019), ICT, SICS. Linköping University, Sweden.
    Cederblad, Gustav
    Linköping University, Sweden.
    Nyström, Mikael
    RISE - Research Institutes of Sweden (2017-2019), ICT, SICS. Linköping University, Sweden.
    Alirezaie, Marjan
    Örebro University, Sweden.
    Lind, Leili
    RISE Research Institutes of Sweden, Digital Systems, Prototyping Society. Linköping University, Sweden.
    Blomqvist, Eva
    RISE - Research Institutes of Sweden (2017-2019), ICT, SICS. Linköping University, Sweden.
    Linden, Maria
    Mälardalen University, Sweden.
    Kristoffersson, Annica
    Örebro University, Sweden.
    Designing an Extensible Domain-Specific Web Corpus for “Layfication”: A Case Study in eCare at Home : Chapter 62019In: Cyber-Physical Systemsfor Social Applications / [ed] Maya Dimitrova and Hiroaki Wagatsuma, Hershey PA, USA 17033: Engineering Science Reference , 2019, p. 98-155Chapter in book (Other academic)
    Abstract [en]

    In the era of data-driven science, corpus-based language technology is an essential part of cyber physicalsystems. In this chapter, the authors describe the design and the development of an extensible domainspecificweb corpus to be used in a distributed social application for the care of the elderly at home.The domain of interest is the medical field of chronic diseases. The corpus is conceived as a flexible andextensible textual resource, where additional documents and additional languages will be appendedover time. The main purpose of the corpus is to be used for building and training language technologyapplications for the “layfication” of the specialized medical jargon. “Layfication” refers to the automaticidentification of more intuitive linguistic expressions that can help laypeople (e.g., patients, familycaregivers, and home care aides) understand medical terms, which often appear opaque. Exploratoryexperiments are presented and discussed.

    Download full text (pdf)
    fulltext
  • 15.
    Santini, Marina
    et al.
    RISE Research Institutes of Sweden, Digital Systems, Prototyping Society.
    Rennes, Evelina
    Linköping University, Sweden.
    Holmer, Daniel
    Linköping University, Sweden.
    Jönsson, Arne
    Linköping University, Sweden.
    Human-in-the-Loop: Where Does Text Complexity Lie?2021Conference paper (Other academic)
    Abstract [en]

    In this position statement, we would like to contribute to the discussion about how to assess quality and coverage of a model. In this context, we verbalize the need of linguistic features’ interpretability and the need of profiling textual variations. These needs are triggered by the necessity to gain insights into intricate patterns of human communication. Arguably, the functional and linguistic interpretation of these communication patterns contribute to keep humans’ needs in the loop, thus demoting the myth of powerful but dehumanized Artificial Intelligence. The desideratum to open up the “black boxes” of AI-based machines has become compelling. Recent research has focussed on how to make sense and popularize deep learning models and has explored how to “probe” these models to understand how they learn. The BERTology science is actively and diligently digging into BERT’s complex clockwork. However, much remains to be unearthed: “BERTology has clearly come a long way, but it is fair to say we still have more questions than answers about how BERT works”. It is therefore not surprising that add-on tools are being created to inspect pre-trained language models with the aim to cast some light on the “interpretation of pre-trained models in the context of downstream tasks and domain-specific data”. Here we do not propose any new tool, but we try to formulate and exemplify the problem by taking the case of text simplification/text complexity. When we compare a standard text and an easy-to-read text (e.g. lättsvenska or simple English) we wonder: where does text complexity lie? Can we pin it down? According to Simple English Wikipedia, “(s)imple English is similar to English, but it only uses basic words. We suggest that articles should use only the 1,000 most common and basic words in English. They should also use only simple grammar and shorter sentences.” This characterization of a simplified text does not provide much linguistic insight: what is meant by simple grammar? Linguistic insights are also missing from state-of-the-art NLP models for text simplification, since these models are basically monolingual neural machine translation systems that take a standard text and “translate” it into a simplified type of (sub)language. We do not gain any linguistic understanding, of what is being simplified and why. We just get the task done (which is of course good). We know for sure that standard and easy-to-read texts differ in a number of ways and we are able to use BERT to create classifiers that discriminate the two varieties. But how are linguistic features re-shuffled to generate a simplified text from a standard one? With traditional statistical approaches, such as Biber’s MDA (based on factor analysis) we get an idea of how linguistic features co-occur and interact in different text types and why. Since pre-trained language models are more powerful than traditional statistical models, like factor analysis, we would like to see more research on "disclosing the layers" so that we can understand how different co-occurrence of linguistic features contribute to the make up of specific varieties of texts, like simplified vs standard texts. Would it be possible to update the iconic example

    Download full text (pdf)
    fulltext
  • 16.
    Santini, Marina
    et al.
    RISE - Research Institutes of Sweden, ICT, SICS.
    Strandqvist, Wiktor
    RISE - Research Institutes of Sweden, ICT, SICS. Linköping University, Sweden.
    Jönsson, Arne
    RISE - Research Institutes of Sweden, ICT, SICS. Linköping University, Sweden.
    Profiling Domain Specificity of Specialized Web Corpora using Burstiness.: Explorations and Open Issues2018In: Proceedings of SLTC2018, 2018Conference paper (Other academic)
  • 17.
    Santini, Marina
    et al.
    RISE - Research Institutes of Sweden (2017-2019), ICT, SICS.
    Strandqvist, Wiktor
    Jönsson, Arne
    RISE - Research Institutes of Sweden (2017-2019), ICT, SICS.
    Profiling specialized web corpus qualities: A progress report on "Domainhood"2019In: Argentinian Journal of Applied Linguistics, ISSN 2314-3576, Vol. 7, no 7Article in journal (Refereed)
  • 18.
    Santini, Marina
    et al.
    RISE - Research Institutes of Sweden, ICT, SICS.
    Strandqvist, Wiktor
    RISE - Research Institutes of Sweden, ICT, SICS. Linköping University, Sweden.
    Nyström, Mathias
    RISE - Research Institutes of Sweden, ICT, SICS. Linköping University, Sweden.
    Alirezai, Marjan
    Örebro University, Sweden.
    Jönsson, Arne
    RISE - Research Institutes of Sweden, ICT, SICS. Linköping University, Sweden.
    Can We quantify domainhood?: Exploring measures to assess domain-specificity in web corpora2018In: Commun. Comput. Info. Sci., 2018, p. 207-217Conference paper (Refereed)
    Abstract [en]

    Web corpora are a cornerstone of modern Language Technology. Corpora built from the web are convenient because their creation is fast and inexpensive. Several studies have been carried out to assess the representativeness of general-purpose web corpora by comparing them to traditional corpora. Less attention has been paid to assess the representativeness of specialized or domain-specific web corpora. In this paper, we focus on the assessment of domain representativeness of web corpora and we claim that it is possible to assess the degree of domain-specificity, or domainhood, of web corpora. We present a case study where we explore the effectiveness of different measures - namely the Mann-Withney-Wilcoxon Test, Kendall correlation coefficient, Kullback–Leibler divergence, log-likelihood and burstiness - to gauge domainhood. Our findings indicate that burstiness is the most suitable measure to single out domain-specific words from a specialized corpus and to allow for the quantification of domainhood.

  • 19.
    Santini, Marina
    et al.
    RISE - Research Institutes of Sweden (2017-2019), ICT, SICS.
    Strandqvist, Wiktor
    RISE - Research Institutes of Sweden (2017-2019), ICT, SICS. Linköping University, Sweden.
    Nyström, Mikael
    RISE - Research Institutes of Sweden (2017-2019), ICT, SICS. Linköping University, Sweden.
    Alirezai, Marjan
    Örebro University, Sweden.
    Jönsson, Arne
    RISE - Research Institutes of Sweden (2017-2019), ICT, SICS. Linköping University, Sweden.
    Can We Quantify Domainhood?: Exploring Measures to Assess Domain-Specificity in Web Corpora2018In: DEXA 2018: Database and Expert Systems Applications, Springer, 2018, p. 207-217Conference paper (Refereed)
    Abstract [en]

    Web corpora are a cornerstone of modern Language Technology. Corpora built from the web are convenient because their creation is fast and inexpensive. Several studies have been carried out to assess the representativeness of general-purpose web corpora by comparing them to traditional corpora. Less attention has been paid to assess the representativeness of specialized or domain-specific web corpora. In this paper, we focus on the assessment of domain representativeness of web corpora and we claim that it is possible to assess the degree of domain-specificity, or domainhood, of web corpora. We present a case study where we explore the effectiveness of different measures - namely the Mann-Withney-Wilcoxon Test, Kendall correlation coefficient, Kullback–Leibler divergence, log-likelihood and burstiness - to gauge domainhood. Our findings indicate that burstiness is the most suitable measure to single out domain-specific words from a specialized corpus and to allow for the quantification of domainhood.

  • 20.
    Strandqvist, Wiktor
    et al.
    RISE - Research Institutes of Sweden (2017-2019), ICT, SICS. Linköping University, Sweden.
    Santini, Marina
    RISE - Research Institutes of Sweden (2017-2019), ICT, SICS.
    Lind, Leili
    RISE - Research Institutes of Sweden (2017-2019), ICT, SICS. Linköping University, Sweden.
    Jönsson, Arne
    Linköping University, Sweden.
    Towards a Quality Assessment of Web Corpora for Language Technology Applications2018In: echnological Innovation for Specialized Linguistic Domains Languages for Digital Lives and Cultures: Proceedings of TISLID’18, 2018Conference paper (Refereed)
    Abstract [en]

    In the experiments presented in this paper we focus on the creation and evaluation of domain-specific web corpora. To this purpose, we propose a two-step approach, namely the (1) the automatic extraction and evaluation of term seeds from personas and use cases/scenarios; (2) the creation and evaluation of domain-specific web corpora bootstrapped with term seeds automatically extracted in step 1. Results are encouraging and show that: (1) it is possible to create a fairly accurate term extractor for relatively short narratives; (2) it is straightforward to evaluate a quality such as domain-specificity of web corpora using well-established metrics.

1 - 20 of 20
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf