Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Towards a Quality Assessment of Web Corpora for Language Technology Applications
RISE - Research Institutes of Sweden (2017-2019), ICT, SICS. Linköping University, Sweden.
RISE - Research Institutes of Sweden (2017-2019), ICT, SICS.ORCID iD: 0000-0002-5737-8149
RISE - Research Institutes of Sweden (2017-2019), ICT, SICS. Linköping University, Sweden.ORCID iD: 0000-0001-5702-7720
Linköping University, Sweden.
2018 (English)In: echnological Innovation for Specialized Linguistic Domains Languages for Digital Lives and Cultures: Proceedings of TISLID’18, 2018Conference paper, Published paper (Refereed)
Abstract [en]

In the experiments presented in this paper we focus on the creation and evaluation of domain-specific web corpora. To this purpose, we propose a two-step approach, namely the (1) the automatic extraction and evaluation of term seeds from personas and use cases/scenarios; (2) the creation and evaluation of domain-specific web corpora bootstrapped with term seeds automatically extracted in step 1. Results are encouraging and show that: (1) it is possible to create a fairly accurate term extractor for relatively short narratives; (2) it is straightforward to evaluate a quality such as domain-specificity of web corpora using well-established metrics.

Place, publisher, year, edition, pages
2018.
Keywords [en]
corpus evaluation, term extraction, log-likelihood, rank correlation, Kullback-Leibler distance
National Category
Engineering and Technology
Identifiers
URN: urn:nbn:se:ri:diva-37624OAI: oai:DiVA.org:ri-37624DiVA, id: diva2:1283327
Conference
Technological Innovation for Specialized Linguistic Domains (TISLID 18), Languages for digital lives and cultures, Ghent, Belgium, 24-26 May 2018.
Available from: 2019-01-29 Created: 2019-01-29 Last updated: 2023-12-05Bibliographically approved

Open Access in DiVA

No full text in DiVA

Authority records

Santini, MarinaLind, Leili

Search in DiVA

By author/editor
Santini, MarinaLind, Leili
By organisation
SICS
Engineering and Technology

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 33 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf