Planned maintenance
A system upgrade is planned for 10/12-2024, at 12:00-13:00. During this time DiVA will be unavailable.
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Can We Quantify Domainhood?: Exploring Measures to Assess Domain-Specificity in Web Corpora
RISE - Research Institutes of Sweden (2017-2019), ICT, SICS.ORCID iD: 0000-0002-5737-8149
RISE - Research Institutes of Sweden (2017-2019), ICT, SICS. Linköping University, Sweden.
RISE - Research Institutes of Sweden (2017-2019), ICT, SICS. Linköping University, Sweden.
Örebro University, Sweden.
Show others and affiliations
2018 (English)In: DEXA 2018: Database and Expert Systems Applications, Springer, 2018, p. 207-217Conference paper, Published paper (Refereed)
Abstract [en]

Web corpora are a cornerstone of modern Language Technology. Corpora built from the web are convenient because their creation is fast and inexpensive. Several studies have been carried out to assess the representativeness of general-purpose web corpora by comparing them to traditional corpora. Less attention has been paid to assess the representativeness of specialized or domain-specific web corpora. In this paper, we focus on the assessment of domain representativeness of web corpora and we claim that it is possible to assess the degree of domain-specificity, or domainhood, of web corpora. We present a case study where we explore the effectiveness of different measures - namely the Mann-Withney-Wilcoxon Test, Kendall correlation coefficient, Kullback–Leibler divergence, log-likelihood and burstiness - to gauge domainhood. Our findings indicate that burstiness is the most suitable measure to single out domain-specific words from a specialized corpus and to allow for the quantification of domainhood.

Place, publisher, year, edition, pages
Springer, 2018. p. 207-217
National Category
Engineering and Technology
Identifiers
URN: urn:nbn:se:ri:diva-37623DOI: 10.1007/978-3-319-99133-7_17Scopus ID: 2-s2.0-85052001976ISBN: 9783319991320 (print)OAI: oai:DiVA.org:ri-37623DiVA, id: diva2:1283326
Conference
International Conference on Database and Expert Systems Applications
Available from: 2019-01-29 Created: 2019-01-29 Last updated: 2020-01-31Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Santini, Marina

Search in DiVA

By author/editor
Santini, Marina
By organisation
SICS
Engineering and Technology

Search outside of DiVA

GoogleGoogle Scholar

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 9 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf