Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Can We Quantify Domainhood?: Exploring Measures to Assess Domain-Specificity in Web Corpora
RISE - Research Institutes of Sweden (2017-2019), ICT, SICS.ORCID-id: 0000-0002-5737-8149
RISE - Research Institutes of Sweden (2017-2019), ICT, SICS. Linköping University, Sweden.
RISE - Research Institutes of Sweden (2017-2019), ICT, SICS. Linköping University, Sweden.
Örebro University, Sweden.
Visa övriga samt affilieringar
2018 (Engelska)Ingår i: DEXA 2018: Database and Expert Systems Applications, Springer, 2018, s. 207-217Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Web corpora are a cornerstone of modern Language Technology. Corpora built from the web are convenient because their creation is fast and inexpensive. Several studies have been carried out to assess the representativeness of general-purpose web corpora by comparing them to traditional corpora. Less attention has been paid to assess the representativeness of specialized or domain-specific web corpora. In this paper, we focus on the assessment of domain representativeness of web corpora and we claim that it is possible to assess the degree of domain-specificity, or domainhood, of web corpora. We present a case study where we explore the effectiveness of different measures - namely the Mann-Withney-Wilcoxon Test, Kendall correlation coefficient, Kullback–Leibler divergence, log-likelihood and burstiness - to gauge domainhood. Our findings indicate that burstiness is the most suitable measure to single out domain-specific words from a specialized corpus and to allow for the quantification of domainhood.

Ort, förlag, år, upplaga, sidor
Springer, 2018. s. 207-217
Nationell ämneskategori
Teknik och teknologier
Identifikatorer
URN: urn:nbn:se:ri:diva-37623DOI: 10.1007/978-3-319-99133-7_17Scopus ID: 2-s2.0-85052001976ISBN: 9783319991320 (tryckt)OAI: oai:DiVA.org:ri-37623DiVA, id: diva2:1283326
Konferens
International Conference on Database and Expert Systems Applications
Tillgänglig från: 2019-01-29 Skapad: 2019-01-29 Senast uppdaterad: 2020-01-31Bibliografiskt granskad

Open Access i DiVA

Fulltext saknas i DiVA

Övriga länkar

Förlagets fulltextScopus

Person

Santini, Marina

Sök vidare i DiVA

Av författaren/redaktören
Santini, Marina
Av organisationen
SICS
Teknik och teknologier

Sök vidare utanför DiVA

GoogleGoogle Scholar

doi
isbn
urn-nbn

Altmetricpoäng

doi
isbn
urn-nbn
Totalt: 9 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf