Can We quantify domainhood?: Exploring measures to assess domain-specificity in web corporaShow others and affiliations
2018 (English)In: Commun. Comput. Info. Sci., 2018, p. 207-217Conference paper, Published paper (Refereed)
Abstract [en]
Web corpora are a cornerstone of modern Language Technology. Corpora built from the web are convenient because their creation is fast and inexpensive. Several studies have been carried out to assess the representativeness of general-purpose web corpora by comparing them to traditional corpora. Less attention has been paid to assess the representativeness of specialized or domain-specific web corpora. In this paper, we focus on the assessment of domain representativeness of web corpora and we claim that it is possible to assess the degree of domain-specificity, or domainhood, of web corpora. We present a case study where we explore the effectiveness of different measures - namely the Mann-Withney-Wilcoxon Test, Kendall correlation coefficient, Kullback–Leibler divergence, log-likelihood and burstiness - to gauge domainhood. Our findings indicate that burstiness is the most suitable measure to single out domain-specific words from a specialized corpus and to allow for the quantification of domainhood.
Place, publisher, year, edition, pages
2018. p. 207-217
Keywords [en]
Big data, Data mining, Expert systems, Search engines, Domain specific, Domain specificity, Kendall correlation coefficients, Log likelihood, Modern languages, Specialized corpora, Web Corpora, Wilcoxon test, Information management
National Category
Natural Sciences
Identifiers
URN: urn:nbn:se:ri:diva-35898DOI: 10.1007/978-3-319-99133-7_17Scopus ID: 2-s2.0-85052001976ISBN: 9783319991320 (print)OAI: oai:DiVA.org:ri-35898DiVA, id: diva2:1261499
Conference
International Conference on Database and Expert Systems Applications DEXA 2018: Database and Expert Systems Applications pp 207-217
Note
Funding text: Acknowledgement. This research was supported by E-care@home, a “SIDUS - Strong Distributed Research Environment” project funded by the Swedish Knowledge Foundation. Project website: http://ecareathome.se/.
2018-11-072018-11-072019-06-18Bibliographically approved