Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
A Web Corpus for eCare: Collection, Lay Annotation and Learning - First Results
RISE - Research Institutes of Sweden, ICT, SICS.ORCID iD: 0000-0002-5737-8149
RISE - Research Institutes of Sweden, ICT, SICS. Linköping University, Sweden.
RISE - Research Institutes of Sweden, ICT, SICS. Linköping University, Sweden.
Örebro University, Sweden.
2017 (English)Conference paper, Published paper (Refereed)
Abstract [en]

In this position paper, we put forward two claims: 1) it is possible to design a dynamic and extensible corpus without running the risk of getting into scalability problems; 2) it is possible to devise noise-resistant Language Technology applications without affecting performance. To support our claims, we describe the design, construction and limitations of a very specialized medical web corpus, called eCare_Sv_01, and we present two experiments on lay-specialized text classification. eCare_Sv_01 is a small corpus of web documents written in Swedish. The corpus contains documents about chronic diseases. The sublanguage used in each document has been labelled as “lay” or “specialized” by a lay annotator. The corpus is designed as a flexible text resource, where additional medical documents will be appended over time. Experiments show that the lay-specialized labels assigned by the lay annotator are reliably learned by standard classifiers. More specifically, Experiment 1 shows that scalability is not an issue when increasing the size of the datasets to be learned from 156 up to 801 documents. Experiment 2 shows that lay-specialized labels can be learned regardless of the large amount of disturbing factors, such as machine translated documents or low-quality texts that are numerous in the corpus.

Place, publisher, year, edition, pages
2017. p. 71-78
National Category
Engineering and Technology
Identifiers
URN: urn:nbn:se:ri:diva-37625DOI: 10.15439/2017f531OAI: oai:DiVA.org:ri-37625DiVA, id: diva2:1283332
Conference
2nd International Workshop on Language Technologies and Applications (LTA'17), Prague, Czech Republic, 3-6 September, 2017
Available from: 2019-01-29 Created: 2019-01-29 Last updated: 2019-08-15Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full text

Authority records BETA

Santini, Marina

Search in DiVA

By author/editor
Santini, Marina
By organisation
SICS
Engineering and Technology

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
v. 2.35.7