Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Assembling a Balanced Corpus from the Internet
RISE., Swedish ICT, SICS.ORCID-id: 0000-0003-4042-4919
Antal upphovsmän: 31998 (Engelska)Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

For empirically oriented textual research it is crucial to have materials available for extraction of statistics, training probabilistic algorithms, and testing hypotheses about language and language processing in general. <p> In recent years, the awareness that text is not just text, but that texts comes in several forms, has spread from more theoretical and literary subfields of linguistics to the more practically oriented information retrieval and natural language processing fields. As a consequence, several test collections available for research explicitly attempt to cover many or most well-established textual <i>genres</i>, or <i>functional styles</i> in well-balanced proportions (Francis and Kucera, 1982; K&auml;llgren, 1990). <p> The creation of such a collection is a complex matter in several respects. Our reseach area is to build retrieval tools for the Internet, and thus, for our purposes, the choice of genres to include is one of the more central problems: there is no well-established genre palette for Internet materials. To find materials to experiment with, we need to create them in a form suitable for our purposes. This is a double edged problem, involving both vaguely expressed user expectations and establishing categories using large numbers of features which taken singly have low predictive and explanatory power. This paper gives an outline of the methodology we use for determining which genres to include.

Ort, förlag, år, upplaga, sidor
1998, 5.
Nationell ämneskategori
Data- och informationsvetenskap
Identifikatorer
URN: urn:nbn:se:ri:diva-20983OAI: oai:DiVA.org:ri-20983DiVA, id: diva2:1041017
Konferens
11th Nordic Conference of Computational Linguistics
Projekt
EasifyTillgänglig från: 2016-10-31 Skapad: 2016-10-31 Senast uppdaterad: 2025-09-23Bibliografiskt granskad

Open Access i DiVA

Fulltext saknas i DiVA

Sök vidare i DiVA

Av författaren/redaktören
Karlgren, Jussi
Av organisationen
SICS
Data- och informationsvetenskap

Sök vidare utanför DiVA

GoogleGoogle Scholar

urn-nbn

Altmetricpoäng

urn-nbn
Totalt: 164 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf