Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
An Evaluation of Bag-of-Concepts Representations in Automatic Text Classification
Number of Authors: 1
2005 (English)Independent thesis Advanced level (degree of Master (Two Years))Student thesis
Abstract [en]

Automatic text classification is the process of automatically classifying text documents into pre-defined document classes. Traditionally documents are represented in the so called bag-of-words model. In this model documents are simply represented as vectors, in which dimensions correspond to words. In this project a representation called bag-of-concepts has been evaluated. This representation is based on models for representing the meanings of words in a vector space. Documents are then represented as linear combinations of the words' meaning vectors. The resulting vectors are high-dimensional and very dense. We have investigated two different methods for reducing the dimensionality of the document vectors: feature selection based on gain ratio and random mapping. Two domains of text have been used: abstracts of medical articles in english and texts from Internet newsgroups. The former has been of primary interest, while the latter has been used for comparison. The classification has been performed by use of three different machine learning methods: Support Vector Machine, AdaBoost and Decision Stump. Results of the evaluation is difficult to interpret, but suggest that the new representation give significantly better results on document classes for which the classical method fails. The representations seem to give equal results on document classes for which the classical method works fine. Both dimensionality reduction methods are robust. Random mapping, while being much less computationally expensive, shows greater variance.

Place, publisher, year, edition, pages
2005, 1. , 72 p.
National Category
Computer and Information Science
Identifiers
URN: urn:nbn:se:ri:diva-22652OAI: oai:DiVA.org:ri-22652DiVA: diva2:1042217
Note
Report number: TRITA-NA-E05150, 2005.Available from: 2016-10-31 Created: 2016-10-31Bibliographically approved

Open Access in DiVA

No full text

Other links

http
Computer and Information Science

Search outside of DiVA

GoogleGoogle Scholar

Total: 8 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
v. 2.26.0