This paper presents work on a method to detect names of proteins in running text. The detection and categorisation of named entities, such as names of people, organisations and places, in classical MUC-style information extraction tasks (Borthwick 1998) might be regarded a solved problem. But names of proteins present a slightly different challenge because of their variant structural characteristics and the specifics of the text domains in which they appear. This certainly holds true for other biological substances, and probably for many other kinds of terminology as well. We will present the different steps involved in our approach to this problem, and show how combinations of them influence recall and precision.
This paper presents work on a method to detect names of proteins in running text. Our system - Yapex - uses a combination of lexical and syntactic knowledge, heuristic filters and a local dynamic dictionary. The syntactic information given by a general-purpose off-the-shelf parser supports the correct identification of the boundaries of protein names, and the local dynamic dictionary finds protein names in positions incompletely analysed by the parser. We present the different steps involved in our approach to protein tagging, and show how combinations of them influence recall and precision. We evaluate the system on a corpus of MEDLINE abstracts and compare it with the KeX system (Fukuda et al., 1998) along four different notions of correctness.
A prerequisite for all higher level information extraction tasks is the identification of unknown names in text. Today, when large corpora can consist of billions of words, it is of utmost importance to develop accurate techniques for the automatic detection, extraction and categorization of named entities in these corpora. Although named entity recognition might be regarded a solved problem in some domains, it still poses a significant challenge in others. In this work we focus on one of the more difficult tasks, the identification of protein names in text. This task presents several interesting difficulties because of the named entities' variant structural characteristics, their sometimes unclear status as names, the lack of common standards and fixed nomenclatures, and the specifics of the texts in the molecular biology domain in which they appear. We describe how we approached these and other difficulties in the implementation of Yapex, a system for the automatic identification of protein names in text. We also evaluate Yapex under four different notions of correctness and compare its performance to that of another publicly available system for protein name recognition.
The lack of persons trained in computational linguistic methods is a severe obstacle to making the Internet and computers accessible to people all over the world in their own languages. The paper discusses the experiences of designing and teaching an introductory course in Natural Language Processing to graduate computer science students at Addis Ababa University, Ethiopia, in order to initiate the education of computational linguists in the Horn of Africa region.
This paper describes how a proposed project will research the expression of attitude, affect, and sentiment in text in order to automatically identify and extract such expressions. The project starting points are a set of hypotheses: + There are syntactic and lexical markers in text such that attitudinal information can be harvested using them; + Players, or discourse referents, in text are one such crucial marker for modeling topicality in general and attitudinal information flow in particular; + Attitudes in texts are dependent on text type and domain; + Attitudinal information can be applied in the development of practical tools for information access, among other application areas; + An extended notion of relevance will afford us with a empirical evaluation model for our theories and experiments.
Evaluation of multimedia and multilingual information access systems needs to be performed from a usage oriented perspective. This document outlines use cases from the three use case domains of the PROMISE project and gives some initial pointers to how their respective characteristics can be extrapolated to determine and guide evaluation activities, both with respect to benchmarking and to validation of the usage hypotheses. The use cases will be developed further during the course of the evaluation activities and workshops projected to occur in coming CLEF conferences.
This paper describes experiments to find attitudinal expressions in written English text. The experiments are based on an analysis of text with respect to not only the vocabulary of content terms present in it (which most other approaches use as a basis for analysis) but also on structural features of the text as represented by presence of function words (in other approaches often removed by stop lists) and by presence of constructional features (typically disregarded by most other analyses). In our analysis, following a constructional grammatical framework, structural features are treated similarly to vocabulary features. Our result gives us reason to conclude - provisionally, until more empirical verification experiments can be performed - that: * Linguistic structural information does help in establishing whether a sentence is opinionated or not; whereas * Linguistic information of this specific type does not help in distinguishing sentences of differing polarity.
This paper describes experiments to use non-terminological information to find attitudinal expressions in written English text. The experiments are based on an analysis of text with respect to not only the vocabulary of content terms present in it (which most other approaches use as a basis for analysis) but also with respect to presence of structural features of the text represented by constructional features (typically disregarded by most other analyses). In our analysis, following a construction grammar framework, structural features are treated as occurrences, similarly to the treatment of vocabulary features. The constructional features in play are chosen to potentially signify opinion but are not specific to negative or positive expressions. The framework is used to classify clauses, headlines, and sentences from three different shared collections of attitudinal data. We find that constructional features transfer well across different text collections and that the information couched in them integrates easily with a vocabulary based approach, yielding improvements in classification without complicating the application end of the processing framework.
Whereas many applications of natural language processing for molecular biology focus on protein name tagging for the purpose high-level information extraction from large corpuses of scientific text, such as automatic identification of protein-protein interactions, high quality protein name tagging has a value in itself. The aim of this study was to design, implement, and evaluate a high-accuracy protein name tagger, and give proof-of-concept for some of the most basic applications of protein name tagging in an information retrieval setting, namely browsing support, active database cross linking, and enhanced query functionality. A combination of heuristics, dictionary look-up, syntactic analysis, and the application of a local dynamic dictionary were used to create a protein name tagger. This tagger outperforms a previously published similar system when benchmarked on a corpus of manually annotated Medline abstracts. In addition to evaluating the tagging performance, the implemented algorithm was used to add mark-up to a corpus of approximately 10000 Medline abstracts, which were indexed in a state-of-the-art information retrieval system. Indexing highlights many basic benets of adding named entity mark-up such as protein names. One obvious benet is that the search process is enhanced by the addition of a search eld. Furthermore, the mark-up can be used for providing active hyperlinks between protein entities in presented documents and protein sequence databases, such as SwissProt, when both databases are indexed in the same information retrieval system. Efficient links can also be constructed in the opposite direction providing high precision retrieval of documents relevant for protein entries. Fast and accurate cross linking can be obtained by using an efficient implementation of the eld based approximate cosine measure, which is a simple standard information retrieval technique for document similarity searching. This poster presents methods, results, implementation details, and features of a prototype system.
This paper introduces four different notions of correctness to be used when measuring the performance of protein name taggers, each of which reflects certain characteristics of the tagger under evaluation. The discussion regarding the different notions is centered around the evaluation of two protein name taggers; Yapex, developed by the authors, and KeX developed by Fukuda et al (1998). For the purpose of illustrating the difference between the ways of evaluation, both taggers are applied to a corpus of 101 MEDLINE abstracts in which all occurrences of protein names have been marked up by domain experts.
This paper reports experiments for the CoNLL 2010 shared task on learning to detect hedges and their scope in natural language text. We have addressed the experimental tasks as supervised linear maximum margin prediction problems. For sentence level hedge detection in the biological domain we use an L1-regularised binary support vector machine, while for sentence level weasel detection in the Wikipedia domain, we use an L2-regularised approach. We model the in-sentence uncertainty cue and scope detection task as an L2-regularised approximate maximum margin sequence labelling problem, using the BIO-encoding. In addition to surface level features, we use a variety of linguistic features based on a functional dependency analysis. A greedy forward selection strategy is used in exploring the large set of potential features. Our official results for Task 1 for the biological domain are 85.2 F1-score, for the Wikipedia set 55.4 F1-score. For Task 2, our official results are 2.1 for the entire task with a score of 62.5 for cue detection. After resolving errors and final bugs, our final results are for Task 1, biological: 86.0, Wikipedia: 58.2; Task 2, scopes: 39.6 and cues: 78.5.