Non-Topical Factors in Information Access Jussi Karlgren Human Machine Interaction Department of and Language Engineering General Linguistics SICS Helsinki University Stockholm, Sweden Helsinki, Finland jussi@sics.se Abstract: Research in information retrieval has traditionally concentrated on making assumptions about the content of documents based on very shallow semantic analysis through word occurrence statistics of various kinds. But texts are more than bags of words, and the semantic analysis information retrieval systems typically used is overly simple. There is ample reason to try to broaden the view of what text is and why. Better content analysis alone will not be enough. Texts are more than their meaning. Texts have structure, they have context, they are written in a style conformant or discordant to a genre they are to be understood in, they may be carefully written or hastily thrown together, they are written by various types of agent for various reasons. Besides information to be found in the text or from the author, texts are used by readers of various backgrounds, for various reasons, and with varying degree of satisfaction. This paper outlines a framework within which to find more knowledge from texts than an approximation of their topic, and gives examples of how to use this knowledge to design useful tools for information access. What is in a Document? ====================== Arguably, the most important characteristic of a document is what it is about. The topic of a document is the most important criterion for selecting it from a collection. How to model the topic of documents in a collection effectively is the most obvious goal for most research in information retrieval. But researchers in the field are keenly aware of the fact that the interaction of information seeking users and the tools to access information sources is important in itself. Information can be sought for various reasons and with various ideas of how to determine what documents or other information bearing units are relevant. Indeed, no interactive information access system can disregard the requirements posed by users -- and this is most often understood to allow users to request documents from the collection with a minimum of complexity, by modeling document and query topic concisely and succintly, and to have systems respond rapidly with their inner workings transparent and understandable. More Than Topic =============== Given that documents, besides topic, have internal structure, textual style, and usage context - not independently of topic, but highly dependent on it - the question is if this sort of fairly vague and informal variation can be used for information access purposes. Most systems today put considerable effort in neutralizing and hiding this sort of spurious variation from the user. But this information could be retained, just as easily. Modeling document variation along any number of vague dimensions can be done through none too complex analysis. This is a first step towards new knowledge sources: less focussed than the obvious ones but knowledge sources just the same, and important situational factors in manual text categorization. Text Usage: Collaborative Search ================================ One of the most obvious characteristics of a document, besides its content, is its usage: documents have *context*, or *ecology*: texts are actually used by people for whom they are important [Walker, 1981]. The knowledge that documents are used differently and for different purposes can be used for better design of interfaces, to support various different access strategies [e.g. Belkin, 1994] - or by explicitly categorizing documents by the company they keep, by the other documents retrieved together with it [e.g. Karlgren, 1990; Resnick *et al* 1994; Hill, Maes]. The latter type of information can be used directly in retrieval. By gathering information about document usage or by asking users to list documents they appreciate a system will be able to *recommend* other similar documents - irrespective of topic. This sort of systematic study of text usage rests on two observations. Firstly, users often have a fair idea of what they have read, and they can relate their query to their own readership history; the situation and the request in information retrieval situations can often be formulated as a form of "I read *A Good Book* - I want more of the same" posed to a librarian or a colleague, or to a number of them. Secondly, an ordinarily unorganized bookcase may self-organize - somewhat unsystematically - based on users' behavior. Interesting documents may be found next to each other. They are interesting because someone placed or left them there, and they are placed there because they have some relation to the original document. In fact, in a library or a bookstore, people around an interesting bookcase tend to be interesting people. You tend to be able to get good reading tips from them. Similarly, a good librarian will remember that a certain book tends to be read by a certain set of people, and another book by the same set of people, and that there may be a similarity between the books, even though they may not be catalogized together - as of course, they often will not be. Anyone who has tried to organize a bookcase by topic knows how many cases of unexpected category conflict one encounters. Text Style, Text Genre, and Text Structure ========================================== To a text reader, an obvious important dimension of textual variation is that of *style*. Stylistic variation between texts of the same topic is often at least as noticeable as the topical variation between texts of different topic but same genre or variety; style is, broadly defined, the difference between two ways of saying the same thing. Stylistic variation in a given text, given the liberal definition above, can occur in many ways and on many linguistic levels: lexical choice, choice of syntactic structures, choice of cohesion markers on a textual level, and so forth. Some choices are constrained by the intended audience and discourse ecology the text is produced in; some are left to be entirely determined by the author's preferences, personal idiosyncracies, and other qualitative factors of writing and editing. The former form the basis for distinguishing *genres* or *functional styles* that can be found in texts: newspapers, legal text, fictional prose, poetry; the latter, variation based on *individual style* [e.g. Vachek, 1975]. Stylistic variation is not orthogonal to or independent of variation that relates directly to content and topic, but is mainly along other dimensions. Naturally, there is covariance. Texts about certain topics may only occur in certain genres, and texts in certain genres may only treat certain topics; most topics do, however, occur in several genres. Stylistic variation is easy to detect using surface cues in the text. Douglas Biber, for instance, has investigated what general dimensions of variation can be found in texts, and found that texts, across languages, speech and text, and genres can all be understood to vary along a small number of dimensions [1988, 1989, 1995]. Further experiments with simple stylistic cues, such as can be computed using readily available linguistic analysis tools show positive results. Even simple measures of terminological complexity - e.g. word length, relative frequency of long words, type/token ratios, relative frequency of various indicative lexical items - paired with measures of syntactic complexity - e.g. clause length, number of complex conjunctions give enough purchase to allow the analysis and recognition of text genres through their individual characteristics with relative ease [Karlgren and Cutting, 1994]. The measures were based on approximations of syntactic complexity - a dimension which exhibits considerable variation between genres [Losee, 1996; Menshikov, 1962]. Indeed, most stylistic measures heretofore have been attempts to find shortcuts for measuring syntactic complexity along with lexical complexity as measured by word lengths and occurrence frequencies of various sorts. Besides the microstructure of text and its lexical items, *text structure* in itself is an important cue for manual categorization of texts. The subtopic structure of texts can be identified without too much complex computational machinery [e.g. Hearst and Plaunt, 1993; Hearst, 1997; Salton and Allan, 1994], by measuring the appearance and gradual disappearance of content bearing words throughout the length of the text. In summary, text has easily distinguishable *form* in addition to *content* and *context*. How can we use knowledge of document variation? =============================================== The types of variation mentioned above can be applied to the design and implementation of information access systems in several ways. Judgments or measures of documents based on new information could be added to today's relevance-based systems; alternatively the information can be used to categorize documents for presentation to the user. Improving the Concept of Relevance ---------------------------------- Information retrieval typically present results as ranked lists of documents, sorted after so-called system-determined of likely *relevance* to a search query. Given that we know more of documents today, adding this knowledge to a typical bare-bones retrieval system is not a trivial design task. If we know of a document that it may be of an interesting topic, it is a newspaper item, its style of writing seems to be personal and subjective, and that its quality seems to be rather low - how will we be able to convert all this information into a single dimensional ranking? In fact, experiments have shown that stylistic and structural analysis can be implemented as filters for relevance judgments [Karlgren, 1996; Strzalkowski *et al* 1996]. Depending on the collection at hand, certain types of document often can be discarded from retrieved sets - in a web context, very short documents containg the words "can not be found" usually are less useful than others. This form of information can be encoded in some way to be combined with probabilistic ranking retrieval systems. Many or most knowledge based systems combine information from different sources by *weighting*, typically combining them in a linear combination of scores. But there is no reason to assume that variables engage in a relationship of a type that is suitable for linear combination. Some might be binary: "If singular first person pronouns are present, a text is not a legal text." or the relationship may be more complex: "If there are numerous tense shifts and a relatively high incidence of personal pronouns the text may be an interview." which could contrast with "If there are plenty of pronouns in the text it may be fiction". Better ways of combining evidence, through decision trees or preferably a combination of decision trees, general pattern matching techniques, and algebraic techniques are absolutely necessary to be able to make use of and understand linguistic data. And in any case, ranked lists give users little help in understanding and utilizing document variation: a richer representation of retrieval results to match a richer understanding of documents on the part of the system is a more fruitful approach. Presenting Vague Categories --------------------------- If we want to present users with information on e.g. document style or genre, we must (i) identify suitable categories or dimensions such as genres, (ii) choose criteria to cluster texts of the same category with usefully predictable results, and (iii) make use of these categories in an information access situation. In the case of stylistic variation, useful genres must be based on differences known and recognized by readers. These differences mean different things to users, and may be difficult to recognize automatically. Naming them must be done judiciously to create and present a palette of genres both reasonably consistent with what users expect and conveniently computable using the measures of stylistic variation available to us [Dewe *et al.*, 1998]. (At this point we need to understand that *vagueness* is a desirable property in human language. If we want to present the user with categories based on notions of "quality" or "subjectivity" we are better off using abstract suitably vague terms to describe the document sets such as "Commercial text" and "Personal text" rather than try force decisions based on very concrete and well-defined terminology.) As but one example of such presentations, our prototype system combines stylistic analysis with topical clustering to broaden the ranking of a probabilistic background system into a matrix presentation of topical cluster by genre [Bretan *et al.*, 1998]. Style, Relevance, and Quality ============================= In interviews with experienced internet users we found that from the users' point of view, it is very likely that an information retrieval or filtering problem is framed as a problem of low *quality* of information, not of low *topicality*. Although an information stream such as a mailing list or a Netnews conference is an interesting medium with occasional nuggets of interesting information, most of the material is irrelevant, uninteresting, and, simply put, of low quality. Quality is a many-faceted quality, and cannot be addressed with simple metrics - which brings us back to the problem of combining evidence addressed above. Open Research Questions ======================= Besides all computationally generally interesting questions and questions related to statistics and machine learning specifically there are important questions to address specifically related to information access. Typical bare bones term-based information retrieval have weak points. The information flow between user and system is poor, and text can yield more information than the word statistics utilized by systems today. The main hindrance to understanding text better is not faulty statistics or processing constraints, but faulty understanding of what text is and why. A first step to understanding text better are experiments in stylistic and structual analysis of texts. They show that it is possible to extract non-topical information from texts with comparatively little bother. While it is true you get what you pay for, even shoddy analysis is informative: texts can be categorized in genres, provided the genres in question are well chosen. And this type of information can be used and has been used for both better presentation of information retrieval results and for reranking the output of probabilistic systems. This is but a start. We need a better understanding of the small and vague clues readers use to pass judgments on texts. References ========== Nicholas J. Belkin. (1994.) "Design Principles for Electronic Texytual Resources: Investigating Users and Uses of Scholarly Information", *Studies in the memory of Donald Walker*, Kluwer. Douglas Biber. (1989). "A typology of English texts", *Linguistics*, 27:3-43. Douglas Biber. (1988.) *Variation across speech and writing*. Cambridge University Press. Ivan Bretan, Johan Dewe, Anders Hallberg, Niklas Wolkert and Jussi Karlgren. (1998.) "Web-Specific Genre Visualization" *Proceedings of WebNet '98*, Orlando, Florida, November 1998. Johan Dewe, Jussi Karlgren, and Ivan Bretan. (1998). "Assembling a Balanced Corpus from the Internet". *Proceedings of the 11th Nordic Conference on Computational Linguistics*, Copenhagen. Marti Hearst. (1997.) "TextTiling: Segmenting Text into Multi-Paragraph Subtopic Passages". *Computational Linguistics*. Marti Hearst and Christian Plaunt. (1993.) "Subtopic Structuring for Full-length Document Access". *Proceedings of the 16th ACM SIGIR Conference on Research and Development in Information Retrieval.* Pittsburgh. New York: ACM. Jussi Karlgren. (1990.) "An Algebra for Recommendations", *Syslab Working Paper* 179, Department of Computer and System Sciences, Stockholm University, Stockholm. Jussi Karlgren. (1996.)"Stylistic Variation in an Information Retrieval Experiment" In Proceedings NeMLaP 2, Bilkent, September 1996. Ankara: Bilkent University. (In the Computation and Language E-Print Archive: cmp-lg/9608003). Jussi Karlgren. (1998). Stylistic Experiments for Information Retrieval. Strzalkowski, T. (ed.) Natural Language Information Retrieval, Tomek, Kluwer. Jussi Karlgren and Douglass Cutting. (1994.) Recognizing Text Genres with Simple Metrics Using Discriminant Analysis. Paper presented at the 14th International Conference on Computational Linguistics (COLING-94), Kyoto, July 1994. (In the Computation and Language E-Print Archive: cmp-lg/9410008). George R. Klare (1963.) *The Measurement of Readability*. Iowa University press. Robert M. Losee. 1996. "Text Windows and Phrases Differing by Discipline, Location in Document, and Syntactic Structure". *Information Processing and Management*. (In the Computation and Language E-Print Archive: cmp-lg/9602003). Pattie Maes. (1994.) "Agents that Reduce Work And Information Overload". Communications of the ACM 37:7. I. I. Menshikov. (1974.) "K voprosu o zhanrovo-stilevoy obuslovlennosti sintaksicheskoy struktury frazy". ("On genre-dependent stylistic variation of the syntactic structure in the clause") In *Voprosy statisticheskoy stilistiki.* Golovin et al. (eds.) 1974. Kiev: Naukova dumka; Akademia Nauk Ukrainskoy SSR. Paul Resnick, Neophytos Iacovou, Mitesh Suchak, Peter Bergstrom, John Riedl. (1994.) "GroupLens: An Open Architechture for Collaborative Filtering of Netnews", *Proceedings of CSCW 94*, Chapel Hill. Gerard Salton and James Allan. (1994.) "Automatic Text Decomposition and Structuring", *Procs. 4th RIAO - Intelligent Multimedia Information Retrieval Systems and Management*, New York. Tomek Strzalkowski, Louise Guthrie, Jussi Karlgren, Jim Leistensnider, Fang Lin, Jose Perez-Carballo, Troy Straszheim, Jin Wang, Jon Wilding. (1997.) Natural Language Information Retrieval: TREC-6 Report. In *Proceedings of TREC-6*, Donna Harman (ed.). Josef Vachek. (1975.) "Some remarks on functional dialects of standard languages". In *Style and Text - Studies presented to Nils Erik Enkvist*. Hakan Ringbom. (ed.) Stockholm: Skriptor and Turku: Abo Akademi. Donald E. Walker. (1981.) "The Organization and Use of Information: Contributions of Information Science, Computational Linguistics, and Artificial Intelligence", *Journal of the American Society for Information Science* 32, (5):347-363.