Plagiarism detection is a challenge for linguistic models — most current implemented models use simple occurrence statistics for linguistic items. In this paper we report two experiments related to plagiarism detection where we use a model for distributional semantics and of sentence stylistics to compare sentence by sentence the likelihood of a text being partly plagiarised. The result of the comparison are displayed for visual inspection by a plagiarism assessor.
There is an increasing amount of structure on the Web as a result of modern Web languages, user tagging and annotation, and emerg- ing robust NLP tools. These meaningful, semantic, annotations hold the promise to significantly enhance information access, by enhancing the depth of analysis of today’s systems. Currently, we have only started exploring the possibilities and only begin to un- derstand how these valuable semantic cues can be put to fruitful use. Unleashing the potential of semantic annotations requires us to think outside the box, by combining the insights of natural lan- guage processing (NLP) to go beyond bags of words, the insights of databases (DB) to use structure efficiently even when aggregating over millions of records, the insights of information retrieval (IR) in effective goal-directed search and evaluation, and the insights of knowledge management (KM) to get grips on the greater whole. This workshop aims to bring together researchers from these dif- ferent disciplines and work together on one of the greatest chal- lenges in the years to come. The desired result of the workshop will be to gain concrete insight into the potential of semantic an- notations, and in concrete steps to take this research forward; to synchronize related research happening in NLP, DB, IR, and KM, in ways that combine the strengths of each discipline; and to have a lively, interactive workshop where every participant contributes actively and which inspires attendees to think freely and creatively, working towards a common goal.
A short summary of findings from the 2005 SIGIR workshop on stylistics in text.
Rapport från regeringens IT-politiska strategigrupp.
After addressing the state-of-the-art during the first year of Chorus and establishing the existing landscape in multimedia search engines, we have identified and analyzed gaps within European research effort during our second year. In this period we focused on three directions, notably technological issues, user-centred issues and use-cases and socio- economic and legal aspects. These were assessed by two central studies: firstly, a concerted vision of functional breakdown of generic multimedia search engine, and secondly, a representative use-cases descriptions with the related discussion on requirement for technological challenges. Both studies have been carried out in cooperation and consultation with the community at large through EC concertation meetings (multimedia search engines cluster), several meetings with our Think-Tank, presentations in international conferences, and surveys addressed to EU projects coordinators as well as National initiatives coordinators. Based on the obtained feedback we identified two types of gaps, namely core technological gaps that involve research challenges, and “enablers”, which are not necessarily technical research challenges, but have impact on innovation progress. New socio-economic trends are presented as well as emerging legal challenges.
Researchers with a complex information need typically slice-and-dice their problem into several queries and subqueries, and laboriously combine the answers post hoc to solve their tasks. Consider planning a social event at the last day of SIGIR, in the unknown city of Beijing, factoring in distances, timing, and preferences on budget, cuisine, and entertainment. A system supporting the entire search episode should “know” a lot, either from profiles or implicit information, or from explicit information in the query or from feedback. This may lead to the (interactive) construction of a complexly structured query, but sometimes the most obvious query for a complex need is dead simple: entertain me. Rather than returning ten-blue-lines in response to a 2.4-word query, the desired system should support searchers during their whole task or search episode, by iteratively constructing a complex query or search strategy, by exploring the result-space at every stage, and by combining the partial answers into a coherent whole. The workshop brought together a varied group of researchers covering both user and system centered approaches, who worked together on the problem and potential solutions. There was a strong feeling that we made substantial progress. First, there was general optimism on the wealth of contextual information that can be derived from context or natural interactions without the need for obstrusive explicit feedback. Second, the task of “contextual suggestions”—matching specific types of results against rich profiles—was identified as a manageable first step, and concrete plans for such as track were discussed in the aftermath of the workshop. Third, the identified dimensions of variation—such as the level of engagement, or user versus system initiative—give clear suggestions of the types of input a searcher is willing or able to give and the type of response expected from a system.
The first report on alternative evaluation methodology summarizes work done within the PROMISE environment and especially within Work package 4 - Evaluation Metrics and Methodologies. The report outlines efforts to develop and support alternative, automated evaluation methodologies, with a special focus on generating ground truth from existing data sources like Log files or annotations. Events like LogCLEF 2011, PatOlympics 2011 or the CHiC2011 workshop are presented and reviewed on their impact on the three main uses case domains.
We describe WEST, a WEb browser for Small Terminals, that aims to solve some of the problems associated with accessing web pages on hand-held devices. Through a novel combination of text reduction and focus+context visualization, users can access web pages from a very limited display environment, since the system will provide an overview of the contents of a web page even when it is too large to be displayed in its entirety. To make maximum use of the limited resources available on a typical hand-held terminal, much of the most demanding work is done by a proxy server, allowing the terminal to concentrate on the task of providing responsive user interaction. The system makes use of some interaction concepts reminiscent of those defined in the Wireless Application Protocol (WAP), making it possible to utilize the techniques described here for WAP-compliant devices and services that may become available in the near future.
Lärobok i formella språk.
Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective. The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines. From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research.
Participative Research labOratory for Multimedia and Multilingual Information Systems Evaluation (PROMISE) is a Network of Excellence, starting in conjunction with this first independent CLEF 2010 conference, and designed to support and develop the evaluation of multilingual and multimedia information access systems, largely through the activities taking place in Cross-Language Evaluation Forum (CLEF) today, and taking it forward in important new ways. PROMISE is coordinated by the University of Padua, and comprises 10 partners: the Swedish Institute for Computer Science, the University of Amsterdam, Sapienza University of Rome, University of Applied Sciences of Western Switzerland, the Information Retrieval Facility, the Zurich University of Applied Sciences, the Humboldt University of Berlin, the Evaluation and Language Resources Distribution Agency, and the Centre for the Evaluation of Language Communication Technologies. The single most important step forward for multilingual and multimedia information access which PROMISE will work towards is to provide an open evaluation infrastructure in order to support automation and collaboration in the evaluation process.
We discuss the synergetic effects that can be obtained in an integrated multimodal interface framework comprising on one hand a visual language-based modality and on the other natural language analysis and generation components. Besides a visual language with high expressive power, the framework includes a cross-modal translation mechanism which enables mutual illumination of interface language syntax and semantics. Special attention has been payed to how to address problems with robustness and pragmatics through unconventional methods which aim to enable user control of the discourse management process.
A scheme for integration of a natural language interface into a multimodal environment is presented with emphasis on the synergetic results that can be achieved, which are argued to be: 1)Complementary expressiveness; 2) Mutual illumination of language syntax and semantics; 3) Robust pragmatics and graceful recovery from failed natural language analyses through the reification of discourse objects to enable user control of discourse management.
This paper describes the starting points, from the standpoint of individual privacy and identity monitoring and from a technological perspective, of how to design and build tools for to help individual users track and monitor their presence on the web. Our design models and represents facets of identity by tracking their mentions in text. It is intended to provide a basis for discussion on how to redress the information imbalance users are subjected to today, due to lack of overview of their own traces.
This paper discusses the creation of re-useable log files for investigating interactive cross-language search behaviour. This was run as part of iCLEF 2008-09 where the goal was generating a record of user-system interactions based on interactive cross-language image searches. The level of entry to iCLEF was made purposely low with a default search interface and online game environment provided by the organisers. User-system interaction and input from users was recorded in log files for future investigation. This novel approach to running iCLEF resulted in logs containing more than 2 million lines of data.
Participation in evaluation campaigns for interactive information retrieval systems has received variable success over the years. In this paper we discuss the large-scale interactive evaluation of multilingual information access systems, as part of the Cross-Language Evaluation Forum evaluation campaign. In particular, we describe the evaluation planned for 2008 which is based on interaction with content from Flickr, the popular online photo-sharing service. The proposed evaluation seeks to reduce entry costs, stimulate user evaluation and encourage greater participation in the interactive track of CLEF.
This paper presents a proposal for iCLEF 2006, the interactive track of the CLEF cross-language evaluation campaign. In the past, iCLEF has addressed applications such as information retrieval and question answering. However, for 2006 the focus has turned to text-based image retrieval from Flickr. We describe Flickr, the challenges this kind of collection presents to cross-language researchers, and suggest initial iCLEF tasks.
For empirically oriented textual research it is crucial to have materials available for extraction of statistics, training probabilistic algorithms, and testing hypotheses about language and language processing in general. <p> In recent years, the awareness that text is not just text, but that texts comes in several forms, has spread from more theoretical and literary subfields of linguistics to the more practically oriented information retrieval and natural language processing fields. As a consequence, several test collections available for research explicitly attempt to cover many or most well-established textual <i>genres</i>, or <i>functional styles</i> in well-balanced proportions (Francis and Kucera, 1982; Källgren, 1990). <p> The creation of such a collection is a complex matter in several respects. Our reseach area is to build retrieval tools for the Internet, and thus, for our purposes, the choice of genres to include is one of the more central problems: there is no well-established genre palette for Internet materials. To find materials to experiment with, we need to create them in a form suitable for our purposes. This is a double edged problem, involving both vaguely expressed user expectations and establishing categories using large numbers of features which taken singly have low predictive and explanatory power. This paper gives an outline of the methodology we use for determining which genres to include.
Users pose very short queries to information retrieval systems. This study shows that the apparent length of the query field has an effect on the length of the query users enter.
This report constitutes the proceedings of the workshop on Information Access in a Multilingual World: Transitioning from Research to Real-World Applications}, held at SIGIR 2009 in Boston, July 23, 2009. Multilingual Information Access (MLIA) is at a turning point wherein substantial real-world applications are being introduced after fifteen years of research into cross-language information retrieval, question answering, statistical machine translation and named entity recognition. Previous workshops on this topic have focused on research and small-scale applications. The focus of this workshop was on technology transfer from research to applications and on what future research needs to be done which facilitates MLIA in an increasingly connected multilingual world.
Multilingual Information Access (MLIA) is at a turning point wherein substantial real-world applications are being introduced after fifteen years of research into cross-language information retrieval, question answering, statistical machine translation and named entity recognition. Previous workshops on this topic have focused on research and small- scale applications. The focus of this workshop was on technology transfer from research to applications and on what future research needs to be done which facilitates MLIA in an increasingly connected multilingual world.
This paper summarizes the task design for iCLEF 2006 (the CLEF interactive track). Compared to previous years, we have proposed a radically new task: searching images in a naturally multilingual database, Flickr, which has millions of photographs shared by people all over the planet, tagged and described in a wide variety of languages. Participants are expected to build a multilingual search front-end to Flickr (using Flickr’s search API) and study the behaviour of the users for a given set of searching tasks. The emphasis is put on studying the process, rather than evaluating its outcome.
As an initial effort to identify universal and language-specific factors that influence the behavior of distributional models, we have formulated a distributionally determined word similarity network model, implemented it for eleven different languages, and compared the resulting networks. In the model, vertices constitute words and two words are linked if they occur in similar contexts. The model is found to capture clear isomorphisms across languages in terms of syntactic and semantic classes, as well as functional categories of abstract discourse markers. Language specific morphology is found to be a dominating factor for the accuracy of the model.
Report from the workshop
Purpose – This paper aims to investigate how readers assess relevance of retrieved documents in a foreign language they know well compared with their native language, and whether work-task scenario descriptions have effect on the assessment process. Design/methodology/approach – Queries, test collections, and relevance assessments were used from the 2002 Interactive CLEF. Swedish first-language speakers, fluent in English, were given simulated information-seeking scenarios and presented with retrieval results in both languages. Twenty-eight subjects in four groups were asked to rate the retrieved text documents by relevance. A two-level work-task scenario description framework was developed and applied to facilitate the study of context effects on the assessment process. Findings – Relevance assessment takes longer in a foreign language than in the user first language. The quality of assessments by comparison with pre-assessed results is inferior to those made in the users' first language. Work-task scenario descriptions had an effect on the assessment process, both by measured access time and by self-report by subjects. However, effects on results by traditional relevance ranking were detectable. This may be an argument for extending the traditional IR experimental topical relevance measures to cater for context effects. Originality/value – An extended two-level work-task scenario description framework was developed and applied. Contextual aspects had an effect on the relevance assessment process. English texts took longer to assess than Swedish and were assessed less well, especially for the most difficult queries. The IR research field needs to close this gap and to design information access systems with users' language competence in mind.
This technical report collects three years of experimentation in interactive cross-language information retrieval by SICS in the annual Cross-language Evaluation Forum (CLEF) evaluation campaigns 2003, 2004, and 2005. We varied simulated task context and measured user performance in document assessment task to find that choice of language and task context indeed have effects on the amount of efforts users need to expend to achieve task completion.
The study presented involves several different contextual aspects and is the latest in a continuing series of exploratory experiments on information access behaviour in a multi-lingual context [1, 2]. This year’s interactive cross-lingual information access experiment was designed to measure three parameters we expected would affect the performance of users in cross-lingual tasks in languages in which the users are less than fluent. Firstly, introducing new technology, we measure the effect of topic-tailored term expansion on query formulation. Secondly, introducing a new component in the interactive interface, we investigate - without measuring by using a control group - the effect of a bookmark panel on user confidence in the reported result. Thirdly, we ran subjects pair-wise and allowed them to communicate verbally, to investigate how people may cooperate and collaborate with a partner during a search session performing a similar but non-identical search task.
This paper reports on the user-centered design methodology and techniques used for the elicitation of user requirements and how these requirements informed the first phase of the user interface design for a Cross-Language Information Retrieval System. We describe a set of factors involved in analysis of the data collected and, finally discuss the implications for user interface design based on the findings.
This paper proposes a novel method for automatically acquiring multi-lingual lexica from non-parallel data and reports some initial experiments to prove the viability of the approach. Using established techniques for building mono-lingual vector spaces two independent semantic vector spaces are built from textual data. These vector spaces are related to each other using a small {\em reference word list} of manually chosen reference points taken from available bi-lingual dictionaries. Other words can then be related to these reference points first in the one language and then in the other. In the present experiments, we apply the proposed method to comparable but non-parallel English-German data. The resulting bi-lingual lexicon is evaluated using an online English-German lexicon as gold standard. The results clearly demonstrate the viability of the proposed methodology.
Documents can be assigned keywords by frequency analysis of the terms found in the document text, which arguably is the primary source of knowledge about the document itself. By including a hierarchi- cally organised domain specific thesaurus as a second knowledge source the quality of such keywords was improved considerably, as measured by match to previously manually assigned keywords. In the presented ex- periment, the combination of the evidence from frequency analysis and the hierarchically organised thesaurus was done using inductive logic programming.
We propose that analysing interviews with subjects who have been exposed to anthropomorphic characters from a metaphorical point of view can provide insights into how characters in the interface are perceived. In a study of the Agneta & Frida system (two characters that comment contents of web pages in an ironic, humorous manner) we found that subjects who used Agneta & Frida used more narrative verbs and adverbs than users who only browsed the web pages. In the latter case, more spatial verbs and adverbs were used. This may imply that normal web browsing is perceived as navigation through a space, while Agneta & Frida provides for a more narrative experience.
There is a need to make the interface of Route Guidence systems more flexible, so that they can adapt to the specific driver needs. Today's systems are primarily aimed at tourists, and interfaces for drivers that have more experience of a city have not been investigated. In this paper we describe a study with very experienced driver-navigators, where we have deduced principles as to how route descriptions are constructed and expressed by humans. Some of these principles are implementable, and a rough outline of a program is presented. Given a plan of how to go to A to B in a city, the program produces a verbal description of that plan. The goal is to incorporate verbal descriptions in Route Guidence systems, primarily aimed at driver-navigators with some knowledge of the city.
We examine the need for plan inference in intelligent help mechanisms. We argue that previous approaches have drawbacks that need to be overcome to make plan inference useful. Firstly, plans have to be inferred - not extracted from the users’ help requests. Secondly, the plans inferred must be more than a single goal or solitary user command.
Utilising adaptive interface techniques in the design of systems introduces certain risks. An adaptive interface is not static, but will actively adapt to the perceived needs of the user. Unless carefully designed, these changes may lead to an unpredictable, obscure and uncontrollable interface. Therefore the design of adaptive interfaces must ensure that users can inspect the adaptivity mechanisms, and control their results. One way to do this is to rely on the user's understanding of the application and the domain, and relate the adaptivity mechanisms to domain-specific concepts. We present an example of an adaptive hypertext help system POP, which is being built according to these principles, and discuss the design considerations and empirical findings that lead to this design.
There is an increasing amount of structure on the Web as a result of modern Web lan- guages, user tagging and annotation, and emerging robust NLP tools. These meaningful, semantic, annotations hold the promise to significantly enhance information access, by en- hancing the depth of analysis of today’s systems. Currently, we have only started exploring the possibilities and only begin to understand how these valuable semantic cues can be put to fruitful use. The workshop had an interactive format consisting of keynotes, boasters and posters, breakout groups and reports, and a final discussion, which was prolonged into the evening. There was a strong feeling that we made substantial progress. Specifically, each of the breakout groups contributed to our understanding of the way forward. First, annotations and use cases come in many different shapes and forms depending on the domain at hand, but at a higher level there are commonalities in annotation tools, indexing methods, user interfaces, and general methodology. Second, there is a framework emerging to view annota- tion as (1) a linking procedure, connecting (2) an analysis of information objects with (3) a semantic model of some sort, expressing relations that contribute to (4) a task of interest to end users. Third, we should look at complex tasks that cannot be comprehensible articulated in a few keywords, and embrace interaction both to incrementally refine the search request and to explore the results at various stages, guided by the semantic structure.
There is an increasing amount of structure on the Web as a result of modern Web languages, user tagging and annotation, and emerg- ing robust NLP tools. These meaningful, semantic, annotations hold the promise to significantly enhance information access, by enhancing the depth of analysis of today’s systems. Currently, we have only started exploring the possibilities and only begin to un- derstand how these valuable semantic cues can be put to fruitful use. Unleashing the potential of semantic annotations requires us to think outside the box, by combining the insights of natural lan- guage processing (NLP) to go beyond bags of words, the insights of databases (DB) to use structure efficiently even when aggregating over millions of records, the insights of information retrieval (IR) in effective goal-directed search and evaluation, and the insights of knowledge management (KM) to get grips on the greater whole. The Workshop aims to bring together researchers from these dif- ferent disciplines and work together on one of the greatest chal- lenges in the years to come. The desired result of the workshop will be concrete insight into the potential of semantic annotations, and in concrete steps to take this research forward; synchronize related research happening in NLP, DB, IR, and KM, in ways that combine the strengths of each discipline; and have a lively, interactive work- shop were everyone contributes and that inspires attendees to think “outside the box.”
We describe a style of computing that differs from traditional numeric and symbolic computing and is suited for modeling neural networks. We focus on one aspect of ``neurocomputing,'' namely, computing with large random patterns, or high-dimensional random vectors, and ask what kind of computing they perform and whether they can help us understand how the brain processes information and how the mind works. Rapidly developing hardware technology will soon be able to produce the massive circuits that this style of computing requires. This chapter develops a theory on which the computing could be based.
Dilemma is a lexicography component in a tool kit for translation. Dilemma presents on the request of a text writer relevant lexical information extracted from previously translated parallel texts by statistical processing. Dilemma is currently used in the ongoing translation of EC legislation into the languages of candidate member countries.
Texts exhibit considerable stylistic variation. This paper reports an experiment where a large corpus of documents is analyzed using various simple stylistic metrics. A subset of the corpus has been previously assessed to be relevant for answering given information retrieval queries. The experiment shows that this subset differs significantly from the rest of the corpus in terms of the stylistic metrics studied.
For the purposes of two recent student projects hosted at SICS, we defined a target notion based on trust in lieu of topical relevance. Given controversial search task topics that interested them, subjects performed experiments with enthusiasm and reported that the experiment had influenced their state of mind. This forms an implicit test of trust in the retrieved material. While the respondents reported a medium, to low-medium range of trust in the materials, and did not believe they had found all pertinent facets of opinion pertaining to the topic, they still adjusted their opinions on the matter to some extent and reported having learned about the topic.
Minutes of Rocquencourt Workshop – INRIA March 13, 2007
The 7th CHORUS workshop on “Affect, Appeal, and Sentiment as Factors Influencing Interaction with Multimedia Information” was held on May 28, 2009, Brussels, immediately following the Third CHORUS Conference, hosted by the European Commission at their Avenue Beaulieu premises. Participation was limited to invited speakers, and comprised sixteen researchers from fourteen research institutes in eight countries.
The Second CHORUS Conference and third Yahoo! Research Workshop on the Future of Web Search was held during April 4-5, 2008, in Granvalira, Andorra to discuss future directions in multi-medial information access and other specialised topics in the near future of retrieval. Attendance was at capacity, with 97 participants from 11 countries and 3 continents.
Compounds, especially in languages where compounds are formed by concatenation without intervening whitespace between elements, pose challenges to simple text retrieval algorithms. Search queries that include compounds may not retrieve texts where elements of those compounds occur in uncompounded form; search queries that lack compounds will not retrieve texts where the salient elements are buried inside compounds. This study explores the distributional characteristics of compounds and their constituent elements using Swedish, a compounding language, as a test case. The compounds studied are taken from experimental search topics given for CLEF, the Cross-Language Evaluation Forum and their distributions are related to relevance assessments made on the collection under study and evaluated in terms of divergence from expected random distribution over documents. The observations made have direct ramifications on e.g. query analysis and term weighting approaches in information retrieval system design.