The technology of information access has gone through a slow but steady process of adapting to th... more The technology of information access has gone through a slow but steady process of adapting to the growth of availability of electronically stored information. When library were small, access to a piece of information could be achieved by asking the librarian, a" wise sage" who was supposed to have read every book in the library. The librarian could tell you which book contained the information you needed and where the book was located.
Automatic construction of hypertexts for self-referencing: the Hyper-TextBook project
We present the results of the Hyper-TextBook project. The aim of the project was to design, devel... more We present the results of the Hyper-TextBook project. The aim of the project was to design, develop and test a methodology and a tool for the fully automatic authoring of hypertexts from full-text documents. The target documents were textbooks because of their specific characteristics and usage, and the project aimed at automatically creating hypertextual versions of textbooks, ie hyper-textbooks. In this first phase of the project hyper-textbooks have been designed and implemented to be used mostly as self-reference sources.
Abstract This contribution proposes a framework to generate auxiliary rich TV content metadata by... more Abstract This contribution proposes a framework to generate auxiliary rich TV content metadata by processing social networks data. Based on simple criteria to identify authoritative social media sources, we have analysed Twitter short messages relative to TV program content and devised a method to compute their informative value. We have extracted dozen of features and characterized such social data in terms of quality and relevancy.
Abstract In classical Information Retrieval systems a relevant document will not be retrieved in ... more Abstract In classical Information Retrieval systems a relevant document will not be retrieved in response to a query if the document and query representations do not share at least one term. This problem is known as “term mismatch”. A similar problem can be found in spoken document retrieval and spoken query processing, where terms misrecognized by the speech recognition process can hinder the retrieval of potentially relevant documents. I will call this problem “term misrecognition”, by analogy to the term mismatch problem.
Abstract. In large-scale distributed retrieval, challenges of latency, heterogeneity, and dynamic... more Abstract. In large-scale distributed retrieval, challenges of latency, heterogeneity, and dynamicity emphasise the importance of infrastructural support in reducing the development costs of state-of-the-art solutions. We present a service-based infrastructure for distributed retrieval which blends middleware facilities and a design framework to 'lift'the resource sharing approach and the computational services of a European Grid platform into the domain of e-Science applications.
The ability to infer the characteristics of offenders from their criminal behaviour ('offender pr... more The ability to infer the characteristics of offenders from their criminal behaviour ('offender profiling') has only been partially successful since it has relied on subjective judgments based on limited data. Words and structured data used in crime descriptions recorded by the police relate to behavioural features. Thus Language Modelling was applied to an existing police archive to link behavioural features with significant characteristics of offenders. Both multinomial and multiple Bernoulli models were used.
Abstract Context has long been considered very useful to help the user assess the actual relevanc... more Abstract Context has long been considered very useful to help the user assess the actual relevance of a document. In web searching, context can help assess the relevance of a web page by showing how the page is related to other pages in the same web site, for example. Such information is very difficult to convey and visualize in a user friendly way.
Abstract. User-generated short documents assume an important role in online communication due to ... more Abstract. User-generated short documents assume an important role in online communication due to the established utilization of social networks and real-time text messaging on the Internet. In this paper we compare the statistics of different online user-generated datasets and traditional TREC collections, investigating their similarities and differences. Our results support the applicability of traditional techniques also to user-generated short documents albeit with proper preprocessing.
Abstract We investigate the utility of topic models for the task of personalizing search results ... more Abstract We investigate the utility of topic models for the task of personalizing search results based on information present in a large query log. We define generative models that take both the user and the clicked document into account when estimating the probability of query terms. These models can then be used to rank documents by their likelihood given a particular query and user pair.
Abstract Information Retrieval (IR) aims at modelling, designing and implementing systems able to... more Abstract Information Retrieval (IR) aims at modelling, designing and implementing systems able to provide fast and effective content-based access to a large amount of information. Information can be of any kind: textual, visual, or auditory. The aim of such systems is to estimate the relevance of documents to a user information need. This is a very hard and complex task for many different reasons that a large volume of research has attempted to explain and tackle.
Abstract Blog post opinion retrieval aims at finding blog posts that are relevant and opinionated... more Abstract Blog post opinion retrieval aims at finding blog posts that are relevant and opinionated about a user's query. In this paper we propose a simple probabilistic model for assigning relevant opinion scores to documents. The key problem is how to capture opinion expressions in the document, that are related to the query topic. Current solutions enrich general opinion lexicons by finding query-specific opinion lexicons using pseudo-relevance feedback on external corpora or the collection itself.
Abstract This paper presents the results of some experiments investigating the use of Neural Netw... more Abstract This paper presents the results of some experiments investigating the use of Neural Networks in the learning engine of an Connectionist Information Retrieval system called CIRS. CIRS uses the learning and generalisation capabilities of the Back Propagation learning algorithm to acquire and use application domain knowledge in the form of a sub-symbolic knowledge representation. This paper describes the architecture of CIRS and reports on experiments on three di erent learning strategies.
MIND is a EU funded project that addresses some of the issues that arise when people have routine... more MIND is a EU funded project that addresses some of the issues that arise when people have routine access to a large number (possibly thousands) of heterogeneous and distributed multimedia Digital Libraries (DLs) over the Internet and the Web. When so many DLs are available, the first information access task is resource selection. This is predominantly an ineffective manual task as users are unaware of the contents of each individual library in terms of quantity, quality, information type, provenance and likely relevance.
Abstract Collaborative filtering systems based on ratings make it easier for users to find conten... more Abstract Collaborative filtering systems based on ratings make it easier for users to find content of interest on the Web and as such they constitute an area of much research. In this paper we first present a Bayesian latent variable model for rating prediction that models ratings over each user's latent interests and also each item's latent topics.
Information retrieval is becoming increasingly concerned with resource selection and data fusion ... more Information retrieval is becoming increasingly concerned with resource selection and data fusion for distributed archives. In distributed information retrieval, a user submits a query to a broker, which determines a solution for how to yield a given number of documents from all available resources.
News professionals, such as Radio, TV and Newsprint journalists and editors, now have at their di... more News professionals, such as Radio, TV and Newsprint journalists and editors, now have at their disposal a large and varied collection of digital information resources. News Agencies such as ANSA, Reuters and AP can, for example, provide live feeds of breaking stories directly into a newsroom. Journalists can also search and browse a variety of online news archives, digital libraries and web repositories when researching and compiling a report.
1. INTRODUCTION Recently, user generated data is growing rapidly and becoming one of the most imp... more 1. INTRODUCTION Recently, user generated data is growing rapidly and becoming one of the most important source of information in the web. This data has a lot of information to be processed like opinion, experience,etc which can be useful in many applications. Forums, mailing lists, on-line discussions, community question answering sites and social networks like facebook are some of these data resources that have attracted researchers lately. Blogosphere (the collection of blogs on the web) is one of the main source of information in this category.
Prior-art search is a critical step in the examination procedure of a patent application. This st... more Prior-art search is a critical step in the examination procedure of a patent application. This study explores automatic query generation from patent documents to facilitate the time-consuming and labor-intensive search for relevant patents. It is essential for this task to identify discriminative terms in different fields of a query patent, which enables us to distinguish relevant patents from non-relevant patents.
Abstract User generated content are one of the main sources of information on the Web nowadays. W... more Abstract User generated content are one of the main sources of information on the Web nowadays. With the huge amount of this type of data being generated everyday, having an efficient and effective retrieval system is essential. The goal of such a retrieval system is to enable users to search through this data and retrieve documents relevant to their information needs. Among the different retrieval tasks of user generated content, retrieving and ranking streams is one of the important ones that has verious applications.
Uploads
Papers by Fabio Crestani