Measuring the complexity of a collection of documents

Vishwa Vinay*, Ingemar J. Cox, Natasa Milic-Frayling, Ken Wood

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

7 Citations (Scopus)

Abstract

Some text collections are more difficult to search or more complex to organize into topics than others. What properties of the data characterize this complexity? We use a variation of the Cox-Lewis statistic to measure the natural tendency of a set of points to fall into clusters. We compute this quantity for document collections that are represented as a set of term vectors. We consider applications of the Cox-Lewis statistic in three scenarios: comparing clusterability of different text collections using the same representation, comparing different representations of the same text collection, and predicting the query performance based on the clusterability of the query results set. Our experimental results show a correlation between the observed effectiveness and this statistic, thereby demonstrating the utility of such data analysis in text retrieval.

Original languageEnglish
Title of host publicationAdvances in Information Retrieval - 28th European Conference on IR Research, ECIR 2006, Proceedings
PublisherSpringer Verlag
Pages107-118
Number of pages12
ISBN (Print)3540333479, 9783540333470
DOIs
Publication statusPublished - 2006
Externally publishedYes
Event28th European Conference on Information Retrieval Research, ECIR 2006 - London, United Kingdom
Duration: 10 Apr 200612 Apr 2006

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume3936 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference28th European Conference on Information Retrieval Research, ECIR 2006
Country/TerritoryUnited Kingdom
CityLondon
Period10/04/0612/04/06

Fingerprint

Dive into the research topics of 'Measuring the complexity of a collection of documents'. Together they form a unique fingerprint.

Cite this