Mining Web Logs to Improve User Experience in Web Search
Doctoral thesis, 2008
The World Wide Web continues to grow in size and diversity and this makes it increasingly
hard for users to find valuable information because of heterogeneous form and content
of the documents, little knowledge about the reliability and prestige of the documents and
a great deal of redundancy.
Usually search engines look for documents that contain specific keywords or phrases
stated by the users as queries. There might be millions of pages containing those keywords
and they may be related to a variety of different topics.
Traditional retrieval strategies yield increasingly poor results due to a dramatic increase
in ballast in the results. Search engine users thus increasingly experience information overload.
With these difficulties in mind, there is a large ongoing effort in research with the goal
to deliver appropriate information to the users, this is what is meant by improving users
Web search experience.
The aim of this thesis is to design, test and analyze different approaches to address the
problem of characterizing search behavior of users and improve the search process, in the
context ofWeb search. There are three main aspects to focus on when tackling this problem
throughWeb Log Mining: user recommendations to improve the search process, automatic
detection of user information needs and modeling of user information needs.
To improve Web search by user recommendations, a suit of algorithms tailored to the
mixture models is presented, the algorithms are simple and efficient. Tests are carried
out on a broad range of generated data according to a spectrum of subclasses of mixture
models, and on real data collected from a Hungarian news portal log and from the Chilean
TodoCl1 web search log, the resulting performance is shown to be of high quality. Other
application areas were mixture models are used also benefit from these results, this is the
case of dating services, e-commerce, virtual collaborative communities, Internet Service
Providers and in bioinformatics to analyze gene expression data.
The main contribution in user behavior characterization is to provide a complete study
of all major learning approaches applied to automatically detect user intent in Web search.
The three analyzed machine learning techniques for mining user intent are: completely supervised,
semi–supervised and unsupervised. In this context the semi–supervised learning
approach shows significant improvements over the supervised approach for mining user
intent and interests, which previously was considered the best one. This study is also of interest more generally in exploring the true potential of all learning techniques in large
scale settings such as the Web, both in terms of their scalability and in their accuracy.
In the user intent modeling context the contribution is the proposition of a new categorization
of user’s intentions from the point of view of facets with the aim to improve
on previous classification schemes. Initially a set of queries were manually labeled with
the new faceted classification scheme to find relationships between the facets to aid in the
manual labelling process and to understand users intentions. The distribution of the queries
within the facets shows that the facets are relevant since each produces a division of the
query space that will allow for better understanding of the user needs.
Web log mining