Category Archives: search

information retrieval, desktop search, personalized search, text analysis, search engines

Alexa's Public Crawler Database

What a great idea Alexa (Amazon.com): the Alexa Web Search Platform, computing and storage resources for rent to analyze large percentages of the entire Web. The opening of this to anyone with an analytics or business idea is certainly a Web 2.0-kind of thing. Outsource your data collection and hardware to analyze it.

Now why not a program for academic research access to the data stores?

WWW2006 Workshop – Logging Traces of Web Activity

I am one of the organizers for the WWW2006 Workshop – Logging Traces of Web Activity: The Mechanics of Data Collection at the WWW2006 Conference in Edinburgh, Scotland in May 2006.

We invite position papers for the WWW 2006 workshop “Logging Traces of Web Activity: The Mechanics of Data Collection”. Many WWW researchers require logs of user behaviour on the Web. Researchers study the interactions of web users, both with respect to general behaviour and in order to develop and evaluate new tools and techniques.

Traces of web activity are used for a wide variety of research and commercial purposes including user interface usability and evaluations of user behaviour and patterns on the web. Currently, there is a lack of available logging tools to assist researchers with data collection and it can be difficult to choose an appropriate technique. There are several tradeoffs associated with different methods of capturing log-based data. There are also challenges associated with processing, analyzing and utilizing the collected data.

This one day workshop will examine the trade-offs and challenges inherent to the different logging approaches and provide workshop attendees the opportunity to discuss both previous data collection experiences and upcoming challenges. The goal of this workshop is to establish a community of researchers and practitioners to contribute to a shared repository of logging knowledge and tools. The workshop will consist of a panel discussion, participant presentations, demonstrations of logging tools and prototypes, and a discussion of the next steps for the group. Participation is open to researchers, practitioners, and students in the field.

The deadline for workshop proposals is January 10, 2006. I hope to see you there.

New Book: Theories of Information Behavior

I am remiss in mentioning that a new book, Theories of Information Behavior, I have written a chapter for is finally out.

From the blurb:

This unique book presents authoritative overviews of more than 70 conceptual frameworks for understanding how people seek, manage, share, and use information in different contexts. A practical and readable reference to both wellestablished and newly proposed theories of information behavior, the book includes contributions from 85 scholars from 10 countries. Each theory description covers origins, propositions, methodological implications, usage, links to related conceptual frameworks, and listings of authoritative primary and secondary references. The introductory chapters explain key concepts, theory, method connections, and the process of theory development.

Check out the Table of Contents (pdf file). (I’m the last chapter in the book, it’s funny that the chapters are organized alphabetically by the title of each chapter.)

Amazon.com link to Theories of Information Behavior. American Society for Information Science & Technology Member Price is 20% off now.

SIGIR 2006 Call for Papers

The ACM Special Interest Group for Information Retrieval (SIGIR) has thier SIGIR 2006 Draft Call for Papers out already. The conference will be in Seattle next August.

SIGIR is one of the best academic conferences to keep up with what’s new and what’s possible for Web search and increasingly, in Desktop search and mobile device search. For 2006 I expect we will see more about vertical search and even blog search too as well as some new insights into user behavior for IR.

Call for Papers: WWW2006 Conference

New notice for participation at the 15th Annual World Wide Web conference in Edinburgh, Scotland (one of my favorite cities).

I will be a reviewer again this year in the Browsers and User Interface track, where there are usually a number of amazing systems and interfaces. Here’s some text describing the track:

The Browsers and User Interfaces track at WWW’2006 focuses on promoting novel research directions and providing a forum where researchers, theoreticians, and practitioners can introduce new approaches, paradigms, applications, share their knowledge and opinions about problems and solutions related to accessing and interacting with data , services, and other humans over the Web. We invite original papers describing both theoretical and experimental research including (but not limited to) the following topics:

  • Browsers and user experience on mobile devices
  • Browser interoperability
  • Novel client-side applications
  • Multimodal interfaces, including speech interaction
  • Information visualization on the Web
  • Multilingual Web content design
  • Novel browsing and navigation paradigms
  • Web interaction with the real world, including robotics and sensor networks
  • Adaptive Web displays and Web personalization
  • Ubiquitous web access, shared displays, and wearable computing
  • Web usability and user experience
  • Web accessibility
  • Web-based collaboration and collaborative Web use
  • Web-logs and online journalism

Hope to see you there.

Study of Yahoo and Google Indices

A fresh approach at some analysis of which search engine has a more comprehensize index: A Comparison of the Size of the Yahoo and Google Indices. It would be interesting to see this study at another order of magnitude, perhaps with MSN included. What I like best is that the study authors released the code for the tests. I seem to be finding that more academics are providing code to let others attempt to verify their study firsthand, build on the study to make relatable comparisons, and most importantly to prodive the opportunity for peer review of the code logic of what the study claims.

The New New Portal

The ingenuity of various independent developers in conjuction with simple scripting, open source databases and XML data formats such as RSS are making old school (1994-1997) portals nearly obsolete. Take this great idea that annotates a prototypical New York Times front page with links to related blog posts (and other feeds) : The Annotated NY Times – About

Throw in Bloglines with its easy to use, Web-based interface for any number of RSS feeds and very soon, a few personal tweaks with greasemonkey, not to mention integrating your own personal blogosphere view using Technorati tags or even more personally oriented, pluck with its client interface/information dashboard++ and you can kiss your portal application providers goodbye.

ORACLE’s recent buyout of Peoplesoft may not be so smart in the long, long run when every business unit, not to mention employee, can crank out structured data feeds, tweak simple logic to act on other’s sources and keep up to date with everything in the organiztion with just a few clicks on everyone’s favorite orange button: .