6. Search and Retrieval

6.1 The Importance and Role of an Effective Search and Retrieval Mechanism
A collection of resources of any kind is only useful if it contains the resource(s) an enquirer requires, and the enquirer is able to locate those resource(s). The efficient location of resources is dependent upon the employment of an effective system of creating and utilising metadata about the resources. Metadata is data about data, or in the case of images data about images.

The issue of search and retrieval in the proposed web site (or any non-trivial site) is essentially a microcosm of the problem faced by users of, and contributors to, the World Wide Web in its entirety.

The role of the search-retrieval mechanism may be illustrated graphically.

Fig 6.1 Searching a Collection

The aim of the search-retrieval mechanism is to minimise the shaded areas in Fig 6.1; i.e. to maximise both the proportion of relevant items returned and the proportion of items returned which are relevant. These two measures were termed recall and precision by Cleverdon, whose work is described by Lesk [58].

Fig 6.2 Relevance

It is likely that those items of interest will have varying degrees of relevance to the enquirer, which should be reflected by the search-retrieval mechanism. Relevance is normally indicated by the order in which results are shown, with the most relevant items appearing first.

Kobayashi and Takeda [56] describe a three way trade-off "between speed of information retrieval, precision and recall". Of particular importance to web users is "the number of valuable sites... listed in the first page of retrieved results".

Criticism of retrieval times was also found in my survey of image users (Appendix B). This seeming impatience is almost paradoxical given that the web provides desktop access to enormous quantities of valuable information the acquisition of which would otherwise demanded visit to numerous reference libraries. However, technology will inevitably deliver ever faster infrastructure and domestic broadband access as the norm.

6.2 The Indexing of Textual Items
In the case of books, established cataloguing frameworks such as the Dewey Decimal Classification (DDC) system exist. Familiar to library users, Dewey is a "general knowledge organization tool" using Arabic numerals to classify material hierarchically to the required degree of specificity [79]. With textual documents such as academic papers the title, abstract and introduction provide clues as to how the item should be catalogued. Keywords (e.g. to be used in computerised searching) can be automatically generated from analysis of the most frequently occurring words.

6.3 Problems Associated with the Indexing of Non-textual Materials
Keepers of libraries of non-textual items are unable to rely upon the well-established and objective methodologies available to cataloguers of textual material. Additionally the cataloguer will often not be the producer of the item and may not have the benefit of a detailed description.

The Introduction to the Library of Congress Thesaurus for Graphic Materials I [59] states that "By their very nature, most pictures are 'of' something", but "In addition, pictorial works are sometimes 'about' something; that is, there is an underlying intent or theme expressed in addition to the concrete elements depicted." For example a photograph may be of a long line of cars outside a petrol station. The same picture may be about protests over fuel tax rises. The aforementioned introduction goes on to say "Subject cataloging must take into account both of these aspects if it is to satisfy as many search queries as possible."

6.4 The Vocabulary Mismatch Problem
The Vocabulary Mismatch problem arises from different people using different vocabulary to describe the same concept e.g. a piece of business equipment may be searched for as a cash register or a till. This may be a consequence of cultural or educational differences, dialect (e.g. British or American English) or simply because language often provides a rich selection of synonyms. Research by Furnas et al [40] found that in a "command naming" exercise out of "more than a thousand pairs of people... Less than a dozen pairs agreed" on the name to be given to a certain command.

6.5 Content Based Image Retrieval
According to Eakins and Graham [24] content based image retrieval (CBIR) is "a technique for retrieving images on the basis of automatically-derived features such as colour, texture and shape." i.e. software examines an image file to determine whether or not it fits some criteria. In February 1999 Eakins [18] considered that "CBIR techniques that might be able to recognise particular types of objects or scenes" were "currently not mature enough for practical applications".

One practical application of CBIR currently available is that of filtering pornography. British-based company First 4 Internet [37] launched its Image Composition Analysis software version 1.8.1 in May 2001 "to protect PC users from pornographic images". AltaVista's Image Search feature [2] makes use of content based retrieval to find images that are "visually similar" to a selected one where "Similarity is based on visual characteristics such as dominant colors, shapes and textures".

6.6 Vocabularies
Vocabularies provide one solution to the difficulties associated with the indexing of non-textual materials such as images and the vocabulary mismatch problem. The J. Paul Getty Trust Introduction to Vocabularies [51] defines vocabularies as "sources of 'standard terminology' for use in the description, cataloging, or documentation of cultural heritage collections".

6.6.1 Thesauri
A thesaurus is defined as a vocabulary consisting of "a compilation of terms representing single concepts". A thesaurus indicates three types of relationship among its terms:

hierarchical, e.g. a bear is a mammal, therefore a search for mammals should return items related to bears.
synonymous, e.g. a search for tube train will return an object indexed as underground train.
associative, in his synthesis of a 1998 seminar by Weinberg [10], Brown states this relationship applies when two terms overlap in meaning. According to Brown the relationship may be symmetrical, e.g. "gold is related to money and money is related to gold", or asymmetrical, e.g. "population control is related to family planning, but there is no related-term reference in the opposite direction. (Someone searching for family planning is unlikely to be interested in population control.)" The associative relationship may be used to suggest alternative searches which may yield relevant results.

The J. Paul Getty Trust has made a number of vocabulary databases available online; the Art and Architecture Thesaurus (AAT) [49], the Union List of Artist Names (ULAN) [52] and the Getty Thesaurus of Geographic Names (TGN) [50]. The Library of Congress Thesaurus for Graphic Materials I [60] is also available on the Web.

Differences in descriptions are likely to occur between indexer and searcher, between different indexers and even by the same indexer at different times. The latter two cases would result in an inconsistent catalogue. This problem underlines the need for a thesaurus to be consulted in both the cataloguing and indexing processes.

Millstead [66] describes how thesauri were originally designed to facilitate consistent indexing of documents, but predicted (in 1998) they "may soon be used more at retrieval than at input". Millstead recommends that the thesaurus should be invisible to the user unless they choose to examine it, she writes "we need to design our interfaces so that users need not interact directly with the thesaurus to any greater extent than they wish or need to".

Fig 6.3 Entry for "Mandolins" from The Library of Congress
Thesaurus for Graphic Materials I [60]

Thesauri are consulted by search algorithms to ensure the broadest range of relevant resources is returned even where the terminology used in the search criteria differs to that found in the resource description. Thesauri can also be used to suggest alternative search terms.

Thesauri are often specific to the collection they describe. However the increasing use of the web to both publish and search across a diverse range of resources has created the need for a mechanism to effectively interrogate different sets of metadata. Doerr [22] considers this problem in detail and calls for a common methodology for thesaurus construction.

6.7 Types of Metadata
The metadata stored about a resource falls into distinct categories. As the capability of search engines improves, as surely it must to enable the full potential of the web to be harnessed, so the need for standardisation in the construction of metadata becomes more pronounced. The incentive for web authors to adhere to such standards is to increase the likelihood of their sites being retrieved by search engines, portals and intelligent agents.

One widely recognised standard for metadata categorization is the Dublin Core from the Dublin Core Metadata Initiative [92]. The Dublin Core consists of fifteen elements "likely to be useful across a broad range of vertical industries and disciplines of study". The elements are Title, Creator, Subject, Description, Publisher, Contributor, Date, Type, Format, Identifier, Source, Language, Relation, Coverage, Rights.

The World Wide Web Consortium (W3C) has developed Resource Description Framework (RDF) [57] in response to the problems in accurate information retrieval on the web. RDF is a means of making web resources machine understandable and is used describe resources and their properties. A resource is anything identifiable by a URI (Uniform Resource Identifier). Resources, properties and their values (also referred to as subject, predicate and object respectively) are represented by RDF statements. E.g. in the sentence John Smith is the creator of resource http://Jsmith/Home, creator is a property of the resource, John Smith is its value. RDF statements can be represented by directed graphs (see Fig 6.4). Properties and their values may also be described by URIs.

RDF uses XML (see 5.5) as its language of implementation. The W3C Specification document provides examples of web pages described with RDF according to the Dublin Core elements.

Kobayashi and Takeda [56] describe how the usefulness of metadata on the web is often diluted by its manipulation to improve search engine result ranking. Such techniques include attaching popular (but irrelevant) keywords and "spamming, i.e., excessive, repeated use of keywords or 'hidden' text". One means of hiding text is to display it in the same colour as the background, thus rendering it invisible to the user, but not to automatic indexing software.

6.8 Review and Refinement
It is imperative that logs of queries being made to the database are recorded and examined in order that the database structure and search mechanism be continually re-evaluated to better satisfy user requirements. Logs should also be kept of images retrieved in order that the most popular images be placed prominently in search results. The most popular images could also be useful in the marketing of the site.

6.9 The Human Factor
The process of cataloguing, searching for, and retrieving resources from a collection is a non-trivial one. The greater the sophistication of the catalogue-retrieval mechanism the less skill is required from the end-user in order to achieve satisfactory results from a search. For an application that is to be made available on the web, and hence available to almost anyone, a minimum of user skill should be assumed, whilst at the same time providing the facility for more knowledgeable users to more precisely define their search criteria. A choice between simple (default) and advanced search modes is a common feature of many web search engines. Given that a relatively low level of user expertise must be assumed a reasonably sophisticated catalogue-retrieval mechanism should be employed.

Two kinds of expertise may be identified; firstly the cataloguer's knowledge of the subject matter, and secondly the system designer's expertise in providing a framework for the efficient storage of metadata and the accurate location and retrieval of resources in response to requests (this would encompass skills such as database design, normalization and query formulation using, for example, SQL). It cannot be assumed that those most suited to the cataloguing of resources possess I.T. skills, nor that those most capable of designing the search mechanism infrastructure have detailed knowledge of the subject matter, therefore close communication between these two skill sets is essential during the design phase.

Ideally once the structure of the search mechanism is determined the process of cataloguing should, as far as possible, benefit from automation. The cataloguer is presented with a form to be completed for each item being added to the collection. In addition to serving as an aide-mémoire to the completeness of each record the form may also ensure consistency by, for example, offering list boxes of fixed choices; although in this case it should be sufficiently flexible to allow the cataloguer to define new categories where those offered are inadequate. The underlying software would automatically update the database of metadata ensuring integrity is maintained. The cataloguer would have no need to be aware of the actual database structure.

It seems likely that the problem of information overload (see 2.5.4) associated with the web will bring about more intelligent search tools and the wider availability of increasingly more intelligent agents (see 2.9). For example where the initial search criteria is vague or ambiguous the software may respond with further questions (in natural English) in order to elicit a more precise description of requirements.

twinIsles.dev >> Photography on the Web

Next The Prototype: Purpose and Philosophy

e-mail me with your comments and suggestions | Home