9. Search and Retrieval Mechanism

The database structure and search mechanism are described.

9.1 Database Design
A relational database structure was selected in preference to a flat-file design for the following reasons:

to take advantage of the performance benefits offered by indexing;
to avoid the repetition of storing the same data many times over;
to improve consistency (e.g. images taken in the same location would always display the same location caption).

The database structure is of the form:

Fig 9.1 Database Structure

Table	Fields
photo	ID, description, locID
phkw	phID, kwID
keyword	keyword, ID
location	fullLocation, ID
locPlace	locID, placeID
places	placeName, ID

Fig 9.2 Database Tables and Fields

N.B. fullLocation refers to a full location, e.g. Burnley, Lancashire, England, Britain, UK; placeName is a single term e.g. New York, Asia, Brighton.

The number of categories of metadata held for each image is currently considerably less than that recommended by standards such as the Dublin Core [92]. This is deliberately so and reflects the fact that many of the Dublin Core categories (e.g. publisher, language etc.) have no meaning or relevance to the image collection at the present time. The structure and implementation of the database mean that additional categories could be added relatively easily should the need arise; e.g. a creator field may become necessary if twinIsles accepts work from other photographers, a date field may be added in response to user feedback.

Keywords consist of single words. The facility to search for phrases (fixed groups of words) as commonly found on web search engines has not been included in the prototype. Instead, the ranking algorithm (described below) has been adopted as a means of providing best match first results without the need for phrase matching. However, phrase matching has been identified as a possible future enhancement to the image retrieval mechanism.

9.2 Collections
Of the initial batch of scanned images (approx 400) certain images appeared to fall naturally into groups. These groups, known as collections, were identified and placed together on fixed HTML pages resembling search result pages. Collections appear as unique items in the database.

Where all images in a collection share the same description e.g. the Sumo collection they do not appear individually in the database (see note below). However where the collection consists of a diverse range of images, e.g. Reflections, Shadows and Light, each image remains in the database in its own right. The implications of this policy are that a search for Sumo returns a single thumbnail linking to the Sumo collection, a search for Japan returns the Sumo collection along with other images, a search for reflections returns the Reflections Shadows and Light collection along with other images matching the search (some of which are also contained within the collection) and a search for windows returns some of the images from the Reflections Shadows and Light collection but not the collection itself (since the search criteria is not sufficiently strongly related to that collection's description).
Initially five collections have been identified and representative thumbnails placed on the search page. Further collections will be identified as the number of images grows and in response to monitoring of the search logs.

Note: in fact every image is catalogued individually in the database, however those images which should only appear as part of a collection are prevented from being returned in search results by not being referenced by other tables.

9.3 Database Creation and Population
The initial batch of scanned images and collections was catalogued using a spreadsheet. The catalogue contained each item's unique identifier (the item file name without its extension), a textual description (which would serve as a caption), the location (an optional field) and any additional keywords not appearing in the description but which could be used as search terms. The additional keywords were used for general concepts e.g. buildings, religion and people.

The description and additional keyword fields were exported to a text file and individual keywords were extracted using a purpose-written Java program. (Java was used as a procedural language and was chosen merely for its ready availability).

The resulting sequential list of keywords was imported back into the spreadsheet where it was edited manually to remove words such as "a" and "the". Each keyword was then given an identifier (an integer). During this process synonyms were identified, which were given identical identifiers. Thus, someone searching for marriage would receive results featuring the word wedding as a keyword. In keeping with the convention found in other thesauri plural forms were added to singular noun keywords. The keywords and identifiers were saved as a text file.

Another Java program was used to identify which keywords in the keywords file appeared within each photograph's description and additional keywords. A further text file was created consisting of a comma delimited list of photograph identifiers and keyword identifiers.

The location of each item was initially entered as a single place name at the most specific level, e.g. Brighton. The individual place names used were identified and a fully qualified location (full location) was created for each, e.g. the full location for Brighton is Brighton, Sussex, England, Britain, UK. Thus, Brighton images would be returned to a user searching for Sussex, England, Britain, UK, or indeed, Brighton. Each full location was given a unique identifier and a text file created holding the full locations and their corresponding identifiers.

Each unique place name used in the fully qualified locations was extracted with a Java program and given an identifier. Multi-word places such as Hong Kong, Covent Garden are indexed both on their full name and also on the (significant) individual components of that name. The place names and identifiers were stored in a text file. Another text file was created, also with a Java program, associating place name identifiers with location identifiers.

Finally a text file was created containing the details of each item, namely its identifier, textual description and identifier of its location (which may be null).

A database, consisting of six tables matching the structure described above, was created in Microsoft Access. The data was then imported into Access from the text files created from the above description.

Microsoft Access was selected for the following reasons:

It is ODBC-compliant. ODBC (Open DataBase Connectivity) is a Microsoft created standard to which many databases conform. Adherence to it increases flexibility allowing the database to be accessed with different software and/or allowing the original software to be used with a different database.
It conforms to the standard relational database model and utilises standard SQL (Structured Query Language) thus ensuring future portability of the system.
It is extremely common and thus well supported in terms of documentation and availability of assistance.
It was available in the development environment.

9.4 The Search
It was decided that the search would be implemented in ASP for the following reasons:

It is supported as standard on widely used Microsoft web servers, and is available for other popular servers.
As a Microsoft technology it integrates well with Access, the database used to store the image catalogue.
It can access any ODBC-compliant database, thus providing flexibility for future migration to an alternative database.
It utilises SQL to query the database.
The technology is free to users of Microsoft Windows operating systems, and hence was available in the development environment without additional expenditure.
It uses the relatively easily learned VBScript scripting language.

In keeping with many web-based search tools the user is offered a choice of quick and advanced search modes. The quick search is presented by default and consists of a single text box. The advanced search provides two text boxes; one for subject, the other for location.

In order to keep the number of images per page to a manageable number a maximum of 30 images is displayed per page. Where more images exist the user is offered a "more" link, which passes the page number and search string back to the same ASP page via the query string.

9.4.1 The Quick Search
An ASP implementation of the simple search was built. The user enters a search string in a text box. The string is parsed using VBScript to 1) remove apostrophes (to adhere to SQL syntax), 2) split it into individual (space delimited) words, 3) remove non-alphanumeric characters from the start and end of each word. An SQL query of the following form is constructed:

SELECT photo.ID, description, fullLocation
FROM photo, phkw, keyword, location
WHERE (keyword='term1' OR keyword='term2' OR…) AND
photo.ID=phkw.phID AND phkw.kwID=keyword.ID AND (photo.locID=location.ID OR (photo.locID IS NULL AND fullLocation IS NULL)) UNION ALL
SELECT photo.ID, description, fullLocation
FROM photo, location, locplace, places
WHERE (placename='term1' OR placename='term2' OR…) AND photo.locID=location.ID AND location.ID=locPlace.locID AND locPlace.placeID=places.ID
ORDER BY photo.ID

As only a single search string is used the terms contained must be sought in both keywords and place names. The queries for keyword and place name are joined with UNION ALL and the final query terminated with ORDER BY photo.ID. Since UNION ALL returns duplicate rows the results provide a rudimentary ranking of closeness of match, i.e. the more search term matchings the more times an item was returned.

The results are then collated using VBScript with a score being assigned to each distinct item returned equal to the number of times it was returned. VBScript is then used to display the results showing the most relevant items (in terms of numbers of matchings) first.

9.4.2 The Advanced Search
VBScript is used to determine whether the user has entered subject, location or both and control is transferred to the appropriate ASP search processing page. In the case of subject only being entered the following query is created:
SELECT photo.ID, description, fullLocation
FROM photo, phkw, keyword, location
WHERE (keyword='term1' OR keyword='term2' OR…) AND
photo.ID=phkw.phID AND phkw.kwID=keyword.ID AND (photo.locID=location.ID OR (photo.locID IS NULL AND fullLocation IS NULL)) GROUP BY photo.ID, description, fullLocation
ORDER BY COUNT(phkw.phID) DESC

Items are ranked according to the number of matches between search terms and their description.

In the case of location only being entered the following query is created:
SELECT photo.ID, description, fullLocation
FROM photo, location, locplace, places
WHERE (placename='place1' OR placename='place2' OR…) And photo.locID=location.ID And location.ID=locPlace.locID And locPlace.placeID=places.ID
GROUP BY photo.ID, description, fullLocation
ORDER BY COUNT(photo.ID) DESC

Items are ranked according to the number of matches between search terms and their location.

In the case of both subject and location being entered the following query is created:
SELECT photo.ID, description, fullLocation
FROM photo, phkw, keyword, location
WHERE (keyword='term1' OR keyword='term2' OR…) AND photo.ID=phkw.phID AND phkw.kwID=keyword.ID AND
photo.locID=location.ID AND location.ID IN (
SELECT locID
FROM locplace, places
WHERE (placename='place1' OR placename='place2' OR…) AND locPlace.placeID=places.ID)
GROUP BY photo.ID, description, fullLocation
ORDER BY COUNT(photo.ID) DESC

Items are ranked according to the number of matches between subject search terms and their description. Only items which match at least one location term are returned.
For each search type a recordset is created in ASP and stepped through. The GROUP BY and ORDER BY clauses mean that the content of the record set is already ranked in best match first order.

Consideration was given to supporting Boolean operators such as AND and OR in the search but this facility is not included in the prototype for two reasons. Firstly these operators require some skill on the part of the user if ambiguity is not to arise, e.g. does cat AND dog OR mouse mean (cat AND dog) OR (cat AND mouse), or does it mean (cat AND dog) OR mouse? Secondly, the ranking algorithm described above provides the advantages of AND and OR searching with ANDed results being returned first, followed by the ORed results. User feedback and search logs will be carefully monitored and the policy for Boolean searching will be reviewed if appropriate.

9.4.3 Proposed Modifications
As the size of the database grows it may become necessary/desirable to adopt some specialised processing to handle searches which would potentially return a very large number of results. Such searches would present a particular problem in the case of the Quick Search, in which the recordset needs additional server-side processing in order to rank the results.

The simplest approach would be to return only the first n results, notifying the user and directing them to the Advanced Search. However, a more satisfactory solution would be to identify keywords (and/or placenames) with large numbers of matches and to amend the SQL query so that these terms are ANDed with the other term(s). E.g. in the case of a search for "festivals Japan", the term Japan would produce many matches. As the search mechanism stands each of these would be returned along with every match for festivals (with images matching both terms being shown first). The proposed modification would mean that only images matching Japan AND festivals would be returned. This would most likely be in accordance with the searcher's intentions.

At present results arising from multiple search terms are ranked according to the number of matches between items and search criteria, however no ranking is applied to results from a single search term. It may be beneficial to rank such results based upon the number of times the thumbnails have previously been clicked, effectively displaying the most popular items first.

Adding new images to the collection is currently somewhat complex as it involves a similar process to that described above for the initial database population. This would be unsuitable for a cataloguer who did not have detailed knowledge of the database structure and considerable I.T. skills, neither of which should be required of a cataloguer. Therefore an automated system of cataloguing should be developed in which a form is completed for each image to be added. The system should refer to a thesaurus describing the various relationships between its terms. Appendix E suggests a model for such a system.

Often users make keyboard or spelling errors when entering data into a computer. A possible enhancement to both search and cataloguing features would be a spell-checker that would not only identify misspelt words but also suggest valid alternatives.

9.5 The Category Index
It was recognized that, due to the initially relatively small size of the image collection, users would benefit from some guidance as to the nature of twinIsles' content if a large number of searches were not to end in frustration. A category index was developed from the major themes of the stock held. This was placed on the main search page. The index incorporates a degree of hierarchy, e.g. Japan includes Tokyo, which in turn includes homeless people.

The index includes the collections described in 9.2 above. Other categories are displayed dynamically in response to the database being queried. This has the advantage that new images are automatically included in the appropriate categories.

twinIsles.dev >> Photography on the Web

Next Other Considerations

e-mail me with your comments and suggestions | Home