As described in the main body of the report the twinIsles cataloguing/retrieval
process does not make use of a formal thesaurus. Instead synonymous and
hierarchical relationships are defined manually and implemented by the
allocation of identical keyword IDs to represent synonymous terms and
the inclusion of broader terms as additional keywords to represent hierarchy.
This approach was considered sufficient for the purposes of launching
the prototype. However, if twinIsles is to grow further, some form of
automated cataloguing system should be implemented along with a formal
thesaurus to which it, and the search mechanism, may refer.
This appendix suggests a modified database incorporating a formal thesaurus
and the cataloguing and search systems which refer to it.
The Thesaurus
The suggested structure permits homographs within the terms e.g. glasses
could be something to drink out of or an aid for the visually impaired
as well as modelling the hierarchical, synonymous and associative relationships
(see 6.6.1).
The initial thesaurus could be generated from the initial set of images,
or could be adapted from an existing thesaurus, a number of which are
publicly available e.g. the Library of Congress Thesaurus for Graphic
Materials [61] (see below).
In either case it will be necessary to periodically update the thesaurus
to accommodate new items which may be added to the collection.
Fig E.1 Proposed structure of modified database
Table
|
Fields
|
Notes
|
PHOTO
|
ID
description
locID
|
Location ID
|
PHCON
|
phID
conID
|
|
CONCEPT
|
ID
termID
|
ID of preferred term
|
TERM
|
ID
term
|
) A term may be a single
) word or a phrase.
|
ADDTERM
|
conID
termID
|
) This table models the
) synonymous
) relationship.
|
HIERARCHY
|
conID
lft
rgt
|
) This table allows
) hierarchy
) to be modelled.
) See Celko [14].
|
RELCON
|
mainConID
relConID
|
) This table models the
) associative relationship
|
LOCATION
|
ID
location
|
Full location for caption.
|
LOCPLACE
|
locID
placeID
|
|
PLACE
|
ID
placeName
|
Name of a single place.
|
Fig E.2 Proposed Tables and Fields
A distinction is made between concepts (underlying things being represented)
and the terms (keywords and phrases) used to represent them, an approach
inspired by Cross et al [17].
Photographs are indexed by concept, rather than term. This implicitly
associates a photograph with all the terms describing its associated concept(s)
as well as all those describing hierarchically related child concepts.
It was noted during the cataloguing of images for the prototype that
some descriptors consist of more than one word e.g. "bullet train",
"new year" etc. It is necessary to store these as phrases within
the database, thus the indexing system must provide a mechanism for indicating
such phrases, e.g. by enclosing them in double quotes.
The search mechanism must therefore also identify phrases. There are
two means of implementing this requirement:
- Every search string could be parsed into single words, 2-word phrases,
3-word phrases
n-word phrase (= entire search string). This would
result in 0.5(n^2+n) search terms arising from an n-word query. This
places the burden of effort onto the computer system.
- The user could be requested to indicate phrases e.g. by placing them
in double quotes (as is the convention on the leading search engines).
This places the burden of effort onto the user. Provided images were
also indexed under the individual words forming any associated phrases
a reasonably satisfactory result set could be expected even where the
user neglected to identify phrases.
Note that a separate HIERARCHY table is required to model the fact that
concepts may appear in multiple hierarchies.
In order to estimate the usefulness of employing an existing thesaurus,
e.g. the Library of Congress Thesaurus for Graphic Materials I, keywords
from the prototype database were compared with terms in this thesaurus.
The file containing the keywords from the prototype database was 7KB.
The file containing the terms that did not match was 4KB (after removing
terms which were singular forms of plurals which had matched), i.e. less
that 50% of twinIsles' keywords matched terms in the Library of Congress
Thesaurus.
The Cataloguing System
The following is a description of what is required from an automated image
cataloguing system suitable for use with twinIsles' database from a user
and system perspective.
User (cataloguer)
|
System
|
Enters image ID (filename).
|
|
|
Checks for uniqueness of ID.
Checks for existence of image.
Displays thumbnail of image to
aid cataloguing.
|
Enters description (caption to
be displayed with image).
|
|
|
Parses description into individual
words (i.e. remove extraneous punctuation, conjunctions, articles
etc).
Consult thesaurus for synonyms
and child terms.
Display all generated and found
keywords/concepts.
See note below.
|
Review and edit displayed keywords/concepts,
deleting and adding terms as appropriate.
|
|
|
Update database.
See note below.
|
Enter location by selecting from
list, or where location is not present in list typing it (in full)
in text box.
|
|
|
Update place and locplace tables
in case of new location.
Update photo record.
|
Note: In the case of new keywords being entered synonymous, hierarchical
and associative relationships would need to be identified and added to
the thesaurus. This could be done at the time, or new keywords could be
saved and correctly placed in the thesaurus at some later time or even
by another person. The former method has the advantage of making the relationships
available for the remainder of the current cataloguing session.
In the case of a homogeneous collection, e.g. a museum of butterflies,
the required thesaurus could probably be well defined from the beginning,
i.e. only in rare cases would new terms need to be added. However, with
a heterogeneous collection such as that of twinIsles which grows in an
unpredictable manner, it is likely that the addition of new keywords would
be a common requirement.
The cataloguing of collections (i.e. sets of similar or related items)
would require a different procedure (and different cataloguing interface).
The Search System
The following is a description of what is required from a search system
interfacing with a thesaurus.
1. The user enters a query.
2. The query is parsed into individual words and phrases, i.e. search
terms. Noise words (e.g. a, and, the) and extraneous punctuation are removed.
3. An initial set of concepts is identified from the search terms (using
the synonymous relationship).
4. The initial set of concepts is expanded to include narrower concepts,
e.g. the concept of pets would be expanded to include dogs, goldfish and
budgies.
5. Images matching the expanded concept set are retrieved.
6. The images are ranked in order of relevance.
7. Related concepts are identified.
8. Matching images are displayed.
9. The related concepts are displayed, providing the user the opportunity
to expand the search
Variants on the above include:
- Showing the concepts associated with each displayed image so that
the user may search on one or more of these.
- Allowing the user to select a number of displayed images as the basis
for a further search. The search would be on the concept(s) associated
with those images.
|