|
The University of Alberta Archives EAD Pilot Project
It has become a cliché to observe that digital technology and broadband networks are revolutionizing how society manages information. Amongst the many implications in this phrase archivists have concerned themselves with one in particular: has digital technology redefined the traditional archival notion of a record and the archival practices required for its reliable preservation? The University of Alberta Archives has become part of this discussion with a photo digitization and EAD finding aid conversion pilot project.
The project began with an e-mail message in January from the Canadian Council of Archives announcing available funds remained after accounting for support for digital heritage activities in the 2001 fiscal year. Although the Archives had already arranged for a digitization project through an ASA supported CAIN initiative, we could not pass on the allure of new financial support. The Archives put together a proposal to convert three RAD-compliant finding aids and a selection of photos, text, and other graphic material from three fonds. The records and finding aids came from the Violet Archer fonds, the Studio Theatre fonds and the Chinook/Fringe Theatre fonds. Selecting these records for digitization was not difficult since the U of A had already formulated a two-page draft list of selection criteria for digital conversion projects (available on request). Technical issues and administrative concerns over digitizing records in a university context were already addressed in a background report (also available on request) for the Archives ASA digitization project.
The proposal we submitted broke down into two phases. In phase one a technician marked up finding aids with the SGML-based EAD data structure. The Library of Congress has made freely available the SGML document type definition (DTD) for encoding text in EAD www.loc.gov/ead/. These have been downloaded with some macros (more will be created on site) from the suite of software tools provided by the Society of American Archivists and commonly known as the EAD Cookbook www.iath.virginia.edu/ead/cookbookhelp.html. The Cookbook was useful for quickly encoding the higher levels of the finding aid files. However, we were working with "legacy" finding aids, each over a hundred pages, written in MS Word and describing a considerable amount of item-level material. To expedite the conversion of such finding aids to a suitable EAD format, Peter Wong, a computer science graduate hired for the project, wrote a C++ program to help parse the MS Word document for item-level entries. As Peter describes it,
each column of the finding aid was copied manually into a separate text document. In our case, these columns were 'Box', 'File', 'Description', and 'Date', resulting in 'Box.txt', 'File.txt', 'Description.txt', and 'Date.txt'. The program then parsed each text file, concatenated them, and inserted the proper EAD tags. The result was an output file, which would contain item level entries with the desired EAD 'c' level tags. The contents of this output file were then copied and pasted into the desired finding aid.
The files resulting from this process were in fact XML (extensible markup language) files. They are the RAD compliant text files with tags giving them the hierarchical structured content designed to describe the interrelationships of records in a fonds. In this way, the structure of the document is completely isolated from the presentation. This enables us to preserve the XML file so that it can be migrated to whatever becomes the flavour of the week for web presentation. The authenticity and reliability of the records are not directly endangered by web technology obsolescence; the intellectual control of the records is secure.
Phase two required giving these XML files a web presentation. We originally intended to use the stylesheets supplied with the EAD Cookbook suite of software. However, we moved over to the New York University stylesheets after discovering their NYU EAD Production Guide www.nyu.edu/library/bobst/collections/findingaids/ead/stylesheets.html. The stylesheets give the visual presentation to the encoded XML files. A Java parser takes the EAD DTD, applies it to the tagged file and validates it, checking for any tagging errors. The parser then links to a stylesheet which gives instructions on how each element should be presented as html on the Web. Originally, we used the XT Java parser designed by James Clark but we have made the decision to move to Saxon since XT is no longer maintained.
At this point we received useful expertise from the University of Alberta SunSITE Project http://archives.sunsite.ualberta.ca . We were confronted with a need to present our encoded finding aids in such a way that the descriptions could include a link at the item level to the scanned digital images we were placing on the SunSITE's Oracle server via a link with Oracle forms. We wanted this to be done 'on the fly' rather than hard coding the html links into our descriptions and thereby infecting our structured content files with irrelevant markup information. Bess Sandler, a SunSite staff member, applied Cocoon, a free, open-source software to publish XML on the Web http://xml.apache.org/cocoon/. She modified the NYU stylesheets to function with cocoon and the Saxon Java parser. Her work essentially solved the problem of how to parse and present networked metadata in a way that links it to the information in the database. In addition to using Cocoon, Bess wrote a Perl script that looks at the containers (EAD terminology for a generic subordinate body of material such as an item or series) described in the XML file, and searches for items in the Oracle database that belong within the container. When found, the script generates the appropriate XML tag and inserts it in the correct location in the HTML presentation. In this way, the HTML link is only generated on the web page. When selected, this link will use a unique identifier (an accession number) to find the appropriate image in the database. Bess is willing to share any software she wrote (including adjustments to the NYU stylesheets) for the project under the gnu public license meaning the programs are free but cannot be resold and Bess should be given credit. She can be contacted at esadler@ualberta.ca. All the scripts and software are running on a Pentium 3 500Mhz desktop computer with 500Mb of RAM running Linux 2.4.7. The UofA Archives project can be viewed at http://sunsite.ualberta.ca/Projects/Archives/. The UofA SunSITE Project is available to host other projects.
Why EAD? There are many approaches to recording structural metadata for digital resources and the collections they originate from. These approaches fall into two basic categories. First, one can capture the structure through basic data management techniques. Many software packages exist that allow one to store digital images (or sound files and other material) in file-system directory schemes. Such schemes are designed to represent the hierarchical structure of the images. Second, in the approach we selected, one can use EAD as a structured markup language to link images and enable access to various levels of a fonds. Ultimately, the mark-up languages can be coordinated with data management techniques to represent and make accessible data in databases. A final reason for the attractiveness of text encoding is cost. Apart from software for digital imaging, all the software used on this project has been open source. The only expenses incurred have come from hardware costs and labour.
Using EAD is not without concerns. Like the other functions for handling digital documents in an archives, there are no benchmark standards for digital access. The cumulative effect of display and access involves decisions concerning "file formats, compression processes, scripting routines, transfer protocols, Web browsers, processor capabilities and server pecifications." (Kenny and Rieger, Moving Theory into Practice). There is a large array of metadata that must be captured and in a networked framework. A short list includes resources discovery, rights management, preservation, administration, provenance and more. Moreover, the provenance of the records is rarely linear. As Daniel Pitti has explained,
Relations between records, creators and functions and activities are dynamic and complex, and not fixed and simple. Creators are related to other creators. Records are related to other records. Functions and activities are related to other functions and activities. (D. Pitti, "Creator Description: Encoded Archival Context," Institute for Advanced Technology in the Humanities, University of Virginia, 2001, p.4.)
It is in part for this reason that the EAD tags defined in document type definitions (DTDs) are relatively general. This is compounded by the fact that SGML and its XML and EAD siblings are designed to separate content from presentation and formatting. These conditions result in a large variety of web presentations for information lacking in descriptive precision and structure. As Elizabeth Shaw noted recently on an EAD list serve:
EAD should be a means to provide structural and semantic markup of archival description. It has always concerned me that it is a loosely structured set of markup, trying to accommodate everyone's idea of the way description should be 'presented.' And even as there are descriptions of what belongs within which tags, the interpretation across archival repositories varies... EAD's highly lax structure has made it more difficult to mine what could be a very rich descriptive structure. In fact, I would argue that its laxness has actually confounded people's ability to modify it to their own descriptive needs by inhibiting the very commonalties that it was developed to promote. (EAD listserv EAD@sun8.loc.gov, September 17, 2001, pp. 1-2).
For this reason, the University of Alberta Archives described its EAD encoding endeavors as a pilot project. While we await the second version of the EAD DTDs the Archives is considering submitting suggestions for changes to the DTDs. At the moment it is a system evolving in a generally useful direction and the Archives anticipates future EAD encoding projects.
Returning to the question posed at the start of this article, our experiences in this project suggest that despite the developments in networked information, the fundamental role of the archivist has not dramatically changed. Archivists continue to make use of digital information through identifying and securing the "juridical-administrative, procedural, provenancial, documentary, and technological contexts." (Anne Gillian-Swetland, "Preserving the Authenticity of Contingent Digital Objects," D-Lib Magazine, Vol.6, No.7/8, July/August 2000, p.2) A digital record is a record created in electronic form. The important contextual signposts are the same. The significant digital problem is taking a digital object and its physical properties and securely attaching this relevant context.
Submitted by: Raymond Frogner, Associate Archivist, University of Alberta Archives
top of page
|