ELECTRONIC INFORMATION AND DIGITIZATION
Preservation and Security Challenges
|
17. Preservation, Security, and Digital Content
Carl Fleischhauer
Our keywords for this publicationpreservation
and securityare variously defined, certainly in the context of
digital materials. The variation is even greater, and the meanings
assigned even broader, when we look at the terms as they are applied to
entities in an organization, for example, the "Preservation Department"
or the "Security Office." Because this book is about strategic
stewardship, I think that it is fitting to discuss things from an
organizational perspective and to take a broader view.
Let me say a few words about the first term,
"preservation." For illustration, allow me to oversimplify a bit
about the old days, when our concern was focused exclusively on physical
objects. Let us take books, for instance. Preservation departments
within libraries were organized to conserve the books andwhen
necessaryto reformat them (which usually meant microfilming them).
If thought was given to adding intellectual value, this was seen as
the business of other units within the library, perhaps where a
selection process assembled the collection, where cataloging took
place, or where the reference staff placed materials in the hands of
readers. Added intellectual value was a matter of providing a
context for a given item within a larger, cataloged collection. And the
collection itself had intellectual value: the whole was greater than the
sum of the parts.
With digital activities, we have seen organizational
boundaries begin to weaken. The Library of Congress program called
American Memory is a classicbut by no means the onlyexample
of a reformatting project motivated by the desire to expand
access. [1] Such activities reproduce the original objects and
at the same time add intellectual value by improving access, especially
when they produce searchable texts. Here, even the parts are "greater"
than they were before digitization.
I know that the reproductions in American Memory are
often described as access-quality copies, but the program and allied
efforts at the Library of Congress have also investigated copy-making in
the service of preservation. (Please note that we use the term
"preservation copy" at the Library of Congress even when we retain the
original item.) For some types of material, we have begun to produce
very high-quality reproductions. The digital images produced by the
Geography and Map Division, for example, surpass in quality the 105mm
microfiche traditionally produced for the division, and the paper
output from the scans far and away surpasses any output obtainable from
the film. I am not sure if the division has started using the p-word for
their digital copies, but the images certainly fill the same niche that
analog preservation copies used to fill.
Meanwhile, the Prints and Photographs Division is
starting to hear from publishers who are well satisfied with the
uncompressed master images they download from the Web, suggesting that
here, too, digital images can take the place of what were called
preservation copy negatives. At the Library of Congress and elsewhere,
we are impressed by the effective service to researchers that is offered by, say color
images of manuscript pages. Such digital images are susceptible to
enlargement and improvements in legibility when examined using
well-selected software. (Alas, audio and video lag behind in terms of
online quality, but I am convinced they will catch up soon.)
In the face of these developments, the people who are
developing digital reformatting programs have moved from discussing
the basics to discussing the niceties. When should we reproduce a book's
pages as ink imprints, that is, capture what amounts to be the
typography and lines in drawings? This approach permits us to create
clean paper facsimiles, as demonstrated by the Cornell University
library scanning projects of the 1990s. Alternatively when shall we make
images of the pages that use a photographic approach to capture the look
of the sheets (including paper color) in the manner demonstrated by
Octavo CD-ROM editions of rare volumes? [2] When is it important
to conserve the original bound volume even in the face of lower image
quality? Are there ever times when the circumstances dictate disbinding
a book in order to make perfect facsimile images? And, to echo my
earlier remarks, this question: for this body of material, how shall we
add value and improve access? This last topic is now understood to
include not only options like exchangeable MARC (MAchine-Readable
Cataloging) catalog records, but also standardized finding aids,
searchable full texts, and "exposing and harvesting" the detailed,
local data that live in intimate relationships with digitized
content.
The Library of Congress has only just begun to
examine the parallel set of issues for "born-digital" content. To some
degree, the added intellectual value will comeas in the
pastfrom assembling and cataloging a collection and
providing access to it. The challenge of distributed custodythe
likelihood that digital content will be held by different libraries
or publisher-ownerswill be met by the development
of refined conventions for describing and indexing this dispersed
content.
What does all this mean for libraries and archives?
One answer is drawn from an organizational model in which an office or a
family of related offices tackles a mix of issues: first, making
reproductions, the reformatting aspect of preservation; second,
analyzing and processing born-digital content; third, contributing to
the addition of intellectual value by various means; and fourth,
preserving content once in digital form, the other aspect of
preservation, presumably including the value-added elements. I will note
here an interesting discussion we are having at the Library working to
distinguish the role of keeper of the digital content from the role of
shaper and indexer of the digital content for end-user access.
What can we say about preserving content in digital
form? Solutions are beginning to emerge. Different methods will be
applied depending on whether we are talking about the content in a
library's custody or the content for which a library takes
responsibility but does not have custody. The latterwhat the
National Library of Australia calls "remote management"is out of
scope for this discussion. It is worth mentioning, however, that the
preservation of remote resources will include certifying custodians and
binding them in legal agreements. The former, however, is the question
we address here: how will we preserve that which is in our custody?
One key element is covered by the broad term
"security." A recent report from the National Research Council notes
that security has conventionally encompassed secrecy confidentiality,
integrity and availability. [3] Using the broad term
"trustworthiness," the report adds other terms or, as the report puts
it, other "dimensions": correctness, reliability, privacy, safety, and
survivability. The report notes that the dimensions are not independent: to increase one (say,
confidentiality) will inevitably decrease another (availability).
Information technology professionals know that it is difficult to manage
these dimensions in a networked environment in which many software
applications are commercial packages produced by third parties.
Consideration of securityor
trustworthinessleads organizations to hone their skills in a
family of actions. Many operational and administrative components must
be brought together to provide a trustworthy networked environment. In
my conversations with colleagues at the Library of Congress, I have
started to hear what amounts to a checklist of these components:
(1) ensure the physical security of buildings,
hardware, and cabling;
(2) install and integrate firewalls and routers to
control network traffic;
(3) authenticate users and authorize their access to
appropriate zones within the institution's systems;
(4) protect the integrity of systems and data
against corruption caused by accident, errors, or infiltration by
unauthorized persons;
(5) monitor data integrity;
(6) monitor network traffic;
(7) back up systems and data and establish disaster
recovery plans; and
(8) develop guidelines for individual users and train
them in the use of these guidelines.
The trustworthiness that will result from assembling
these components will provide a necessarybut not sufficientcondition
for the preservation of content. Several commentators have
summarized the known approaches to preserving digital content in five
categories, at least two of which are addressed by a trustworthy networked information
system: (1) refreshing the bits and (2) using better media.
"Refreshing" refers to copying a stream of bits from one location to
another, whether the physical medium is the same or not, to keep the
hits alive without change. "Better media" refers to the longevity and
technology-independence of storage media, which may be more important
for offline than for online storage.
As an aside, I will confess that I do not quite know
the best location for "authenticity" in this cyber-geography. Like many
others, I have been impressed by the papers on this topic resulting
from the Council on Library and Information Resources discussion in
January 2000. [4] But the papers demonstrate that the issues are too
numerous to permit easy resolution. "Authentic" in what sense? If
authenticity is a matter of a document's properties, do we mean all
properties or just some properties? To what degree is the need met by
technological elements: checksums, encryption, signatures? How shall we
distinguish the elements pertaining to "integrity" from those
pertaining to "authenticity"? Cliff Lynch's paper in this volume
reminds us of our dependence on "trust," especially trust in an
intermediary to whom we turn to authenticate a
document. [5]
Let me leave authenticity to those better able to
explain it and return to the five digital-content preservation
categories. I mentioned (1) refreshing the bits and (2) using better
media, associating them with security and trustworthiness. The truth is,
one could define the term "trustworthiness" to cover the next three
categories as well: (3) migration of content, (4) emulation of the
technical environment, and (5) digital paleography. Migration includes
the transformation of content from one data representation (digital
format) to another, that is, from one digital format for images (say
TIFF, or tagged image file format) to a future standard format that
provides enhanced functionality. Emulation requires that one use
the power of a new generation of technology to function as if it were
the technology of a previous generation. For example, the provision of
future access to the computer game Myst would almost certainly
require the emulation of Windows 95 (the Microsoft operating system
released in 1995) and other elements. Peter Hirtle's definition of
"digital paleography" alludes to "the venerable science and art of
reading and deciphering old or obscure handwriting." Hirtle envisions
digital paleographers who can, say read files encoded in HTML 1.0 (the
first version of Hypertext Markup Language) and convert them to
"whatever standard may then be current, be it XML [Extensible Markup
Language] and a stylesheet, a hand-held markup language, or an eBook
standard." [6]
How might we accomplish migration or emulation? (I
cannot think of a thing to say about digital paleography.) One answer is
to seize the moment when content first arrives at our door or when we
create it in a reformatting program. This is our best opportunity to be
sure we have a preservable digital object. An analogy may be drawn with
preservable physical objects. We seek to acquire physical books bound in
signatures and printed on acid-free paper, knowing that they are
inherently easier to conserve than cheaply bound volumes printed on
acidic paper. We produce preservation microfilm on polyester-based film
and process it according to preservation standards in the laboratory. By
the same token, we will wish to acquire born-digital books with texts in
an accepted markup language and illustrations in standard image file
formats. Such items will be inherently easier to migrate than electronic
books with texts and illustrations in proprietary formats that require
special software for viewing. Orin the case of an eBook in a
proprietary structure that provides valuable "behavior"the most
preservable electronic instance will be one that is accompanied by
thorough documentation and tools to maximize playability as computer systems
change. Everyone agrees that these circumstances cry out for special
metadata: we need information about the form and structure of the
content, about the systems that might have to be emulated to play it,
about access restrictions, and more. We need technical metadata to
support migration, emulation, or a judicious combination of the
two.
It is worth a word about "look and feel" or "object
behavior" in these examples. Reformatting (which can be viewed as a
type of migration) transforms the look and feel of the physical book: a
microfilm has no pages to turn, no paper to touch, just frames to
advance. Similarly, we can expect some change in the look and feel of
the migrated cyber book as next-generation software renders the text in
a different way, or a new, higher resolution display screen or printer
renders the illustration at reduced size. We trust that these changes
will be minorI do not quite dare say "aesthetic"and that
the information contained by text and picture will remain intact. In
contrast, the look and feel of the non-migratable cyber
bookor a book for which a curator is willing to pay the price to
maintain its look and feelwill remain unchanged as long as system
emulation can be provided. Although experts differ on this, we worry
that the level of effort required to accomplish emulation will be
greater than the effort needed to migrate, when migration is possible.
As one of my colleagues points out, it may be easier to apply methods
for searching an extended digital corpus at any given time if that
corpus has been migrated into newer formats.
Reformatting programs make cyber objects that
reproduce original physical items. At the Library of Congress, the general
strategy for digital reformatting has been to produce migratable
content, that is, reproductions of the originals and an object structure
designed to permit migration. These reproductions are structured to
provide a representation of the original item that is as good as or better than
conventional reformatted copies. These copies must be at least as good
as a microfilm's representation of a book or an analog audio tape's
representation of a wax cylinder. In no case are these reproductions
intended to emulate the complete look and feel or behavior of the
original items. But there is an opportunity here to add value of a
different sort, as in the case of rendering the text in searchable
form.
What is a library's role when others are the makers
of the objects? The event of acquiring offers a useful point of
consideration, representing as it does the transaction between a library
and the maker or the maker's representative. At this point, there may be
an opportunityas has been the case with the push for the use of
acid-free paper in book manufacturingto influence makers to
produce digital content in more preservable form. The desire to
influence makers has special meaning for the Library of Congress, where
some acquisitions result from the workings of the copyright law. In
these instances, the Library can define "best editions" (the form of a
work desired by the Library for its collections) in ways that are most
supportive of content preservation. The acquisition event is also a
moment for the analysis of arriving digital content and the
documentation of the features that are relevant for preservation
planning. It may also be a moment for carrying out a cost-benefit
analysis that weighs one preservation approach against another.
What do these ideas mean for the institution and its
organization? Earlier, I alluded to an office or offices that would
make reproductions, add intellectual value, and preserve content in
digital form. But if digital content preservation entails operating a
trustworthy networked information system, migrating content, and
emulating systems, to say nothing of analysis-upon-acquisition and the
execution of legal agreements for remote resourceswell, we are
surely not talking about an office in the singular. This calls for
distributed responsibilities and carries a strong need for computer
science expertise. Come to think of it, I guess this begins to describe
a library in a digital age. It reminds us that securing and preserving
digital content require a collective effort that will depend on the
contributions of many people.
chap17.html
|