The Library of Congress >> To Preserve and Protect

Publications (The Library of Congress)


ELECTRONIC INFORMATION AND DIGITIZATION
Preservation and Security Challenges

17. Preservation, Security, and Digital Content
Carl Fleischhauer

Our keywords for this publication—preservation and security—are variously defined, certainly in the context of digital materials. The variation is even greater, and the meanings assigned even broader, when we look at the terms as they are applied to entities in an organization, for example, the "Preservation Department" or the "Security Office." Because this book is about strategic stewardship, I think that it is fitting to discuss things from an organizational perspective and to take a broader view.

Let me say a few words about the first term, "preservation." For illustration, allow me to oversimplify a bit about the old days, when our concern was focused exclusively on physical objects. Let us take books, for instance. Preservation departments within libraries were organized to conserve the books and—when necessary—to reformat them (which usually meant microfilming them). If thought was given to adding intellectual value, this was seen as the business of other units within the library, perhaps where a selection process assembled the collection, where cataloging took place, or where the reference staff placed materials in the hands of readers. Added intellectual value was a matter of providing a context for a given item within a larger, cataloged collection. And the collection itself had intellectual value: the whole was greater than the sum of the parts.

With digital activities, we have seen organizational boundaries begin to weaken. The Library of Congress program called American Memory is a classic—but by no means the only—example of a reformatting project motivated by the desire to expand access. [1] Such activities reproduce the original objects and at the same time add intellectual value by improving access, especially when they produce searchable texts. Here, even the parts are "greater" than they were before digitization.

I know that the reproductions in American Memory are often described as access-quality copies, but the program and allied efforts at the Library of Congress have also investigated copy-making in the service of preservation. (Please note that we use the term "preservation copy" at the Library of Congress even when we retain the original item.) For some types of material, we have begun to produce very high-quality reproductions. The digital images produced by the Geography and Map Division, for example, surpass in quality the 105mm microfiche traditionally produced for the division, and the paper output from the scans far and away surpasses any output obtainable from the film. I am not sure if the division has started using the p-word for their digital copies, but the images certainly fill the same niche that analog preservation copies used to fill.

Meanwhile, the Prints and Photographs Division is starting to hear from publishers who are well satisfied with the uncompressed master images they download from the Web, suggesting that here, too, digital images can take the place of what were called preservation copy negatives. At the Library of Congress and elsewhere, we are impressed by the effective service to researchers that is offered by, say color images of manuscript pages. Such digital images are susceptible to enlargement and improvements in legibility when examined using well-selected software. (Alas, audio and video lag behind in terms of online quality, but I am convinced they will catch up soon.)

In the face of these developments, the people who are developing digital reformatting programs have moved from discussing the basics to discussing the niceties. When should we reproduce a book's pages as ink imprints, that is, capture what amounts to be the typography and lines in drawings? This approach permits us to create clean paper facsimiles, as demonstrated by the Cornell University library scanning projects of the 1990s. Alternatively when shall we make images of the pages that use a photographic approach to capture the look of the sheets (including paper color) in the manner demonstrated by Octavo CD-ROM editions of rare volumes? [2] When is it important to conserve the original bound volume even in the face of lower image quality? Are there ever times when the circumstances dictate disbinding a book in order to make perfect facsimile images? And, to echo my earlier remarks, this question: for this body of material, how shall we add value and improve access? This last topic is now understood to include not only options like exchangeable MARC (MAchine-Readable Cataloging) catalog records, but also standardized finding aids, searchable full texts, and "exposing and harvesting" the detailed, local data that live in intimate relationships with digitized content.

The Library of Congress has only just begun to examine the parallel set of issues for "born-digital" content. To some degree, the added intellectual value will come—as in the past—from assembling and cataloging a collection and providing access to it. The challenge of distributed custody—the likelihood that digital content will be held by different libraries or publisher-owners—will be met by the development of refined conventions for describing and indexing this dispersed content.

What does all this mean for libraries and archives? One answer is drawn from an organizational model in which an office or a family of related offices tackles a mix of issues: first, making reproductions, the reformatting aspect of preservation; second, analyzing and processing born-digital content; third, contributing to the addition of intellectual value by various means; and fourth, preserving content once in digital form, the other aspect of preservation, presumably including the value-added elements. I will note here an interesting discussion we are having at the Library working to distinguish the role of keeper of the digital content from the role of shaper and indexer of the digital content for end-user access.

What can we say about preserving content in digital form? Solutions are beginning to emerge. Different methods will be applied depending on whether we are talking about the content in a library's custody or the content for which a library takes responsibility but does not have custody. The latter—what the National Library of Australia calls "remote management"—is out of scope for this discussion. It is worth mentioning, however, that the preservation of remote resources will include certifying custodians and binding them in legal agreements. The former, however, is the question we address here: how will we preserve that which is in our custody?

One key element is covered by the broad term "security." A recent report from the National Research Council notes that security has conventionally encompassed secrecy confidentiality, integrity and availability. [3] Using the broad term "trustworthiness," the report adds other terms or, as the report puts it, other "dimensions": correctness, reliability, privacy, safety, and survivability. The report notes that the dimensions are not independent: to increase one (say, confidentiality) will inevitably decrease another (availability). Information technology professionals know that it is difficult to manage these dimensions in a networked environment in which many software applications are commercial packages produced by third parties.

Consideration of security—or trustworthiness—leads organizations to hone their skills in a family of actions. Many operational and administrative components must be brought together to provide a trustworthy networked environment. In my conversations with colleagues at the Library of Congress, I have started to hear what amounts to a checklist of these components:

(1) ensure the physical security of buildings, hardware, and cabling;

(2) install and integrate firewalls and routers to control network traffic;

(3) authenticate users and authorize their access to appropriate zones within the institution's systems;

(4) protect the integrity of systems and data against corruption caused by accident, errors, or infiltration by unauthorized persons;

(5) monitor data integrity;

(6) monitor network traffic;

(7) back up systems and data and establish disaster recovery plans; and

(8) develop guidelines for individual users and train them in the use of these guidelines.

The trustworthiness that will result from assembling these components will provide a necessary—but not sufficient—condition for the preservation of content. Several commentators have summarized the known approaches to preserving digital content in five categories, at least two of which are addressed by a trustworthy networked information system: (1) refreshing the bits and (2) using better media. "Refreshing" refers to copying a stream of bits from one location to another, whether the physical medium is the same or not, to keep the hits alive without change. "Better media" refers to the longevity and technology-independence of storage media, which may be more important for offline than for online storage.

As an aside, I will confess that I do not quite know the best location for "authenticity" in this cyber-geography. Like many others, I have been impressed by the papers on this topic resulting from the Council on Library and Information Resources discussion in January 2000. [4] But the papers demonstrate that the issues are too numerous to permit easy resolution. "Authentic" in what sense? If authenticity is a matter of a document's properties, do we mean all properties or just some properties? To what degree is the need met by technological elements: checksums, encryption, signatures? How shall we distinguish the elements pertaining to "integrity" from those pertaining to "authenticity"? Cliff Lynch's paper in this volume reminds us of our dependence on "trust," especially trust in an intermediary to whom we turn to authenticate a document. [5]

Let me leave authenticity to those better able to explain it and return to the five digital-content preservation categories. I mentioned (1) refreshing the bits and (2) using better media, associating them with security and trustworthiness. The truth is, one could define the term "trustworthiness" to cover the next three categories as well: (3) migration of content, (4) emulation of the technical environment, and (5) digital paleography. Migration includes the transformation of content from one data representation (digital format) to another, that is, from one digital format for images (say TIFF, or tagged image file format) to a future standard format that provides enhanced functionality. Emulation requires that one use the power of a new generation of technology to function as if it were the technology of a previous generation. For example, the provision of future access to the computer game Myst would almost certainly require the emulation of Windows 95 (the Microsoft operating system released in 1995) and other elements. Peter Hirtle's definition of "digital paleography" alludes to "the venerable science and art of reading and deciphering old or obscure handwriting." Hirtle envisions digital paleographers who can, say read files encoded in HTML 1.0 (the first version of Hypertext Markup Language) and convert them to "whatever standard may then be current, be it XML [Extensible Markup Language] and a stylesheet, a hand-held markup language, or an eBook standard." [6]

How might we accomplish migration or emulation? (I cannot think of a thing to say about digital paleography.) One answer is to seize the moment when content first arrives at our door or when we create it in a reformatting program. This is our best opportunity to be sure we have a preservable digital object. An analogy may be drawn with preservable physical objects. We seek to acquire physical books bound in signatures and printed on acid-free paper, knowing that they are inherently easier to conserve than cheaply bound volumes printed on acidic paper. We produce preservation microfilm on polyester-based film and process it according to preservation standards in the laboratory. By the same token, we will wish to acquire born-digital books with texts in an accepted markup language and illustrations in standard image file formats. Such items will be inherently easier to migrate than electronic books with texts and illustrations in proprietary formats that require special software for viewing. Or—in the case of an eBook in a proprietary structure that provides valuable "behavior"—the most preservable electronic instance will be one that is accompanied by thorough documentation and tools to maximize playability as computer systems change. Everyone agrees that these circumstances cry out for special metadata: we need information about the form and structure of the content, about the systems that might have to be emulated to play it, about access restrictions, and more. We need technical metadata to support migration, emulation, or a judicious combination of the two.

It is worth a word about "look and feel" or "object behavior" in these examples. Reformatting (which can be viewed as a type of migration) transforms the look and feel of the physical book: a microfilm has no pages to turn, no paper to touch, just frames to advance. Similarly, we can expect some change in the look and feel of the migrated cyber book as next-generation software renders the text in a different way, or a new, higher resolution display screen or printer renders the illustration at reduced size. We trust that these changes will be minor—I do not quite dare say "aesthetic"—and that the information contained by text and picture will remain intact. In contrast, the look and feel of the non-migratable cyber book—or a book for which a curator is willing to pay the price to maintain its look and feel—will remain unchanged as long as system emulation can be provided. Although experts differ on this, we worry that the level of effort required to accomplish emulation will be greater than the effort needed to migrate, when migration is possible. As one of my colleagues points out, it may be easier to apply methods for searching an extended digital corpus at any given time if that corpus has been migrated into newer formats.

Reformatting programs make cyber objects that reproduce original physical items. At the Library of Congress, the general strategy for digital reformatting has been to produce migratable content, that is, reproductions of the originals and an object structure designed to permit migration. These reproductions are structured to provide a representation of the original item that is as good as or better than conventional reformatted copies. These copies must be at least as good as a microfilm's representation of a book or an analog audio tape's representation of a wax cylinder. In no case are these reproductions intended to emulate the complete look and feel or behavior of the original items. But there is an opportunity here to add value of a different sort, as in the case of rendering the text in searchable form.

What is a library's role when others are the makers of the objects? The event of acquiring offers a useful point of consideration, representing as it does the transaction between a library and the maker or the maker's representative. At this point, there may be an opportunity—as has been the case with the push for the use of acid-free paper in book manufacturing—to influence makers to produce digital content in more preservable form. The desire to influence makers has special meaning for the Library of Congress, where some acquisitions result from the workings of the copyright law. In these instances, the Library can define "best editions" (the form of a work desired by the Library for its collections) in ways that are most supportive of content preservation. The acquisition event is also a moment for the analysis of arriving digital content and the documentation of the features that are relevant for preservation planning. It may also be a moment for carrying out a cost-benefit analysis that weighs one preservation approach against another.

What do these ideas mean for the institution and its organization? Earlier, I alluded to an office or offices that would make reproductions, add intellectual value, and preserve content in digital form. But if digital content preservation entails operating a trustworthy networked information system, migrating content, and emulating systems, to say nothing of analysis-upon-acquisition and the execution of legal agreements for remote resources—well, we are surely not talking about an office in the singular. This calls for distributed responsibilities and carries a strong need for computer science expertise. Come to think of it, I guess this begins to describe a library in a digital age. It reminds us that securing and preserving digital content require a collective effort that will depend on the contributions of many people.



<<< Previous <<< Contents>>> Next >>>


chap17.html
 
  The Library of Congress >> To Preserve and Protect
   September 15, 2008
Contact Us