Long-Term Persistence

When storing digital data for the long term, one generally differentiates between the preservation of the digital bits on a storage medium, and the interpretability of these bits (what they represents, e.g. a video file in a certain format or a text document in a certain format). These are different problems that require different solutions or tactics.

Bit stream preservation

Bit stream preservation is about preserving the digital bits on a data carrier (storage medium). All current storage technology has a limited life-time, both the media themselves as well as the specific technology in general. Hard disks for example typically last only about 5 years before they break. The types of hard disk and how they connect to the computer also change over the years and it is quite likely that hard disk technology as we know it now will at some point be replaced with some other storage technology. Due to the short life times of the media as well as the storage technology as such, two measures need to be taken:

    • the data needs to be migrated to newer, up to date, storage media at regular intervals
    • backup copies of the data need to be created, ideally many of them in geographically different locations.

At the MPI for Psycholinguistics, where all DOBES data is stored, the storage technology is typically replaced with newer technology every 5 years. At the moment, a Hierarchical Storage Management system is in use (SUN/ORACLE SAM-FS) that stores the data on hard drives as well as on data tapes (LTO5). Data can be moved back and forth dynamically between tape and hard drive depending on a number of parameters, e.g. how often a file is accessed.

For all archived DOBES data, 7 copies currently exist:

  • at the MPI, two copies are created dynamically at different storage media and different locations in the building
  • a copy is distributed dynamically to the GWDG in Göttingen (Germany), which is one of the big computer centers of the Max-Planck-Society and which itself has a double storage strategy.
  • another copy is distributed dynamically to the RZG in Garching-Munich, which is the other big computer center of the Max-Planck-Society and which also has a double storage strategy (exchange of all data with the Leibniz Computer Center)
  • another copy is distributed dynamically to the MPI for evolutionary Anthropology in Leipzig.

For the archived data in the two computer centers of the Max Planck Society, the president has given a 50 year institutional guarantee to preserve the bits.

Interpretability

The most difficult aspect of preserving digital data for the long term is the interpretability of the formats. File formats and encodings typically have a limited life time as well and it can be very difficult to read a format after it has become obsolete. One example is the WordPerfect format, which was very popular unti the mid 90s but is hardly used any more today.  If at some point there is no software available any more to read these obsolete formats, the data is basically lost even if the files are perfectly intact. In order to minimize this risk, most archives try to use standardized, open, non-proprietary formats as much as possible. For textual material, XML-based formats are often preferred since they contain both the content as well as the structure of the format in one and the same document in a plain (Unicode) text.

File formats that are accepted for the DOBES archive are standardized, open, non-proprietary formats as much as possible, however there are some compromises when certain formats are the de-facto standard and there is no better alternative available. The list of accepted formats can be found in appendix A of the manual of the LAMUS web-based archive upload system.