Digital Data Management, Curation, and Archiving

Types of Data

There are a wide array of data types. Data can be described as experimental, observational. Data can also be in the tabular, or in the form of images, video or sound. Data can also be derived from other data, or the result of simulation. Its important to think broadly about your definition of data as well. Some projects may not create data in a traditional sense, but rather result in design schematics for instance, 3D Models or just a series of images.

When describing your data, try to be as descriptive as possible to help others in interpreting it. You might also briefly describe the source or origins of your data, the amount of data in bytes or even the number of files.

Directory and File Naming Conventions

How  you name your directories, files and variables them can have a huge impact on making your data useable and understandable to others. Also the structure of your directory system is important, too. When planning your file structure and naming conventions there are a number of things to consider. Imagine someone reading your directory and files as a sentence. In the examples below, file paths are in bold using a slash (/) to separate directory and file names.

  • Use human readable names: a23/seq is not good, Guppy/Sequences is better.
  • Use the project name for the top directory of each project
  • Subdirectories might be named:
    • By experimental run
    • By types of data (ie. images, analysis, sequences
  • Avoid spaces - many statistical programs cannot handle spaces in variable names. Spaces in file names and folders can cause problems as well. Use underscores or lower camel case for names with multiple words:
    • variableOne
    • September_2012
  • Plan your directory structure so that it reflects the structure of your data.

 

File Types

Prior to embarking n a research project, our team of Research Data Librarians will be happy to work with you to identify the best data format solutions for your work. We can work with you to find the current standards for data in your field, and create data management plans that both conform to these standards and ensure long-term access to your data.

If you already have data, we can work with you to determine how you want your data to be used, who you wish to have access to your data, and prepare your data for long-term storage and access.

Some general information is provided below, but please contact us for current information about established best practices regarding file types and recommended preservation formats. For the especially curious, the Library of Congress provides an excellent guide to the Sustainability of Digital Formats to which your librarian may refer in the course of a consultaiton.

Text and Textual Data

While its fine to use MS Word MS Excel, or other proprietary software for your day to day creation and handling of your data, sustainable, long term sharing and archiving are often better supported through tranformation to established, open formats including text, comma separated value, and PDF. As human readable, structured text formats, text (.txt) and comma separated value tabular files (.csv) are almost always readable by other software and are most easy to archive and curate. Almost all word processing and spread sheet software can import text files. If you have any questions about your particular case, please contact us, we'll be happy to help.

Image, Video and Audio Data

Curation of image, video and audio data poses unique challenges. While open formats exist, standard practices in many fields favor the use of proprietary standards or  use image standards with imbedded data that would be lost if converted to other formats. While one recommended practice is to save data in both in the source application format as well as an open format if possible, it is particulaly important to use lossless rather than lossy compression. It is also very important to document how the images or audio files were created, although the level of detail will vary greatly between image capturing devices. For instance, images of paintings might want to include lighting conditions, distance from the plane, exposure time, et cetera. For other devices these parameters might not make any sense at all. There are often existing discipline specific standards. If you are unsure of existing standards, please ask us. We'll research it for you.

Other Data Formats

It is impossible to predict that variety of data formats that are used or will be used. If you are unsure how best to manage or archive your data, please don't hesitate to contact us. We will research it for you and develop a management and curation plan custom made for your data.

Private and Sensitive Data

Sometimes data will contain private information about people, information deemed to be secret by the government or references to ecologically, culturally or otherwise senstive places. You probably already know if you data does. If you have an IRB approval, include that here.

If you do have any privacy or sensitive data issues, describe how you secure your data. For instance, using password protected and encrypted hard drives. Also describe how you will anonomyze, obscure or remove this information when you make your data public.

Documentation

Computer programmers know the importance of well documented code. Experienced programmers know that if they don't document as they write, their documentation will suffer. Similarly, document your data as it is created and document it so that someone in your field, but unfamiliar with  your data would be able to understand your data. Programmers who have experience with large projects know how important it is to keep all the components of a program organized in a clear, well defined file structure.

Tabular data can often be documented internally in the comments field. It would also be good to place a text (.txt) or read me file in the same directory as your data that is a "data dictionary" for each of the fields in your tabular data. The data dictionary would describe the data in each field, and also describe any transformations or external dependencies associated with the data.

if your data is complex, with many different files and file types, organize them in a well structured hierarchical directory system. Each directory should have a README.txt file that describes the files and data located inside. If your field has established metadata schemas for data, use that.

If you are unsure about the existence of established metadata in your field, just contact us, well find out for you. We can also assist you in creating your metadata document(s).

Data Acquisition, Integrity, and Quality

Acquiring your data and maintaining its quality and integrity is probably one of your main concerns. How you do that will depend on the hardware and technology at your disposal.  Most of us have learned these lessons the hard way, but for those lucky few who haven't, here is what can happen. When considering how to store and maintain your data you should consider the follow issues.

Potential Problems

Corruption

Your data can be corrupted any number of ways. Probably the most common reason is user error. For instance, sorting an excel file wrong then saving it can result in useless data. Hardware failures can result in unreadable disks. If your data is accessible over a network, your data can be maliciously corrupted. Sometimes, the corruption can be as simple as wishing  you hadn't made some changes to a document, but realizing it too late and having saved over the version you preferred.

Many errors are introduced during the acquisition of data as well. Hand entered data can be highly error prone, as well as subject to "creative input" when there is an interest in the outcome. Where ever possible, use automated data entry methods. If you must hand enter data, consider having a second or third person inspect the data. Also, creating web based data entry forms can allow you to validate the data as its entered. If you need help creating these forms, feel free to contact us.

Incompatibility

If you are collaborating on a project, accidental corruption of data and other documents is very easy. It is all too easy when passing documents over email, or storing multiple versions of documents on different machines to edit an obsolete document or to end up with documents that are difficult if not impossible to merge.

Sometimes when working with others, you introduce incompatibility by your choice of file types. For instance, using the most recent version of Xcel when others only have an older version, making it unreadable to them. Also, using rarely used and obscure file types can limit accessibility. Try only use widely supported and ideally open formats when sharing data.

Loss

Data loss, like corruption can happen any number of ways and like corruption, is probably most often user error. Files are accidentally deleted, or data from a database is accidentally removed. Data can also be lost due to faulty hardware or maliciousness as well.

Exposure

Many studies gather or use data that contains private information about people, references to ecologically or culturally sensitive places, has economic interests such as patents etc. Just as you wouldn't want someone accessing your personal computer to get information such as credit card numbers or social security numbers, you may not want your data to be exposed to others.

Solutions

To avoid the problems described above, you should consider the strategies below, redundancy, backup, access and versioning. In fact these strategies are so basic, you should use them for all your digital assets, not just those associated with a particular project.

Redundancy

Redundancy is the first layer of protection. For instance, using RAID (Redundant Array of Identical Disks) on your computer can help protect you from hard drive failures. RAID mirrors your data on one or more additional disks, so that if one disk fails, you still have the data on at least one other. If a disk fails you just replace it and the software rebuilds your array. However, this does not prevent corruption or loss of data by user error as such changes are immediately mirrored on all the disks. RAID is one example of redundancy, depending on your project and infrastructure there may be other places to implement redundant hardware.

Backup

Most people are familiar with backing up your data. This can take many forms, often writing changes to a second hard drive, USB drive or optical disk. The gold standard for backup, however, is backing up your data automatically, at regular intervals, to a different location than your primary storage with incremental backup.

By using a different location you minimize the probability of both storage locations being destroyed by a single disaster such as a flood. Automation ensures the backup happens even should you forget to do so. Incremental backup, keeps snapshots of your drive at different times, so that you can roll back in time to the last known valid time. Otherwise its possible to discover problems only after your drive was backed up and the errors written to the backup storage.

Access Control

If your computer is not password protected then it should be. Ideally, all the files to your project should be stored on a central computer with controlled access over a network. Even if your data can be viewable to the public, the ability to add, modify or delete data should be limited.

Passwords should be complex and safe. If you have many passwords to remember, consider using a password keeping program such as KeePass to store and create passwords. Then you only need remember one password, the one for your password keeper. Remember, most security breaches are the result of people using easy passwords or giving away their password, not due to brute force hacking.

If keeping  your data private is truly a concern, also consider encrypting your data to add another layer of security (Redundancy).

Version Control

We have probably all wrestled with the issue of keeping track of the most recent version of a document, while still keeping a record of previous documents. If you are working on a collaborative team, this problem becomes even more of an issue. Often multiple copies of a document are passed around, edited and passed around and eventually which document is the "correct" one becomes impossible to know. There are number of strategies to alleviate this problem.

You could enforce naming conventions. For instance naming documents with the date or data and time if needed. This works ok with one or two people working on a document, but it requires you remember to do that. However, you can end up with a large number of files, and it doesn't prevent two people from working on a file at the same time.

Software developers are probably familiar with versioning software such as GIT, SVN and Subversion. These tools allow you to keep track of versions of files. They also allow  you to create branches of a project then merge them, documenting who made what changes and when. They also function as an additional backup. Although they were developed for computer code, they work with any kind of document. If you wish to learn more Github.com is an excellent service that offers great documentation and is either free or inexpensive depending on your level of needs.

Again, if you have any questions or concerns about this, please feel free to contact us.

Intellectual Property Rights Management

It is becoming increasingly recognized that data is most powerful, when it is made available for others to use.Funding agencies are starting to require that researchers make your data available to others.  That doesn't mean you don't have rights regarding your data. As long as you are within these requirements, you are free to decide how people use your data. Some important considerations to make are:

  •     Can people make commercial use of your data?
  •     Can people use your data with or without citing you?
  •     Would you like an embargo period on your data?

The legal aspects of copyrighting data are not clear cut. Most people consider numerical data, or tabular data to not be under copyright. However, charts, figures, images, videos, and other forms of data can be. In either case, you may attach a license to your data. Licensing your data not only may protect you legally, but perhaps more importantly, instruct others how you wish your data to be used. There are a number of online resources for licensing listed below. Additionally, many products of research are patentable. If you intend to seek a patent, you should describe how this will affect use of your data.

Creative Commons provides a licensing tool designed to help you select the correct license for your needs. It is used by Flickr for instance and well known. It is the most flexible.

The Open Data Commons provides tools for making data open, and may be more applicable to databases and tabular data.

There are a number of licensing schemes for software. This would also include scripts for statistical packages.

The  GNU General Public License  is perhaps the most well known. This license is a bit restrictive in that it is "Copy Left" requiring anyone using your software to adhere to the same GNU License for derivatives.

The Apache License Version 2.0 is sometimes considered a better, more modern option, giving others more latitude in how your software is used.

The BSD Open Source License is an additionaly popular alternative favored because of its simplicity.

The important issue here is not only to protect your legal rights with your data, but also to provide guidance to others who wish to use your data.