Data Citations

The Dataverse Network Project standardizes the citation of data sets. Until this Project, citations of data were inconsistent or nonexistent in many publications, with future access and scholarly recognition highly uncertain. When you contribute a study to the Dataverse Network, the citation is calculated and presented automatically.

See the following for detailed information about how the Project implements citations:

The Standard

The citation standard defined here offers proper recognition to authors as well as permanent identification through the use of global, persistent identifiers in place of URLs, which can change frequently. Use of universal numerical fingerprints (UNFs) guarantees to the scholarly community that future researchers will be able to verify that data retrieved is identical to that used in a publication decades earlier, even if it has changed storage media, operating systems, hardware, and statistical program format.

Following is an authentic example of a replication data-set citation (from International Studies Quarterly, King and Zeng, 2007: PDF, p.209):

Gary King; Langche Zeng, 2006, "Replication Data Set for 'When Can History be Our Guide? The Pitfalls of Counterfactual Inference'" hdl:1902.1/DXRXCFAWPK UNF:3:DaYlT6QSX9r0D50ye+tXpA== Murray Research Archive [distributor]

This citation has six components. Three are readable by humans: the author, title and year. Two components are machine-readable, and one is optional. Of the machine-readable components to this citation, the unique global identifier begins with "hdl" (this refers to the international handle system). The universal numerical fingerprint begins with "UNF". This identifier is designed to persist even if URLs--or the web itself--are replaced with something else. When the citation appears online, the identifier is hot-linked to the URL that references the identifier, which works in browsers available today. In print, the URL is also included in the citation.

Four features make the UNF especially useful:

Citations also can have optional features in a standard format, such as "Murray Research Archive [distributor]", which lists a network type in square brackets that is selected from a given, controlled vocabulary.

Learn more: Micah Altman and Gary King. 2007. "A Proposed Standard for the Scholarly Citation of Quantitative Data," D-Lib Magazine, Vol. 13, No. 3/4 (March). (Abstract: HTML | Article: PDF)

Technical Details

The UNF portion of the citation standard for data sets uses a specific algorithm to compute the approximated semantic content of a digital object. This approximated content is then put into a normalized (or canonicalized) form, and a hash function is used to compute a unique fingerprint for the resulting normalized, approximated object. The resulting hash (a string of characters) is thus independent of the storage medium and format of the object.

Version 3 of the UNF algorithm currently is used by the Dataverse Network Project. This algorithm can be used on digital objects containing vectors of numbers, vectors of character strings, data sets comprising such vectors, and studies comprising one or more such data sets. Version 4 has better security at the cost of a longer UNF.

The UNF algorithm applied to the content of a data set or study is as follows:

  1. Round each element in a numeric vector to k significant digits using the IEEE 754 round towards zero rounding mode. The default value of k is seven, the maximum expressible in single-precision floating point calculations.
    UNF calculation for vectors of character strings is identical, except that you truncate to k characters and the default value of k is 128.
  2. Convert each vector element to a character string in exponential notation, omitting noninformational zeros.
    If an element is missing, represent it as a string of three null characters.

    If an element is an IEEE 754, nonfinite, floating-point special value, represent it as the signed, lowercase, IEEE minimal printable equivalent (that is, +inf,-inf, or +nan).

    Each character string comprises the following:

    • A sign character.
    • A single leading digit.
    • A decimal point.
    • Up to k-1 digits following the decimal, consisting of the remaining k-1 digits of the number, omitting trailing zeros.
    • A lowercase letter "e."
    • A sign character.
    • The digits of the exponent, omitting trailing zeros.

    For example, the number pi at five digits is represented as -3.1415e+, and the number 300 is represented as the string +3.e+2.

  3. Terminate character strings representing nonmissing values with a POSIX end-of-line character.
  4. Encode each character string with Unicode bit encoding. Versions 3 through 4 use UTF-32BE; Version 4.1 uses UTF-8.
  5. Combine the vector of character strings into a single sequence, with each character string separated by a POSIX end-of-line character and a null byte.
  6. Compute a hash on the resulting sequence using the standard MD5 hashing algorithm for Version 3 and using SHA256 for Version 4.
    The resulting hash is base64 encoded to support readability.
  7. Calculate the UNF for each lower-level data object, using a consistent UNF version and level of precision across the individual UNFs being combined.
  8. Sort the base64 representation of UNFs in POSIX locale sort order.
  9. Apply the UNF algorithm to the resulting vector of character strings using k at least as large as the length of the underlying character string.
  10. Combine UNFs from multiple variables to form a single UNF for an entire data frame, and then combine UNFs for a set of data frames to form a single UNF that represents an entire research study.

Learn more: Software for computing UNFs is available in an R Module, which includes a Windows standalone tool and code for Stata and SAS languages. See also Micah Altman and Gary King. 2007. "A Proposed Standard for the Scholarly Citation of Quantitative Data," D-Lib Magazine, Vol. 13, No. 3/4 (March). (Abstract: HTML | Article: PDF); Micah Altman, Jeff Gill and Michael McDonald, 2003, Numerical Issues in Statistical Computing for the Social Scientist, New York: John Wiley (Web site); and Micah Altman, Jeff Gill, and Michael McDonald, "R Modules for Accurate and Reliable Computing," UseR! 2006 (PDF).