UNF Version 3

Version 3 of the UNF algorithm was used by the Dataverse Network software prior to version 2.0, and was implemented in R code. This algorithm was used on digital objects containing vectors of numbers, vectors of character strings, data sets comprising such vectors, and studies comprising one or more such data sets.

The UNF V3 algorithm applied to the content of a data set or study is as follows:

1. Round each element in a numeric vector to k significant digits using the IEEE 754round towards zero rounding mode. The default value of k is seven, the maximum expressible in single-precision floating point calculations. UNF calculation for vectors of character strings is identical, except that you truncate to k characters and the default value of k is 128.

2. Convert each vector element to a character string in exponential notation, omitting noninformational zeros. If an element is missing, represent it as a string of three null characters. If an element is an IEEE 754, nonfinite, floating-point special value, represent it as the signed, lowercase, IEEE minimal printable equivalent (that is, +inf, -inf, or +nan).

Each character string comprises the following:

A sign character.

A single leading digit.

A decimal point.

Up to k-1 digits following the decimal, consisting of the remaining k-1 digits of the number, omitting trailing zeros.

A lowercase letter "e."

A sign character.

The digits of the exponent, omitting trailing zeros.

For example, the number pi at five digits is represented as -3.1415e+, and the number 300 is represented as the string +3.e+2.

1. Terminate character strings representing nonmissing values with a POSIX end-of-line character.

2. Encode each character string with Unicode bit encoding. Versions 3 through 4 use UTF-32BE; Version 4.1 uses UTF-8.

3. Combine the vector of character strings into a single sequence, with each character string separated by a POSIX end-of-line character and a null byte.

4. Compute a hash on the resulting sequence using the standard MD5 hashing algorithm for Version 3 and using SHA256 for Version 4. The resulting hash is base64encoded to support readability.

5. Calculate the UNF for each lower-level data object, using a consistent UNF version and level of precision across the individual UNFs being combined.

6. Sort the base64 representation of UNFs in POSIX locale sort order.

7. Apply the UNF algorithm to the resulting vector of character strings using k at least as large as the length of the underlying character string.

8. Combine UNFs from multiple variables to form a single UNF for an entire data frame, and then combine UNFs for a set of data frames to form a single UNF that represents an entire research study.

Learn more: Software for computing UNFs is available in an R Module, which includes a Windows standalone tool and code for Stata and SAS languages. Also see the following for more details: Micah Altman and Gary King. 2007. "A Proposed Standard for the Scholarly Citation of Quantitative Data," D-Lib Magazine, Vol. 13, No. 3/4 (March). (Abstract: HTML | Article: PDF)

Data Citation

The Dataverse Network Project standardizes the citation of data sets. Until this Project, citations of data were inconsistent or nonexistent in many publications, with future access and scholarly recognition highly uncertain. When you create a study in the Dataverse Network, the citation is generated and presented automatically.

At IQSS, we continually work on a number of projects to facilitate and promote the use of data citation. We recently hosted a workshop on Data Citation Principles where we had a lively and rewarding discussion to generate guidelines for data citation. Find more information on the workshop here.

To learn more about data citation, see publications related to this topic and our recent workshop on the topic.

The Standard

The citation standard defined here offers proper recognition to authors as well as permanent identification through the use of global, persistent identifiers in place of URLs, which can change frequently. Use of universal numerical fingerprints (UNFs) guarantees to the scholarly community that future researchers will be able to verify that data retrieved is identical to that used in a publication decades earlier, even if it has changed storage media, operating systems, hardware, and statistical program format.

Following is an authentic example of a replication data-set citation (from International Studies Quarterly, King and Zeng, 2007, p. 209):

Gary King; Langche Zeng, 2006, "Replication Data Set for 'When Can History be Our Guide? The Pitfalls of Counterfactual Inference'" hdl:1902.1/DXRXCFAWPK UNF:3:DaYlT6QSX9r0D50ye+tXpA== Murray Research Archive [distributor]

This citation has six components. Four are readable by humans: the author(s), title, year and distributor (only the title is required, but authors and year are strongly recommended). Two components are machine-readable. Of the machine-readable components to this citation, the unique global identifier begins with "hdl" (this refers to the international HANDLE.NET system). This identifier is designed to persist even if URLs--or the web itself--are replaced with something else. When the citation appears online, the identifier is hot-linked to the URL that references the identifier, which works in browsers available today. In print, the URL is also included in the citation.

The universal numerical fingerprint begins with "UNF". Four features make the UNF especially useful: The UNF algorithm's cryptographic technology ensures that the alphanumeric identifier will change when any portion of the data set changes. Not only does this assure future researchers that they can use the same data set referenced in a years-old journal article, it enables the data set's owner to track each iteration of the owner's research. When an original data set is updated or incorporated into a new, related data set, the algorithm generates a unique UNF each time. The UNF is determined by the content of the data, not the format in which it is stored. For example, you create a data set in Stata or R. Five years later, you need to look at your data set again, but the data was converted to the next big thing (NBT). You can use NBT, recompute the UNF, and verify for certain that the data set you're downloading is the same one you created originally. That is, the UNF will not change. Knowing only the UNF, journal editors can be confident that they are referencing a specific data set that never can be changed, even if they do not have permission to see the data. In a sense, the UNF is the ultimate summary statistic. The UNF's noninvertible, cryptographic properties guarantee that acquiring the UNF of a data set conveys no information about the content of the data. Authors can take advantage of this property to distribute the full citation of a data set--including the UNF--even if the data is proprietary or highly confidential, all without the risk of disclosure. Citations also can have optional features in a standard format, such as "Murray Research Archive [distributor]", which lists a network type in square brackets that is selected from a given, controlled vocabulary.

For information on how to implement the Universal Numerical Fingerprint (UNF), see "A Fingerprint Method for the Verification of Scientific Data" in the publication section.

Learn more:

Micah Altman and Gary King. 2007. "A Proposed Standard for the Scholarly Citation of Quantitative Data," D-Lib Magazine, Vol. 13, No. 3/4 (March). 

Universal Numerical Fingerprint

The Universal Numerical Fingerprint (UNF) portion of the citation standard for data sets uses a specific algorithm to compute the approximated semantic content of a digital object. This approximated content is then put into a normalized (or canonicalized) form, and a hash function is used to compute a unique fingerprint for the resulting normalized, approximated object. The resulting hash (a string of characters) is thus independent of the storage medium and format of the object. Version 3 of the UNF algorithm was implemented, using R code, by the Project prior to implementation of Dataverse Network software version 2.0. With the release of Dataverse Network software version 2.0, UNF version 5 is implemented and uses Java code. If a study was created in a dataverse hosted by a Dataverse Network using software prior to version 2.0, the UNF calculations for that study and all subsettable files comply with UNF version 3 standards. After the Dataverse Network on which such studies are hosted is updated to software version 2.0 or later, all new studies and subsettable files contributed to a dataverse in that Network will comply with UNF version 5 standards. If a new subsettable file is uploaded to an existing study for which the UNF was calculated using version 3 standards, the new file's UNF is calculated using version 5 of the standard and a new UNF also is calculated for the study using version 5.

Learn more: Micah Altman, Jeff Gill and Michael McDonald, 2003, Numerical Issues in Statistical Computing for the Social Scientist, New York: John Wiley. 

UNF Version 3

Version 3 of the UNF algorithm was used by the Dataverse Network software prior to version 2.0, and was implemented in R code. This algorithm was used on digital objects containing vectors of numbers, vectors of character strings, data sets comprising such vectors, and studies comprising one or more such data sets.

The UNF V3 algorithm applied to the content of a data set or study is as follows:

1. Round each element in a numeric vector to k significant digits using the IEEE 754round towards zero rounding mode. The default value of k is seven, the maximum expressible in single-precision floating point calculations. UNF calculation for vectors of character strings is identical, except that you truncate to k characters and the default value of k is 128.

2. Convert each vector element to a character string in exponential notation, omitting noninformational zeros. If an element is missing, represent it as a string of three null characters. If an element is an IEEE 754, nonfinite, floating-point special value, represent it as the signed, lowercase, IEEE minimal printable equivalent (that is, +inf, -inf, or +nan).

Each character string comprises the following:

A sign character.

A single leading digit.

A decimal point.

Up to k-1 digits following the decimal, consisting of the remaining k-1 digits of the number, omitting trailing zeros.

A lowercase letter "e."

A sign character.

The digits of the exponent, omitting trailing zeros.

For example, the number pi at five digits is represented as -3.1415e+, and the number 300 is represented as the string +3.e+2.

1. Terminate character strings representing nonmissing values with a POSIX end-of-line character.

2. Encode each character string with Unicode bit encoding. Versions 3 through 4 use UTF-32BE; Version 4.1 uses UTF-8.

3. Combine the vector of character strings into a single sequence, with each character string separated by a POSIX end-of-line character and a null byte.

4. Compute a hash on the resulting sequence using the standard MD5 hashing algorithm for Version 3 and using SHA256 for Version 4. The resulting hash is base64encoded to support readability.

5. Calculate the UNF for each lower-level data object, using a consistent UNF version and level of precision across the individual UNFs being combined.

6. Sort the base64 representation of UNFs in POSIX locale sort order.

7. Apply the UNF algorithm to the resulting vector of character strings using k at least as large as the length of the underlying character string.

8. Combine UNFs from multiple variables to form a single UNF for an entire data frame, and then combine UNFs for a set of data frames to form a single UNF that represents an entire research study.

Learn more: Software for computing UNFs is available in an R Module, which includes a Windows standalone tool and code for Stata and SAS languages. Also see the following for more details: Micah Altman and Gary King. 2007. "A Proposed Standard for the Scholarly Citation of Quantitative Data," D-Lib Magazine, Vol. 13, No. 3/4 (March). (Abstract: HTML | Article: PDF)

UNF Version 5

Version 5 of the UNF algorithm is used by the Dataverse Network software version 2.0 and later, and is implemented in Java code. As in version 3, this algorithm is used on digital objects containing vectors of numbers, vectors of character strings, data sets comprising such vectors, and studies comprising one or more such data sets. Version 5 adds normalization forms for date, time, duration, bitstring and logical values, in addition to those used for numeric and character values in version 3.

The UNF V5 algorithm applied to the content of a data set or study is as follows:

1.  Calculate a UNF for each element as follows:

Round elements in a numeric vector to k significant digits using the IEEE 754 round towards nearest, ties to even rounding mode. The default value of k is seven, the maximum expressible in single-precision floating point calculations.

Calculate the UNF for vectors of character strings as above, except truncate to k characters and the default value of k is 128.

Normalize boolean values to numeric values of either 0, 1, or missing. No rounding is applied.

Normalize bit fields by converting to big-endian form, truncating all leading empty bits, aligning to a byte boundary by re-padding with leading zero bits, and base64 encoding to form a character string representation. No rounding is applied, and missing values are represented by three null bytes.

Normalize time, date, and durations based on a single, unambiguous representation selected from the many described in the ISO 8601 standard.

Convert calendar dates to a character string of the form YYYY-MM-DD. Partial dates in the form YYYY or YYYY-MM are permitted.

Time representation is based on the ISO 8601 extended format, hh:mm:ss.fffff. When .fffff represents fractions of a second, it must contain no trailing (non-significant) zeroes, and is omitted if valued at zero. Other fractional representations, such as fractional minutes and hours, are not permitted. If the time zone of the observation is known, convert the time value to the UTC time zone and append a ”Z” to the time representation.

Format elements that comprise a combined date and time by concatenating the (full) date representation, “T”, and the time representation. Do not use partial date representations for combined date and time values.

For type-specific approximation, delete the entire component of the time, date, or combined time-date in the following order: fractional seconds, seconds, minutes, hours, day, time zone indicator (if any), and month.

Represent durations by using two date-time values, formatted as defined previously, and separated by a solidus (“/”), where each [n] represents the number of years, months, dates, hours, minutes, and seconds (respectively) in the duration.

Fractional values of seconds (only) are permitted in the form of nnn.fffff. Where n=0, the “0” is required. All other leading and trailing zeroes, fractional hours and minutes, and truncated values are prohibited. Use durations only where the actual start time is not known, otherwise use a time interval must be used.

2.  Convert each vector element to a character string in exponential notation, omitting noninformational zeros.

If an element is missing, represent it as a string of three null characters.

If an element is an IEEE 754, nonfinite, floating-point special value, represent it as the signed, lowercase, IEEE minimal printable equivalent (that is, +inf, -inf, or +nan).

Each character string comprises the following:

 

A sign character.

A single leading digit.

A decimal point.

Up to k-1 digits following the decimal, consisting of the remaining k-1 digits of the number, omitting trailing zeros.

A lowercase letter "e."

A sign character.

The digits of the exponent, omitting trailing zeros.

 

For example, the number pi at five digits is represented as -3.1415e+, and the number 300 is represented as the string +3.e+2.

 3.  Terminate character strings representing nonmissing values with a POSIX end-of-line character.

4.  Encode each character string with Unicode bit encoding. Version 5 uses UTF-8.

5.  Combine the vector of character strings into a single sequence, with each character string separated by a POSIX end-of-line character and a null byte.

6.  Compute a hash on the resulting sequence using the standard SHA256 hashing algorithm. The resulting hash is base64 encoded to support readability.

7.  Calculate the UNF for each lower-level data object, using a consistent UNF version and level of precision across the individual UNFs being combined.

8.  Sort the base64 representation of UNFs in POSIX locale sort order.

9.  Apply the UNF algorithm to the resulting vector of character strings using k at least as large as the length of the underlying character string.

10.  Combine UNFs from multiple variables to form a single UNF for an entire data frame, and then combine UNFs for a set of data frames to form a single UNF that represents an entire research study.

Learn more: Micah Altman. 2008. "A Fingerprint Method for Verification of Scientific Data", in Advances in Systems, Computing Sciences and Software Engineering (Proceedings of the International Conference on Systems, Computing Sciences and Software Engineering 2007), Springer Verlag. (Web site)