UNF Version 5

Version 5 of the UNF algorithm is used by the Dataverse Network software version 2.0 and later, and is implemented in Java code. As in version 3, this algorithm is used on digital objects containing vectors of numbers, vectors of character strings, data sets comprising such vectors, and studies comprising one or more such data sets. Version 5 adds normalization forms for date, time, duration, bitstring and logical values, in addition to those used for numeric and character values in version 3.

The UNF V5 algorithm applied to the content of a data set or study is as follows:

1.  Calculate a UNF for each element as follows:

• Round elements in a numeric vector to k significant digits using the IEEE 754 round towards nearest, ties to even rounding mode. The default value of k is seven, the maximum expressible in single-precision floating point calculations.

• Calculate the UNF for vectors of character strings as above, except truncate to k characters and the default value of k is 128.

• Normalize boolean values to numeric values of either 0, 1, or missing. No rounding is applied.

• Normalize bit fields by converting to big-endian form, truncating all leading empty bits, aligning to a byte boundary by re-padding with leading zero bits, and base64 encoding to form a character string representation. No rounding is applied, and missing values are represented by three null bytes.

• Normalize time, date, and durations based on a single, unambiguous representation selected from the many described in the ISO 8601 standard.

• Convert calendar dates to a character string of the form YYYY-MM-DD. Partial dates in the form YYYY or YYYY-MM are permitted.

Time representation is based on the ISO 8601 extended format, hh:mm:ss.fffff. When .fffff represents fractions of a second, it must contain no trailing (non-significant) zeroes, and is omitted if valued at zero. Other fractional representations, such as fractional minutes and hours, are not permitted. If the time zone of the observation is known, convert the time value to the UTC time zone and append a ”Z” to the time representation.

Format elements that comprise a combined date and time by concatenating the (full) date representation, “T”, and the time representation. Do not use partial date representations for combined date and time values.

For type-specific approximation, delete the entire component of the time, date, or combined time-date in the following order: fractional seconds, seconds, minutes, hours, day, time zone indicator (if any), and month.

Represent durations by using two date-time values, formatted as defined previously, and separated by a solidus (“/”), where each [n] represents the number of years, months, dates, hours, minutes, and seconds (respectively) in the duration.

Fractional values of seconds (only) are permitted in the form of nnn.fffff. Where n=0, the “0” is required. All other leading and trailing zeroes, fractional hours and minutes, and truncated values are prohibited. Use durations only where the actual start time is not known, otherwise use a time interval must be used.

2.  Convert each vector element to a character string in exponential notation, omitting noninformational zeros.

If an element is missing, represent it as a string of three null characters.

If an element is an IEEE 754, nonfinite, floating-point special value, represent it as the signed, lowercase, IEEE minimal printable equivalent (that is, +inf, -inf, or +nan).

Each character string comprises the following:

• A sign character.

• A single leading digit.

• A decimal point.

• Up to k-1 digits following the decimal, consisting of the remaining k-1 digits of the number, omitting trailing zeros.

• A lowercase letter "e."

• A sign character.

• The digits of the exponent, omitting trailing zeros.

For example, the number pi at five digits is represented as -3.1415e+, and the number 300 is represented as the string +3.e+2.

3.  Terminate character strings representing nonmissing values with a POSIX end-of-line character.

4.  Encode each character string with Unicode bit encoding. Version 5 uses UTF-8.

5.  Combine the vector of character strings into a single sequence, with each character string separated by a POSIX end-of-line character and a null byte.

6.  Compute a hash on the resulting sequence using the standard SHA256 hashing algorithm. The resulting hash is base64 encoded to support readability.

7.  Calculate the UNF for each lower-level data object, using a consistent UNF version and level of precision across the individual UNFs being combined.

8.  Sort the base64 representation of UNFs in POSIX locale sort order.

9.  Apply the UNF algorithm to the resulting vector of character strings using k at least as large as the length of the underlying character string.

10.  Combine UNFs from multiple variables to form a single UNF for an entire data frame, and then combine UNFs for a set of data frames to form a single UNF that represents an entire research study.

Learn more:
Micah Altman. 2008. "A Fingerprint Method for Verification of Scientific Data", in Advances in Systems, Computing Sciences and Software Engineering (Proceedings of the International Conference on Systems, Computing Sciences and Software Engineering 2007), Springer Verlag. (Web site)