Subset, Analyze, and Download Data Sets

Data files (subsettable files) can be subsetted and analyzed online by using the Dataverse Network application. For analysis, the Dataverse Network offers a user interface to Zelig, a powerful, R-based statistical computing tool. A comprehensive set of Statistical Analysis Models are provided.

After you find the data set that you want, access the Subset and Analysis options to use the online tools. Then, you can subset data by variables or observations, translate it into a convenient format, download subsets, and apply statistics and analysis.

Review the Data Subset and Recode Tips before you start.

Statistical Analysis Models

You can apply any of the following advanced statistical models to all or some variables in a data set:

  • Categorical data analysis: Cross tabulation
  • Event count models, for event count dependent variables:
    • Negative binomial regression
    • Poisson regression
  • Models for continuous bounded dependent variables:
    • Exponential regression for duration
    • Gamma regression for continuous positives
    • Log-normal regression for duration
    • Weibull regression for duration
  • Models for continuous dependent variables:
    • Analysis of variance
    • Least squares regression
    • Linear regression for left-censoreds
  • Models for dichotomous dependent variables:
    • Logistic regression for binaries
    • Probit regression for binaries
    • Rare events logistic regression for binaries
  • Models for ordinal dependent variables:
    • Ordinal logistic regression for ordered categoricals
    • Ordinal probit regression for ordered categoricals

Access Subset and Analysis Options

You can subset and analyze data files before you download the file or your subsets.

To access the Subset and Analysis options for a data set:

  1. Click the title of the study from which you choose to analyze or download a file or subset.
  2. Click the Documentation, Data and Analysis tab for the study.
  3. In the list of study files, locate the data file that you choose to download, subset, or analyze.
    You can download data sets for a file only if the file entry includes the subset icon.
  4. Click the Access Subset/Analysis link associated with the selected file.
    If prompted, check the I accept box and click Continue to accept the Terms of Use, and then click the Access Subset/Analysis icon again.
    You see the Data File page listing data for the file that you choose to analyze.

Recode and Case-Subset Data

Review the Data Recode and Subset Tips before you start work with a study's files.

To recode and subset variables within a data set:

  1. In the Data File page, click the Recode and Case-Subsetting tab.
  2. One the right side of the variable list, use the Show drop-down list and select one of the following options to show variables in predefined quantities: All, 50, 20, or 10.
  3. Scroll down the screen and click the check boxes to select variables from the table of available values. When you select a variable, it is added to the Selected Variables box at the top of the tab.
    To remove a variable from this box, deselect it from the Variable Type list at the bottom of the screen.
    To select all variables, click the check box beside the column name, Variable Type.
  4. Select one variable in the Selected Variables box, and then click the right Arrow button.
    The existing name and label of the variable appear in the New Variable Name and New Variable Label boxes.
  5. In the New Variable Label field, change the variable name to a unique value that is not used in the data file.
    The new variable label is optional.
  6. In the table below the Variable Name fields, you can check one or more values to drop them from the subset, or enter new values, labels, or ranges (as a condition) as needed. Click the Add Value/Range button to create more entries in the value table.
    Note: Click the i Info button to view tips on how to use the Recode and Subset table. Also, See Data Recode and Subset Tips for more information about adding values and ranges.
  7. Click the Apply Recodes button.
    Your renamed variables appear in the Selected Variables box.
    Note: If you enter a variable name that is already in use, you see the message The variable Name you entered is found among the existing variables; enter a new variable name.
  8. Select another variable in the Selected Variables box, click the right Arrow button, and repeat the recode action.
    Repeat this process for each variable that you choose to recode.

Continue to download a subset.

Data Recode and Subset Tips

Use the following guidelines when working with data files:

  • Recoding:
    • You must fill at least the first (new value) and last (condition) columns of the table; the second column is optional and for a new value label.
    • If the old variable you chose for recoding has information about its value-labels, you can prefill the table with these data for convenience, and then modify these prefilled data.
    • To exclude a value from your recoding scheme, click the check box in the same row.
  • Subsetting:
    • If the variable you chose for subsetting has information about its value-labels, you can prefill the table with these data for convenience.
    • To exclude a value in the last column of the table, click the check box in the same row.
    • To include a particular value or range, enter it in the last column whose header shows the name of the variable for subsetting.
  • Entering a value or range as a condition for subsetting or recoding:
    • Suppose the variable you chose for recoding is x.
      If your condition is x==3, enter 3.
      If your condition is x < -3, enter (--3.
      If your condition is x > -3, enter -3-).
      If your condition is -3 < x < 3, enter (-3, 3).
    • Use square brackets ([]) for closed ranges.
    • You can enter nonoverlapping values and ranges separated by a comma, such as 0,[7-9].

Download Subsets

You can download a subset of variables within a study file. You also can recode a subset of variables and download the recoded subset, if you choose.

To download a subset of variables:

  1. In the Data File page, click the Download Subset tab.
  2. Click the radio button for the appropriate File Format in which to download the variables: Text, R Data, S plus, or Stata.
  3. On the right side of the tab, use the Show drop-down list to select the quantities of variables to list at one time: All, 50, 20, or 10.
  4. Scroll down the screen and click the check boxes to select variables from the table of available values. When you select a variable, it is added to the Selected Variables box at the top of the tab.
    To remove a variable from this box, deselect it from the Variable Type list at the bottom of the screen.
    To select all variables, click the check box beside the column name, Variable Type.
  5. Click the Download button. If prompted, check the I accept box and then click the Continue button to accept the Terms of Use. Then, click Download again.
  6. Follow your browser's prompt to open or save the data file to your computer's disk drive.

Apply Descriptive Statistics

When you run descriptive statistics for data, you can do any of the following with the analysis results:

  • Open the results in a new window to save or print the results.
  • Download the R workspace in which the statistics were analyzed, for replication of the analysis. See Replicate Analysis for more information.
  • View citation information for the data analyzed, and for the full data set from which you selected variables to analyze. See View Citations for more information.

To apply descriptive statistics to a data set or subset:

  1. In the Data File page, click the Descriptive Statistics tab.
  2. Click one or both of the Descriptive Statistics options: Univariate Numeric Summaries and Univariate Graphic Summaries.
  3. On the right side of the tab, use the Show drop-down list to select one of the following options to show variables in predefined quantities: All, 50, 20, or 10.
  4. Scroll down the screen and click the check boxes to select variables from the table of available values. When you select a variable, it is added to the Selected Variables box at the top of the tab.
    To remove a variable from this box, deselect it from the Variable Type list at the bottom of the screen.
    To select all variables, click the check box beside the column name, Variable Type.
  5. Click the Run Statistics button.
    If prompted, check the I accept box and then click the Continue button to accept the Terms of Use. Then, click Run Statistics again.
    You see the Dataverse Analysis page.
  6. To save or print the results, scroll to the Descriptive Statistics section and click the link Open results in a new window. You then can print or save the window contents.
    To save the analysis, scroll to the Replication section and click the button Download Workspace File.
    Review the Citation Information for the data set and for the subset which you analyzed.
  7. Click the link Back to Analysis and Subsetting to return the previous page and continue analysis of your selected variables.

Perform Advanced Analysis

When you run advanced statistical analysis for data, you can do any of the following with the analysis results:

  • Open the results in a new window to save or print the results.
  • Download the R workspace in which the statistics were analyzed, for replication of the analysis. See Replicate Analysis for more information.
  • View citation information for the data analyzed, and for the full data set from which you selected variables to analyze. See View Citations for more information.

To run statistical models for selected variables:

  1. In the Data File page, click the Advanced Statistical Analysis tab.
  2. Scroll down the screen and click the check boxes to select variables from the table of available values. When you select a variable, it is added to the Selected Variables box at the top of the tab.
    To remove a variable from this box, deselect it from the Variable Type list at the bottom of the screen.
    To select all variables, click the check box beside the column name, Variable Type.
  3. Select a model from the Choose a Statistical Model drop-down list.
  4. Select one variable in the Selected Variables box, and then click the applicable arrow button to assign a function to that variable from within the analysis model.
    You see the name of the variables in the appropriate function box.
    Note: Some functions allow a specific type of variable only, while other functions allow multiple variable types. Types include Character, Continuous, and Discrete. If you assign an incorrect variable type to a function, you see an Incompatible type error message.
  5. Repeat the variable and function assignments until your model is complete.
  6. Select your Output and Analysis options.
  7. Click the Run Model button.
    If prompted, check the I accept box and then click the Continue button to accept the Terms of Use. Then, click Run Model again.
    You see the Dataverse Analysis page.
  8. To save or print the results, scroll to the Advanced Statistical Analysis section and click the link Open results in a new window. You then can print or save the window contents.
    To save the analysis, scroll to the Replication section and click the button Download Workspace File.
    Review the Citation Information for the data set and for the subset which you analyzed.
  9. Click the link Back to Analysis and Subsetting to return the previous page and continue analysis of your selected variables.

View Summary Statistics

When a subsettable data file is uploaded for a study, the DVN code calculates summary statistics for each variable within that data file. On any tab of the Data File page, you can view the summary statistics for each variable in the data file. Information listed comprises the following:

  • For continuous variables, the application calculates summary statistics that are listed in the DDI schema.
  • For discrete variables, the application tabulates values and their labels as a frequency table.
    Note, however, that if the number of categories is more than 50, the values are not tabulated.
  • The UNF value for each variable is included.

To view summary statistics for a variable:

  1. In the Data File page, click any tab.
  2. On the right side of the tab, use the Show drop-down list to select one of the following options to show variables in predefined quantities: All, 50, 20, or 10.
  3. Scroll down the page and locate a variable for which you choose to view summary statistics. Then, click the Summary Statistics icon for that variable to toggle the summary information on and off.

Replicate Analysis

You can save the R workspace in which the DVN performed an advanced analysis. When you download the workspace file, you download a zipped archive the contains four files. Together, these files enable you to recreate the subset analysis in another R environment:

  • citationFile.<identifier>.txt - The citation for the subset that you analyzed.
  • rhistoryFile.<identifier>.R - The R code used to perform the analysis.
  • tempsubsetfile.<identifier>.tab - The R object file used to perform the analysis.
  • tmpRWSfile.<identifier>.RData - The subset data that you analyzed.

To download this workspace for your analysis:

  1. For any subset, Apply Descriptive Statistics or Perform Advanced Analysis.
  2. On the Dataverse Analysis or Advanced Statistical Analysis page, scroll to the Replication section and click the button Download Workspace File.
  3. Follow any browser prompts to save the zipped archive.
    When the archive file is saved to your local storage, extract the contents to use the four files that compose the R workspace.