Learn more about the Data

image-20240409-022658.png

About the Data

The All of Us Research Program offers a structured, tiered-data access model to accommodate various levels of data sensitivity and user requirements. This model is designed to promote inclusivity and transparency while safeguarding participant privacy and ensuring data security. For more details, please read the Precision Medicine Initiative Data Security Policy Principles and Framework and the Privacy and Trust Principles.

Read more about the All of Us Program and how data was collected.


Data Sources

All of us integrates data from various sources including surveys, electronic health records (EHRs), bio samples, physical measurements, and wearables like Fitbit. More about data that is included in All of Us Data Sets here.

View the data roadmap for more information on data that has been included and data that is planning to be included.

Data Curation Process

The All of Us Research Program uses the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM) to standardize EHR data for all researchers. Read more about the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) and more about the Data Curation Process (1) (2)

Here are some tutorials to understand OHDSI Standardized Vocabularies.

Data Dictionary

The All of Us Data Dictionary details participant data, privacy modifications, and specifies whether fields are standard OMOP or custom to All of Us. It also outlines data cleaning methods, lists custom concept IDs, and tracks changes via versioning.

Explore the Registered Tier Data Dictionary and Controlled Tier Data Dictionary.

For a searchable database of available concepts with metadata, please visit the ODHSI Athena tool at Athena.


OMOP

The OMOP Common Data Model (CDM) is essential for harmonizing and standardizing the health data collected from diverse sources. In the All of Us Research Hub, the OMOP model provides a structured format for integrating data across different healthcare systems by standardizing data from raw clinical data (e.g., from electronic health records (EHR), surveys, wearables, and genomics) into a standardized format within a common vocabulary system that allows for consistent querying and analysis across diverse datasets.

Learn more about how the data is structured using the OMOP CDM:


Curated Data Sets

All of Us Research Program data in its final format, after harmonization and refinement, are referred to as a curated dataset.

  • Restriction on Identifiable Data: Directly identifiable information, such as names, contact details, participant IDs, IP addresses, and raw medical records with potential identifiers, is not released in the Consolidated Data Repository (CDR).

  • Data Encryption and Anonymization: All participant data is encrypted, and obvious identifiers are removed from research data. Identifiable information like names and addresses is kept separate from health information

  • Independent Security Reviews: External reviewers assess and test the program's security measures regularly, ensuring they are effective against current threats

There are three tiers of data access: Public (no login required); Registered (login required); and Controlled (additional approval required). Learn more about selecting the right data tier for your project. More information about privacy differences between data tiers.

Public Tier

Data Included:

 

What is it: Anonymized, aggregate-level data that poses negligible risks to the privacy of research participants.

  • No individual participant-level information is included.

  • Contains only summary statistics and aggregate information.

  • Aggregate bin size is set at 20 individuals.

  • Counts lower than 20 are displayed as 20; counts higher than 20 are rounded up to the closest multiple of 20 (e.g., a count of 1245 will be displayed as 1260).

 

Where to access: Accessible without login into the All of Us Research Hub.

These data are available to everyone through Data Snapshots and the Data Browser, an interactive tool on the Research Hub. *Note that counts may differ between Data Snapshots and Data Browser due to the lag time in the curation process.

 

  • Enrollment Data:

    • Counts by state and census region.

    • Stratified by demographics such as gender, age group, and race/ethnicity.

  • Derived or Analyzed Data:

    • Genomic summary statistics.

    • Numbers of participants below a threshold to be determined.

  • Precomputed Results:

    • Genome-wide association study results for a given phenotype.

  • Medical Records Data:

    • Binned counts of lab results, medications, and diagnosis and procedure codes.

    • Text fields are not included.

  • Aggregate Counts of Participant-Provided Structured Field Data:

    • Includes demographics, access to healthcare, general health questionnaires, and others.

    • Text fields are not included.

  • Age Group Binning:

    • Ages are binned using 10-year groups.

    • Ages <18 are omitted.

    • Ages ≥90 are grouped together.

    • Example age groups: 18-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90+.

Registered Tier

Data Included:

 

What is it? Includes participant-level data with transformations to protect privacy.

Date Transformation: All dates are consistently, shifted backwards for each participant by a random number between 1 to 365 days.

How to Access? Approved researchers with a login to the secure Researcher Workbench.

More information: Explore the Registered Tier Data Dictionary

  • EHR Domains: Conditions, Drug Exposures, Labs & Measurements, Procedures, and Death Reports (1)

  • Physical Measurements

  • Wearable Devices (e.g. Fitbit) Learn More: (1) (2)

  • Survey Questions:

    • Basics

    • Overall Health

    • Lifestyle

    • Health Care Access & Utilizatoin

    • Personal & Family Health History

    • COVID-19 Participant Experience (COPE)

    • COPE Minute Survey

    • Social Determinants of Health

    • Learn more: (1) (2)

  • Genomics: SNV/Indel Variants. Learn more

Fields Removed:

  • All free-text fields and unstructured documents (EHR).

  • Geo-location data smaller than US state (except EHR site). Learn more here.

  • Race and Ethnicity subgroups (e.g., Hmong, Filipino, Caribbean).

  • Participants aged > 89 (Demographic).

  • Living situation, active duty military status (PPI).

  • Death causes, diagnosis codes subject to public knowledge (EHR).

  • Billing codes for sex, sexuality, or gender categories (EHR).

Fields Generalized:

  • Race/Ethnicity: Less common races grouped; multi-racial non-Hispanic grouped as ‘Mixed racial group’.

  • Sex at Birth: Grouped into ‘Male’, ‘Female’, and a generalized group.

  • Gender Identity: Grouped into ‘Man’, ‘Woman’, and a generalized group.

  • Sexual Orientation: Non-‘Straight’ selections grouped together.

  • Education: Less than high school/GED grouped together.

  • Employment: Grouped into ‘Employed for wages/self-employed’, ‘Not employed’, and ‘Prefer not to answer’.

Controlled Tier

Data included:

 

What is it? Contains all Registered Tier data plus data elements that may not directly identify individual participants but could increase re-identification risk when combined with other data.

How to access? Access is granted to researchers who meet additional requirements on top of a login to the secure Researcher Workbench.

More information: Explore Controlled Tier Data Dictionary.

  • More detailed demographic, EHR, and survey data than the Registered Tier.

  • Individual level Genomics data:

  • Omics Data: Includes gene expression, metabolome, and proteome data.

  • All by All Tables: Maps known and novel associations across 3,400 phenotypes with gene-based and single-variant data from nearly 250,000 whole genome sequence. Learn more: User Support All by All Tables

  • Free text fields that may retain sensitive information:

    • Clinician Notes

    • Narrative Data from Participant Provided Information that exclude explicit identifiers

  • Externally Sourced Socioeconomic Data sourced from the U.S. Census American Community Survey via a three digit zip code linkage. Learn more

Fields Included

  • Retains All Dates (Includes month and year) that are shifted to preserve participant privacy.

  • Geographic Data: Includes census tract and/or 3-digit zip codes.

  • Exact Ages: Includes exact ages, even for participants aged > 89.


Access Data in the Research Hub.

  1. Create an account. Data is made available on the secure All of Us Research Hub, where researcher activity is monitored. Authorization for access to the registered and controlled data tiers will be user based, rather than project based.

  2. Complete Mandatory Researcher Registration and Training: Researchers must register with the program, complete ethics training, and agree to a responsible data use code of conduct before accessing data.

  3. Read and Agree to the Data User Code of Conduct.

  4. Obtain a Data Passport: Upon account creation, authorized users will receive a “data passport”, a prerequisite for accessing the registered and controlled data tiers and for creating workspaces for research projects.

  5. Create a project workspace for each unique research project.

  6. Submit project descriptions for each project workspace created, which are made public and searchable to support auditing, public engagement, and compliance with privacy and transparency principles.

image-20240812-031402.png

Notes about USC IRB Approval

The Researcher Workbench employs a data passport model, through which authorized users do not need IRB review to begin a research project. Most authorized users will not be conducting human subjects research with All of Us data for two reasons:​

(1) The research will not directly involve participants, only their data.

(2) the data available in the Researcher Workbench has been carefully checked and altered to remove identifying information while preserving its scientific utility. ​Nevertheless, we encourage anyone using All of Us data to apply the ethical principles of research with human participants to their work.

However, please note that a USC IRB review is required prior to initiating a USC affiliated research project due to the following conditions set by the Human Research Protection Program at USC that a Data Use Agreement (DUA) is needed to access the data set (Not Human Subjects Research Worksheet, p. 6).

Please pursue the following with USC IRB based on the listed conditions:

NHSR* self-determination 

Indicated for non research purposes using NHSR data

(1) no intent for research; or intent for conference presentation dependent on reach of conference, AND

(2) Using the following NHSR Data:

  • All of Us tier 1, tier 2

  • N3C tier 1, tier 2 

Requires NO ACTION on part of researcher.

NOTE: N3C Code of Conduct indicates that data can only be used for research purposes and should be publicly disseminated in some form

*NHSR: Not Human Subjects Research

NHSR Determination

Indicated for non-human subjects research

(1) Any intent for research on NHSR; or intent for conference presentation dependent on reach of conference , AND

(2) Using the following NHSR Data:

  • All of Us tier 1, tier 2 

  • N3C tier 1, tier 2 

ACTION: iStar item 1.1 

NOTE: A journal may request proof of an NHSR determination upon reviewal of submission. In this circumstance, a researcher without an NHSR Determination cannot retroactively request one.  

IBR Review, exempt category 4 Indicated for Secondary Research uses of Identifiable Private Information or Identifiable Biospecimens.

(1) Having any Research or Public dissemination intent, AND

(2) Conducting human subjects' research using the following data:

  • All of Us – tier 3 (genomic data) (identifiable biospecimen) 

  • N3C – tier 3 limited data (zip code, treatment dates, identifiable private information, no genomic data) 

ACTION:

(1) Submit an Exempt Review in sections 5.1 in iStar.

(2) Use Social Behavioral/Secondary Research protocol template


Learn more: