About the Data

Where does the Data Come from?

N3C does not recruit participants; instead, it receives de-identified harmonized EHR data from 83 Data Transfer Agreements and >230 sites. These sites provide data on individuals tested for COVID-19 or exhibiting related symptoms.

image-20240514-045235.png
Once you gain access to N3C's Enclave, your homepage will offer a brief overview of the most current N3C Cohort Statistics.

N3C Cohort Definition (Phenotype)

N3C identifies patients and controls by establishing a common COVID-19 phenotype that will define the data pull for the limited data set.

The latest COVID-19 phenotype documentation identifies lab-confirmed, suspected, and possible cases of COVID-19 and matches them to controls based on demographic factors in a 1:2 ratio (cases:control).

For more detailed information, you can view the latest phenotype on the Github Wiki.

The Phenotype Explorer is a tool within the N3C enclave designed to help researchers interactively explore and visualize COVID-19 phenotype definitions and criteria.

image-20240514-045835.png

 

Data Format

N3C ingests data from formats like PCORnet, ACT, OMOP, and TriNetX, harmonizing them into the OMOP 5.3.1 standard CDM for comprehensive analytics. The OMOP CDM standardizes the organization of clinical EHR data, allowing integration from various sources. For more details on data harmonization, visit here. Learn more about OMOP Vocabulary here (1) (2).

The central table in the OMOP vocabulary system is the table, concept.

Available Data

The data contains real world data from patients who were tested for COVID-19 or whose symptoms are consistent with COVID-19. It also contains data from individuals infected with pathogens such as SARS 1, MERS and H1N1, which can support comparative studies.

Data is focused only on retrospective electronic health record data. Specific variables available may vary depending on the contributing institutions.

The Data Dictionary catalogues available data in N3C based on OMOP Common Data Model Specifications,

List of Available Data

  • Demographic Information:

    • Age

    • Gender

    • Race/Ethnicity

    • Geographic location

    • Social determinants of Health

 

  • Clinical Diagnoses and Conditions:

    • COVID-19 diagnosis (e.g., PCR test results, ICD-10 codes)

    • Comorbidities (e.g., diabetes, hypertension)

    • Other medical conditions (e.g., cardiovascular diseases, respiratory diseases)

 

  • Laboratory Results:

    • Blood tests (e.g., complete blood count, metabolic panel)

    • Biomarkers (e.g., inflammatory markers, D-dimer)

    • Viral load measurements

  • Vital Signs and Physiological Measurements:

    • Blood pressure

    • Heart rate

    • Respiratory rate

    • Body temperature

 

  • Medication and Treatment Data:

    • Prescription medications

    • Dosages and frequencies

    • Treatment protocols for COVID-19 and other conditions

 

  • Procedures and Interventions:

    • Surgeries

    • Medical procedures (e.g., intubation, ventilation)

    • Therapeutic interventions (e.g., oxygen therapy, antiviral treatment)

  • Clinical Outcomes:

    • Admission, Transfer, Discharge

    • Hospitalizations

    • Intensive care unit (ICU) admission

    • Mortality

    • Long COVID Clinic Visits

 

  • Longitudinal Data:

    • Time-stamped records of clinical events and measurements, allowing for longitudinal analyses and outcome assessments over time.

 

  • Clinician Free-Text Notes

    • Natural Language Processing, NLP, derived concepts are applied to clinical notes.

    • N3C NLP Guide

  •  

  • CMS, Center for Medicare and Medicaid

 

Data Levels

Three levels of data are available for analysis. You will request access to a data level for each project.

Level

Data Description

Eligible Users

Access Requirements

Appropriate Projects (e.g.)

Level

Data Description

Eligible Users

Access Requirements

Appropriate Projects (e.g.)

Level 3

Limited Data Set (LDS)

Patient data that retain the following protected health information

  • dates of service

  • patient ZIP code

Zip codes are truncated to the first 3 digits, and

Researchers from U.S.-based institutions

  • N3C registration

  • N3C Data Enclave account

  • Data Use Agreement (DUA) executed with NCATS

  • NIH IT training completion

  • Approved Data Use Request (DUR)

  • Human Subjects Research Protection training completion

  • USC (Local Human Research Protection Program) IRB determination letter

Example: Studies considering absolute timing, such as determining if a patient's primary COVID-19 infection occurred during the Delta wave.

Level 2

De-identified Data Set

Patient data from the LDS with the following changes:

  • Dates of service are shifted randomly by ±180 days to protect patient privacy. Each patient's dates are shifted by the same unknown, random amount.

  • Patient ZIP codes are truncated to the first three digits or removed entirely if the ZIP code represents fewer than 20,000 individuals.

Researchers from U.S.-based and foreign institutions.

  • N3C registration

  • N3C Data Enclave account

  • DUA executed with NCATS

  • NIH IT training completion

  • Approved DUR

  • Human Subjects Research Protection training completion

Example: An analysis of Comorbidity Patterns in COVID-19 Patients by examining general trends and demographic factors.

Synthetic Data Set

(available in Education Enclave)

Data that are computationally derived from the LDS that resemble patient information statistically in Level 3 Data but are not actual patient data.

Citizen scientists and researchers from U.S.-based and foreign institutions.

  • N3C registration

  • N3C Data Enclave account

  • DUA executed with NCATS

  • NIH IT training completion

  • Approved DUR

Not available for research, but is used for data science training in the Education Enclave.

PPRL Data

Privacy-Preserving Record Linkage Data Set

Restricted external datasets that have been linked to N3C Data using Privacy Record Linkage.

More information (1) (2) (3)

Example: RECOVER Data Guide; CMS; Mortality Evidence

Researchers from U.S.-based institutions

Special procedures for gaining access to PPRL data as part of a Level 3 access request.

Example: Analyze the impact of COVID-19 on healthcare utilization among Medicaid and Medicare patients using CMS Medicaid data alongside EHR data.

Extremal Data Sets

Can be imported into

Publicly available data (e.g., U.S. Census and regional data) for use alongside EHR data.

Example:

 

No special requirements. Users can request ingestion of additional external datasets here.

Example: Analyze the correlation between socio-economic factors from U.S. Census data and COVID-19 health outcomes using de-identified EHR data.

Dashboards

Visit the N3C Dashboard, which provides detailed visualizations and insights into COVID-19 patient data, including demographics, mortality, comorbidities, medication usage, and regional distribution. It also features tools for exploring institutional collaborations, data contributions, and publications resulting from N3C research. Some helpful dashboards include:

Related Enclave Applications

Data Catalog

Browse the most commonly accessed data tables, such as Level 1 and Level 2 data, notional data for learning, and commonly used external data sets.

Once your project workspace is created, the dataset (from the data catalogue) requested in your DUR will be linked to your project workspace for dataset creation and analysis.

 

→ Request External Datasets

Publicly available datasets (PADs) can be recommended for ingestion into the N3C Data Enclave. To ensure security and privacy, N3C and NCATS have established a formal policy for incorporating external data. After submitting a Request for Use and Access form, datasets will be evaluated, and if approved, made accessible to researchers.

More information:

More Resources: