Research Workbench
Workspaces and Creating a New Project
Create A Workspace
Each specific project will be initiated by creation of a new workspace. Here you will store and analyze data within a cloud based platform. These workspaces act as sandboxes where multiple files and analytical environments associated with one project can be accessed, stored, and shared.
Workspace Ownership: Each workspace has one or more owners who manage permissions for viewing and analyzing data within the workspace. An owner may grant access to additional users with their own accounts, provided they have the relevant authorization to view the data tier. Users cannot share workspaces by sharing accounts. More about navigating workspaces. (1) (2)
Research Project Description: Workspace owners must declare a purpose and provide summary of each workspace purpose for public posting on the Research Hub and All of Us website for transparency. Research all of us provides this example.
Learn More here: User Support - Questions about Workspace and Workspace Optimization
Create a Data Set for Analysis
There are several options to Query the All of US database and create a dataset for analysis:
The first is to build your own SQL query of the database or modify a query obtained from the Dataset Builder within a Jupyter Notebook without using Cohort and Dataset Builders. This task necessitates becoming familiar with the OMOP Common Data Model to understand how the All of Us dataset is structured. Please see the following section Analysis Environment for more information on how to open a notebook.
The second is to use specifically create tools in the hub to Import your query and the resulting data frame into a analysis environment using the following tools:
These Steps follow the process of creating and exporting a dataset using the All of Us Tools.
1. Use the Cohort Builder to select participants for analysis. The Cohort Builder allows you to create and review cohorts and exclude/annotate participants in your study group. You can also build control cohorts. Learn how to use the Cohort Builder (1) (2) (3)
1. Select Cohorts + to begin creating a cohort for analysis.
| 2. Select the domain of your criteria to begin creating your cohort. |
3. If your domain contains multiple medical concepts, then you will be able to navigate through the parent-child relationships of those concepts through click-point interface. You can select the “hierarchal diagram” image to make more specific concept selections. It can be helpful to preview the Data Browser before making your selections.
| 4. This is the page you will see when you navigate the concept hierarchies. Simply select the plus sign to add your criteria. There is a search box that you can you use to more quickly navigate to the concepts you’re looking for. |
5. You can add multiple inclusion and exclusion criteria using the same process. The tool will present an aggregate preview of our cohort. Select create Cohort to use your selections to create a dataset. | 6. Save and provide a name and description of your cohort for later use. |
1a. After creating your cohort, you have the option to use the Review Set Creator to view row level data and descriptive statistics of a random sample of your cohort. Note that you will be viewing synthetic data. You’re able to take note and annotations within this tool. Learn more here.
1. Select Create Review Sets to review row level data of your cohort.
| 2. Name your review set an select a number of participants to create a random sample. |
3. Select Cohort Details to view synthetic descriptive statistics of your random sample. | 4. This page depicts descriptive statistics of your review set.
|
2. Use the Dataset Builder to select Cohorts and Concept Sets to query the database by selecting concept sets to preview and create analysis-ready datasets that can be directly exported into an analysis environment or the Notebook. Learn how to use the Data Set Builder (1)
Select the Dataset Builder within your Workspace. | 2. Select your cohort, concepts, and values that you’d like to include in your datasets.
|
3. You can preview your datasets before proceeding to analysis. |
4. Export Code to Create your data set directly into a Jupyter Notebook Environment. From here you will be able to directly open an analysis environment.
|
3. Use the Concept Set Selector within the Data Builder to select variables and data types fit for your project. You can search for and save collections of concepts from a particular domain as a “Concept Set.” You may access and extract these concept sets from your cohort in an analysis environment or Notebook.
1. Access the Concept Set Selector by clicking the plus button within the Data Set builder. This will navigate you to a list of OMOP concepts to add to your dataset.
| 2. Navigate through domains and their concepts within the concept selector.
|
3. Name and save your concept set for later use. |
4. Access your newly create concept sets within the data set builder under the Select Concept Sets column. You will have to scroll to the bottom of the list to view your Workspace Concept Sets.
|
4. Use the Genomic Extraction Tool within the Data Builder to access short read WGS genomics data. Learn more here.
- In the Select Concept Sets table, select All whole genome variant data under Prepackaged Data Sets. This option will only be available if your workspace has access to controlled tier data.
5. Importing and Managing Datasets:
Create Datasets: Write SQL queries within Jupyter Notebook to extract, merge, and query data from databases.
For example, you can retrieve data from participant cohorts, and then merge concept sets into a single data frame.
Import External Files: Upload external files or datasets directly into your environment as needed. Learn more here.
Import Datasets: Use the Dataset Builder to pull data (cohorts and concept sets) into your analysis environment.
- You can directly open datasets that you’ve create from the dataset builder under the Analysis tab in your Workspace
Analysis Environment
The notebook is a versatile analysis environment that offers various processes, tools, and environments to facilitate your research. Below is a guide on how to effectively use this environment:
1.Open an Analysis Environment.
Data Processing Options:
Local Data Processing (Jupyter Notebook): Conduct small-scale data processing within your notebook using Python or R libraries such as
pandas
,dplyr
, orNumPy
.Large-Scale Data Processing (Apache Spark): For larger datasets, load the data into Apache Spark using PySpark, SparkR, or Scala to perform distributed processing on large-scale data
Available Environments:
Jupyter Notebook: Ideal for combining Python, R, and SQL in one place, making it easy to run multi-step analyses and visualize results in a single document.
User Support: Jupyter Notebook; SQL in the Jupyter Environment; GPUS
SAS Studio: A complementary tool for users familiar with SAS for statistical analysis, allowing it to be used alongside Python and R.
RStudio: A robust environment for R development, offering an enhanced interface for coding, debugging, and markdown support.
User Support: RStudio
Data Analysis and Exploration:
Preliminary Processing: Clean and summarize your data before deeper analysis. Use packages like
pandas
ordplyr
for initial exploration. Learn more here.Advanced Analysis: Perform more detailed analysis using pre-installed libraries and tools for:
Statistical modeling (e.g.,
statsmodels
,rstat
).Machine learning (e.g.,
scikit-learn
,caret
).Visualization (e.g.,
matplotlib
,ggplot2
).Here is a list of pre-installed tools and packages.
Code Snippets for Efficiency:
Use Code Snippets: Pre-built code snippets for common functions are available to help streamline your workflow. These snippets can perform specific tasks like data cleaning, merging, and visualization.
Also note the following:
Users are logged out after 30 minutes of inactivity, but long-running notebooks will continue to execute even if logged out. . If you wish to capture all notebook cell outputs, use this notebook to run your long-running notebook (or any other long-running notebook).
Clusters will automatically pause after 24 hours; log in and start any notebook to reset the autopause timer for jobs exceeding this duration.
The workbench automatically saves the current version of your notebooks.
1. You can open an analysis environment under the Analysis tab or by launching applications on the side bar in your workspace. | Please note that opening any environment will be billed. Please see here for more information in regards to billing. |
2. Exporting Data and Results:
Download Outputs: You can download non-participant-level data and analysis outputs directly from the environment, in compliance with the Data User Code of Conduct.
Community Workspace
Researchers who want to share their workspace projects with all other registered Researcher Workbench users are able to do so as a community workspace.
Community workspaces open a new way to foster knowledge-sharing, collaboration, and learning within the community of researchers using the Researcher Workbench across the world. These workspaces are hosted in the Researcher Workbench in a new Community Workspaces section of the existing featured workspace library.
Learn more: All of Us User Support: Community Workspace