Research Workbench
Workspaces and Creating a New Project
Create A Workspace
Each specific project will be initiated by creation of a new workspace. Here you will store and analyze data within a cloud based platform. These workspaces act as sandboxes where multiple files and analytical environments associated with one project can be accessed, stored, and shared.
Workspace Ownership: Each workspace has one or more owners who manage permissions for viewing and analyzing data within the workspace. An owner may grant access to additional users with their own accounts, provided they have the relevant authorization to view the data tier. Users cannot share workspaces by sharing accounts. More about navigating workspaces. (1) (2)
Research Project Description: Workspace owners must declare a purpose and provide summary of each workspace purpose for public posting on the Research Hub and All of Us website for transparency. Research all of us provides this example.
Learn More here: User Support - Questions about Workspace and Workspace Optimization
Create a Data Set for Analysis
There are several options to Query the All of US database and create a dataset for analysis:
The first is to build your own SQL query of the database or modify a query obtained from the Dataset Builder within a Jupyter Notebook without using Cohort and Dataset Builders. This task necessitates becoming familiar with the OMOP Common Data Model to understand how the All of Us dataset is structured. Please see the following section Analysis Environment for more information on how to open a notebook.
The second is to use specifically create tools in the hub to Import your query and the resulting data frame into a analysis environment using the following tools:
These Steps follow the process of creating and exporting a dataset using the All of Us Tools.
1. Use the Cohort Builder to select participants for analysis. The Cohort Builder allows you to create and review cohorts and exclude/annotate participants in your study group. You can also build control cohorts. Learn how to use the Cohort Builder (1) (2) (3)
| |
| |
1a. After creating your cohort, you have the option to use the Review Set Creator to view row level data and descriptive statistics of a random sample of your cohort. Note that you will be viewing synthetic data. You’re able to take note and annotations within this tool. Learn more here.
| |
|
2. Use the Dataset Builder to select Cohorts and Concept Sets to query the database by selecting concept sets to preview and create analysis-ready datasets that can be directly exported into an analysis environment or the Notebook. Learn how to use the Data Set Builder (1)
| |
|
|
3. Use the Concept Set Selector within the Data Builder to select variables and data types fit for your project. You can search for and save collections of concepts from a particular domain as a “Concept Set.” You may access and extract these concept sets from your cohort in an analysis environment or Notebook.
|
|
|
4. Use the Genomic Extraction Tool within the Data Builder to access short read WGS genomics data. Learn more here.
5. Importing and Managing Datasets:
Create Datasets: Write SQL queries within Jupyter Notebook to extract, merge, and query data from databases.
For example, you can retrieve data from participant cohorts, and then merge concept sets into a single data frame.
Import External Files: Upload external files or datasets directly into your environment as needed. Learn more here.
Import Datasets: Use the Dataset Builder to pull data (cohorts and concept sets) into your analysis environment.
Analysis Environment
The notebook is a versatile analysis environment that offers various processes, tools, and environments to facilitate your research. Below is a guide on how to effectively use this environment:
1.Open an Analysis Environment.
Data Processing Options:
Local Data Processing (Jupyter Notebook): Conduct small-scale data processing within your notebook using Python or R libraries such as
pandas
,dplyr
, orNumPy
.Large-Scale Data Processing (Apache Spark): For larger datasets, load the data into Apache Spark using PySpark, SparkR, or Scala to perform distributed processing on large-scale data
Available Environments:
Jupyter Notebook: Ideal for combining Python, R, and SQL in one place, making it easy to run multi-step analyses and visualize results in a single document.
User Support: Jupyter Notebook; SQL in the Jupyter Environment; GPUS
SAS Studio: A complementary tool for users familiar with SAS for statistical analysis, allowing it to be used alongside Python and R.
RStudio: A robust environment for R development, offering an enhanced interface for coding, debugging, and markdown support.
User Support: RStudio
Data Analysis and Exploration:
Preliminary Processing: Clean and summarize your data before deeper analysis. Use packages like
pandas
ordplyr
for initial exploration. Learn more here.Advanced Analysis: Perform more detailed analysis using pre-installed libraries and tools for:
Statistical modeling (e.g.,
statsmodels
,rstat
).Machine learning (e.g.,
scikit-learn
,caret
).Visualization (e.g.,
matplotlib
,ggplot2
).Here is a list of pre-installed tools and packages.
Code Snippets for Efficiency:
Use Code Snippets: Pre-built code snippets for common functions are available to help streamline your workflow. These snippets can perform specific tasks like data cleaning, merging, and visualization.
Also note the following:
Users are logged out after 30 minutes of inactivity, but long-running notebooks will continue to execute even if logged out. . If you wish to capture all notebook cell outputs, use this notebook to run your long-running notebook (or any other long-running notebook).
Clusters will automatically pause after 24 hours; log in and start any notebook to reset the autopause timer for jobs exceeding this duration.
The workbench automatically saves the current version of your notebooks.
2. Exporting Data and Results:
Download Outputs: You can download non-participant-level data and analysis outputs directly from the environment, in compliance with the Data User Code of Conduct.
Community Workspace
Researchers who want to share their workspace projects with all other registered Researcher Workbench users are able to do so as a community workspace.
Community workspaces open a new way to foster knowledge-sharing, collaboration, and learning within the community of researchers using the Researcher Workbench across the world. These workspaces are hosted in the Researcher Workbench in a new Community Workspaces section of the existing featured workspace library.
Learn more: All of Us User Support: Community Workspace