DataHub
The DataHub is a centralized resource within the Center for Health Data Science that connects researchers to curated health data, along with the infrastructure, policies, processes, and expert support needed to access and use it. Whether you’re just getting started or exploring advanced analytic strategies, the DataHub supports you at every stage of your data journey.The DataHub includes a comprehensive catalog of key data resources available through CHDS, spanning both internally hosted data and external sources. It also provides curated metadata, guidelines, tools, and procedures to support compliant data access and use.
The DataHub is designed to streamline access to data and provide expert guidance throughout the research process. To learn more or discuss your project, contact bedacprp@bu.edu or explore the links below.
Request Services or Assistance Collaborative Model & Rates
Our Inventory
Resources listed below include both publicly available datasets (Public) and restricted datasets (Restricted) that require credentialed access (e.g., through DUAs or required training).
| Resource | Access | Category | Description |
|---|---|---|---|
| Acumen CMS Data Linkage (NIA) | Restricted |
Administrative/Claims | Links Medicare claims to NIA-funded studies (e.g., HRS, NHATS). Enables longitudinal analysis of older adults’ health, spending, and outcomes. Managed by Acumen LLC with curated, linked datasets. |
| ADVANCE Clinical Research Network (led by OCHIN) | Restricted |
EHR/Clinical | PCORnet network aggregating EHR and claims data from safety-net providers, focusing on underserved populations. Uses PCORnet CDM. Strong for health equity and primary care research. |
| All of Us Research Program | Restricted |
EHR/Clinical | NIH initiative collecting EHR, biospecimen, survey, and wearable data from 1M+ participants. Focus on diversity, longitudinal tracking, and personalized medicine via a secure Researcher Workbench. |
| Alzheimer’s Disease Research Centers (ADRC) | Restricted |
Cohort | Originally established in 1984, the Alzheimer’s Disease Research Centers (ADRCs) are Congressionally designated Centers of Excellence. The ADRCs support the nation’s increased effort to address Alzheimer’s disease (AD) and Alzheimer’s disease-related dementia (ADRD) by providing information, local resources, support, and opportunities to participate in research. |
| American Community Survey (ACS) | Public |
National survey | Annual U.S. Census survey covering 3.5M households. Offers tract-level demographic, housing, and economic data. Key for policy and spatial analysis. Estimates have margins of error for small areas. |
| Boston Biorepository, Recruitment, and Integrative Network (BBRAIN) | Restricted |
EHR/Cohort | The Boston Biorepository, Recruitment, and Integrative Network (BBRAIN) is a national research resource created to advance the study of Gulf War Illness (GWI). |
| CDC Data Catalog | Public |
Public health/community | CDC offers a variety of public and adminsitrative health datasets measuring mortality (WONDER), food environment (mRFEI), local disease burden (PLACES), and life expectancy (USALEEP) and other concepts. |
| CMS Medicaid | Restricted |
EHR/Administrative/Claims | Massachusetts all-payer claims database covering ~80% of the population. Strong for health utilization and insurance analysis. |
| CMS Medicare | Restricted |
Administrative/Claims | Claims and enrollment data for ~65M older or disabled Americans. Ideal for longitudinal studies of healthcare use in older adults. |
| CMS Transformed Medicaid Statistical Information System (T-MSIS) | Restricted |
Administrative/Claims | Standardized Medicaid/CHIP data from all states. Includes claims, enrollment, and provider data. Designed for national comparability. |
| Data Axle | Restricted |
Administrative/Claims | Business and consumer dataset covering 15M+ U.S. businesses and 300M consumers. Useful for market research and outreach. |
| Epic Electronic Health Record System (EHR) | Restricted |
EHR/Clinical | Clinical EHR data from Epic Systems. Rich in labs, diagnoses, medications, and procedures. Granular and timely, but limited to care within EPIC networks and varies by institution. |
| Framinham Heart Study (FHS) | Restricted |
Cohort | Established by the National Heart, Lung, and Blood Institute (NHLBI) in 1948 in Framingham, Massachusetts, the study follows multiple generations of participants to understand the determinants of cardiovascular disease and related chronic conditions. |
| Health Resources and Services Administration (HRSA) | Public |
National survey | Health access and provider shortage data, including HPSAs and FQHCs. Covers underserved regions and providers. Ideal for access-to-care studies; lacks clinical or patient-level detail. |
| Informatics for Integrating Biology & the Bedside (i2b2) | Public |
EHR/Claims | Open-source platform for EHR cohort discovery at academic centers. Provides aggregate patient counts for exploratory work. Customizable, but no direct access to patient-level data. |
| International URBAN Alcohol Research Collaboration on HIV/AIDS (ARCH) | Restricted |
Cohort | The URBAN ARCH consortium includes three cohorts of people living with HIV in Mbarara, Uganda; St. Petersburg, Russia; and Boston, Massachusetts. |
| MACS/WIHS Combined Cohort Study (MWCCS) | Restricted |
EHR/Clinical | Combines MACS and WIHS the longest running observational cohorts of men and women with HIV (and includes people without HIV) in the US along with new recruits. https://statepi.jhsph.edu/mwccs/about-mwccs/ |
| Massachusetts Center for Health Information and Analysis (CHIA) All Payers Claims Database (APCD) | Restricted |
EHR/Administrative/Claims | Massachusetts all-payer claims database covering ~80% of the population. Strong for health utilization and insurance analysis. |
| Merative MarketScan Research Database | Restricted |
Administrative/Claims | Proprietary claims database with 200M+ covered lives. Detailed data for commercially insured individuals. Excellent for utilization and cost studies, but not generalizable to public/uninsured populations. |
| National Clinical Cohort Collaborative (N3C) | Restricted |
EHR/claims | NIH-sponsored data enclave with harmonized EHR from 80+ institutions. Initially focused on COVID-19, now expanded. Supports ML and advanced analytics with OMOP-standardized data. |
| Observational Health Data Sciences and Informatics (OHDSI) | Public |
EHR/Clinical | Global open-science network using the OMOP data model to enable distributed observational research across EHR and claims data. Tools for comparative effectiveness and large-scale phenotyping. |
| OptumLabs Data Warehouse | Restricted |
EHR/Clinical | Large dataset of claims + EHR for 200M+ patients, used in real-world evidence research. Strong linkage, cost/utilization analysis; access is proprietary and costly. |
| TriNetX | Restricted |
EHR/Clinical | Federated EHR network used for real-time cohort building, feasibility assessment, and clinical trial recruitment. De-identified data from 120+ global healthcare orgs. |
| U.S. Census | Public |
National survey | Decennial population count (~330M U.S. residents). Provides full coverage demographic and housing data. Used for policy, planning, and allocation. Conducted every 10 years. |
| U.S. Department of Agriculture (USDA) | Public |
National survey | USDA offers data measuring the food environment, SNAP participation, and rurality measures. Common in food insecurity and rural health work. |
| VA Informatics ” Computing Infrastructure (VINCI) Restricted | Restricted |
Administrative/Claims | Secure VA-managed environment offering access to national EHR data for 9M+ veterans, including structured/unstructured data, pharmacy, imaging, and claims. Used in chronic disease and aging research. |