Broad Genome References

This dataset includes two human genome references assembled by the Genome Reference Consortium: Hg19 and Hg38. Additionally, this dataset includes a specially curated version of the Hg19 reference dataset, known as the decoy genome. This dataset contains all the resource files needed to run GATK Best Practices workflows on sequencing data. AWS S3 has made this collection of reference data available free of charge so that anyone can use the AWS cloud platform to perform large-scale genomics analysis without worrying about the cost to download or host this data for themselves.

Accessing Broad Genome References on AWS

The dataset is organized by a directory structure where each set of reference files are organized into a sub-directory directly under the bucket s3://broad-references:

s3://broad-references/hg38/
s3://broad-references/hg19/
s3://broad-references/Homo_sapiens_assembly19_1000genomes_decoy/

Each reference directory includes: * a README with details on how these files are generated * *.dict, *.fasta, and other common auxiliary resource files * vcf files for known indels and snps from various genomic projects, such as 1000Genomes, dbsnp, etc. * an interval list containing contiguous regions that is used to analyze whole genome sequencing data in chunks

For example, if you're looking to find the fasta files associated to the hg19 set of resources, an initial listing of the hg19 reference directory s3://broad-references/hg19/ shows: s3://broad-references/hg19/v0

The first directory under the reference type is a version to protect the files from mutating and any changes/updates to an existing refrence is explicitly managed. Listing the contents under the version directory and filtering by the term fasta yields:

s3://broad-references/hg19/v0/Homo_sapiens_assembly19.fasta
s3://broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.64.amb
s3://broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.64.ann
s3://broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.64.bwt
s3://broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.64.pac
s3://broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.64.sa
s3://broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.alt
s3://broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.amb
s3://broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.ann
s3://broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.bwt
s3://broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.fai
s3://broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.pac
s3://broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.sa

Contact

This dataset is maintained and updated by the Broad Institute. For any help, please contact hensonc@broadinstitute.org.

License

This data is acquired from the NCBI GenBank site. The GenBank database is designed to provide and encourage access within the scientific community to the most up-to-date and comprehensive DNA sequence information. Therefore, NCBI places no restrictions on the use or distribution of the GenBank data. However, some submitters may claim patent, copyright, or other intellectual property rights in all or a portion of the data they have submitted. NCBI is not in a position to assess the validity of such claims, and therefore cannot provide comment or unrestricted permission concerning the use, copying, or distribution of the information contained in GenBank.