TCGA数据库下载数据

 (2014-12-11 21:40:40)
转载

标签: 

tcga数据库

分类: biology

TCGA数据库是癌症基因图集(Cancer Genome Atlas,TCGA)计划,将针对不同癌症的所有基因变异进行系统分析。网址:http://cancergenome.nih.gov/
   点击“lauch Data Portal”,进入下载页面;然后,选择你需要的癌症数据,进行下载即可。当然“Dowddata”点击后,会出现四种不同的方式进行数据下载和筛选

 现在,了解RNA-SeqV2中基因和基因的isoform的表达量的计算。当你选择了RNA-SeqV2的level3数据时,要注意到此时有两种不同的方法来估算基因的表达量和基因的isoform的表达量,这里可以查看如下网址1和网址2以及这个网址:而RNA-SeqV2的具体信息查阅网址3:​其中,在文件.rsem.genes.results中raw_counts是指The number of reads mapping to this gene,scaled_estimate是指T值“tau value”;而在rsem.genes.normalized_results文件中normalized_count是指upper quartile normalized RSEM count estimates(75%)。或许可以看看这个网址4:​​​这里有详细列举TCGA的RNAseq的处理。在这个网址5上:我们可以知道RSEM的计算 具体如下网址6。 同时在以下两个网址,可以知道The scaled estimate value 作为一个衡量基因或isoform的表达量,应该没有问题。参考网址7和网址

 TCGA 上面的数据有很多,但是RNA-seq的原始数据貌似必须要申请,而且要求符合一定条件。这个网址9让大家看看数据的情况。数据一般是48,50,75,paired-en

我从TCGA下载的数据中isoform的id为uc002icp.3,这个应该是UCSC gene的转录本id,然后我在下载了hg19的gtf,却没有找到这个id;我想到了hg19的注释文件版本不同是:hg19 June 2011 build),却没有发现这个转录本id。然后在网址9能找到一些有用的数据。而网址9的信息来自网址10

网址10里面存储了文件“hg19_M_rCRS.fa”,其实是assemble 染色体序列,来自UCSC中hg19的24条染色体和一个chrM序列。

关于TCGA的一些中文介绍,如网址11
最后,TCGA中做map的hg19基因组以及所参照的基因注释文件在网址12.该网址中的hg19 June 2011 build的gaf文件就里面包含了所有的基因(20806),以及与UCSCgene对应的转录本id,当然还有它们间的序列。
对于TCGA数据的使用,你可以参考:

可参考 这个网站提供的工具:该网址的中文参考:http://www.howsci.com/integrative-analysis-of-complex-cancer-genomics-and-clinical-profiles-using-the-cbioportalal.html
http://www.cbioportal.org/public-portal/

The cBio Cancer Genomics Portal provides visualizationanalysis and download of large-scale cancer genomics data sets.

The portal is developed and maintained by the Computational Biology Center at Memorial Sloan-Kettering Cancer Center.
目前来能够从TCGA数据库中提取数据的处理工具有cBioPortal(http://www.cbioportal.org/public-portal/cgds_r.jsp),ICGC(http://dcc.icgc.org/download/current)和GenePattern(http://www.broadinstitute.org/cancer/software/genepattern/download/index)。这些工具使用起来还是有其局限性,都不能够轻易获取每个癌症类型的二维数据矩阵(例如基因为rows,样本为columns)。

TCGA中每个样本都是相互独立的,两个样本的barcode中sample type为06或01,而其他都相同,但是这两个样本都有可能来自同个病人的不同组织。例如:两个乳腺癌样本TCGA-BH-A1FE-01,TCGA-BH-A1FE-06,01意味着原始癌症样本来自乳腺,06意味着转移样本,经过查证来自卵巢。具体查找方法,见以下邮件内容:

Each sample in TCGA is a separate sample.  I think the best place to look for site information of each sample is the pathology report.  You can find the pathology report file name (they are pdf files) in the biospecimen_sample file in the pathology_report_file_name column. Once you have the file name, you can search for it using the Bulk Download tool  at https://tcga-data.nci.nih.gov/tcga/findArchives.htm. You will want to copy and paste the pathology report file name in the File Name field and click the Find button. This will give you the directory that contains the pdf file.  You can then click the View Files link and it will display all the pathology files in that directory.  You will need to search for the pdf file you are interested in and you can open it for viewing there.

Let me give you an example.  Sample TCGA-D3-A1Q6-06A has pathology report TCGA-D3-A1Q6.5BA4EDD7-8462-4028-8CB9-8FE2DDC51D3E.pdf, and sample TCGA-D3-A1Q6-07A has pathology report TCGA-D3-A1Q6.7E74D698-CA50-40D9-8A06-6CECFD8580DA.pdf. Using the Bulk Download tool, you will find that both of these files are in directory nationwidechildrens.org_SKCM.pathology_reports.Level_1.180.9.0. You can View Files directly in the Bulk Download tool, or you can go to the Open-Access HTTP Directory at https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/ and drill down to the pathology reports for the disease you are interested in and find the pdf files in there.  To drill down you would select the disease you are interested in (e.g., skcm) and then select bcr, nationwidechildrens.org, pathology_reports, reports, and then select the directory that you found using the Builk Download tool.  In that directory you will find both pdf files of interest.

If you look at the pdf file for sample TCGA-D3-A1Q6-06A, you will see in the hand-written notes at the top right, “Site: subcutaneous tissue.”  This matches diagnosis A, found at right upper arm.  If you look at the pdf file for sample TCGA-D3-A1Q6-07A, you will see the hand-written notes “Site: lymph nodes, axillary.”  This matches diagnosis D, lymph nodes.

This is probably the best place to find any details about the location of the sample in question.  Note however, that normal samples (such as the BRCA normal sample TCGA-BH-A18V-11, do not have pathology reports. Normal solid tissue samples are typically normal tissue collected adjacent to the tumor sample.