In this tutorial, you will analyse a set of NGS samples to find mutations in the region of the TP53 gene.
Some of the steps described here might be unnecesarry for such a simple task, but in general, they are usually required in a WGS/WES analysis pipeline.
Slides (PDF): here
Many of the files used in this notebook have already been prepared to save time and the most time consuming commands will only be run on a single file to demonstrate their usage. The intermediate files generated purely for demonstration purposes will be saved to the testResults
directory. Please leave other directories untouched, unless stated otherwise. In line with this, some of the code cells have been converted to Raw NBConvert to simply show the code, but not actually run it. If you do want to run these codes, convert these cells back to Code and run them. (This usually requires quite a bit of time.)
Prepared files and analysis tools are located in the ../../course/genomeSequencing/
folder which is not writable.
The amount of time it takes for a cell to run is indicated by the number of 🐌 icons whenever some patience is needed. (No snails means a relatively short waiting time.)
In [ ]
on their left side.In [*]
on their left side. WAIT IT OUT!In [12]
on the left side. Now it is okay to continue.#
comments).Do not hesitate to ask if you are unsure!
storageDirectory = '../../course/genomeSequencing/'
The FASTQ files of the analysed samples are located in the ../../course/genomeSequencing/FASTQ
directory. You can list these with:
!ls -ls $storageDirectory""FASTQ
We have altogether 12 FASTQ files. (The ".gz" extension means that they are compressed.) The filenames suggest, that we have 3 CRC samples and 3 adenoma (AD) samples in the dataset with their respective NAT (normal tissue adjacent to tumor) pairs.
These are not whole genome sequences, the data for a specific genomic region was extracted to make the workflow faster. In general cases, the whole genome (or whole exome) should be considered.
Before any variant calling can be performed, FASTQ files must be aligned to the appropriate reference genome. There are many versions of the human reference genome, in this tutorial, we will use the version "GRCh38" (or "hg38"). (This is the most recent assembly available, but for specific tasks it can be practical to use older versions as well.)
The FASTA format means that we are looking for a text file in which nucleotide bases are simply listed next to each other as letters. Don't get discouraged if the file you found has a ".fa.gz" extension, it simply means that once downloaded, it should also be extracted. Copy the URL of the file you found into the next cell:
# copy the URL here, make sure to keep the quote marks
URL = 'https://example.org/refgenomes/hg38/chromosomeN.fa.gz'
Now that you located the file, all is left is downloading. This can be done by running the cell below. However, given that it takes a few minutes to download, we'll skip this step for now. An already downloaded version can be found in the folder refgenome
.
(If you do want to download the file, change the cell type from Raw NBConvert to Code for the following cell and run it.)
!ls -ls refgenome
As this file also has a ".fa.gz" extension, it has to be extracted first by running:
!gunzip refgenome/refgenome.fa.gz
!ls -ls refgenome/
Before the actual alignment can happen, we have to create an index file for the reference genome with the appropriate command of samtools
.
!samtools
#!samtools [samtools_command_for_indexing_FASTA_and_other_arguments]
If everything went according to plan, you should see a refgenome.fa.fai
file in the refgenome
folder when running:
!ls -ls refgenome/
🐌 Takes about 90 seconds...
# command necessary before bwa can be run
Now the actual alignment can be performed for a single file with:
🐌 Takes about a minute...
!bwa mem refgenome/refgenome.fa $storageDirectory""FASTQ/1_NAT.fq.gz > testResults/1_NAT.sam
This command creates the alignment file in SAM format. This is a text-based format, so it's easy to look into it:
!head -4 testResults/1_NAT.sam
Most applications require the binary version of the SAM file as their input, the BAM format. One can convert a SAM file to a BAM file with:
!samtools view -bS testResults/1_NAT.sam > testResults/1_NAT.bam
Given that the alignment for all the files in our dataset would take a few minutes, we'll skip this step and use the already prepared BAM files in the ../../course/genomeSequencing/BAM
directory.
However, (if you are patient) the alignment for all files can be easily performed with: (🐌🐌🐌 takes around 10-15 minutes)
!ls -ls $storageDirectory""BAM
!samtools sort -o sortedBAM/1_NAT_sorted.bam $storageDirectory""BAM/1_NAT.bam
🐌🐌 Takes about 2-3 minutes...
# command(s) and code for sorting all BAM files
GATK is one of the most optimized state of art tools for both germline and somatic mutation calling. However, running it on many samples and whole genomes requires a lot of computer memory and time and also quite a bit of practice. Below we run the mutation detection pipelines as suggested by GATK Best practices.
The pipeline for calling germline mutations with GATK consists of many steps.
The absolute first one is to create a sequence dictionary for the reference genome with:
!java -jar $storageDirectory""tools/picard.jar CreateSequenceDictionary R=refgenome/refgenome.fa O=refgenome/refgenome.dict
It is also generally a good idea to check if our BAM files are appropriate for the HaplotypeCaller
tool before actually trying to run it. (Otherwise we might get fairly counterintuitive error messages that are hard to interpret.) This can be done with:
!java -jar $storageDirectory""tools/picard.jar ValidateSamFile \
I=sortedBAM/1_NAT_sorted.bam \
MODE=SUMMARY
Apparently all of the reads in the BAM file miss a "read group". Keep in mind, that most of the pipelines used for NGS analysis are reasonably complicated and require a lot of tweaking before they can be run without any error messages appearing. Keep working on them and use Google if something seems unclear.
This specific problem can be remedied with adding read groups to all of our BAM files by running the following cell:
🐌🐌 Takes about 3 minutes...
fnames = !ls sortedBAM/
for fn in fnames:
fnb = '_'.join(fn.split('_')[0:2])
fnrg = fnb + '_RG.bam'
!java -jar $storageDirectory""tools/picard.jar AddOrReplaceReadGroups \
INPUT=sortedBAM/$fn \
OUTPUT=RGBAM/$fnrg \
RGLB=lib1 \
RGPL=illumina \
RGPU=unit1 \
RGSM=$fnb
And finally we need to index the BAM files we wish to process:
fnames = !ls RGBAM/
for fn in fnames:
!samtools index RGBAM/$fn
When running the cell below, you should see openjdk version "1.8.*"
!java -version
Now that the preparations are (hopefully) done, the next step is to create an individual GVCF file for each sample that contains information on basically all sites of the genomic region.
Due to the extremely long runtime, we'll skip running it on all of the samples and simply use the prepared GVCF files in the ../../course/genomeSequencing/HC_GVCFs
directory:
!ls -ls $storageDirectory""HC_GVCFs/
Now we have to combine these GVCF files using the GenomicsDBImport
tool:
🐌 Takes about a minute...
!$storageDirectory""tools/gatk-4.0.10.1/gatk GenomicsDBImport \
-V $storageDirectory""HC_GVCFs/1_NAT.raw.g.vcf.gz \
-V $storageDirectory""HC_GVCFs/1_AD.raw.g.vcf.gz \
-V $storageDirectory""HC_GVCFs/2_NAT.raw.g.vcf.gz \
-V $storageDirectory""HC_GVCFs/2_CRC.raw.g.vcf.gz \
-V $storageDirectory""HC_GVCFs/3_NAT.raw.g.vcf.gz \
-V $storageDirectory""HC_GVCFs/3_AD.raw.g.vcf.gz \
-V $storageDirectory""HC_GVCFs/4_NAT.raw.g.vcf.gz \
-V $storageDirectory""HC_GVCFs/4_CRC.raw.g.vcf.gz \
-V $storageDirectory""HC_GVCFs/5_NAT.raw.g.vcf.gz \
-V $storageDirectory""HC_GVCFs/5_AD.raw.g.vcf.gz \
-V $storageDirectory""HC_GVCFs/6_NAT.raw.g.vcf.gz \
-V $storageDirectory""HC_GVCFs/6_CRC.raw.g.vcf.gz \
--genomicsdb-workspace-path my_database \
-L chr17
And finally, we can proceed with the joint genotyping of our samples. This code writes a list of variants to the testResults/jointgenotyping.vcf.gz
file by collecting information from all the analysed samples. It is also listed in the ../../course/genomeSequencing/VarCallResults
folder for further reference. (At this point, we will leave it at that, but we will come back to the interpretation of the results in a later section.)
🐌 Takes about a minute...
!$storageDirectory""tools/gatk-4.0.10.1/gatk GenotypeGVCFs \
-R refgenome/refgenome.fa \
-V gendb://my_database \
-O testResults/jointgenotyping.vcf.gz
MuTect2 essentially compares two samples (a tumor and a normal one) with each other, while also doing a comparison with a set of unrelated normal samples (referred to as a Panel of Normals or "PoN"). So if you do not already have a PoN available, you should start by creating it. You can include any number of samples in it, but your results will largely depend on how many and what type of samples you chose. It is good practice to use samples in the PoN that were sequenced and prepared in a similar manner to the samples you would like to analyse to get rid of consistent sequencing noise.
A reliable, albeit fairly slow method is to always use all normal samples in the PoN, except for the NAT pair of the investigated tumor sample. This naturally means, that a separate PoN has to be created for each of our tumor samples. (This takes a lot of time and generally, if you have many samples available, it is better to use separate normal samples for the PoN and analyse each tumor-normal pair with that to get consistent results. However, as we have a very limited number of normal samples, we will generate a separate PoN for each of our sample pairs.)
PoN generation for sample pair 1
: as we decided to use all normal samples in the PoN, except for the normal sample of the given pair, this PoN will consist of samples 2_NAT
, 3_NAT
, 4_NAT
, 5_NAT
and 6_NAT
.
The first step is to run MuTect2
in tumor_only
mode for each of the samples included in the PoN. Given that all of our NAT samples are included in 5 PoNs (out of the total 6), we should perform this step for all of them.
🐌🐌🐌🐌 As it would take more than an hour to run the cell below, we'll simply use the prepared files in the ../../course/genomeSequencing/PoN_VCFs
folder.
However, in theory, this code could be used:
!ls -ls $storageDirectory""PoN_VCFs/
Now we can generate the appropriate PoN for sample pair 1
by combining the above created VCF files for samples 2_NAT
, 3_NAT
, 4_NAT
, 5_NAT
and 6_NAT
:
!$storageDirectory""tools/gatk-4.0.10.1/gatk CreateSomaticPanelOfNormals \
-vcfs $storageDirectory""PoN_VCFs/2_NAT_pon.vcf.gz \
-vcfs $storageDirectory""PoN_VCFs/3_NAT_pon.vcf.gz \
-vcfs $storageDirectory""PoN_VCFs/4_NAT_pon.vcf.gz \
-vcfs $storageDirectory""PoN_VCFs/5_NAT_pon.vcf.gz \
-vcfs $storageDirectory""PoN_VCFs/6_NAT_pon.vcf.gz \
-O testResults/pair1_PoN.vcf.gz
Naturally, this step can be performed for all 6 sample pairs, but all the PoNs can be found already prepared in directory ../../course/genomeSequencing/PoNs
:
!ls -ls $storageDirectory""PoNs/
And finally, somatic mutation detection for all of the CRC/AD samples can be run with the following code, but output files have already been prepared to the ../../course/genomeSequencing/VarCallResults
folder to save some time.
🐌🐌🐌🐌🐌 Takes around 2 hours...
!ls -ls $storageDirectory""VarCallResults/
As a last step, MuTect2 requires to run the code below to filter out any false positives in the variant set:
(The output files are also listed in the ../../course/genomeSequencing/VarCallResults
directory for further reference.)
fnames = !ls $storageDirectory""VarCallResults/*_m.vcf.gz
for fn in fnames:
fnf = 'testResults/'+fn.split('/')[-1].split('.vcf.gz')[0] + '_filtered.vcf.gz'
!$storageDirectory""tools/gatk-4.0.10.1/gatk FilterMutectCalls \
-V $fn \
-O $fnf
Even though the above described pipelines (GATK tools) operate in a highly optimized manner, they also suffer from great computational costs. A commonly used alternative is VarScan2, that can analyse whole genomes much faster, while admittedly sacrificing precision. Here we demonstrate how somatic variants can be called using VarScan2.
The pipeline here is much more straightforward than with MuTect2, only two steps are needed to be performed: the generation of a joint pileup file for each sample pair and the actual variant calling.
Here we generate a joint pileup file for sample pair 1
with:
🐌 Takes around 90 seconds...
!samtools mpileup \
-f refgenome/refgenome.fa \
-q 30 -Q 30 -d 8000 \
RGBAM/1_NAT_RG.bam RGBAM/1_AD_RG.bam > testResults/1_joint_PUP.pup
!ls -ls RGBAM/
The rest of the pileup files are prepared in the ../../course/genomeSequencing/PUP
folder, but if necessary can be generated with:
!ls -ls $storageDirectory""PUP/
And the somatic mutations in sample 1_AD
can be recovered with:
🐌 Takes around 50 seconds...
!java -jar $storageDirectory""tools/VarScan2/VarScan.v2.4.3.jar somatic \
$storageDirectory""PUP/1_joint_PUP.pup \
testResults/1_somatic_varscan \
--mpileup 1
The rest of the sample pairs can be analysed with the following code, but output files are already stored in ../../course/genomeSequencing/VarCallResults
.
!ls -ls $storageDirectory""VarCallResults/
Some of the above described mutation detection tools produce VCF files. These have essentially similar structures, but contain different levels of information, depending on the analysis pipeline. In python, the pyvcf
package can be used to further analyse VCF files. To get access to the functions of the package, we have to import it first with:
import vcf
To have a quick look at the germline mutations found by GATK, we don't even need to use the pyvcf
package, it can be achieved by:
!zcat $storageDirectory""VarCallResults/jointgenotyping.vcf.gz | head
All of these lines start with "#" and contain general information about both the pipeline and the abbreviations used in the actual data lines. The structure of the actual variant lines is described in the last line starting with "#":
!zcat $storageDirectory""VarCallResults/jointgenotyping.vcf.gz | grep "^#" | tail -1
So each variant line contains the genomic position of the variant (#CHROM
, POS
), the reference and the alternate alleles in the position (REF
, ALT
), the quality assigned to the variant by GATK (QUAL
), the reasons for filtering the variant out if there are any (FILTER
), information about the position in all samples (INFO
), the format of the following columns (FORMAT
), and columns describing the values indicated by FORMAT
for each of the samples separately (1_AD
, 1_NAT
, 2_CRC
, 2_NAT
, 3_AD
, 3_NAT
, 4_CRC
, 4_NAT
, 5_AD
, 5_NAT
, 6_CRC
, 6_NAT
).
The actual lines containing the variants can be looked at by ignoring lines starting with "#":
!zcat $storageDirectory""VarCallResults/jointgenotyping.vcf.gz | grep -v "^#" | head
Whenever the given position is not covered in a given sample, the sample's column only contains "empty" infomation. (Basically zeros and "."-s.)
# importing plotting packages
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# importing numpy for calculations
import numpy as np
# TP53 gene location on chr17: https://ghr.nlm.nih.gov/gene/TP53#location
TP53_min = 7668402
TP53_max = 7687550
# sample names
sample_names = ['1_NAT', '1_AD', '2_NAT', '2_CRC', '3_NAT', '3_AD', '4_NAT', '4_CRC', '5_NAT', '5_AD', '6_NAT', '6_CRC']
# genotype dictionary
gt_dict = {'0/0': 1, '0/1': 2, '1/1': 3}
# preparing data matrix with NaNs
data_matrix = np.empty((len(sample_names), TP53_max-TP53_min+1))
data_matrix[:] = np.nan
# iterating through variants with the pyvcf package
vcf_reader = vcf.Reader(filename=storageDirectory+'VarCallResults/jointgenotyping.vcf.gz')
for record in vcf_reader:
# only consider position if it is in the TP53 region
if record.POS >= TP53_min and record.POS <= TP53_max:
# get genotype info for all samples
for sID, s in enumerate(sample_names):
data_matrix[sID][record.POS-TP53_min] = gt_dict.get(record.genotype(name=s)['GT'], np.nan)
# preparing colormap for plotting
from matplotlib.colors import LinearSegmentedColormap
colors = ['#E9967A', '#D8BFD8', '#66CDAA']
cmap = LinearSegmentedColormap.from_list('my_cmap', colors, N=3)
# plotting the matrix with a horizontal colorbar and annotating arrows
grid_kws = {"height_ratios": (.9, .05), "hspace": .3}
f, (ax, cbar_ax) = plt.subplots(2, gridspec_kw=grid_kws, figsize=(15,6))
ax = sns.heatmap(data_matrix, ax=ax,
cbar_ax=cbar_ax,
xticklabels=False,
yticklabels=sample_names,
cbar_kws={"orientation": "horizontal", 'ticks': [1+1/3,2,3-1/3]},
cmap= cmap)
cbar_ax.set_xticklabels(['0/0', '0/1', '1/1'], fontsize=14)
ax.set_xlabel('\nTP53', fontsize=16)
cbar_ax.tick_params(size=0)
ax.annotate('chr17:7674797', xy=(7674797-TP53_min, 3), xytext=(7674797-TP53_min-2100,2),
arrowprops=dict(facecolor='black',
width=0.5,
headwidth=7,
headlength=10));
ax.annotate('chr17:7676325', xy=(7676325-TP53_min, 6), xytext=(7676325-TP53_min-2100,5),
arrowprops=dict(facecolor='black',
width=0.5,
headwidth=7,
headlength=10));
It's apparent that in many of the positions, there is either no variation in any of the samples or no information is available about these. On the other hand, in some of the positions, all samples seem to have some kind of variation instead of being strictly reference.
# command for checking the ../../share/genomeSequencing/VarCallResultsjointgenotyping.vcf.gz file
# for position chr17:7674797
# command for checking the ../../share/genomeSequencing/VarCallResultsjointgenotyping.vcf.gz file
# for position chr17:7676325
The VCF files produced by MuTect2 are essentially similar to the one analyzed above. To take a quick look at the variant lines, we can run (here we are ignoring lines starting with a "#"):
!zcat $storageDirectory""VarCallResults/1_somatic_m_filtered.vcf.gz | grep -v "^#" | head -2
VarScan2 uses a slightly different format and also saves SNVs and indels to separate files:
!cat $storageDirectory""VarCallResults/1_somatic_varscan.snp | head -3
!cat $storageDirectory""VarCallResults/1_somatic_varscan.indel | head -3
To make further analysis easier, we will load both MuTect2 and VarScan2 files to dataframes, only keeping the most important information about the variants. These will be: chromosome, position, reference allele, variant allele, normal coverage, variant allele frequency in normal, tumor coverage, variant allele frequency in tumor and "quality".
MuTect2 labels mutations that are supposedly false positives with one or more reasons for filtering in their "FILTER" field. We will only load those variants into the dataframe, where the "FILTER" field is empty.
VarScan2 labels variants either as "LOH", "Germline" or "Somatic". We will only use variants deemed to be "Somatic".
As for the "quality" field, both variant callers characterize the detected variants with some type of measure for quality. These are generally connected to probabilities of falsely detecting a somatic variant when it's merely either sequencing or mapping noise or a miscategorized germline mutation. These quality values are not directly comparable between variant callers, but serve as a guide to distinguish high-quality variants from low-quality variants among the output of a given detector.
For the dataframe formalism, we will be using the pandas
python package.
import pandas as pd
def load_mutect2(sample_pair):
lines = []
vcf_reader = vcf.Reader(filename=storageDirectory+'VarCallResults/'+sample_pair+'_somatic_m_filtered.vcf.gz')
for record in vcf_reader:
if len(record.FILTER) == 0:
lines.append([record.CHROM, record.POS, record.REF, record.ALT[0],
sum(record.samples[1]['AD']), record.samples[1]['AD'][1]/sum(record.samples[1]['AD']),
sum(record.samples[0]['AD']), record.samples[0]['AD'][1]/sum(record.samples[0]['AD']),
record.INFO['TLOD'][0]])
df = pd.DataFrame(lines, columns = ['chr', 'pos', 'ref', 'alt', 'norm_cov', 'norm_alt_freq',
'tum_cov', 'tum_alt_freq', 'qual'])
return df.sort_values(by='pos').reset_index(drop=True)
def load_varscan2(sample_pair):
tmp1 = pd.read_csv(storageDirectory+'VarCallResults/'+sample_pair+'_somatic_varscan.snp', sep='\t')
tmp2 = pd.read_csv(storageDirectory+'VarCallResults/'+sample_pair+'_somatic_varscan.indel', sep='\t')
tmp = pd.concat([tmp1, tmp2])
tmp = tmp[tmp['somatic_status'] == 'Somatic']
df = pd.DataFrame()
df['chr'] = tmp['chrom']
df['pos'] = tmp['position']
df['ref'] = tmp['ref']
df['alt'] = tmp['var']
df['norm_cov'] = tmp['normal_reads1']+tmp['normal_reads2']
df['norm_alt_freq'] = tmp['normal_reads2']/df['norm_cov']
df['tum_cov'] = tmp['tumor_reads1']+tmp['tumor_reads2']
df['tum_alt_freq'] = tmp['tumor_reads2']/df['tum_cov']
df['qual'] = -10*np.log10(tmp['somatic_p_value'])
return df.sort_values(by='pos').reset_index(drop=True)
Now we can use the above functions to load the necessary data and postprocess the results.
# list of tumor samples
tumor_samples = [k for k in sample_names if "NAT" not in k]
# lists to store mutation counts
mutect2_mut_counts = []
varscan2_mut_counts = []
for sp in [k[0] for k in tumor_samples]:
# loading mutect2 results for given tumor sample
mt_df = load_mutect2(sp)
# loading varscan2 results for given tumor sample
vs_df = load_varscan2(sp)
# mutation count equals the number of lines in the dataframe
mutect2_mut_counts.append(mt_df.shape[0])
varscan2_mut_counts.append(vs_df.shape[0])
# plotting mutation counts on a barchart
fig, ax = plt.subplots(figsize=(7,6))
index = np.arange(len(tumor_samples))
bar_width = 0.35
opacity = 1
rects1 = ax.bar(index, mutect2_mut_counts, bar_width,
alpha=opacity, color=colors[0],
label='MuTect2')
rects1 = ax.bar(index+bar_width, varscan2_mut_counts, bar_width,
alpha=opacity, color=colors[1],
label='VarScan2')
ax.set_xlabel('\nTumor sample', fontsize=14)
ax.set_ylabel('Number of somatic mutations found', fontsize=14)
ax.set_xticks(index + bar_width / 2)
ax.set_xticklabels(tumor_samples)
ax.tick_params(axis='x', length=0)
ax.legend()
plt.grid(color='#f1f1f1')
plt.show()
Let's investigate the somatic mutations in sample 3_AD
in more detail:
mt_3 = load_mutect2('3')
vs_3 = load_varscan2('3')
Let's calculate the number of mutations found by both variant callers:
print('Number of mutations found by both variant callers: ' + str(len(set(mt_3['pos']).intersection(set(vs_3['pos'])))))
Let's check which mutation this is:
mutpos = list(set(mt_3['pos']).intersection(set(vs_3['pos'])))[0]
mt_3[mt_3['pos'] == mutpos]
vs_3[vs_3['pos'] == mutpos]
This single mutation is the mutation with the highest quality value in both datasets, suggesting that both MuTect2 and VarScan2 can reliably find "obvious" mutations, but disagree on ambiguous ones. It is interesting to note that VarScan2 reports a much higher coverage for this position than MuTect2. This is because GATK tools perform local realignment in active regions, which can influence local coverage.
mt_3[mt_3['pos'] != mutpos]
In some cases, a general, statistical analysis of the mutations is preferred. On the other hand, for some applications, the exact consequences of the detected mutations must be determined. For this, we will be using the ANNOVAR tool which helps with the labelling of specific mutation sites with available information from different databases.
The tool itself is located in the tools/annovar
folder. However, the databases used for annotation must be first selected from the list of available resources and then downloaded.
First we need to download the appropriate databases with the following commands:
🐌🐌🐌 Given that it can take up to 10 minutes to download, we skip these steps and use the database files located in the folder ../../course/genomeSequencing/tools/annovar/DBdownload
.
Now that all the necessary databases have been downloaded, we can prepare our input files for ANNOVAR. The general structure to be followed is shown below:
1 948921 948921 T C
1 1404001 1404001 G T
1 5935162 5935162 A T
1 162736463 162736463 C T
1 84875173 84875173 C T
The columns are: chromosome, starting position, end position, reference allele, alternate allele. We will use the following function to convert our dataframes to ANNOVAR-compatible text files:
def prepare_ANNOVAR_input(df, filename):
df['pos2'] = df['pos']
df[['chr', 'pos', 'pos2', 'ref', 'alt']].to_csv('ANNOVARinput/' + filename, sep='\t', index=False, header=False)
And now we can generate the input files with:
for ts in tumor_samples:
df = load_mutect2(sample_pair=ts[0])
prepare_ANNOVAR_input(df, ts + '_ANNOVAR_input.txt')
Finally, ANNOVAR can be run on these files with:
🐌🐌 Takes about 5-6 minutes, but output files have been already prepared to the folder ../../course/genomeSequencing/ANNOVARresults
.
We can load the resulting tables to a single dataframe with:
all_dfs = []
for ts in tumor_samples:
df = pd.read_csv(storageDirectory+'ANNOVARresults/'+ts+'.hg38_multianno.csv')
df.insert(0, 'sample', [ts]*df.shape[0])
all_dfs.append(df)
df = pd.concat(all_dfs).reset_index(drop=True)
Now we can check if there were any somatic mutations found in the TP53 gene:
df[df['Gene.refGene']=='TP53']
Apparently, samples 3_AD
, 5_AD
and 6_CRC
have a mutation in the TP53 region. All of these mutations are listed in both the dbSNP and the COSMIC datasets. The occurences of these mutations in different tissues (in the COSMIC database) can be listed with:
for p, o in zip(list(df[df['Gene.refGene']=='TP53']['Start']),
[k.split('OCCURENCE=')[1].split(',') for k in list(df[df['Gene.refGene']=='TP53']['cosmic70'])]):
print(p,o,'\n')
# code for counting mutations not categorized to specific genes
# if you have no idea about the code, count them by hand and write the answer here