An NGS analysis pipeline: investigating TP53 mutations in adenoma and CRC samples

In this tutorial, you will analyse a set of NGS samples to find mutations in the region of the TP53 gene.

Some of the steps described here might be unnecesarry for such a simple task, but in general, they are usually required in a WGS/WES analysis pipeline.

Slides (PDF): here

Warning: This is a html version of the notebook only uploaded to `https://dlab.elte.hu/index.php/bitechnologia-msc-2019-10-21-25/` for documentation purposes. All the tools and resources necessary to run the code cells are located at `https://kooplex-edu.elte.hu/`. Head over there to complete the tasks. (A short tutorial on how to do this can be found on the first few slides.)

Notebook structure

Technical note: Lines starting with a `!` are operation system (linux bash) commands (that usually call external programs). Other code lines and code cells are run in python.

Many of the files used in this notebook have already been prepared to save time and the most time consuming commands will only be run on a single file to demonstrate their usage. The intermediate files generated purely for demonstration purposes will be saved to the testResults directory. Please leave other directories untouched, unless stated otherwise. In line with this, some of the code cells have been converted to Raw NBConvert to simply show the code, but not actually run it. If you do want to run these codes, convert these cells back to Code and run them. (This usually requires quite a bit of time.)

Prepared files and analysis tools are located in the ../../course/genomeSequencing/ folder which is not writable.

Technical note: Both code and markdown cells can be run with the Shift+Enter combination.

The amount of time it takes for a cell to run is indicated by the number of 🐌 icons whenever some patience is needed. (No snails means a relatively short waiting time.)


How to use this notebook

  1. Read the text and run cells in the appropriate order. (Mixing up the order will cause errors.)
    • The cells that need to be run have In [ ] on their left side.
    • Run these cells with the Shift+Enter combination.
    • While the cell is still running, you will see In [*] on their left side. WAIT IT OUT!
    • When the cell has finished running, you will see a number in the brackets, e.g. In [12] on the left side. Now it is okay to continue.
  1. Answer/solve the tasks. The ones that require coding must be solved before you can continue.
    • Write the solution code to the cells indicated with lightblue text (# comments).
    • Once you have the solution, run the code cell. If you skip this step, you won't be able to continue running the notebook.
    • When the task requires a text as an answer (and no code), write it into the Markdown cell below the text of the task. (Double click on the Markdown cell to edit its text.)

Do not hesitate to ask if you are unsure!


Overview

  1. Basic exploration of samples, downloading the necessary data
  2. Aligning FASTQ files to a reference genome
  3. Postprocessing alignment files
  4. Variant calling with GATK and VarScan2
  5. Understanding variant caller outputs
  6. Annotating mutations

Basic exploration of samples, downloading the necessary data

Setting the path to external files

All prepared files and tools are located in different subfolders of the ../../course/genomeSequencing/ folder. Here we set this as the value of a variable for further reference:

In [ ]:
storageDirectory = '../../course/genomeSequencing/'

A quick look at the data

The FASTQ files of the analysed samples are located in the ../../course/genomeSequencing/FASTQ directory. You can list these with:

In [ ]:
!ls -ls $storageDirectory""FASTQ

We have altogether 12 FASTQ files. (The ".gz" extension means that they are compressed.) The filenames suggest, that we have 3 CRC samples and 3 adenoma (AD) samples in the dataset with their respective NAT (normal tissue adjacent to tumor) pairs.

These are not whole genome sequences, the data for a specific genomic region was extracted to make the workflow faster. In general cases, the whole genome (or whole exome) should be considered.

Downloading a reference genome

Before any variant calling can be performed, FASTQ files must be aligned to the appropriate reference genome. There are many versions of the human reference genome, in this tutorial, we will use the version "GRCh38" (or "hg38"). (This is the most recent assembly available, but for specific tasks it can be practical to use older versions as well.)

Task: Research where the TP53 gene is located on the human genome (which chromosome). Try to find the reference genome for that chromosome online and copy its URL below. Make sure to look for the correct version (GRCh38).
Hint: Use google. Always.

The FASTA format means that we are looking for a text file in which nucleotide bases are simply listed next to each other as letters. Don't get discouraged if the file you found has a ".fa.gz" extension, it simply means that once downloaded, it should also be extracted. Copy the URL of the file you found into the next cell:

In [ ]:
# copy the URL here, make sure to keep the quote marks
URL = 'https://example.org/refgenomes/hg38/chromosomeN.fa.gz'

Now that you located the file, all is left is downloading. This can be done by running the cell below. However, given that it takes a few minutes to download, we'll skip this step for now. An already downloaded version can be found in the folder refgenome.

(If you do want to download the file, change the cell type from Raw NBConvert to Code for the following cell and run it.)

!wget $URL
In [ ]:
!ls -ls refgenome

As this file also has a ".fa.gz" extension, it has to be extracted first by running:

In [ ]:
!gunzip refgenome/refgenome.fa.gz
In [ ]:
!ls -ls refgenome/

Indexing the reference genome

Before the actual alignment can happen, we have to create an index file for the reference genome with the appropriate command of samtools.

Task: Find the `samtools` command for the indexing of reference FASTA files and index the reference genome.
Hint: The help for `samtools` can be printed by running the `!samtools` command in a cell:
In [ ]:
!samtools
In [ ]:
#!samtools [samtools_command_for_indexing_FASTA_and_other_arguments]

If everything went according to plan, you should see a refgenome.fa.fai file in the refgenome folder when running:

In [ ]:
!ls -ls refgenome/

Aligning FASTQ files to the reference genome

BWA

The general tool for aligning raw sequencing data to a reference genome is bwa.

Task: Print the help for `bwa` in the next cell.

Preparations for alignment with BWA

Task: What needs to be done before using `bwa` according to the manual? Perform this task in the cell below for file `refgenome/refgenome.fa`.

🐌 Takes about 90 seconds...

In [ ]:
# command necessary before bwa can be run

Aligning a single file to the reference genome:

Now the actual alignment can be performed for a single file with:

🐌 Takes about a minute...

In [ ]:
!bwa mem refgenome/refgenome.fa $storageDirectory""FASTQ/1_NAT.fq.gz > testResults/1_NAT.sam

This command creates the alignment file in SAM format. This is a text-based format, so it's easy to look into it:

In [ ]:
!head -4 testResults/1_NAT.sam
Task: What do the letters after the sequence column mean *exactly*? Is is better to have a value of `A` or a value of `E`?

Converting SAM to BAM

Most applications require the binary version of the SAM file as their input, the BAM format. One can convert a SAM file to a BAM file with:

In [ ]:
!samtools view -bS testResults/1_NAT.sam > testResults/1_NAT.bam
Task: What does `-bS` mean in the above command?

BAM file generation for all FASTQ files

Given that the alignment for all the files in our dataset would take a few minutes, we'll skip this step and use the already prepared BAM files in the ../../course/genomeSequencing/BAM directory.

However, (if you are patient) the alignment for all files can be easily performed with: (🐌🐌🐌 takes around 10-15 minutes)

!mkdir BAM fnames = !ls $storageDirectory""FASTQ for fn in fnames: bn = fn.split('.')[0]+'.bam' !bwa mem refgenome/refgenome.fa $storageDirectory""FASTQ/$fn | samtools view -bS - > BAM/""$bn !ls -ls BAM
In [ ]:
!ls -ls $storageDirectory""BAM

Postprocessing alignment files

Sorting BAM files

Most downstream tools expect to get sorted BAM files as their input. This can be achieved with:

In [ ]:
!samtools sort -o sortedBAM/1_NAT_sorted.bam $storageDirectory""BAM/1_NAT.bam
Task: Sort the BAM files for the rest of the samples as well.
Hint: If you have no other idea, copy the above command 11-times into the next cell, change sample names and run the cell.

🐌🐌 Takes about 2-3 minutes...

In [ ]:
# command(s) and code for sorting all BAM files

Variant calling with HaplotypeCaller, MuTect2 and VarScan2

GATK

GATK is one of the most optimized state of art tools for both germline and somatic mutation calling. However, running it on many samples and whole genomes requires a lot of computer memory and time and also quite a bit of practice. Below we run the mutation detection pipelines as suggested by GATK Best practices.

Task: What is the difference between germline and somatic mutations?

GATK - HaplotypeCaller for germline mutations

The pipeline for calling germline mutations with GATK consists of many steps.

The absolute first one is to create a sequence dictionary for the reference genome with:

In [ ]:
!java -jar $storageDirectory""tools/picard.jar CreateSequenceDictionary R=refgenome/refgenome.fa O=refgenome/refgenome.dict
Hint: Check for possible errors in the BAM files beforehand!

It is also generally a good idea to check if our BAM files are appropriate for the HaplotypeCaller tool before actually trying to run it. (Otherwise we might get fairly counterintuitive error messages that are hard to interpret.) This can be done with:

In [ ]:
!java -jar $storageDirectory""tools/picard.jar ValidateSamFile \
      I=sortedBAM/1_NAT_sorted.bam \
      MODE=SUMMARY

Apparently all of the reads in the BAM file miss a "read group". Keep in mind, that most of the pipelines used for NGS analysis are reasonably complicated and require a lot of tweaking before they can be run without any error messages appearing. Keep working on them and use Google if something seems unclear.

This specific problem can be remedied with adding read groups to all of our BAM files by running the following cell:

🐌🐌 Takes about 3 minutes...

In [ ]:
fnames = !ls sortedBAM/
for fn in fnames:
    fnb = '_'.join(fn.split('_')[0:2])
    fnrg = fnb + '_RG.bam'
    !java -jar $storageDirectory""tools/picard.jar AddOrReplaceReadGroups \
        INPUT=sortedBAM/$fn \
        OUTPUT=RGBAM/$fnrg \
        RGLB=lib1 \
        RGPL=illumina \
        RGPU=unit1 \
        RGSM=$fnb

And finally we need to index the BAM files we wish to process:

In [ ]:
fnames = !ls RGBAM/
for fn in fnames:
    !samtools index RGBAM/$fn
Hint: If you run into any problems when using GATK tools, make sure you have java **version 8** installed on your computer. Anything newer than that can cause unexpected errors.

When running the cell below, you should see openjdk version "1.8.*"

In [ ]:
!java -version

Now that the preparations are (hopefully) done, the next step is to create an individual GVCF file for each sample that contains information on basically all sites of the genomic region.

Note: 🐌🐌🐌🐌 Running the cell below would take more than an hour for all of our samples. Even for a single sample it takes around 10 minutes (on this computer) and an awful lot of memory. Imagine running it on a whole genome for many samples! Using GATK is *not always practical or even feasible*.
# GVCF generation for all BAM files !mkdir HC_GVCF fnames = !ls RGBAM/ for fn in fnames: fnb = '_'.join(fn.split('_')[:2]) !$storageDirectory""tools/gatk-4.0.10.1/gatk HaplotypeCaller \ -R refgenome/refgenome.fa \ -I RGBAM/$fn \ -O HC_GVCFs/$fnb"".raw.g.vcf.gz \ -ERC GVCF

Due to the extremely long runtime, we'll skip running it on all of the samples and simply use the prepared GVCF files in the ../../course/genomeSequencing/HC_GVCFs directory:

In [ ]:
!ls -ls $storageDirectory""HC_GVCFs/

Now we have to combine these GVCF files using the GenomicsDBImport tool:

Note: Even if you analyse the whole genome, you must run the following code for each chromosome separately and adjust the `-L` option accordingly.

🐌 Takes about a minute...

In [ ]:
!$storageDirectory""tools/gatk-4.0.10.1/gatk GenomicsDBImport \
    -V $storageDirectory""HC_GVCFs/1_NAT.raw.g.vcf.gz \
    -V $storageDirectory""HC_GVCFs/1_AD.raw.g.vcf.gz \
    -V $storageDirectory""HC_GVCFs/2_NAT.raw.g.vcf.gz \
    -V $storageDirectory""HC_GVCFs/2_CRC.raw.g.vcf.gz \
    -V $storageDirectory""HC_GVCFs/3_NAT.raw.g.vcf.gz \
    -V $storageDirectory""HC_GVCFs/3_AD.raw.g.vcf.gz \
    -V $storageDirectory""HC_GVCFs/4_NAT.raw.g.vcf.gz \
    -V $storageDirectory""HC_GVCFs/4_CRC.raw.g.vcf.gz \
    -V $storageDirectory""HC_GVCFs/5_NAT.raw.g.vcf.gz \
    -V $storageDirectory""HC_GVCFs/5_AD.raw.g.vcf.gz \
    -V $storageDirectory""HC_GVCFs/6_NAT.raw.g.vcf.gz \
    -V $storageDirectory""HC_GVCFs/6_CRC.raw.g.vcf.gz \
    --genomicsdb-workspace-path my_database \
    -L chr17

And finally, we can proceed with the joint genotyping of our samples. This code writes a list of variants to the testResults/jointgenotyping.vcf.gz file by collecting information from all the analysed samples. It is also listed in the ../../course/genomeSequencing/VarCallResults folder for further reference. (At this point, we will leave it at that, but we will come back to the interpretation of the results in a later section.)

🐌 Takes about a minute...

In [ ]:
!$storageDirectory""tools/gatk-4.0.10.1/gatk GenotypeGVCFs \
   -R refgenome/refgenome.fa \
   -V gendb://my_database \
   -O testResults/jointgenotyping.vcf.gz

GATK - MuTect2 for somatic mutations

MuTect2 essentially compares two samples (a tumor and a normal one) with each other, while also doing a comparison with a set of unrelated normal samples (referred to as a Panel of Normals or "PoN"). So if you do not already have a PoN available, you should start by creating it. You can include any number of samples in it, but your results will largely depend on how many and what type of samples you chose. It is good practice to use samples in the PoN that were sequenced and prepared in a similar manner to the samples you would like to analyse to get rid of consistent sequencing noise.

A reliable, albeit fairly slow method is to always use all normal samples in the PoN, except for the NAT pair of the investigated tumor sample. This naturally means, that a separate PoN has to be created for each of our tumor samples. (This takes a lot of time and generally, if you have many samples available, it is better to use separate normal samples for the PoN and analyse each tumor-normal pair with that to get consistent results. However, as we have a very limited number of normal samples, we will generate a separate PoN for each of our sample pairs.)

PoN generation for sample pair 1: as we decided to use all normal samples in the PoN, except for the normal sample of the given pair, this PoN will consist of samples 2_NAT, 3_NAT, 4_NAT, 5_NAT and 6_NAT.

The first step is to run MuTect2 in tumor_only mode for each of the samples included in the PoN. Given that all of our NAT samples are included in 5 PoNs (out of the total 6), we should perform this step for all of them.

🐌🐌🐌🐌 As it would take more than an hour to run the cell below, we'll simply use the prepared files in the ../../course/genomeSequencing/PoN_VCFs folder.

However, in theory, this code could be used:

!mkdir PoN_VCFs fnames = !ls RGBAM/ for fn in fnames: if 'NAT' not in fn: continue fnb = '_'.join(fn.split('_')[:2]) !tools/gatk-4.0.10.1/gatk Mutect2 \ -R refgenome/refgenome.fa \ -I RGBAM/$fn \ -tumor $fnb \ -O PoN_VCFs/$fnb""_pon.vcf.gz
In [ ]:
!ls -ls $storageDirectory""PoN_VCFs/

Now we can generate the appropriate PoN for sample pair 1 by combining the above created VCF files for samples 2_NAT, 3_NAT, 4_NAT, 5_NAT and 6_NAT:

In [ ]:
!$storageDirectory""tools/gatk-4.0.10.1/gatk CreateSomaticPanelOfNormals \
    -vcfs $storageDirectory""PoN_VCFs/2_NAT_pon.vcf.gz \
    -vcfs $storageDirectory""PoN_VCFs/3_NAT_pon.vcf.gz \
    -vcfs $storageDirectory""PoN_VCFs/4_NAT_pon.vcf.gz \
    -vcfs $storageDirectory""PoN_VCFs/5_NAT_pon.vcf.gz \
    -vcfs $storageDirectory""PoN_VCFs/6_NAT_pon.vcf.gz \
    -O testResults/pair1_PoN.vcf.gz

Naturally, this step can be performed for all 6 sample pairs, but all the PoNs can be found already prepared in directory ../../course/genomeSequencing/PoNs:

In [ ]:
!ls -ls $storageDirectory""PoNs/

And finally, somatic mutation detection for all of the CRC/AD samples can be run with the following code, but output files have already been prepared to the ../../course/genomeSequencing/VarCallResults folder to save some time.

🐌🐌🐌🐌🐌 Takes around 2 hours...

!mkdir VarCallResults fnames = !ls RGBAM/*.bam | grep -v "NAT" for fn in fnames: tfn = '_'.join(fn.split('/')[1].split('_')[:2]) fc = tfn[0] !tools/gatk-4.0.10.1/gatk Mutect2 \ -R refgenome/refgenome.fa \ -I RGBAM/""$tfn""_RG.bam \ -I RGBAM/""$fc""_NAT_RG.bam \ -tumor $tfn \ -normal $fc""_NAT \ -pon $storageDirectory""PoNs/pair""$fc""_PoN.vcf.gz \ -L chr17 \ -O VarCallResults/""$fc""_somatic_m.vcf.gz
In [ ]:
!ls -ls $storageDirectory""VarCallResults/

As a last step, MuTect2 requires to run the code below to filter out any false positives in the variant set:

(The output files are also listed in the ../../course/genomeSequencing/VarCallResults directory for further reference.)

In [ ]:
fnames = !ls $storageDirectory""VarCallResults/*_m.vcf.gz
for fn in fnames:
    fnf = 'testResults/'+fn.split('/')[-1].split('.vcf.gz')[0] + '_filtered.vcf.gz'
    !$storageDirectory""tools/gatk-4.0.10.1/gatk FilterMutectCalls \
       -V $fn \
       -O $fnf

VarScan2 for somatic mutations

Even though the above described pipelines (GATK tools) operate in a highly optimized manner, they also suffer from great computational costs. A commonly used alternative is VarScan2, that can analyse whole genomes much faster, while admittedly sacrificing precision. Here we demonstrate how somatic variants can be called using VarScan2.

The pipeline here is much more straightforward than with MuTect2, only two steps are needed to be performed: the generation of a joint pileup file for each sample pair and the actual variant calling. Here we generate a joint pileup file for sample pair 1 with:

🐌 Takes around 90 seconds...

In [ ]:
!samtools mpileup \
    -f refgenome/refgenome.fa \
    -q 30 -Q 30 -d 8000 \
    RGBAM/1_NAT_RG.bam RGBAM/1_AD_RG.bam > testResults/1_joint_PUP.pup
In [ ]:
!ls -ls RGBAM/
Task: What is the difference between options `-q 30` and `-Q 30`?

The rest of the pileup files are prepared in the ../../course/genomeSequencing/PUP folder, but if necessary can be generated with:

!mkdir PUP fnames = !ls RGBAM/*.bam | grep -v "NAT" for fn in fnames: fnb = '_'.join(fn.split('/')[1].split('_')[:2]) fc = fnb[0] !samtools mpileup \ -f refgenome/refgenome.fa \ -q 30 -Q 30 -d 8000 \ RGBAM/""$fc""_NAT_RG.bam RGBAM/""$fnb""_RG.bam > PUP/""$fc""_joint_PUP.pup
In [ ]:
!ls -ls $storageDirectory""PUP/

And the somatic mutations in sample 1_AD can be recovered with:

🐌 Takes around 50 seconds...

In [ ]:
!java -jar $storageDirectory""tools/VarScan2/VarScan.v2.4.3.jar somatic \
    $storageDirectory""PUP/1_joint_PUP.pup \
    testResults/1_somatic_varscan \
    --mpileup 1

The rest of the sample pairs can be analysed with the following code, but output files are already stored in ../../course/genomeSequencing/VarCallResults.

fnames = !ls $storageDirectory""PUP/ for fn in fnames: fc = fn[0] !java -jar $storageDirectory""tools/VarScan2/VarScan.v2.4.3.jar somatic \ $storageDirectory""PUP/""$fc""_joint_PUP.pup \ VarCallResults/""$fc""_somatic_varscan \ --mpileup 1
In [ ]:
!ls -ls $storageDirectory""VarCallResults/

Understanding variant caller outputs

Dealing with VCF files

Some of the above described mutation detection tools produce VCF files. These have essentially similar structures, but contain different levels of information, depending on the analysis pipeline. In python, the pyvcf package can be used to further analyse VCF files. To get access to the functions of the package, we have to import it first with:

In [ ]:
import vcf

Germline mutations of GATK HaplotypeCaller

To have a quick look at the germline mutations found by GATK, we don't even need to use the pyvcf package, it can be achieved by:

In [ ]:
!zcat $storageDirectory""VarCallResults/jointgenotyping.vcf.gz | head

All of these lines start with "#" and contain general information about both the pipeline and the abbreviations used in the actual data lines. The structure of the actual variant lines is described in the last line starting with "#":

In [ ]:
!zcat $storageDirectory""VarCallResults/jointgenotyping.vcf.gz | grep "^#" | tail -1

So each variant line contains the genomic position of the variant (#CHROM, POS), the reference and the alternate alleles in the position (REF, ALT), the quality assigned to the variant by GATK (QUAL), the reasons for filtering the variant out if there are any (FILTER), information about the position in all samples (INFO), the format of the following columns (FORMAT), and columns describing the values indicated by FORMAT for each of the samples separately (1_AD, 1_NAT, 2_CRC, 2_NAT, 3_AD, 3_NAT, 4_CRC, 4_NAT, 5_AD, 5_NAT, 6_CRC, 6_NAT).

The actual lines containing the variants can be looked at by ignoring lines starting with "#":

In [ ]:
!zcat $storageDirectory""VarCallResults/jointgenotyping.vcf.gz | grep -v "^#" | head

Whenever the given position is not covered in a given sample, the sample's column only contains "empty" infomation. (Basically zeros and "."-s.)

Task: What does not covered mean above?
Example: Let's create a figure that shows in each genomic position if the given sample is "homozygous reference" (0/0), "homozygous alternate" (1/1), "heterozygous" (0/1) or "no information is available" for all samples. Let's confine the investigation to the TP53 gene.
In [ ]:
# importing plotting packages
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# importing numpy for calculations
import numpy as np
In [ ]:
# TP53 gene location on chr17: https://ghr.nlm.nih.gov/gene/TP53#location
TP53_min = 7668402
TP53_max = 7687550
In [ ]:
# sample names
sample_names = ['1_NAT', '1_AD', '2_NAT', '2_CRC', '3_NAT', '3_AD', '4_NAT', '4_CRC', '5_NAT', '5_AD', '6_NAT', '6_CRC']
# genotype dictionary
gt_dict = {'0/0': 1, '0/1': 2, '1/1': 3}
In [ ]:
# preparing data matrix with NaNs
data_matrix = np.empty((len(sample_names), TP53_max-TP53_min+1))
data_matrix[:] = np.nan

# iterating through variants with the pyvcf package
vcf_reader = vcf.Reader(filename=storageDirectory+'VarCallResults/jointgenotyping.vcf.gz')
for record in vcf_reader:
    # only consider position if it is in the TP53 region
    if record.POS >= TP53_min and record.POS <= TP53_max:
        # get genotype info for all samples
        for sID, s in enumerate(sample_names):
            data_matrix[sID][record.POS-TP53_min] = gt_dict.get(record.genotype(name=s)['GT'], np.nan)
In [ ]:
# preparing colormap for plotting
from matplotlib.colors import LinearSegmentedColormap
colors = ['#E9967A', '#D8BFD8', '#66CDAA']
cmap = LinearSegmentedColormap.from_list('my_cmap', colors, N=3)
In [ ]:
# plotting the matrix with a horizontal colorbar and annotating arrows
grid_kws = {"height_ratios": (.9, .05), "hspace": .3}
f, (ax, cbar_ax) = plt.subplots(2, gridspec_kw=grid_kws, figsize=(15,6))
ax = sns.heatmap(data_matrix, ax=ax,
                  cbar_ax=cbar_ax,
                  xticklabels=False,
                  yticklabels=sample_names,
                  cbar_kws={"orientation": "horizontal", 'ticks': [1+1/3,2,3-1/3]},
                    cmap= cmap)
cbar_ax.set_xticklabels(['0/0', '0/1', '1/1'], fontsize=14)
ax.set_xlabel('\nTP53', fontsize=16)
cbar_ax.tick_params(size=0)
ax.annotate('chr17:7674797', xy=(7674797-TP53_min, 3), xytext=(7674797-TP53_min-2100,2), 
            arrowprops=dict(facecolor='black', 
                            width=0.5, 
                            headwidth=7,
                            headlength=10));
ax.annotate('chr17:7676325', xy=(7676325-TP53_min, 6), xytext=(7676325-TP53_min-2100,5), 
            arrowprops=dict(facecolor='black', 
                            width=0.5, 
                            headwidth=7,
                            headlength=10));

It's apparent that in many of the positions, there is either no variation in any of the samples or no information is available about these. On the other hand, in some of the positions, all samples seem to have some kind of variation instead of being strictly reference.

Task: In the position chr17:7674797 (highlighted with an arrow on the figure above), all of the samples seem to be either heterozygous or homozygous non-reference. Find the reference and alternate alleles in this position. Can you find any traces of the presence of this variation in other sample sets?
Hint: Try using the UCSC Genome Browser.
In [ ]:
# command for checking the ../../share/genomeSequencing/VarCallResultsjointgenotyping.vcf.gz file 
# for position chr17:7674797
Task: What about chr17:7676325?
In [ ]:
# command for checking the ../../share/genomeSequencing/VarCallResultsjointgenotyping.vcf.gz file 
# for position chr17:7676325

Comparison of somatic mutations found by MuTect2 and VarScan2

The VCF files produced by MuTect2 are essentially similar to the one analyzed above. To take a quick look at the variant lines, we can run (here we are ignoring lines starting with a "#"):

In [ ]:
!zcat $storageDirectory""VarCallResults/1_somatic_m_filtered.vcf.gz | grep -v "^#" | head -2

VarScan2 uses a slightly different format and also saves SNVs and indels to separate files:

In [ ]:
!cat $storageDirectory""VarCallResults/1_somatic_varscan.snp | head -3
In [ ]:
!cat $storageDirectory""VarCallResults/1_somatic_varscan.indel | head -3

To make further analysis easier, we will load both MuTect2 and VarScan2 files to dataframes, only keeping the most important information about the variants. These will be: chromosome, position, reference allele, variant allele, normal coverage, variant allele frequency in normal, tumor coverage, variant allele frequency in tumor and "quality".

MuTect2 labels mutations that are supposedly false positives with one or more reasons for filtering in their "FILTER" field. We will only load those variants into the dataframe, where the "FILTER" field is empty.

VarScan2 labels variants either as "LOH", "Germline" or "Somatic". We will only use variants deemed to be "Somatic".

As for the "quality" field, both variant callers characterize the detected variants with some type of measure for quality. These are generally connected to probabilities of falsely detecting a somatic variant when it's merely either sequencing or mapping noise or a miscategorized germline mutation. These quality values are not directly comparable between variant callers, but serve as a guide to distinguish high-quality variants from low-quality variants among the output of a given detector.

For the dataframe formalism, we will be using the pandas python package.

In [ ]:
import pandas as pd
In [ ]:
def load_mutect2(sample_pair):
    lines = []
    vcf_reader = vcf.Reader(filename=storageDirectory+'VarCallResults/'+sample_pair+'_somatic_m_filtered.vcf.gz')
    for record in vcf_reader:
        if len(record.FILTER) == 0:
            lines.append([record.CHROM, record.POS, record.REF, record.ALT[0], 
                         sum(record.samples[1]['AD']), record.samples[1]['AD'][1]/sum(record.samples[1]['AD']),
                         sum(record.samples[0]['AD']), record.samples[0]['AD'][1]/sum(record.samples[0]['AD']),
                         record.INFO['TLOD'][0]])

    df = pd.DataFrame(lines, columns = ['chr', 'pos', 'ref', 'alt', 'norm_cov', 'norm_alt_freq', 
                                      'tum_cov', 'tum_alt_freq', 'qual'])  
    return df.sort_values(by='pos').reset_index(drop=True)
In [ ]:
def load_varscan2(sample_pair):
    tmp1 = pd.read_csv(storageDirectory+'VarCallResults/'+sample_pair+'_somatic_varscan.snp', sep='\t')
    tmp2 = pd.read_csv(storageDirectory+'VarCallResults/'+sample_pair+'_somatic_varscan.indel', sep='\t')
    tmp = pd.concat([tmp1, tmp2])
    tmp = tmp[tmp['somatic_status'] == 'Somatic']
    df = pd.DataFrame()
    df['chr'] = tmp['chrom']
    df['pos'] = tmp['position']
    df['ref'] = tmp['ref']
    df['alt'] = tmp['var']
    df['norm_cov'] = tmp['normal_reads1']+tmp['normal_reads2']
    df['norm_alt_freq'] = tmp['normal_reads2']/df['norm_cov']
    df['tum_cov'] = tmp['tumor_reads1']+tmp['tumor_reads2']
    df['tum_alt_freq'] = tmp['tumor_reads2']/df['tum_cov']
    df['qual'] = -10*np.log10(tmp['somatic_p_value'])
    return df.sort_values(by='pos').reset_index(drop=True)

Now we can use the above functions to load the necessary data and postprocess the results.

Example: Let's create a figure that shows the number of detected somatic mutations by both MuTect2 and VarScan2 for each CRC and AD sample.
In [ ]:
# list of tumor samples
tumor_samples = [k for k in sample_names if "NAT" not in k]
# lists to store mutation counts
mutect2_mut_counts = []
varscan2_mut_counts = []

for sp in [k[0] for k in tumor_samples]:
    # loading mutect2 results for given tumor sample
    mt_df = load_mutect2(sp)
    # loading varscan2 results for given tumor sample
    vs_df = load_varscan2(sp)
    # mutation count equals the number of lines in the dataframe
    mutect2_mut_counts.append(mt_df.shape[0])
    varscan2_mut_counts.append(vs_df.shape[0])
In [ ]:
# plotting mutation counts on a barchart
fig, ax = plt.subplots(figsize=(7,6))

index = np.arange(len(tumor_samples))
bar_width = 0.35
opacity = 1

rects1 = ax.bar(index, mutect2_mut_counts, bar_width,
                alpha=opacity, color=colors[0],
                label='MuTect2')

rects1 = ax.bar(index+bar_width, varscan2_mut_counts, bar_width,
                alpha=opacity, color=colors[1],
                label='VarScan2')

ax.set_xlabel('\nTumor sample', fontsize=14)
ax.set_ylabel('Number of somatic mutations found', fontsize=14)
ax.set_xticks(index + bar_width / 2)
ax.set_xticklabels(tumor_samples)
ax.tick_params(axis='x', length=0)
ax.legend()
plt.grid(color='#f1f1f1')
plt.show()
Task: What tendency can you detect from the figure?

Let's investigate the somatic mutations in sample 3_AD in more detail:

In [ ]:
mt_3 = load_mutect2('3')
vs_3 = load_varscan2('3')

Let's calculate the number of mutations found by both variant callers:

In [ ]:
print('Number of mutations found by both variant callers: ' + str(len(set(mt_3['pos']).intersection(set(vs_3['pos'])))))

Let's check which mutation this is:

In [ ]:
mutpos = list(set(mt_3['pos']).intersection(set(vs_3['pos'])))[0]
mt_3[mt_3['pos'] == mutpos]
In [ ]:
vs_3[vs_3['pos'] == mutpos]

This single mutation is the mutation with the highest quality value in both datasets, suggesting that both MuTect2 and VarScan2 can reliably find "obvious" mutations, but disagree on ambiguous ones. It is interesting to note that VarScan2 reports a much higher coverage for this position than MuTect2. This is because GATK tools perform local realignment in active regions, which can influence local coverage.

A very important note: Any result you get largely depends on the variant caller you used. Comparing results of different pipelines is rarely wise and requires a great deal of caution. It is also incredibly difficult to keep track of all filtering steps used by a specific mutation caller, thus you have no way of knowing how the final list of mutations were truly produced.
Task: The mutations only found by MuTect2 (and not VarScan2) can be listed in a table when you run the cell below. Why do you think VarScan2 missed these mutations?
Hint: VarScan2 uses hard filtering for parameter values. You can check the actual filters used by scrolling back to the section where we ran VarScan2 on the samples.
In [ ]:
mt_3[mt_3['pos'] != mutpos]

Annotating mutations

In some cases, a general, statistical analysis of the mutations is preferred. On the other hand, for some applications, the exact consequences of the detected mutations must be determined. For this, we will be using the ANNOVAR tool which helps with the labelling of specific mutation sites with available information from different databases.

The tool itself is located in the tools/annovar folder. However, the databases used for annotation must be first selected from the list of available resources and then downloaded.

Example: Let's annotate somatic mutations found by MuTect2 for all CRC/AD samples with the COSMIC database and dbSNP build 150, while also labelling them with gene information.

First we need to download the appropriate databases with the following commands:

🐌🐌🐌 Given that it can take up to 10 minutes to download, we skip these steps and use the database files located in the folder ../../course/genomeSequencing/tools/annovar/DBdownload.

!perl $storageDirectory""tools/annovar/annotate_variation.pl -buildver hg38 -downdb -webfrom annovar refGene $storageDirectory""tools/annovar/DBdownload/!perl $storageDirectory""tools/annovar/annotate_variation.pl -buildver hg38 -downdb -webfrom annovar cosmic70 $storageDirectory""tools/annovar/DBdownload/!perl $storageDirectory""tools/annovar/annotate_variation.pl -buildver hg38 -downdb -webfrom annovar avsnp150 $storageDirectory""tools/annovar/DBdownload/

Now that all the necessary databases have been downloaded, we can prepare our input files for ANNOVAR. The general structure to be followed is shown below:

1 948921 948921 T C 1 1404001 1404001 G T 1 5935162 5935162 A T 1 162736463 162736463 C T 1 84875173 84875173 C T

The columns are: chromosome, starting position, end position, reference allele, alternate allele. We will use the following function to convert our dataframes to ANNOVAR-compatible text files:

In [ ]:
def prepare_ANNOVAR_input(df, filename):
    df['pos2'] = df['pos']
    df[['chr', 'pos', 'pos2', 'ref', 'alt']].to_csv('ANNOVARinput/' + filename, sep='\t', index=False, header=False)

And now we can generate the input files with:

In [ ]:
for ts in tumor_samples:
    df = load_mutect2(sample_pair=ts[0])
    prepare_ANNOVAR_input(df, ts + '_ANNOVAR_input.txt')

Finally, ANNOVAR can be run on these files with:

🐌🐌 Takes about 5-6 minutes, but output files have been already prepared to the folder ../../course/genomeSequencing/ANNOVARresults.

!mkdir ANNOVARresults for ts in tumor_samples: !perl $storageDirectory""tools/annovar/table_annovar.pl \ ANNOVARinput/""$ts""_ANNOVAR_input.txt \ $storageDirectory""tools/annovar/DBdownload/ \ -buildver hg38 \ -out ANNOVARresults/""$ts"" \ -remove \ -protocol refGene,avsnp150,cosmic70 \ -operation gx,f,f \ -nastring . \ -csvout -polish -xref $storageDirectory""tools/annovar/example/gene_xref.txt

We can load the resulting tables to a single dataframe with:

In [ ]:
all_dfs = []
for ts in tumor_samples:
    df = pd.read_csv(storageDirectory+'ANNOVARresults/'+ts+'.hg38_multianno.csv')
    df.insert(0, 'sample', [ts]*df.shape[0])
    all_dfs.append(df)
df = pd.concat(all_dfs).reset_index(drop=True)

Now we can check if there were any somatic mutations found in the TP53 gene:

In [ ]:
df[df['Gene.refGene']=='TP53']

Apparently, samples 3_AD, 5_AD and 6_CRC have a mutation in the TP53 region. All of these mutations are listed in both the dbSNP and the COSMIC datasets. The occurences of these mutations in different tissues (in the COSMIC database) can be listed with:

In [ ]:
for p, o in zip(list(df[df['Gene.refGene']=='TP53']['Start']), 
                [k.split('OCCURENCE=')[1].split(',') for k in list(df[df['Gene.refGene']=='TP53']['cosmic70'])]):
    print(p,o,'\n')
Task: How many of the somatic mutations were not categorized to a specific gene?
In [ ]:
# code for counting mutations not categorized to specific genes
# if you have no idea about the code, count them by hand and write the answer here