NGS one-liner to call variants

22 Oct 2023 — Barış Salman

Links

1. Introduction

⊕

We can sequence genomes chunk by chunk. This sequencing creates millions of these “short reads” which we need to assembly back into a genome or align back to a reference if we already done the assembly before. There are different approaches to both of this and this post concerned with the latter.

This process is performed by various tools and in many steps with a lot of commands. There have been a lot of work put in to making this process more streamlined with workflow managers. This is not one of them. Purpose of this exercise is to investigate shell pipes and is purely educational.

It is made for capture/amplicon short read sequencing in mind for human DNA and tested with reference exome sequencing data described here. As I start building and testing the pipeline I had a few problems as well as questions but first lets look at the command itself.

2. One-liner

2.1. Pipeline

# fastp --in1 "$R1" --in2 "$R2" --stdout --html $OUTPUT.fastp.html --json $OUTPUT.fastp.json 2>$OUTPUT.$$.fastp.log |
bwa mem -t "$THREADS" -R "@RG\tID:$SAMPLE\tSM:$SAMPLE\tPL:illumina\tLB:lib1\tPU:foo" "$REFERENCE" "$R1" "$R2" 2>"$OUTPUT.$$.bwa.log" |
│   # pv -cN bwa -s "$(gzip -l "$R1" "$R2" | awk '{print $2}' | tail -n1)" |
│   samtools collate -@ "$THREADS" -O - |
│   samtools fixmate -@ "$THREADS" -m - - |
│   samtools sort -@ "$THREADS" - 2>"$OUTPUT.$$.samtools.log" |
│   # pv -cN samtools_sort |
│   ([ "$AMPLICON" = "NO" ] && samtools markdup -@ "$THREADS" - - || cat) |
│   # pv -cN samtools_markdup |
│   samtools view -C -T "$REFERENCE" - |
│   tee "$OUTPUT.cram" |
│   bcftools mpileup --threads "$THREADS" -Ou -A -T "$TARGET" -d 10000 -L 10000 -a "FORMAT/AD,FORMAT/DP" -f "$REFERENCE" - 2>>"$OUTPUT.$$.bcftools.log" |
│   # pv -cN bcftools_mpileup |
│   bcftools call --threads "$THREADS" -Ou --ploidy "$ASSEMBLY" -mv |
│   # pv -cN bcftools_call |
│   bcftools view -i 'FORMAT/DP>5&&QUAL>5' |
│   bcftools norm --threads "$THREADS" -Ou -m-any --check-ref w -f "$REFERENCE" 2>>"$OUTPUT.$$.bcftools.log" |
│   bcftools +fill-tags -Ou -- -t all 2>>"$OUTPUT.$$.bcftools.log" |
│   bcftools +setGT -Ou -- -t q -n c:'0/1' -i 'VAF>=.1' 2>>"$OUTPUT.$$.bcftools.log" |
│   bcftools +setGT -Ov -- -t q -n c:'1/1' -i 'VAF>=.75' 2>>"$OUTPUT.$$.bcftools.log" |
│   /home/bar/ensembl-vep/vep --everything --force_overwrite --vcf --pick --format vcf \
│   │   --fork $THREADS \
│   │   --stats_file "$OUTPUT"_summary.html \
│   │   --warning_file "$OUTPUT"_warnings.txt \
│   │   --output_file STDOUT --cache 2>"$OUTPUT.$$.vep.log" |
│   # pv -cN vep |
│   bcftools +split-vep -c SYMBOL,gnomADg_AF:Float,IMPACT,Existing_variation 2>>"$OUTPUT.$$.bcftools.log" |
│   bcftools filter --threads "$THREADS" -Ou -m+ -s 'onTarget' -M "$TARGET" |
│   bcftools filter --threads "$THREADS" -Ou -m+ -s 'lowQual' -g3 -G10 -e 'FORMAT/DP<=15 || QUAL<=20' |
│   bcftools filter --threads "$THREADS" -Ou -m+ -s 'highFreq' -e 'gnomADg_AF>0.001' |
│   bcftools filter --threads "$THREADS" -Ou -m+ -s 'lowIMPACT' -i 'IMPACT~"HIGH" || IMPACT~"MODERATE"' |
│   bcftools filter --threads "$THREADS" -Ou -m+ -s 'HOMrare' -e 'GT="1/1" && (gnomADg_AF <= 0.001 || (Existing_variation="." && gnomADg_AF="." && ID="."))' |
│   bcftools filter --threads "$THREADS" -Ob -m+ -s 'HETnovel' -e 'GT="0/1" && Existing_variation="." && gnomADg_AF="." && ID="."' |
│   tee "$OUTPUT.bcf" |
│   bcftools +split-vep -f "$columns" -d -i 'CANONICAL~"YES"' 2>>"$OUTPUT.$$.bcftools.log" |
│   awk -v header="$header" 'BEGIN {printf  header} 1' |
│   gzip -c >"$OUTPUT.tsv.gz"

2.2. Breakdown

Preprocessing fastq

fastp with the --stdout parameter writes interleaved output. We are just using default operations. Because of different results its commented out and not being used.
Alignment

bwa can take interleaved input from stdin with `-p` option. This was used to get the input from fastp stdout. We are just starting from bwa by giving the both fastq files as inputs.
Processing aligned reads

Samtools collate, fixmate, sort, markdup submodules are used. We use samtools to mark duplicate reads. http://www.htslib.org/doc/samtools-markdup.html markdup has a conditional check beforehand for a variable named AMPLICON which if true skips the markdup and just cats the alignments.
Writing out the alignment file

samtools view is used to convert the bam file to cram. We can use tee to write out the cram file while passing it down the pipeline.
Variant calling

mpileup has -d 10000 -L 10000 parameters in case there are high depth amplicon sequencing. Beware the 4.4 commands before the call.
Post processing variants

Check out this section to understand rationale behind this view command. Variant normalization is performed by bcftools norm. In fill-tags command the specific tag I need is the variant allele fraction (VAF). We later use the VAF with setGT heterozygous and setGT homozygous commands.

We are using -Ov before the VEP command because I had problem with VEP reading compressed from stdin. https://github.com/Ensembl/ensembl-vep/issues/1502
annotating variants

A bare bone vep command. This brings a lot of annotation with the --everything= flag.
filtering variants

Here we use split-vep and add some of the columns we’re interested in into the INFO column before filtering. Filtering expressions depends on what we’re trying to achieve, I just put the ones I like. 28 29 30 31 32 33

Here no variant is excluded but the tags we give with the -s parameters are appended to the FILTER column. After the filtering we write bcf with the tee command.
writing out a tsv file

We define the columns we want to have in the final table columns and use split-vep to format it into a table. We can add these columns as the first line with awk after we format it as a header. We can than use gzip to keep it compressed.

3. Results and highlights

As result this report is created with multiqc. It would be better make a hard filtered vcf and create the report that one but since this test already has small number of variants it wasn’t required.

Test the command by running it multiple times.
Count the data in intermediary steps (md5sum, mapped and paired reads, number of variants etc.).

4. Auxiliary tasks and Embellishments

4.1. setting up the environment

These are the variable we later use in the pipeline. We can change R1 and R2 and sample for changing the input. We can edit columns to customize the output.

Assembly is used in the VEP command and if we want to use GRCh37 we need to install the GRCh37 cache and also need to change the reference genome.

BASEDIR=$(dirname "$(realpath "$0")")

THREADS=16
ASSEMBLY="GRCh38"
REFERENCE="$HOME/reference/GRCh38/Homo_sapiens_assembly38.fasta"
TARGET="$HOME/reference/GRCh38/hg38.refGene.exon.bed"
AMPLICON="NO"

R1="$HOME/sample/U0a_CGATGT_L001_R1_005.fastq.gz"
R2="$HOME/sample/U0a_CGATGT_L001_R2_005.fastq.gz"
SAMPLE="U0a"

OUTPUT_DIR="results.$$"

OUTPUT="$OUTPUT_DIR/$SAMPLE"
columns="[%SAMPLE]\t%CHROM\t%POS\t%REF\t%ALT\t%ID\t%FILTER[\t%GT\t%VAF\t%AD\t%DP]\t%Consequence\t%IMPACT\t%SYMBOL\t%Feature\t%EXON\t%INTRON\t%HGVSc\t%HGVSp\t%cDNA_position\t%CDS_position\t%Protein_position\t%Amino_acids\t%Codons\t%Existing_variation\t%MANE_SELECT\t%MANE_PLUS_CLINICAL\t%GENE_PHENO\t%SIFT\t%PolyPhen\t%DOMAINS\t%AF\t%gnomADe_AF\t%gnomADg_AF\t%MAX_AF\t%MAX_AF_POPS\t%CLIN_SIG\t%PHENO\t%PUBMED\t%CANONICAL\n"
header=$(echo "$columns" | sed "s/%//g;s/\[//g;s/\]//g")

mkdir -p "$OUTPUT_DIR"

cp "$0" "$OUTPUT_DIR"

Here we copy the script to result directory to log the command we are running.

4.2. software versions

Multiqc can show these in a neat table. We just need some formating.

{
│   printf 'oneliner: "%s"\n' "$VERSION";
│   printf 'fastp: "%s"\n' "$(fastp 2>&1 | grep version | cut -d " " -f 2)"
│   printf 'bwa: "%s"\n' "$(bwa 2>&1 | grep Version | cut -d: -f2)"
│   printf 'samtools: "%s"\n' "$(samtools version | sed 1q | cut -d " " -f 2)"
│   printf 'bcftools: "%s"\n' "$(bcftools version | sed 1q | cut -d " " -f 2)"
│   printf 'ensembl-vep: "%s"\n' "$(~/ensembl-vep/vep | grep ensembl-vep | cut -d : -f 2)"
│   printf 'bedtools: "%s"\n' "$(bedtools --version | cut -d " " -f2)"
│   (
│   │   echo "annotation_sources:"
│   │   ~/ensembl-vep/vep --show_cache_info | sed 's/\s/: "/;s/$/"/;s/^/    /'
│   )
} > "$OUTPUT_DIR"/oneliner_mqc_versions.yaml

4.3. Getting the stats and creating reports

IGV-reports create easy to browse variant list with alignments. We can add this html to our final html report. At last the multiqc brings all together with custom plots and software versions.

samtools index -@ $THREADS "$OUTPUT.cram"
bcftools index "$OUTPUT.bcf"

. ~/venv/bin/activate

create_report "$OUTPUT".bcf \
│   http://igv-genepattern-org.s3.amazonaws.com/genomes/seq/hg38/hg38.fa \
│   --genome hg38 --flanking 1000 \
│   --sample-columns GT AD DP VAF \
│   --info-columns SYMBOL gnomADg_AF IMPACT Existing_variation \
│   --tracks "$OUTPUT".cram --output "$OUTPUT"_mqc.html
│
samtools stats --reference "$REFERENCE" "$OUTPUT.cram" >"$OUTPUT.cram.stats"
samtools idxstats "$OUTPUT.cram" >"$OUTPUT.cram.idxstats"
samtools flagstat "$OUTPUT.cram" >"$OUTPUT.cram.flagstat"
bcftools stats "$OUTPUT.bcf" >"$OUTPUT.bcf.stats"
"$BASEDIR"/plot_resource_usage.R "$OUTPUT_DIR"
multiqc -f -s -o "$OUTPUT_DIR" "$OUTPUT_DIR"

4.4. pv

pipe viewer is a command line utility that can show a progress bar as data pipe through it. In its manual it uses -s (size) option with `du` command, in our case since our data is compressed we use `gzip -l`. After the first `pv`, it is harder to find out the amount of data passing through so others won’t be accurate but it can still be used to show how much time is passed. Its -S parameter should not be used since it will stop the input prematurely.

They’re commented out because my shell acts weird after the command finishes.

4.5. Monitoring the resource usage

We can start this process in background before the pipeline starts. This gets cpu, memory percentages with `ps` command and file sizes with `du` command every 5 seconds.

monitor_resources() {
│   while ps $$ >/dev/null; do
│   │   du -b $OUTPUT_DIR/* | sed "s#^#$(date +%Y/%m/%d/%H:%M:%S) #" >>"$OUTPUT_DIR/file_sizes.$$.log"
│   │   ps --ppid $$ --forest --no-heading -o %cpu,%mem,cmd 2>/dev/null |
│   │   │   cut -d " " -f 1-6 |
│   │   │   sed "s#^#$(date +%Y/%m/%d/%H:%M:%S) #" |
│   │   │   grep -v "CMD\|pv\|ps" |
│   │   │   awk '{print $4"_"$5"\t"$1"\t"$2"\t"$3}' >>"$OUTPUT_DIR/resources.$$.log"
│   │   sleep 5
│   done
}
monitor_resources &

4.5.1. Plotting the resource usage and file sizes

This script creates the resource usage file sizes plots using the metrics created at above section. These images can later be included in the multiqc report.

library(ggplot2)

args = commandArgs(trailingOnly=TRUE)

if (length(args)==0) {
│ run_dir <- "."
} else if (length(args)==1) {
│ run_dir <- args[1]
}

num_x_ticks <- 66

plot_resource_usage <- function(log_path) {
│ data <- read.table(log_path)
│ mem <- data[c("V1", "V2", "V4")]
│ mem$V5 <- "MEM"
│ cpu <- data[c("V1", "V2", "V3")]
│ cpu$V5 <- "CPU"
│
│ colnames(mem) <- c("cmd", "time", "percent", "type")
│ colnames(cpu) <- c("cmd", "time", "percent", "type")
│
│ data <- rbind(cpu, mem)
│
│ major_tasks <- c(
│   "bwa_mem",
│   "samtools_sort",
│   "bcftools_mpileup",
│   "samtools_markdup",
│   "vep"
│ )
│ data <- data[grepl(paste(major_tasks, collapse = "|"), data$cmd), ]
│
│ data$time <- as.POSIXct(data$time, format = "%Y/%m/%d/%H:%M:%S")
│
│ date_breaks <- paste(
│   signif(
│   │ as.numeric(
│   │   difftime(max(data$time), min(data$time), units = "secs") / num_x_ticks
│   │ ),
│   │ 2
│   ),
│   "sec"
│ )
│
│ ggplot(
│   data,
│   aes(x = time, y = percent, color = cmd, group = cmd, linetype = cmd)
│ ) +
│   facet_wrap(~type, nrow = 2, scales="free_y") +
│   geom_line() +
│   theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
│   ggtitle(log_path) +
│   scale_linetype_manual(values = rep(c(
│   │ "solid", "longdash", "twodash",
│   │ "dashed", "dotdash", "dotted", "solid"
│   ), 3)) +
│   scale_x_datetime(date_breaks = date_breaks)
}


plot_file_sizes <- function(log_path) {
│ num_x_ticks <- 67
│ data <- read.table(log_path)
│ colnames(data) <- c("time", "size", "file")
│ data$time <- as.POSIXct(data$time, format = "%Y/%m/%d/%H:%M:%S")
│
│ major_files <- c("cram$", "bcf$", "tsv$")
│
│ data <- data[grepl(paste(major_files, collapse = "|"), data$file), ]
│
│ date_breaks <- paste(
│   signif(
│   │ as.numeric(
│   │   difftime(max(data$time), min(data$time), units = "secs") / num_x_ticks
│   │ ),
│   │ 2
│   ),
│   "sec"
│ )
│
│ ggplot(
│   data,
│   aes(x = time, y = size, color = file, group = file, linetype = file)
│ ) +
│   geom_line() +
│   theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
│   ggtitle(log_path) +
│   scale_linetype_manual(
│   │ values = rep(c(
│   │   "solid", "longdash", "twodash",
│   │   "dashed", "dotdash", "dotted", "solid"
│   │ ), 3)
│   ) +
│   scale_x_datetime(date_breaks = date_breaks)
}


my_files <- list.files(path = run_dir, pattern = "^resources.*\\.log$", full.names = T)
for (i in my_files) {
│ plot_resource_usage(i)
│ ggsave(paste(i, "_mqc.png", sep = ""), width = 14, height = 7)
}

my_files <- list.files(path = run_dir, pattern = "^file_sizes.*\\.log$", full.names = T)
for (i in my_files) {
│ plot_file_sizes(i)
│ ggsave(paste(i, "_mqc.png", sep = ""), width = 14, height = 7)
}

5. Running and testing the pipeline

Each tool of the pipe should be tested individually and commands parameters should be given attention in detail. In particular, samtools’ stdin and stdout can be tricky to get right in the first try.

Comparing the output created running the commands in distinct steps and as pipe makes sure pipeline itself not causing any side effects.

Running the pipeline multiple times is also an indispensable test to make sure it is reliable. I noticed different variant counts in earlier tests because of the wrong parameter of pv (I used -S that would end the buffer prematurely).

I did the initial setting up and making sure that pipeline runs with super small subset of the FASTQ files.

zcat U0a_CGATGT_L001_R1_005.fastq.gz | sed 200q > Test_CGATGT_L001_R1_005.fastq.gz
zcat U0a_CGATGT_L001_R2_005.fastq.gz | sed 200q > Test_CGATGT_L001_R2_005.fastq.gz

I did the test runs ten at a time.

cd /home/bar/runs/
for i in {1..10}; do ../oneliner.sh; done

5.1. Should this be a single pipeline?

I would assume no. We have random reads coming from the fastq file and we should wait for alignment process to complete before going on with the variant calling. However, this doesn’t happen because I assume `samtools sort` buffers the whole thing while sorting.

While testing with a smaller subset running in pipe and distinct steps gives the same number of resulting variants. However, when we remove the `samtools sort` I got no variants which in concordance with my initial hunch. This would need more thorough testing to be sure like showing bwa and bcftools running at the same time (but i didn’t keep those stats).

bar@debiantestin:~/runs$ ls -1 results.*/Test.*.bcftools.log | xargs grep '^Lines'

Seperate Steps
results.234225/Test.234225.bcftools.log:Lines   total/split/joined/realigned/skipped:   20/0/0/0/0

Pipe no samtools sort
results.235225/Test.235225.bcftools.log:Lines   total/split/joined/realigned/skipped:   0/0/0/0/0

Pipe
results.70416/Test.70416.bcftools.log:Lines   total/split/joined/realigned/skipped:     20/0/0/0/0

5.2. Why is it taking too long?

This is not a pipe related concern but after setting up the pipe and running with relatively larger data which is still around 500Mb it took 15 hours.

BWA completes about less than 10 minutes. In this time the first chunk of the alignments are written about 60-70 Mb, then alignments are written incrementally until sort and mpileup finishes.

Investigating further I found we are calling anything and everything; half a million variants in the file. Target intervals only includes 13007 of them which only 7 has depth and quality bigger than 5.

bar@debiantestin:~/runs/results.89097$ bcftools view U0a.bcf -H | wc -l
532845
bar@debiantestin:~/runs/results.89097$ bcftools view U0a.bcf -i 'FORMAT/DP>1' -H | wc -l
70674
bar@debiantestin:~/runs/results.89097$ bcftools view U0a.bcf -i 'FORMAT/DP>20' -H | wc -l
2408
bar@debiantestin:~/runs/results.89097$ bcftools view U0a.bcf -i 'FORMAT/DP>20&&QUAL>20' -H | wc -l
2080
bar@debiantestin:~/runs/results.89097$ bcftools view U0a.bcf -i 'FORMAT/DP>20&&QUAL>100' -H | wc -l
1288
bar@debiantestin:~/runs/results.89097$ bcftools view U0a.bcf  -H -T /home/bar/reference/GRCh38/hg38.refGene.exon_padding.bed |wc -l
13007
bar@debiantestin:~/runs/results.89097$ bcftools view U0a.bcf -i 'FORMAT/DP>5&&QUAL>5' -H -T /home/bar/reference/GRCh38/hg38.refGene.exon_padding.bed |wc -l
7

After filtering for target region and the depth and quality we get it down to 15 minutes. For some reason target region overlaps low quality variants.

How much of the variants filtered effects the time it tooks to complete the pipeline. Biggest time concern after eliminating variant count would be the VEP. It takes around ~15 mins to annote 7 variants. Runing the VEP by itself takes only around 15 secs. Meaning there are some other bottlenecks.

We utilize more of the system resources passing around the $THREADS variable but it would be more interesting challenge to scale this pipeline horizontally.

5.3. Different number of variants

⊕

bwa mem -t "$THREADS" -R "@RG\tID:$SAMPLE\tSM:$SAMPLE\tPL:illumina\tLB:lib1\tPU:foo" "$REFERENCE" "$R1" "$R2" 2>"$OUTPUT.$$.bwa.log" > "$OUTPUT.sam"
│   # pv -cN bwa |
samtools collate -@ "$THREADS" -o "$OUTPUT.collated.bam" "$OUTPUT.sam"
samtools fixmate -@ "$THREADS" -m "$OUTPUT.collated.bam" "$OUTPUT.fixmate.bam"
samtools sort -@ "$THREADS" "$OUTPUT.fixmate.bam" -o "$OUTPUT.sorted.bam"  -O bam
samtools markdup -@ "$THREADS" "$OUTPUT.sorted.bam" "$OUTPUT.markdup.bam"
│   # ([ "$AMPLICON" = "NO" ] && samtools markdup -@ "$THREADS" - - || cat) |
│   # pv -cN samtools_markdup |
│
bcftools mpileup --threads "$THREADS" -Ou -A -T "$TARGET" -d 10000 -L 10000 -a "FORMAT/AD,FORMAT/DP" -f "$REFERENCE" "$OUTPUT.markdup.bam" 2>>"$OUTPUT.$$.bcftools.log" |

Disassembled commands. tee can also be used.

Even after going through this section I had different number of variants while running the pipeline. It would result in 6, 7 or 8 variants in every ten batches of my test runs. Debuging this was like disassembling an engine and testing each part individually. Variants are counted by bcftools norm step. So whatever was happening before this step. I worked my way up the ladder.

I had 10 different runs with various number of variants. I first make sure bcftools mpileup and bcftools call worked correctly by running it ten times for different number of variant called runs. It produced the same output from every time whether it was 6, 7 or 8 variants. So it was something upstream.

I separated each command before and checked the stats.

for bam in $(find -type f -name "*.*am"); do echo $bam; samtools stats $bam > $bam.stats; done
find -type f -name "*.stats" | xargs grep -Ri "reads mapped and paired"

./results.148485/U0a.sorted.bam.stats:SN        reads mapped and paired:        4229234 # paired-end technology bit set + both mates mapped
./results.148485/U0a.markdup.bam.stats:SN       reads mapped and paired:        4229234 # paired-end technology bit set + both mates mapped
./results.148485/U0a.sam.stats:SN       reads mapped and paired:        4229234 # paired-end technology bit set + both mates mapped
./results.148485/U0a.collated.bam.stats:SN      reads mapped and paired:        4229234 # paired-end technology bit set + both mates mapped
./results.148485/U0a.fixmate.bam.stats:SN       reads mapped and paired:        4229234 # paired-end technology bit set + both mates mapped
./results.159805/U0a.sorted.bam.stats:SN        reads mapped and paired:        4229234 # paired-end technology bit set + both mates mapped
./results.159805/U0a.markdup.bam.stats:SN       reads mapped and paired:        4229234 # paired-end technology bit set + both mates mapped
./results.159805/U0a.sam.stats:SN       reads mapped and paired:        4229234 # paired-end technology bit set + both mates mapped
./results.159805/U0a.collated.bam.stats:SN      reads mapped and paired:        4229234 # paired-end technology bit set + both mates mapped
./results.159805/U0a.fixmate.bam.stats:SN       reads mapped and paired:        4229234 # paired-end technology bit set + both mates mapped
./results.155247/U0a.sorted.bam.stats:SN        reads mapped and paired:        4229236 # paired-end technology bit set + both mates mapped
./results.155247/U0a.markdup.bam.stats:SN       reads mapped and paired:        4229236 # paired-end technology bit set + both mates mapped
./results.155247/U0a.sam.stats:SN       reads mapped and paired:        4229236 # paired-end technology bit set + both mates mapped
./results.155247/U0a.collated.bam.stats:SN      reads mapped and paired:        4229236 # paired-end technology bit set + both mates mapped
./results.155247/U0a.fixmate.bam.stats:SN       reads mapped and paired:        4229236 # paired-end technology bit set + both mates mapped
./results.150020/U0a.sorted.bam.stats:SN        reads mapped and paired:        4229234 # paired-end technology bit set + both mates mapped
./results.150020/U0a.markdup.bam.stats:SN       reads mapped and paired:        4229234 # paired-end technology bit set + both mates mapped
./results.150020/U0a.sam.stats:SN       reads mapped and paired:        4229234 # paired-end technology bit set + both mates mapped
./results.150020/U0a.collated.bam.stats:SN      reads mapped and paired:        4229234 # paired-end technology bit set + both mates mapped
./results.150020/U0a.fixmate.bam.stats:SN       reads mapped and paired:        4229234 # paired-end technology bit set + both mates mapped
./results.162839/U0a.sorted.bam.stats:SN        reads mapped and paired:        4229234 # paired-end technology bit set + both mates mapped
./results.162839/U0a.markdup.bam.stats:SN       reads mapped and paired:        4229234 # paired-end technology bit set + both mates mapped
./results.162839/U0a.sam.stats:SN       reads mapped and paired:        4229234 # paired-end technology bit set + both mates mapped
./results.162839/U0a.collated.bam.stats:SN      reads mapped and paired:        4229234 # paired-end technology bit set + both mates mapped
./results.162839/U0a.fixmate.bam.stats:SN       reads mapped and paired:        4229234 # paired-end technology bit set + both mates mapped
./results.156764/U0a.sorted.bam.stats:SN        reads mapped and paired:        4229234 # paired-end technology bit set + both mates mapped
./results.156764/U0a.markdup.bam.stats:SN       reads mapped and paired:        4229234 # paired-end technology bit set + both mates mapped

samtools outputs where consistent in itself as it showed same counts in each run but 155247 in particular had totally different counts. Pointing there was something before or during the alignment.

Next I checked the fastq files produced by fastp.

│find -type f -name "*.fq.gz"  | xargs md5sum

a90dd1e71c9e5432e8b67a54e65424a1  ./results.148485/U0a.fq.gz
a90dd1e71c9e5432e8b67a54e65424a1  ./results.159805/U0a.fq.gz
8f716b9f65748efdb3f30ba7d7794d4d  ./results.155247/U0a.fq.gz
a90dd1e71c9e5432e8b67a54e65424a1  ./results.150020/U0a.fq.gz
a90dd1e71c9e5432e8b67a54e65424a1  ./results.162839/U0a.fq.gz
8b4619fd39555cee4d5cca430d999379  ./results.156764/U0a.fq.gz
8b4619fd39555cee4d5cca430d999379  ./results.153722/U0a.fq.gz
a90dd1e71c9e5432e8b67a54e65424a1  ./results.158278/U0a.fq.gz
a90dd1e71c9e5432e8b67a54e65424a1  ./results.151555/U0a.fq.gz
a90dd1e71c9e5432e8b67a54e65424a1  ./results.161314/U0a.fq.gz

It creates three different outputs. Searching the github I found related issues: https://github.com/OpenGene/fastp/issues/506. Removing the fastp step gave the consistent alignment and variant call results.

Appendix I: The one-liner in one line

Marquee element support is deprecated so there is a custom javascript.

You can stop by clicking here.

div

bwa mem -t "$THREADS" -R "@RG\tID:$SAMPLE\tSM:$SAMPLE\tPL:illumina\tLB:lib1\tPU:foo" "$REFERENCE" "$R1" "$R2" 2>"$OUTPUT.$$.bwa.log" | samtools collate -@ "$THREADS" -O - | samtools fixmate -@ "$THREADS" -m - - | samtools sort -@ "$THREADS" - 2>"$OUTPUT.$$.samtools.log" | ([ "$AMPLICON" = "NO" ] && samtools markdup -@ "$THREADS" - - || cat) | samtools view -C -T "$REFERENCE" - | tee "$OUTPUT.cram" | bcftools mpileup --threads "$THREADS" -Ou -A -T "$TARGET" -d 10000 -L 10000 -a "FORMAT/AD,FORMAT/DP" -f "$REFERENCE" - 2>>"$OUTPUT.$$.bcftools.log" | bcftools call --threads "$THREADS" -Ou --ploidy "$ASSEMBLY" -mv | bcftools view -i 'FORMAT/DP>5&&QUAL>5' | bcftools norm --threads "$THREADS" -Ou -m-any --check-ref w -f "$REFERENCE" 2>>"$OUTPUT.$$.bcftools.log" | bcftools +fill-tags -Ou -- -t all 2>>"$OUTPUT.$$.bcftools.log" | bcftools +setGT -Ou -- -t q -n c:'0/1' -i 'VAF>=.1' 2>>"$OUTPUT.$$.bcftools.log" | bcftools +setGT -Ov -- -t q -n c:'1/1' -i 'VAF>=.75' 2>>"$OUTPUT.$$.bcftools.log" | /home/bar/ensembl-vep/vep --everything --force_overwrite --vcf --pick --format vcf --fork $THREADS --stats_file "$OUTPUT"_summary.html --warning_file "$OUTPUT"_warnings.txt --output_file STDOUT --cache 2>"$OUTPUT.$$.vep.log" | bcftools +split-vep -c SYMBOL,gnomADg_AF:Float,IMPACT,Existing_variation 2>>"$OUTPUT.$$.bcftools.log" | bcftools filter --threads "$THREADS" -Ou -m+ -s 'onTarget' -M "$TARGET" | bcftools filter --threads "$THREADS" -Ou -m+ -s 'lowQual' -g3 -G10 -e 'FORMAT/DP<=15 || QUAL<=20' | bcftools filter --threads "$THREADS" -Ou -m+ -s 'highFreq' -e 'gnomADg_AF>0.001' | bcftools filter --threads "$THREADS" -Ou -m+ -s 'lowIMPACT' -i 'IMPACT~"HIGH" || IMPACT~"MODERATE"' | bcftools filter --threads "$THREADS" -Ou -m+ -s 'HOMrare' -e 'GT="1/1" && (gnomADg_AF <= 0.001 || (Existing_variation="." && gnomADg_AF="." && ID="."))' | bcftools filter --threads "$THREADS" -Ob -m+ -s 'HETnovel' -e 'GT="0/1" && Existing_variation="." && gnomADg_AF="." && ID="."' | tee "$OUTPUT.bcf" | bcftools +split-vep -f "$columns" -d -i 'CANONICAL~"YES"' 2>>"$OUTPUT.$$.bcftools.log" | awk -v header="$header" 'BEGIN {printf  header} 1' | gzip -c >"$OUTPUT.tsv.gz"

Appendix II: Setting up tools and data

We are going to need fastqc, fastp, bwa, samtools, bcftools, tabix, bedtools, ensembl-vep and multiqc. I created small scripts with dependencies as I was trying to create containers with some of them. Some of them are just installed from debian repos.

fastqc bedtools tabix

apt update -y
apt install -y fastqc tabix bedtools

ggplot2
Debian packages lots of R packages so we don’t have to compile it.
```
apt update -y
apt install -y r-cran-ggplot2
```
multiqc and igv-reports
Debian also packages multiqc but it is version 1.4 which doesn’t have software version or custom image module we are using. In order to get the latest version of multiqc we need to spin up a virtual environment. IGV-reports is also a python package we can use the same environment while installing.
```
apt update -y
apt install -y python3-virtualenv
virtualenv -p python3 venv
source venv/bin/activate
pip install -U multiqc igv-reports
```

fastp

apt update -y
apt install -y wget
wget http://opengene.org/fastp/fastp -O /usr/bin/fastp
chmod a+x /usr/bin/fastp

BWA

apt update -y
apt install -y git gcc zlib1g-dev make
git clone https://github.com/lh3/bwa
cd bwa
make
cp ./bwa /usr/local/bin/
cd ..
rm -rf bwa

samtools

apt update -y
apt install -y \
│ git gcc zlib1g-dev autoconf make \
│ liblzma-dev libbz2-dev libcurl4-openssl-dev
git clone --recurse-submodules https://github.com/samtools/htslib.git
git clone https://github.com/samtools/samtools

cd samtools
autoheader
autoconf -Wno-syntax
./configure --without-curses
make
make install
cd ..
rm -rf samtools
rm -rf htslib

bcftools

apt update -y
apt install -y \
│ git gcc zlib1g-dev autoconf make \
│ liblzma-dev libbz2-dev libperl-dev \
│ libgsl-dev libcurl4-openssl-dev
git clone --recurse-submodules https://github.com/samtools/htslib.git
git clone https://github.com/samtools/bcftools

cd bcftools
autoheader && autoconf && ./configure --enable-libgsl --enable-perl-filters
make
make install
cd ..
rm -rf bcftools
rm -rf htslib

ensembl-vep

Installing the vep cache takes time…

apt install -y \
│   zlib1g-dev libbz2-dev liblzma-dev gcc \
│   libmodule-build-perl libjson-perl libdbi-perl \
│   libset-intervaltree-perl build-essential make \
│   automake git unzip autoconf libdbd-mysql-perl \
│
git clone https://github.com/Ensembl/ensembl-vep.git
cd ensembl-vep
perl INSTALL.pl -a acf -s homo_sapiens -y GRCh38 (vep_install)

downloading the reference genome

apt update -y
apt install -y wget

cd
mkdir -p reference/GRCh38
cd reference/GRCh38
wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.dict
wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta
wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta.64.alt
wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta.64.amb
wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta.64.ann
wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta.64.bwt
wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta.64.pac
wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta.64.sa
wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta.fai
cd

downloading the sample data and target file

cd
mkdir -p sample
cd sample
wget https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_005_BH814YADXX/Project_RM8398/Sample_U0a/U0a_CGATGT_L001_R1_005.fastq.gz
wget https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_005_BH814YADXX/Project_RM8398/Sample_U0a/U0a_CGATGT_L001_R2_005.fastq.gz
cd "$HOME"/reference/GRCh38
wget https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/references/GRCh38/resources/hg38.refGene.exon.bed.gz

Preprocess the target file

We add some padding to the panel file

PADDING=100
wget https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.chrom.sizes

bedtools slop -i hg38.refGene.exon.bed -g hg38.chrom.sizes -b $PADDING > hg38.refGene.exon_padding.bed

Appendix III: History and Acknowledgments

<2023-10-22 Sun>

Changed the title from NGSoneliner to NGS one-liner to call variants and made additions in order to make the subject clear thanks to GenoMax’s suggestions.

NGS one-liner to call variants

1. Introduction

2. One-liner

2. One-liner

2.1. Pipeline

2.2. Breakdown

3. Results and highlights

4. Auxiliary tasks and Embellishments

4.1. setting up the environment

4.2. software versions

4.3. Getting the stats and creating reports

4.4. pv

4.5. Monitoring the resource usage

4.5.1. Plotting the resource usage and file sizes

5. Running and testing the pipeline

5.1. Should this be a single pipeline?

5.2. Why is it taking too long?

5.3. Different number of variants

Appendix I: The one-liner in one line

div

Appendix II: Setting up tools and data

Appendix III: History and Acknowledgments

<2023-10-22 Sun>

Comments