Anna Syme

Click name ↑ to return to homepage

Assembly workflow for bacteria - example

Note - this is probably a bit out of date now

Basic commands for assembling a bacterial genome (from 2017-2018). Needs Canu, Circlator, Pilon and other tools.

#get the sequencing reads
#joined Pacbio reads: subreads.fastq
#illumina reads: R1.fastq.gz, R2.fastq.gz

#run canu to assemble
canu -p <output prefix> -d <output dir> genomeSize=2.8m -pacbio-raw subreads.fastq
#option: use sensitivity settings after outdir: corMhapSensitivity=high corMinCoverage=0
#output: prefix.contigs.fasta
infoseq prefix.contigs.fasta

#run circlator to trim and orient
circlator all --threads 8 --verbose contigs.fasta corrected.reads.fastq.gz <output dir>
#option: use branch lengths option after verbose: --b2r_length_cutoff <2 x av. length)
less 04.merge.circularise.log #to see if circularised
less 06.fixstart.log #to see where oriented
#output: 06.fixstart.fasta; rename as circ_contigs.fa

#if some contigs not circularised, try berokka

#option: separate out contigs from multifasta file
samtools faidx contigs.fa #index the file
samtools faidx contigs.fa contig1 > contig1.fa
samtools faidx contigs.fa contig2 > contig2.fa

#find smaller plasmids

#align illumina to Pacbio contigs
bwa index circ_contigs.fa
bwa mem -t 8 circ_contigs.fa R1.fastq.gz R2.fastq.gz | samtools sort > aln.bam
#output is aln.bam

#extract unmapped reads
samtools index aln.bam
samtools fastq -f 4 -1 unmapped.R1.fastq -2 unmapped.R2.fastq -s unmapped.RS.fastq aln.bam

#assemble these reads with spades
spades.py -1 unmapped.R1.fastq -2 unmapped.R2.fastq -s unmapped.RS.fastq --careful --cov-cutoff auto -o <output dir>
#save any long contigs; e.g. as contig3.fa

#option: assemble all illumina reads with spades
spades.py -1 R1 -2 R2 --careful --cov-cutoff auto -o output_dir

#find and trim overhang
head -n 10 contig3b.fa > contig3b.fa.head
makeblastdb -in contig3b.fa -dbtype nucl
blastn -query contig3b.fa.head -db contig3b.fa -evalue 1e-3 -dust no -out contig3b.bls
less contig3b.bls #find spot where overhang starts
samtools faidx contig3b.fa
samtools faidx contig3b.fa contig3b:1-2252 > contig3b.fa.trimmed
#open contig3b.fa in nano and shorten header name

#join all contigs: all_contigs.fa

#correct with pilon
bwa index all_contigs.fa
bwa mem -t 32 all_contigs.fa R1 R2 | samtools sort > aln.bam
samtools index aln.bam
samtools faidx all_contigs.fa
pilon --genome all_contigs.fa --frags aln.bam --output pilon1 --fix bases --mindepth 0.5 --changes --threads 32 --verbose

#repeat correction using Illumina reads - use output from above - pilon1.fasta - as the contigs file

#output e.g.: corrected genome assembly of Staphylococcus aureus in .fasta format, containing three contigs: chromosome, large plasmid and small plasmid.