Kmer counting

Summary (example)

jellyfish count -m 21 -s 100M -t 10 -C <(zcat R1.fq.gz) <(zcat R2.fq.gz)
jellyfish histo -t 10 --high=1000000 mer_counts.jf > reads.histo
# then load histogram in genomescope with k = m, read length = 150, max kmer cov = 1000000

Theory

Theory:

Kmer counting tutorial: https://bioinformatics.uconn.edu/genome-size-estimation-tutorial/#

Jellyfish

Note: Aug 2019: conda install seems to work only with: conda install jellyfish=2.2.3

Steps:

jellyfish count -m 21 -s 100M -t 10 -C reads.fasta

To use gzipped files: and paired-end reads:

jellyfish count -m 21 -s 100M -t 10 -C <zcat R1.fq.gz) <(zcat R2.fq.gz)

Plot the histogram:

jellyfish histo -t 10 mer_counts.jf > reads.histo

Plot the histogram with an x axis of one million instead of default 10,000:

jellyfish histo -t 10 --high=1000000 mer_counts.jf > reads.histo

Genomescope

Genomescope takes in the histogram (the kmer profile) from jellyfish and estimates other information.

Note: Genomescope is for short reads only (e.g. Illumina), and is only for diploid genomes (although may be developed to work with polyploids also).

Theory:

Steps:

Or, to run on command line: genomescope.R [histogram file] [ k ] [read length] [outdir] [max kmer cov]

Results: typically, the graph shows:

The graph is also shown with logarithmic axes.

There are some statistics reported above the graph, and some in a ‘results’ box below.

My interpretation of these terms (based a lot on the supplementary paper info) is: