Jellyfish - count kmers to estimate genome size

Quick example

jellyfish count -m 21 -s 100M -t 10 -C <(zcat R1.fq.gz) <(zcat R2.fq.gz)
jellyfish histo -t 10 --high=1000000 mer_counts.jf > reads.histo
# then load histogram in genomescope with k = m, read length = 150, max kmer cov = 1000000

Theory

k = a number (e.g. 21)
kmer = a sequence of length k bases (e.g. 21 bases)

Theory:

If k is large enough so that each kmer found is unique in the genome,
and if the genome length (e.g. 1,000,000) is much larger than the kmer length (e.g. 21),
and if no PCR or sequencing errors,
then the number of kmers will be approximately equal to the length of the genome.

Kmer counting tutorial: https://bioinformatics.uconn.edu/genome-size-estimation-tutorial/#

Jellyfish

Jellyfish: https://github.com/gmarcais/Jellyfish
Further jellyfish steps : https://github.com/gmarcais/Jellyfish/tree/master/doc

Note: Aug 2019: conda install seems to work only with: conda install jellyfish=2.2.3

Steps:

Go through the sequencing reads
For each new kmer seen, add to table, with count.
If kmer seen before, increment count.
Find the average kmer frequency (sequencing depth), e.g. 50
Exclude kmers with a count of ~ 1, as these are likely from errors
Add all the other kmers, and divide by average kmer frequency => This is the approx genome length

jellyfish count -m 21 -s 100M -t 10 -C reads.fasta

To use gzipped files: and paired-end reads:

jellyfish count -m 21 -s 100M -t 10 -C <zcat R1.fq.gz) <(zcat R2.fq.gz)

-m: kmer length, 21 is commonly used
-s: size of hash table: should be genome size + extra kmers from seq errors. However, it does say that hash size will be increased automatically if needed.
-C: canonical. Reverse complement kmers are considered to be identical and are counted as the same thing. This is recommended.
-t: number of threads
output: mer_counts.jf

Plot the histogram:

jellyfish histo -t 10 mer_counts.jf > reads.histo

Plot the histogram with an x axis of one million instead of default 10,000:

jellyfish histo -t 10 --high=1000000 mer_counts.jf > reads.histo

Discussion on setting to 1 million: https://github.com/schatzlab/genomescope/issues/22
There may be very high counts of some kmers due to chloroplast sequences, spike-in sequences, etc.