Note - see Jellyfish first
Genomescope
Genomescope takes in the histogram (the kmer profile) from jellyfish and estimates other information.
Note: Genomescope is for short reads only (e.g. Illumina), and is only for diploid genomes (although may be developed to work with polyploids also).
- Code: https://github.com/schatzlab/genomescope
- Paper: https://academic.oup.com/bioinformatics/article/33/14/2202/3089939
- Note: the supplementary information is very informative.
Theory:
- The shape of the the kmer profile is affected by the level of heterozygosity and duplication.
- Compare the profile found with models, and estimate statistics.
Steps:
- Go to: http://qb.cshl.edu/genomescope/
- Upload the reads.histo file
- Settings: kmer length - same as used in jellyfish (e.g. 21), read length - e.g. 150, max kmer depth: default is 1000 or match to end of histogram’s x axis (e.g. 1000000)
Or, to run on command line:
genomescope.R [histogram file] [ k ] [read length] [outdir] [max kmer cov]
Results: typically, the graph shows:
- Main central peak: homozygous alleles at average kmer coverage. (Smaller kmers will have higher coverage).
- Left-hand peak: heterozygous alleles. Note that this peak gets higher than the main peak when heteorzygosity is only ~ 1.2%.
- Right-hand peak(s): repeats
The graph is also shown with logarithmic axes.
There are some statistics reported above the graph, and some in a ‘results’ box below.
My interpretation of these terms (based a lot on the supplementary paper info) is:
- Genome length (haploid)
- Genome unique length: from single copy homozygous and heterozygous alleles (under the main and left peak).
- Genome repeat length: from repeat copies (under the graph to the right of the main peak).
- Heterozygosity rate: the percentage of bases that are specific to one of the homologous chromosomes. So, if 1%, every 100th base will differ between a chromosome pair.
- Duplication rate: the percentage read duplication, ostensibly artificially caused. (Although I’m not sure how this gets calculated, as how do you tell these apart from real duplicate reads from repeat regions?). So, if 1%, I take this to mean 1 in every 100 reads is duplicated. I’m not sure what this is used for but I suppose it affects the shape of the kmer profile (and therefore its fit with a model).
- Error %: based on the model inferred for the graph, this is where the lines cross (between 0 cov and heterozygous peak). This is a number of erroneous kmers: this is then used to estimate the number of erronous bases in the reads (becasue 1 erroneous base can => many erroneous kmers).