ForBio k-mer workshop: "Oh-know"


Teachers : Kamil S. Jaron, José Cerca, Rishi De-Kayne, Lucía Campos, Siavash Mirarab 


K-mers are an extremely powerful tool in addressing many genomics questions, especially in species without great reference genomes and annotations, which is for the vast majority of life and all sequencing projects of yet undescribed species. Although various k-mer based approaches are becoming more and more popular, many genomicists struggle with lack of good resources for deeper understanding of what's going on under the hood.


Objectives : The aim of the workshop will be to train and inspire genomics researchers to understand and utilize k-mer-based approaches for their sequencing projects in an online hands-on course. The workshop will contain several blocks detailed below with different types of k-mer analyzes, and will be interleaved with several talks on the application of the discussed approaches.


Prerequisites : Basic knowledge of command line (bash, UNIX). Previous experience with genomics data is recommended, however we will be open to everyone. A “pre-workshop” will be held on basics of genomics / command line.


Detailed program


Mo, 13.9.

Tu, 14.9.

We, 15.9.

Th, 16.9.

12:00 - 15:00 CET


Welcome & Participant Intro


Introduction to K-mer spectra analysis 

Characterization of genomes using k-mer spectra analysis

Separate sub-genomes of an allopolyploid

16:00 - 19:00 CET



Separating chromosomes by comparison of sequencing libraries

Introduction to k-mers for analyzing skimming data

Advanced use of k-mers for analyzing skimming data


Bash Refresher

If you want to join the workshop but are just getting started with bioinformatics or are worried your bioinformatic skills might be a bit rusty then there's no need to exclaim 'Oh-know'! At the start of the workshop we will be running a short bash refresher module which will cover the basics of the command line and use of a computer cluster. We will discuss the different types of sequencing technologies available, what the output from these technologies look like, and how to get from your raw data to the k-mer analysis steps we will cover later on. This will include unpacking, exploring, and manipulating sequence data, doing basic quality control, and preparing summary statistics which will set you up for the more advanced bioinformatics tools and analyzes discussed later in the workshop.

K-mer spectra analysis

Most of the genomes sequenced are Pandora boxes - completely undescribed genomes. While cytological studies and flow cytometry are the best way to generate some general insights about the genome structure, they are hard to scale unfortunately requiring very different expertise. K-mer spectra analysis is an alternative way to infer basic genomic properties directly from sequencing data. It provides us with an elegant way to estimate heterozygosity, genome size and repetitive fractions prior genome assembly. In Introduction to K-mer spectra analysis module we will first understand the logic behind decomposing reads into k-mers and explore the basic properties of the k-mer spectra on a variety of genomes. In the follow up module Characterization of genomes using k-mer spectra analysis we will learn how to apply the k-mer spectra analysis on more complicated genomes, including polyploids.

Separating of sub-genomes

Many species show exciting karyotype variations - sex chromosomes differ between sexes, germ-line restricted chromosomes differ between soma and the germ line, accessory (B) chromosomes differ between different lineages or populations. As a consequence, we are able to separate chromosomes by comparison of sequencing libraries . We will show how sequencing libraries can be compared, how we can identify k-mers belonging to individual chromosomes or sub-genomes. Sometimes, however, the two libraries are not possible to sequence, for example if the chromosomes to separate are sister chromosomes of a polyploid species that evolved by a hybridization of two different species (allopolyploid) and the two parental species are unknown or even extinct. In separating sub-genomes of an allopolyploid block we will show how to use k-mers related to transposable element fossils to tease apart the two parental sub-genomes.

Analyzes skimming data

Genome skimming, the practice of sequencing genomes at low coverage (eg, 1X), is increasingly gaining popularity as a way of characterizing biodiversity. The resulting data, which could simply be called a bag of reads, cannot be assembled and in many applications, a reference genome that would allow mapping does not exist. Given these limitations, how are we to analyze the genome skims? The traditional approach is to simply assemble organelle genomes and use only a small fraction of the data. However, k-mer-based analyzes allow a wider range of analyzes. The quintessential computational question for many downstream analyzes is the following. Given two bags or reads covering the genome at low coverage, can we compute the distance between the genomes that generated those bags of reads? The answer is yes. Using k-mers, such distances can be computed with high accuracy. However, there are several adjacent opportunities and challenges that should be considered. Challenges include dealing with contamination and estimation of genomic parameters (eg, length and repeat spectra). Opportunities include phylogenetic placement of samples using skimming data and identification of mixtures. In this module, we present a suite of tools that deal with the goal of analyzing genome skims: Skmer for distance calculation, APPLES and MISA for phylogenetic placement using such distances, CONSULT for elimination of contamination, and RESPECT for estimation of genomic parameters.


NB:  Due to corona restrictions, the course will be fully online with talks on Zoom and the use of Slack and GitHub for the practical.




Application deadline is August 15th, 2021.

Click here  to apply now.

Contact Hugo de Boer Quentin Mauvisseau for more information.   

Published June 14, 2021 4:55 PM - Last modified July 10, 2021 10:15 PM