Our bioinformatics projects are:
- Visualization in the cloud
- Scalable metagenomics,
- Leishmania genomics,
- genome assembly with mixed technologies,
- parallel scalable genome assembly,
- semantic web for biology,
- structured output learning meets combinatorial chemistry,
- arctic metagenome -- communities and systems, and
- string kernels and beyond.
One of the problems with genomics is that you need to deliver large data to end users before delivering a user experience for visualization. But most end users are not ready for these large data and in some way it does not make sense to move large data around.
Ray Cloud Browser is a genome graph browser in the cloud. The purpose is to visualize genome ssemblies and the main claim is that it will help researchers understanding a growing array of problems (like assembly algorithms).
The basic design is the same that is found in the Ray assembler -- message passing and granularity are the two pillars of Ray technologies. Although currently limited to assembly graph visualization, the Ray technologies and ideas put forward are reusable freely, through products like RayPlatform, Ray Cloud Browser and Ray.
Student: Sébastien Boisvert
Metagenomics is concerned with the integrative study of systems composed of organisms from communities and their biological functions. This work aims at devising methods for scalable analysis.
Student: Sébastien Boisvert
The recent sequencing of several Leishmania species, a dimorphic intracellular parasite, sets the stage for new approaches in studying this parasite. We created whole genome microarrays targeting all genes of this parasite, along with an online analysis platform tailored for this microarray. This allows to study the gene expression of this parasite in different contexts (see Ubeda et al. & Rochette et al.). We also sequenced and annotated a non-pathogenic Leishmania species, L. tarentolae. Comparative genomics using all sequenced species allows a better understanding of the biology of this parasite.
An accurate and complete genome sequence of a desired species or phylogenetically close relative is now a basic pre-requisite for advanced genomics research. A crucial step in obtaining high-quality genome sequence is the ability to correctly assemble short individual sequence reads into longer contiguous sequences accurately representing genomic regions that are much longer than any single contributing read. Current sequencing technologies continue to offer increases in throughput and corresponding reductions in cost and time. Unfortunately, the benefit of obtaining very large numbers of reads is complicated by a non-trivial presence of sequence errors, with different types of errors and biases being observed with the different sequencing systems. Although software systems exist for assembling reads for each individual system, no comprehensive procedure was proposed for high-quality genome assembly based on mixes of reads from different technologies. We describe an open source software program called OpenAssembler which has been specifically developed to assemble reads obtained from a combination of sequencing systems, and compare its performance to other assembly packages on simulated and real datasets. To illustrate the value of OpenAssembler, we used a combination of Roche/454 and Illumina reads to assemble the 3.6 Mb Acinetobacter baylyi ADP1 genome (NCBI/Genbank accession CR543861) into 119 contigs containing 26 mismatches and 7 indels. The Newbler assembler, using only the Roche/454 reads (reads for which it has been design for), assembled the genome into 118 contigs with 64 mismatches and 356 indels.
(text from Robert Cedergren Bioinformatics Colloquium 2009)
The site http://denovoassembler.sf.net hosts the Ray project -- a massively parallel open source genome assembler for sequencers such as Roche 454 sequencers, Illumina sequencers, SOLiD sequencers, Pacific Biosciences sequencers, Helicos Biosciences sequencers, and exciting Ion Torrent semiconductor-based sequencers.
Student: Sébastien Boisvert
Collaborator: François Laviolette
Poster: pdf:Boisvert-IMII-2010.pdf (in french)
Presentation: pdf:Boisvert-ULaval-2010.pdf (in french)
Paper: Journal of Computational Biology (ahead of print) doi:10.1089/cmb.2009.0238
Bio2RDF is a project that create linked data over the existing data from the genomic data providers. By linking these data into a mashup, queries that spawn over all these data are now possible. Now that this web of data has been created, does the web itself contain knowledge that can be extracted from its structure? By using graph theory, artificial intelligence and statistical analysis, I hope to find hidden knowledge in this web semantic graph.
Student: Marc-Alexandre Nolin
Collaborator: Michel Dumontier
Paper: Journal of Biomedical Informatics (2008) doi:10.1016/j.jbi.2008.03.004
Our aims are: To develop an in silico approach for drug screening that would employ novel machine learning methodologies coupled to a powerful combinatorial chemistry process to validate the model generated. The process would be iterative to improve the model. In addition, we would use new instrumentation to provide affinity measurements for the target and compounds selected to further enhance the model predictive power by providing data to qualify and stratify the binding data. The novelty of the proposal resides in using the massive computational space for predicting molecular binding and coupling it with the equally massive combinatorial chemistry space striving to maximize the intersect between them.
By ways of metagenomics and high throughput pyrosequencing, we addressed the hypothesis that cyanobacterial mats in polar aquatic ecosystems maintain a nutrient-rich microenvironment via decomposition and scavenging processes. Analysis of more than 592,554 genomic DNA reads (total of 11.5 million base pairs) showed that the ribosomal and protein-coding genes of two high Arctic ice shelf mat communities were dominated by Proteobacteria, not Cyanobacteria, which implies a broad range of bacterial decomposition and nutrient recycling processes in addition to phototrophy. Principal component analysis of genes for light-, nitrogen-, and phosphorus-related processes provided evidence of partitioning of mat function among taxonomically different constituents of the mat consortia. Viruses were also present (notably alpha-, beta-, gamma-proteobacteria phages and cyanophages), which likely contribute to cellular lysis and recycling, as well as other Bacteria, Archaea, and microbial eukaryotes. Our results show that microbial mats are sites of intense mineralization, with nitrogen metabolism dominated by ammonium-related systems, while nitrification genes were absent. Nutrient scavenging systems were detected including genes for transport proteins and enzymes converting larger molecules into more readily assimilated inorganic forms (allantoin degradation, cyanate hydrolysis, exophosphatases, phosphonatases). These results based on metagenomic profiling underscore the rich diversity of microbial life even in extreme polar habitats, and the capability of mat consortia to retain and recycle nutrients in the benthic microenvironment.
Student: Thibault Varin
Collaborators: Connie Lovejoy; Warwick Vincent
Poster: pdf:VarinT_2008.pdf Thibault Varin. Metagenomic analysis of arctic microbial mats communities. Polar and Alpine Microbiology conference. Banff, Alberta, Canada. May, 2008.
Papers: Limnology & Oceanography (2010) doi:10.4319/lo.2010.55.5.1901
Varin, T., Lovejoy, C., Jungblut, A., Vincent, W., Corbeil, J. Metagenomic analysis of stress genes in microbial mat communities from Antarctica and the extreme High Arctic. Applied and environmental microbiology 2012 Jan;78(2):549-59. doi: 10.1128/AEM.06354-11.
HIV type 1 infects human cells through the interactions between ligands and receptors. Accordingly, this retrovirus uses the CD4 receptor in conjunction with a chemokine receptor, to penetrate target cells. In vivo, the chemokine receptor is either CCR5 or CXCR4. Bioinformatic methods were described to predict the coreceptor usage but they all rely on sequence alignments, making any sequences with too many indels not processable. To cope with this drawback, we developped an alignment-free approach using string kernels and support vector machines. The SVM has strong theoretical support and is very robust to noise. We created a new string kernel, namely the distant segments kernel, and compared it to existing string kernels in the litterature, such as the local alignment kernel and the blended spectrum kernel.
We obtained, with the distant segments kernel, an accuracy (1-empirical risk) of 94.80% on a testing set of 1425 examples with a classifier trained on a set of 1425 examples. Our algorithm outperforms the current state-of-the-art method for this classification task. Out of the 1425 training examples, only 577 were used as support vectors by the support vector machine, which indicates that a large margin linear classifier exists in a large feature space. Our method allows the fast and accurate prediction of all allowed coreceptor usages, that are CCR5, CXCR4 and CCR5-and-CXCR4. We implemented a web server to perform automatic classification through the CGI interface. This web server is available at http://genome.ulaval.ca/hiv-dskernel .
Support vector machines and string kernels have broad applicability in bioinformatics, such as remote protein homology detection, gene finding, and clustering. Furthermore, kernels are not limited to bioinformatics, but can also be applied to many tasks in chemoinformatics, such as virtual screening trials.
(text from Robert Cedergren Bioinformatics Colloquium 2008)
Student: Sébastien Boisvert
Collaborators: Mario Marchand and François Laviolette
Paper: Retrovirology (2008) doi:10.1186/1742-4690-5-110