Some Notes on SOLiD Data Analysis:The ChIP-seq

ChIP-seq is probably the only known technique available today to ‘map’ global occupancy of a protein on the genome. There some important considerations in these studies. The most important concerns have been well-discussed in a recent paper by Tytelman et al. (August, 2009) (here). The most important finding of this paper is the observation that mere presence of chromatin can bias the coverage and detected reads thus twisting the final interpretation. This work shows conclusively that any study of such global patterns will always be marred by inherent chromatin states.

Having said this let us walk through the steps in ChIP-seq analysis. The analysis of ChIP-seq data has two components

  1. Determining the ‘locations’ of the reads. This can be further extended by generating meta-data for the given location such as: Nearest genes, nearest sturctural and/or functionl segments of the genome etc.
  2. Generating ‘motif’ for the putative binding site.

The basic steps in analysis of ChIP-seq are

  1. Determining the locations of the reads and converting this information in a ‘visualizable’ format. As discussed in the previous post, the tools are part of the tool-kit developed by the ABI-SOLiD. This information can be loaded as a ‘Custom Track’ to the UCSC genome Browser (For comprehensive tutorial on how to use the USCS genome browser please visit their home page at the http://genome.ucsc.edu.
  2. The motif detection is carried out by the MACS tool box. This tool box is decribed by Zhang et al., Genome Biology (here). The paper describes the tool in detail. The help available with the tool is also very good.

Continue reading

Some Notes on SOLiD Data Analysis

This particular post will try to enuntiate a few common steps carried out during the SOLiD data analysis. I will try to be as detailed as possible, but this is going to be long post and will be updated often.

The SOLiD data is read in ‘color-space’. For most common analysis we need the sequences in base-space (as nucleotides). The reading of color-space data and its matching to the reference genome database is carried out on the on-machine cluster. It is a absolute requirement that the reference sequence be given at the start of the SOLiD run. It is possible to installed the corona_lite tool box available here. The corona_lite tool box contains many tools and accessory scripts that are there to make your life easy. They also come with extensive documentation. Reading that documentation takes a lot of time though. It is usually most helpful to have a ‘linux’ machine for running these tool boxes.

Let us now run through the basic steps of the analysis pipeline:

  1. The data is obtained as .csfasta file.
  2. Match color-space data to reference genome in colorspace to obtain the .ma.<readlength>.<allowedmismathces> file.
  3. There is a tool that will conver the ma file into gff file. This filefomat contains the location information but not the information in base-space.
  4. There is another tool to ‘annotate’ the GFF file. This tool converts the gff file into ‘annotated’ GFF, which now also contains the sequence information in the base space.

Upto these four steps the entire analysis is identicle, after this step there are other specific tools for the analysis of ChIP-seq, whole-transcriptome, small-rna molecules etc.

We will discuss about the tools for the ChIP-seq next.