pCal -- An Efficient Method for Statistical Significance Calculation of Transcription Factor Binding Sites

Author: Ziliang Qian < zlqian@sibs.ac.cn> ; Lingyi Lu < lylu@sibs.ac.cn>

Bioinformatics Center

Key Laboratory of Molecular System Biology

Shanghai Institutes for Biological Sciences

Chinese Academy of Sciences

Version: 0.1
Date: 02.02.2007

·   Overview

Various statistical models have been developed to describe the transcription factors DNA binding preference, by which we identify putative transcription factor binding sites (TFBS) according to scores assigned. Statistical significance of these scores, usually known as the p-value, is playing a critical role in identification. We developed an efficient algorithm to provide precise calculation of the statistical significance, remarkably enhancing the calculation efficiency by reducing the time complexity from an exponent scale to a linear scale, and successfully extending the application of this algorithm to a wide range, from the wildly used position weight matrix (PWM) models to the complicated Bayesian Network models. pCal is an implementation of such a algorithm in C++.

First of all, pCal is able to produce a null distribution of scores according to a given PWM. Based on the null distribution, pCal is designed either to calculate the p-value of each nucleotide of an input DNA sequence or  to provide a score cutoff according to a given p-value. For more detail discussion on the algorithm of pCal, please refer to the paper: An Efficient Method for Statistical Significance Calculation of Transcription Factor Binding Sites, Ziliang Qian, Lingyi Lu, Liu Qi, Yixue Li, 2007.

·   Binaries

The program is free for scientific use. Please contact me, if you are planning to use the software for commercial purposes. The software must not be further distributed without prior permission of the author. If you use pCal in your scientific work, please cite as

·         An Efficient Method for Statistical Significance Calculation of Transcription Factor Binding Sites, Ziliang Qian, Lingyi Lu, Liu Qi, Yixue Li, 2007.
[PDF][Postscript (gz)]

I would also appreciate, if you sent me (a link to) your papers so that I can learn about your research. The implementation was developed on Linux with gcc. The binaries are available at the following location for different flatforms, with examples also included:

·   How to use

pCal  consists of a formatter module (formatter), a null distribution generator module (genDist) and a final report module including two optional executable programs (pValue/cutoff). The formatter module and the null distribution generator module must be performed successively before the final report module is executed.

  • formatter is used to convert one TRANSFAC/JASPAR style count matrix into pCal acceptale file format, called with the following parameters:

./ formatter -i inputmatrix_file -o matrix_pcal -t [0,1] -f [0,1]

The input profile inputmatrix_file can consist of actual counts or be normalized (each column sum =1), but not in the form of a PWM, as the JASPAR matrix format with each line indicating the corresponding nucleotide's profile ( see http://mordor.cgb.ki.se/jaspar2005//TEMPLATES/help.htm#browse ):

A  |        0  3 79 40 66 48 65 11 65  0
C  |       94 75  4  3  1  2  5  2  3  3
G  |        1  0  3  4  1  0  5  3 28 88
T  |        2 19 11 50 29 47 22 81  1  6
 
or as the TRANSFAC matrix format with each column indicating the corresponding nucleotide's profile ( see http://www.gene-regulation.com/pub/databases/transfac/doc/toc.html ):
 
  A   C    G    T
  0  97  1  2
  3  7 7  0 20
  81  4  3 12
  41  3  4 52
  68  1  1 30
  49  2  0 49
  67  5  5 23
  12  2  3 83
  67  3 29  1
  0  3 91  6    
You will find an example of input file at: matrix.transfac and matrix.jaspar.

The output file matrix_pcal contains pCal acceptable file format with each line indicating the score of corresponding position and nucleotide. For example, the first four lines of the table below show that scores of the first position (number 0) are 0 when the nucleotide is ‘A’,  and1.19066 when the nucleotide is ‘C’, and 0.0126666 when the nucleotide is ‘G’ and 0.025333 when the nucleotide is ‘T’.

0               A       -10

0               C       -0.0314

0               G       -4.5747

0               T       -3.8816

1               A       -3.4761

1               C       -0.2572

1               G       -10

1               T       -1.6303

2               A       -0.2053

2               C       -3.1884

2               G       -3.4761

2               T       -2.1768

3               A       -0.8858

3               C       -3.4761

3               G       -3.1884

3               T       -0.6627

4               A       -0.3851

4               C       -4.5747

4               G       -4.5747

4               T       -1.2074

5               A       -0.7035

5               C       -3.8816

5               G       -10

5               T       -0.7246

6               A       -0.4003

6               C       -2.9653

6               G       -2.9653

6               T       -1.4837

7               A       -2.1768

7               C       -3.8816

7               G       -3.4761

7               T       -0.1803

8               A       -0.4003

8               C       -3.4761

8               G       -1.2425

8               T       -4.5747

9               A       -10

9               C       -3.4761

9               G       -0.0974

9               T       -2.7830

You will find an example of output file at: matrix.pcal.

 

   -t   - type of input matrix:
            0: JASPAR type of matrix
            1: TRANSFAC type of matrix
   
   -f   - type of score scheme: (see [Joachims, 1999c], [Joachims, 2002a])
            0: MATCH score scheme 
            1: TRANSFAC type of matrix
  • genDist  is used to convert the null distribution of scores according to the above matrix, called with the following parameters:    

                  

            ./genDist -i matrix_pcal  -o matrix.pcal.dp 

 

The input file matrix_pcal is the file produced by the previous program formatter. The output file matrix.pcal.dp contains the null distribution of scores according to the input matrix matrix_pcal. You will find an example of output file at: matrix.pcal.dp.s

  ·      The final report module including two optional executable programs pValue and cutoff. 
     
     pValue is designed to calculate the p-value of each nucleotide of an input DNA sequence, called with the following parameters:
	
	./pValue -i binding_site.fasta -m matrix_pcal

  The input file binding_site.fasta contains a binding site sequence in fasta format, like

>MA0001 AGL3 7
ccCCATAAATAGgaatatcgggatga 
    You will find an example of input file at: binding_site.fasta. 
    An other input file matrix_pcal is the file produced by previous program formatter.
  • cutoff is designed to provide a score cutoff according to a given p-value, called with the following parameters:   
 
	./cutoff -p pvalue -m matrix_pcal
 
     The input parameter pvalue is a real number less than 1 and lager than 0, and the input file matrix_pcal is the file produced by previous program formatter. 
          Fianlly, this program will report a score at terminal cutoff according to the given p-value and matrix.
 
 

·   Disclaimer

This software is free only for non-commercial use. It must not be distributed without prior permission of the author. The author is not responsible for implications from the use of this software.

·   References

Last modified Feb 2nd, 2007 by Lingyi Lu  lylu@sibs.ac.cn