Bioinformatics Techniques and Resources
The Problems
1. How to cluster sequences on the basis of sequence similarity?
We have the set of TF binding sites in this file here. We need to make sequence alignments. This will be done using EMBOSS tools, which I installed on my mac. Pairwise alignments between two sequences can be done in several different ways.
1. Dotplots
2. Global alignments. Compare two sequences over their entire lengths, appropriate where sequences are expected to share similarity over whole length. Needle (EMBOSS) implements the Needleman-Wunsch algorithm for global alignment.
3. Local alignment, searches for regions of local similarity and need not include the entire length of the sequences. The EMBOSS program water is a rigorous implementation of the Smith Waterman algorithm for local alignments.
To read a sequence into EMBOSS, it must be in a particular file format. Fasta is the easiest for writing a file with multiple sequences. So first I convert the RegulonDB transcription factor binding sequences into a Fasta file using Perl, see below..
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
open (DATA, "TFBindingSites.txt") || die "Couldnt open TFBindingSites.txt";
open (MYFILE, ">TFBS.fasta");
open (MYFILE2, ">names.txt");
while (<DATA>) { #Go through OperonSetText file
chomp;
my (undef,$TF, $name, undef, undef,undef, undef,undef, undef, $seq) = split(/\t+/);
if($seq){
print MYFILE "\n\>$name\t$TF\n$seq";
print MYFILE2 "\n$name";
}
}
close(MYFILE);
close(MYFILE2);
This sort of simple perl file is invaluable. Now, the water algorithm in EMBOSS can be called with this file and the ID of sequences in this file as follows. Using the line.
water TFBS.fasta:ECK120015699 TFBS.fasta:ECK120015692
The following Perl program runs water to get all pairwise alignments and reads the output file to extract the alignment scores.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
#Call water several times and process the data files produced.
open (DATA, "names.txt") || die "Couldnt open names.txt";
open (DATA2, "names.txt") || die "Couldnt open names.txt";
open (MYFILE, ">distances.txt") || die "Couldnt open distances.txt";
my $a;
my $b;
while (<DATA>) {
$a = $_;
chomp($a);
open (DATA2, "names.txt") || die "Couldnt open names.txt";
while (<DATA2>){
$b = $_;
#print "testing $a and $b\n";
chomp($b);
my $command = "water TFBS.fasta:$a TFBS.fasta:$b -gapopen 10 -gapextend 0.5 -outfile output.txt\n";
system($command);
open (DATA3, "output.txt") || die "Couldnt open output.txt";
while (<DATA3> ) {
if(/# Length:([\s\d\.]+)\s+/){
print MYFILE "$a\t$b\t";
print MYFILE "$1\t";
}
if(/# Identity:([\s\d\/]+)\s+/){
print MYFILE "$1\t";
}
if(/# Similarity:([\s\d\/]+)\s+/){
print MYFILE "$1\t";
}
if(/# Gaps:([\s\d\/]+)\s+/){
print MYFILE "$1\t";
}
if(/# Score:([\s\d\.]+)\s+/){
print MYFILE "$1\n";
}
}
close(DATA3);
}
close(DATA2);
}
close(DATA);
close(MYFILE);
This takes a long time because the ALL-to-ALL matrix is very large. The above program takes some of the sequence at the sides of the TFBS present in the regulonDB file. The program below strictly only takes the capitalized seqence from the RegulonDB file of TFBSs.
We are left with a matrix of alignment scores. Distance between two sequences is proportional to 1/Alignment Score. This gives a distance matric when can then be visualized using multidimensional scaling methods.
References
1. Important background. Much of GRN evolution seems to be by gene duplication. See Sarah Teichmann's work and data here.
2. Babu seems to have done a huge amount of work on the evolution of trancription networks, see his webpage here for a complete publication list of his work. [Refer to 'Papers' notes on evolution of evolvability for further information.]