PART 1: ARE GRN MOTIFS ADAPTATIONS FOR EVOLVABILITY?
In this page I explore the issue of evolvability in bacteria. Evolvability means the ability to produce adaptive varients. Are there adaptations in bacteria that increase their probability of producing adaptive varients. What are the general principles of evolvability?
One possibility is that network topology may be optimized for evolvability, as well as for developmental reasons. Various gene regulatory topological motifs in bacteria have been discovered to be over-expressed in comparison to 'random' controls (S Mangan, U Alon) Are there any statistically significant differences in the types of genes present at the ends of FFLs (feed forward loops) and BIFANs? If so, are those genes at the end of these motifs subject to variable directed selection to a greater extent than genes elsewhere?
Similar questions (relating gene ontology to gene regulatory network topology) have been asked previously and so we should see how those authors went about things. For example, Hahn, Conant and Wagner (2003) claim to have found "no correlation between highly connected proteins and evolutionary rate... in the E. coli metabolic network and ...only a weak correlation in the yeast protein-interaction network". Daniel Promislow has written "A Regulatory Network Aalysis of Phenotypic Plasticity in Yeast" looking at the interaction between gene regulatory network topology and phenotypic plasticity".
Techniques and Methodology
I would like to obtain the topology and gene ontology data in an easy to use format, such that I can get a feeling for the system.
1. The gene ontology database. See below for how to use various databases to get GO information about particular genes in operons.
2. Motif detection software from Uri Alon and R. Milo's team. mFinder and mDraw are very easy to use motif detection tools. The Mfinder tool is used to get a list of the nodes in motifs, using the -omem command line extension to generate a MEMBERS graph. The following Python & Perl code can be used to extract the relavent information from the two files to generate a list of nodes at the ends of various motifs obtained from the OUT and MEMBERS files of Mfinder. [Click here to see my Perl programming notes.].On this page you can find a Perl program to take the Motif data from mfinder and geneate random controls from this data.
Bacterial Adaptation to Changing Envionments
There are two mechanisms by which bacteria can adapt to changing environments,
1. Changing gene expression. Repertoire of programmed responses is limited so how does an organism adapt to novel challanges for which there is no evolved sensing mechanism? A signaling system may be implausible for some kinds of enviornmental change.
2. Genotype change.
SSRs located in the reading frames of genes or their promotors, mediate high frequency, reversible, genotypic switching by slipped strand mispairing. This might cause a frameshift if th SSR is in the reading frame. SSRs in promotors alter the binding of the RNA polymerase e.g. bifA and bifB pilus biogeneis genes. (96). This is more like a tuning mechnaism rather than an ON-OFF switch. There is also evidance that TF binding can be altered (51)!!!!

Moxon, Bayliss and Hood, 2006. list the genes that have these SSRs, and propose they increase fitness in fluctuating environments. These genes are called CONTINGENCY genes, often doing something like attachment. Second order selection is required for the evolution of DNA repair, or replication processes, otherwise there is no selection pressure for these to evolve, since they do not benefit the individual. Is second order selection the same as the evolution of evolvability?
Stochastic switching at these loci can be combinatorial. Phase variation of cell surface determinents in bacterial pathogens has been observed (74, 86, 100), on off reversible switching of antigenic varients in flagella for example. (4), or lipopolysaccaride biosynthesis, iron aquasition, restriction modification systems (105) that may allow aquasition of foreign DNA!!!

The genetic basis of the rapid switching is recombination dependent inversion of a DNA element (82), gene conversion,, transposition, methylation. SSRs are a novel explanation for phase variation.
Why isnt the gene switched off by SSRs lost by neutral drift? How long would this take to happen? When would the change become irreversible?
"In most bacteria there is positive selection against th presence of SSRs (1). "
Switching rate can be adusted by changing the length of the SSR.
The evolution of SSRs is very interesting. There is good theoretical work showing what the optimal mutation rate should be in a fluctuating environment.


The last reference above talks of how SSRs CAN alter TFs. This is really interesting. Are other mutations at promotors too slow???
Using GO Databases
Uri Alon's E.Coli motif data
This data can be downloaded here. See the sample below...
1 aceBAK
2 acnA
3 acrAB
4 acrR
5 acs
6 ada_alkB
7 adhE
8 adiA
9 adiA_adiY
10 ahpCF
11 aidB
12 alaWX
13 aldB
14 alkA
15 alpA
16 ansB
17 appCBA
18 appY
19 araBAD
20 araC
This contains the names of all the operons that Alon's team have used to detect motifs in the E.Coli GRN. We wish now to associate each of these operon names with a LIST of functional annotations, either the MultiFunc annotations or the GO annotations. We have the following different means of getting the gene ontology data for each of these operons.
Method 1: Ecocyc website based manual retrieval
You could go through the transcription units in ECOCYC here searching for the entries in Alon's data above. For example, here is the entry for aceBAK. This page itself does not contain the GO data for the genes contained in the operon. To get that you have to click on the little picture of the gene, e.g. aceB, aceA or aceK. This page is like the one below...

This shows the synonyms of the gene, the MultiFunc superclasses, and the Go clases. Note there are multiple possible functions for one gene. If one were doing this process manually, presumably one would produce ones OWN flat database, e.g. in Excel, like this one here. This seems terribly tedious, and there must be a better way. For example, it would be splendid if we could get some database with the information contained in the above two pages, and extract it using a Perl program. Lets try to find these files now.
Method 2: Ecocyc flatfile databases
The following flat files (here) are available for download after online registration. This comes in a very large directory called ecocyc-flatfiles.tar, which contains the files. One of the easiest file types to use may be the tabular (tab-delimited) files here. For example, genes.col lists the the gene names AND the gene ontology for that gene. Let us see if we can find the genes aceBAK in this file! We search the file in BBEdit (a text editor).

A search for the operon fails of-course, since this is a database of genes not operons, but the individual genes turn up in a search for aceB, aceA, and aceK. The above snippet shows the line associated with each of these genes. This is the order of entries in each line.
UNIQUE-ID .......... this is EG10023 in the case of aceB.
BLATTNER-ID (E. coli PGDB only; column omitted otherwise) ...... b4014 in the case of aceB.
COMMON-NAME ...... aceB in the case of aceB.
PRODUCT-NAME ....... malate synthase A (an enzyme) in the case of aceB.
SWISS-PROT-ID ....... P08997 in th case of aceB.
REPLICON UNIQUE-ID ......COLI-K12 (which is where the gene is found in the case of aceB).
START-BASE ....... 4213501
END-BASE ....... 4215102
SYNONYMS (4)..... Alternative names ECK4006 b4014 in the case of aceB.
GENE TYPE (4) ...... (I think two tabs seperate GENE TYPE from SYNONYMS) glyoxylate bypass cytoplasm (I think there is tab delimitation between the different types within GENE TYPE).
What is this gene type classification? It seems to be the standard GO catagorization. The mapping from the multiFun to the GO classifications can be downloaded here. For example, aceB has the GO classification cytoplasm, which appears here
MultiFun:7.1 Cytoplasm > GO:cytoplasm ; GO:0005737
in the above file. It is the same as MultiFun:7.1 It also has a glyoxylate bypass classification which looks like this
MultiFun:1.7.2 Glyoxylate bypass > GO:glyoxylate cycle ; GO:0006097
in the file here. We now need to annotate Alon's file in as complete a way as possible, automatically, by using the above ECOCYC tab delimited file. To do this we need a list of all the genes in each operon mentioned in Alon's list. The first part of the Perl program should go through each operon, storing the genes present in that operon in a data structure. RegulonDB is the thing to use here. It contains the following file. This file has the following structure.

Of-course, there is some gumph at the top, but this can be removed by cutting out of the text file with a text editor. The operon name is on the left, see aceBAK appears! Next is the number of genes in that operon, followed by which way it is encoded on the strand, and then the gene names in two formats in the form x1|y1,x2|y2, etc.... x1 being the name we are interested in.
The Perl script below takes the above file and constructs a hash table from an operon key to a list of genes, and writes this to a file. Refer to my Perl notes here. Actually, this Perl script was written by stevexf, a programmer at tek-tips in response to a question I sent. This is an excellent site if you get stuck on something.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my %operons;
while (<>) {
chomp;
my ($operon, undef, undef, $genes) = split(/\s+/);
my @items = split(/,/, $genes);
foreach (@items) {
my $gene = (split(/\|/))[0];
push @{$operons{$operon}}, $gene;
}
}
print "$_:", join(',', @{$operons{$_}}), "\n" foreach (sort keys %operons);
print Dumper(%operons);
How does it work? In the main loop, the file is chomped, and variables $operon and $genes are assigned the strings returned from the positions 0 and 3 in the split output. This is possible due to the regularity of the first three columns in the array data. The genes will always appear in the 4th column. Nice regularity. The 4th column is then split again using comma delimiters and the individual gene|gene names stored in the items array. Then steve goes through the items array splitting it at | and keeping the first name, which is the gene name we want. Then he pushes the $operon and the $gene into the operon hash. Finally, the output is made in two ways, one with a print, and one with a data dumper module.
Here is another way of doing it, submitted by GrandFather at Perl Monks, another site for getting help on Perl. These guys are amazing. And its a great way to learn. Try to write your code, if you have difficulties with it, post it for help. This time a next unless with a different kind of regEx is used to test for the right line structure consiting of spaces and NOT spaces in the right order, anchored to the start and end of a line. Memory parenthesise are used and transfered to named variables. The array @genes is assigned the value of split applied to the second memory paranthesis string. The hash table $operons is made for this operon, with empty value, with an or= ???? Whats happening with the operons hash with two consecutive curley brackets???
use strict;
use warnings;
my %operons;
while (<>) {
chomp;
next unless /^(\S+)\s+\S+\s+\S+\s+(\S+)$/;
print " in loop \n";
my ($operon, $targets) = ($1, $2);
my @genes = split /,/, $targets;
$operons{$operon} ||= {};
for (@genes) {
my ($subOperon, $gene) = /([^|]*)\|(.*)/;
$operons{$operon}{$subOperon} = $gene;
}
}
for my $operon (sort keys %operons) {
print "$operon: ",
join (', ', map {"$_ ($operons{$operon}{$_})"} sort keys %{$operons{$operon}}),
"\n";
}
And here is yet another way of doing it, by theFox at PerlMonks, but which seems to be problem with the first split.
#!/perl/bin/perl -w
use strict;
use Data::Dumper;
my $hash = {};
for (<DATA>) {
chomp;
my @fields = split(/\s{4}/, $_);
my @genes = split(',', $fields[3])
;
@genes = map { (split('\|'))[1] } @genes;
push(@{$hash->{$fields[0]}}, \@genes);
}
print Dumper($hash);
Next Make a Hash Table with Key = Gene, Value List = GO Ontology.
We read the file into excell as a tab delimited file. We found the gene name is always in column C, and the GOs are always after column L. This makes reading it possible. The Perl script below makes a hash table containing the ontology data as values of gene names.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my %ontologyHash = ();
my $gene;
my @line;
my @sliceOntology;
while (<>) {
chomp;
@line = split(/\t+/); #Get tab deliminted components into an array.
my (undef, undef, $gene) = @line; #Get gene name from line array.
#print $gene . " ";
@sliceOntology = @line;
@sliceOntology = @sliceOntology[12..$#sliceOntology]; #Keep only those after the 10th one.
#print " $#sliceOntology @sliceOntology \n ";
push @{$ontologyHash{$gene}}, @sliceOntology;
#$ontologyHash{$gene} = \@sliceOntology;
}
print "$_:", join(',', @{$ontologyHash{$_}}), "\n" foreach (sort keys %ontologyHash);
#print Dumper(%ontologyHash);
Making an Operon => Ontology Hash on Alon's Data
Now we need to combine the above two hash tables using Uri Alon's file of operons that takes a reference number for an operon and gives the GO ontologies of all genes in that operon, i.e. associated with that reference number. Go through Uri Alon's file, for each operon get the genes, for each gene get the GOs, make a new hash table.
The final perl code is this...
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my %operons;
my %ontologyHash = ();
my $gene;
my @line;
my @sliceOntology;
open (DATA, "OperonSetText") || die "Couldnt open operonSet";
while (<DATA>) {
chomp;
my ($operon, undef, undef, $genes) = split(/\s+/);
my @items = split(/,/, $genes);
foreach (@items) {
my $gene = (split(/\|/))[0];
push @{$operons{$operon}}, $gene;
}
}
#print "$_:", join(',', @{$operons{$_}}), "\n" foreach (sort keys %operons);
#print Dumper(%operons);
open DATA2, "ontology.txt";
while (<DATA2>) {
chomp;
@line = split(/\t+/); #Get tab deliminted components into an array.
my (undef, undef, $gene) = @line; #Get gene name from line array.
#print $gene . " ";
@sliceOntology = @line;
@sliceOntology = @sliceOntology[12..$#sliceOntology]; #Keep only those after the 10th one.
#print " $#sliceOntology @sliceOntology \n ";
push @{$ontologyHash{$gene}}, @sliceOntology;
#$ontologyHash{$gene} = \@sliceOntology;
}
#print "$_:", join(',', @{$ontologyHash{$_}}), "\n" foreach (sort keys %ontologyHash);
#print Dumper(%ontologyHash);
open (DATA3, "coliInterFullNames.txt") || die "Couldnt open 3";
my %opToGo;
while (<DATA3>) {
chomp;
my ($num, $op) = split(/\s+/);
#print "$num $op\n";
if($operons{$op}){
my @genes = @{$operons{$op}};
foreach (@genes){
if($ontologyHash{$_}){
#print " @{$ontologyHash{$_}} , " ;
push @{$opToGo{$op}}, "@{$ontologyHash{$_}} " ;
}
}
}
}
my $k;
foreach $k (sort (keys (%opToGo)))
{
print $k, "=> ", @{$opToGo{$k}}, "\n";
}
Here is an example of the file we can obtain from the above Operon => GO hash using the dumper function.
$VAR1 = 'pheV';
$VAR2 = [];
$VAR3 = 'slp';
$VAR4 = [];
$VAR5 = 'malZ';
$VAR6 = [
'cytoplasm'
];
$VAR7 = 'cynR';
$VAR8 = [
'operon',
'activator'
];
$VAR9 = 'sdaA';
$VAR10 = [];
$VAR11 = 'gltBDF';
$VAR12 = [
'glutamate',
'nitrogen metabolism',
'glutamate',
'nitrogen metabolism',
'nitrogen metabolism',
'operon',
'periplasm'
];
Each operon e.g. gltBDF is associated with a set of GOs. However, I prefer to convert the GO ontologies to the highest level of the MultiFun Hierarchy, using the file here which is of the form...
MultiFun:4.S.143 Ni++ > GO:nickel ion transport ; GO:0015675
MultiFun:4.S.143 Ni++ > GO:nickel ion transporter activity ; GO:0015099
MultiFun:4.S.144 nicotinamide mononucleotide > GO:nicotinamide mononucleotide transport ; GO:0015890
MultiFun:4.S.144 nicotinamide mononucleotide > GO:nicotinamide mononucleotide transporter activity ; GO:0015663
MultiFun:4.S.145 nitrite > GO:nitrite transport ; GO:0015707
MultiFun:4.S.145 nitrite > GO:nitrite transporter activity ; GO:0015113
MultiFun:4.S.146 nucleoside > GO:nucleoside transport ; GO:0015858
MultiFun:4.S.146 nucleoside > GO:nucleoside transporter activity ; GO:0005337
MultiFun:4.S.146 nucleoside > GO:nucleoside transport ; GO:0015858
MultiFun:4.S.147 nucleoside/H+ > GO:nucleoside transport ; GO:0015858
MultiFun:4.S.147 nucleoside/H+ > GO:nucleoside transporter activity ; GO:0005337
MultiFun:4.S.147 nucleoside/H+ > GO:nucleoside transport ; GO:0015858
MultiFun:4.S.148 oligopeptide > GO:oligopeptide transport ; GO:0006857
MultiFun:4.S.148 oligopeptide > GO:oligopeptide transporter activity ; GO:0015198
to convert the current GO catagories to the highest level of the MultiFun catagories listed here. The following perl script is added to the end of the above perl script to achieve this. This program now takes operons and outputs the multifun top catagories associated with them.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my %operons;
my %ontologyHash = ();
my $gene;
my @line;
my @sliceOntology;
open (DATA, "OperonSetText") || die "Couldnt open operonSet";
while (<DATA>) {
chomp;
my ($operon, undef, undef, $genes) = split(/\s+/);
my @items = split(/,/, $genes);
foreach (@items) {
my $gene = (split(/\|/))[0];
push @{$operons{$operon}}, $gene;
}
}
#print "$_:", join(',', @{$operons{$_}}), "\n" foreach (sort keys %operons);
#print Dumper(%operons);
open DATA2, "ontology.txt";
while (<DATA2>) {
chomp;
@line = split(/\t+/); #Get tab deliminted components into an array.
my (undef, undef, $gene) = @line; #Get gene name from line array.
#print $gene . " ";
@sliceOntology = @line;
@sliceOntology = @sliceOntology[12..$#sliceOntology]; #Keep only those after the 10th one.
#print " $#sliceOntology @sliceOntology \n ";
push @{$ontologyHash{$gene}}, @sliceOntology;
#$ontologyHash{$gene} = \@sliceOntology;
}
#print "$_:", join(',', @{$ontologyHash{$_}}), "\n" foreach (sort keys %ontologyHash);
#print Dumper(%ontologyHash);
open (DATA3, "coliInterFullNames.txt") || die "Couldnt open 3";
my %opToGo;
while (<DATA3>) {
chomp;
my ($num, $op) = split(/\s+/);
#print "$num $op\n";
if($operons{$op}){
my @genes = @{$operons{$op}};
foreach (@genes){
if($ontologyHash{$_}){
#print " @{$ontologyHash{$_}} , " ;
push @{$opToGo{$num}}, @{$ontologyHash{$_}} ;
}
}
}
}
my $k;
my $multi;
my $cata;
my $cata2;
my %opToCat;
foreach $k (sort (keys (%opToGo)))
{
my @a = @{$opToGo{$k}} ;
foreach my $mat (@a){
#print " *$mat* ";
#Search the multifun2go file for the multifun catgaory associated with this word.
open (DATA4, "multifun2go.txt") || die "Couldnt open 3";
my @line2 = <DATA4>;
foreach $multi (@line2) {
#print " $multi ";
if ($multi =~ /\b$mat\b/) {
$cata = $`;
if($cata) {
if( $cata =~/MultiFun:(\d).+/) {
#print "$k => $1 \n";
push @{$opToCat{$k}}, $1 ;
}
}
}
}
}
#print "\n";
}
print Dumper(%opToCat);
#OpToCat contains a hash of operon number to go Top Level MultiFun Number;
Let us check first how many operons we have been able to find MultiFun catagories for. We do this by writing a spreadsheet file viewable in excel (tab delimited) of the following structure.
(1) Operon number (2) Operon name (3) MultiFunc Top Catagories
This is irritatingly incomplete, even after making alterations to the Perl program to improve the accuracy of regular expression matching. Download the file here. This final version of the code that generated this file is available here.
The above code results in many incomplete entries. Let us check the mappings systematically, First, how good is the mapping between Uri Alon's Operon list and genes? Let us make this mapping work properly first. AFTER SEVERAL RE-WORKINGS THE FINAL CODE CAN BE DOWNLOADED HERE.
This code produces the following file (here) of randomized ontologies based shuffling pairs of operons into and out of the original set that contained FFL motifs identified using mfinder. For example, the original motifs in FFLs are..

These numbers represent operons, each row is one FFL. Each number has been swapped with a randomly chosen number (operon) from the list of Uri Alon. There are 10 of these randomizations in the above file, however, we could produce many more.
Statistical Analysis
Batch Test
Basically, each MultiFun X.Y catagory can be considered to be a ball of a different color. The set of FFL motifs can be considered to be a bucket that contains a set of these balls. Imagine there is a large bucket containing various colors of ball. We want to know how likely it is that the real distirbution does not come from the same distribution os N randomly drawn samples of size 123: (41 x 3) balls (MultiFun catagories) from the bucket. The bucket contains the full list of operons in the network described by Alon, each listed by ontology.The following file gives the frequencies of ontologies found in the original FFL data in the format.
(1) Ontology (2) Frequency
The file downloadable here generates ontology frquencies for the original + N shuffled random models, using the datafile downloadable here.
Which statistical tests should be used?
One approach is to calculate the mean and sd of appearence of a each MultiFun catagory that appears in the original network, in the randomized networks, and see which of these appears with a frequency significantly higher than in randomized networks. A Z-score can be calculated as (Nreal-Nrandom)/SD as a test of statistical significance. This is done for each type of MultiFun catagory. Load these files into EXCEL. The mean and standard deviation of motif appearences in the random files can be calculated directly in the Perl file.
The following mathematica file (here) plots BoxWhiskerPlots with the actual data superimposed. See the figure below.

The ontology codes are as follows. The Z scores are > 3 for 11, 13, 14, which corresponds to 2.2. , 3.1, and 3.3. MultiFun catagories. This corresponds to Regulation Type, RNA related (Transcription related) and Genetic Unit Regulated. It is not unexpected that genes in the GRN of E.Coli should fall into these catagories. (column 20) Transporters 4.9 had a Z-score of 2.7. Doing a chi square test is going to be hard because the number of catagories in actual and expected classes is unlikely to be the same since when drawing randomly, the number of multifun catagories can differ.

Recently a superior dataset has become available, downloadable here. This is likely to be more useful because it gives the multifun catagories directly referenced from the bXXXX gene name. These bXXXX gene names can be obtained from the operon to gene file. The following perl code (download here) uses this new data to conduct an analysis identical to the above.

Again, as with the previous results, only 2.2, 3.1 and 3.3 are overrepresented in the actual data. From the above Z-scores we can get the p-values, i.e. the probability that this sample would be chosen by chance if drawn from the same distribution as the population.
Problems with the Above Statistics
1. We need to use a FDR method (False discovery rate method) to adjust for the fact that Z-scores are being calculated multiple times. This adjusts for the fact that we could get false positives. The following software can be used in R to calculate appropriate Q-values, download here. A tutorial on entering data into R is found here.
The Q values are shown below... (1)
(1)
(2)
Figure (1) Q-values for whole motifs. (2) Q-values for downstream operons in motifs.
This indicates that even taking FDR into account, the transcription related ontologies are overrepresented significantly in FFLs. This is not surprising considering at least two of the nodes in the FFL must be transcriptional!!! We should really only consider the downstream node.
2. We have been using a z-test however, this assumes we know the mean and standard deviation of the population, but we only have a sample of the population (1000 random models). Should a t-test be used?
Analysis of the Downstram Operon in FFLs (Operon Z)
The next investigation looks at only the downstream node in the FFL. The Q-values above (2), are for the downstream gene ontologies only. Again,the TF related ontologies are significantly overrepresented in FFL at the downstream gene, however, in addition, 19 (MultiFun:4.8) is also significantly overrepresented. This is interesting, 4.8. is as follows.
4.8.A Accessory Factors Involved in Transport
4.8.A.1 The Membrane Fusion Protein (MFP) Family
4.8.A.2 The Secretin Auxiliary Lipoprotein (SAL) Family
4.8.A.3 MPA1 Family auxillary transport protein
4.8.A.7 The Phosphotransferase System Enzyme I (EI) Family
4.8.A.8 The Phosphotransferase System HPr (HPr) Family
Why should accessory factors involved in transport be significantly overrepresented in the downstream operons of FFLs? ALSO, it is interesting that even in the downstream component of FFLs there is an overrepresentation of transcription related genes compared with the random model. The BoxWhiskerPlot with superimposed actual multifun frequencies shows the results for the downstream gene, see 11, 13,14 and 19 have significant Q-values. However, there are only 4 instances of 4.8 in the set of genes in the FFL motifs of E.Coli. Even if this is significantly overrepresented according to the NULL model, it does not account for most of the FFL instances.

The overrepresentation of 4.8 is due to the fact that RpoN appears at the end of 4 different FFLs. RpoN contains a gene that has a 4.8 classification. There are only 8 such genes in the genome. This anomoly accounts for why 4.8 is an overrepresented MultiFun catagory in the FFL data. Nothing more. It is interesting that 2.2, 3.1 and 3.3. are overrepresented in the z component. This means that the downstream element itself tends to be itself a factor involved in regulation.
A further control is required...
PART 2 WHAT IS THE ROLE OF WEAK INTERACTIONS IN EVOLVABILITY?
Introduction
We are interested in evolvability, this is a 'variational property' of the phenotype. This is a study of the ORIGIN OF VARIATION. In evolutionary algorithms the 'representation problem' is well known. The 'black art' of evolutionary optimization is getting the representation right. The representation problem is how to code a solution in such as way that random variation and selection can lead to a solution. What are the evolutionary forces that shape the GP map?
Variability is under genetic control
For example, genetic variation can be reduced by artificial stabilizing selection (Rendel, 1967) and increased by artificial directional selection (Lazebnyi et al 1991). What is the influence of selection on the genotype-phenotype mapping function? How can the GP map evolve? Some outcomes and processes producing these outcomes are proposed below.
Limiting Pleiotropy
Is necessary for the accumulation of adaptations. Development for example is organized into semi-autonomous processes. Dissociability (Needham, 1933). Evolution of complex adaptations requries "a match" between the functional relationships of the phenotypic characters and their genetic representation. (Rupert Riedl, 1975, The Imitatory epigenotype).
Modularity
Has typically been considered from the evol. of evolv. perspective. Modularity can arise through either parcellation (suppression of pleiotropic effects) or integration (creation of within group pleiotropic effects). Stabilizing selection on all characters simultaneously favours suppression of all mutational effects (Wagner ), so is unlikely to lead to modularity. The combination of directional and stabilizing selection leads to the differential suppression of pleiotropic effects (Wagner, 1995b).
Selection for Breaking Developmental Constraints and Increasing the Degrees of Freedom of the Phenotype.
Has been demonstrated ( Vermeij, 1971, 1973, 1974).
Selection for adaptation rate
may be another way in which the GP map can evolve. (Rechenberg, 1973; Riedl 1975). This seems to be the explanation for SSRs. Right?
Constructional Selection
This is a model of how gene duplication affects the evolution of the genotype-phenotype map. (Altenberg, 1985, 1994, 1995b). The existing genes are a subset of all possible genes. Those preserved by selection as functioning genes are those which least perturb functions under stabilizing selection, while supplying variation under directional selection. See Altenberg for simulations of selective genome growth, (Altenberg, 1995b). Let us consider some of Altenberg's models.
Altenberg "Evolving Better Representations through Selective Gene Growth."

Altenberg has n binary valued genes that exert control over f phenotypic functions each of which contributes to fitness. A gene controls a subset of functions. The GP map can be represented by a binary matrix, M = ||mij ||, i = 1..n, j = 1..f, The columns of M are the polygeny vectors gj = ||mij||, i = 1..n; gives the genes controlling each fitness component j. The rows of M are the pleiotropy vectors, giving the fitness components controlled by each gene i. Each fitness component isa uniform pseudo-random function of the genotype (like KAuffman's NK model). Thus, any change in gi gives a new output uncorrelated with the old output. If a fitness component is affected by no genes it is zero. The total fitness is the normalized sum of the fitness components.
New genes are added with some plieotropy vector, and randomly assigned allelic value (0,1). If fitness increases, it is kept, otherwise it is removed. A 1 mutant adaptive walk is used to optimize the allelic values of the new genome, and the process repeated.
Using constructional selection allowed more rapid evolution, to higher fitness, with fewer resources, and lower plieotropy. Adding gene duplication allowed the GP map to get structured in such a way as to speed up evolution. So, even in a fixed environment modularity in the GENOTYPE-PHENOTYPE MAP can come about if one is able to evolve the genetic representation of the phenotype.
Weak Interactions
The simplest model possible should be produced to demonstrate the importance of weak interactions for evolvability. The aim of the model is to evolve a representation that has high evolvability and see how weak connectivity features in the solution.
Evolution in Changing Environments
Uri Alon's experiment with NAND gates and phenotypic modularity is a nice example here. Modularity evolves in environments with changing fitness functions. Alon does not look at MODULARITY in the genotype to phenotype map here, but rather modularity in the phenotype. Can a simpler model of a changing environment be devised that uses changing NK fitness landscapes for example?
Weak Interactions in Gene Regulatory Networks: A bioinformatics investigation of transcrption factor binding sites.
Another approach is to look at the kinds of directed variation that might result from mutations acting on the gene regulatory network. This is a heavy duty bioinformatics piece of research, but it might work. It makes several assumptions however. Is there clustering in the space of TF binding sites? How is random variation in TF binding site space likely to change the topology of the GRN? BioPerl can be used to do these investigations, alternatively many other sequence alignment packages are available. TF binding sites have been very difficult to find because they are quite short, 6-20 bp only! They evolve more slowly than the corresponding flanking inter-genic regions.
Interestingly, there is considerable position specific variation in evolutionary rates WITHIN TFBSs.One paper argues that the evolutionary rate at each position is a function of the selectivity of the factor for bases at that position. (BMC Evolutionary Biology 2003, 3 Moses et al. ). However, could it not be the case that this variability was related to evolvability, i.e. one binding site changing to accomodate another TF at short notice? The models of binding sites assume linear interaction between binding sites and the tightness of binding of the TF to the site. The energy contribution of the sequence to the TF binding is assumed ot be linear. This is strange. Isn't it likely that one sites' identity may alter the effect that another site has on the binding of the TF? A structural model may be required to make this more explicit.
We have a list of TFBSs from this file here found on regulonDB. We extract each sequence and label it with the TF name that binds to that sequence. The following software can be used to carry out an ALL-to-ALL alignment. For details of how to do this sort of bioinformatics, see my bioinformatics page.
A simulation of memeory in dynamic network evolution.
I will develop a computer model of the co-evolution of transcription factors and transcription factor binding sites. I will try to demonstrate that the system can be optimized to allow directed phenotypic variation in the face of merely random mutation, and MORE GENERALLY to structure the properties of the GP map between (TF + TFBS space) and (GRN space), in a manner analogous to the mapping from RNA sequence space to RNA secondary structure space.
Details of program development can be found here.
[6] Discuss the concept of Memory in Viral Quasispecies. They exposed the viral population to environment A, then to B, then to A agian, and found that adaptation to A was more rapid in the species exposed to A in the past.
How do TF-TFBS pairs evolve in real organisms?
[1] Gives examples of evolution by TF protein alteration + TF binding site alteration. TFBS evolution may allow less pleiotropy than TF evolution. Duplication of TFs is another mechanism apart from mutation at TFBSs. [Andolfatto, 2005 [3]] show that untranslated regions of Drosophila genome may contribute to adaptations just as much as codon regions. They worked this out on the basis of 'patterns of nucleotide variation' [How did they do this?]
There really are some amazing comparative genomic techniques that I would like to learn about. They can be very helpful in indicating which parts of the genome have been subjected to various kinds of selection. It is worth us spending some time to understand Andolfatto's methods so we can use similar methods in examining E.Coli. How do cis-regulatory regions come out on measures of polymorphism and divergence. Need to read more about molecular evolution, neutral theory of evolution, comparative genomics, etc....
The validity of the above techniques may be reduced because we know that even functionally conserved enhancers can undergo rapid sequence divergence [4]. If this is so, it is difficult to identify functional sites on the basis of sequence.
[5] Wagner et al 2007 JTB argue that TFBSs are poorly conserved within and between species. Rather they argue that binding site number is selectively maintained, by a stochastic birth death process in the cis-regulatory region. They argue that over evolutionary time, novel TFBSs come into existance, are bound to, and others die. They make a stochastic model and observe the number of binding sites that come and go, and calculate the time dynamics of the mean and variance in binding site number produced in this process. Note, they don't look at binding site position. Their model would predict a random distirbution of binding site positions as well I expect. Is this observed in the data? Also, their fitness function gives equal fixation rates for any mutation that increases or does not reduce the number of binding sites. There is no more complex fitness function re: ability to alter binding sites rapidly, e.g. from one binding site type to another, i.e. they do not explore the demands created by variable selection conditions. It is likely that a species over 20My would experience some differences in fitness function. These results appear to contradict Andolfatto's results above which suggest that UTRs in Drosophila DO show unusual amounts of conservation. Does the assumption of random mutation with selection for no decrease in binding site number predict a distribution of Hamming distances to actual TFBSs? In the extreme case the sequence may be 1 Hamming distance away from the TFBS number, so that mutations in just one nucleotide can easily alter the number of binding sites required. An interesting fitness function would be to select for variable numbers of TFBSs over time, and to examine the sequences resulting. The simplest experiment is as follows.
Experiment 1: In arm 1 select for variable binding site number = {20, 10, 20, 10, 20, 10}. In arm 2 select for fixed binding site number = 15, with periods ranging from 1-N generations, for a given mutation rate. How do the sequences differ in both cases? Does weak connectivity arise as a side-effect of variable selection?
I think the above experiment may be similar Kauffman's NK Landscapes, depending on the scoring function one uses to classify closeness to the optimal sequence motif. WRITE A C++ OR PERL SCRIPT TO EVOLVE STRINGS TO DO THE ABOVE TASKS.
Our investigation in which we compare alignments amoungt all known TFBSs in the RegulonDB database does not assume that binding site positions are conserved. However, we assume that there may be more complex fitness functions acting on binding sites such as the requirement that some TFBSs turn into other TFBSs quickly, in variable environments.
Weak Interactions can Decrease Variance in Signaling
References
[1] H. Hoekstra and J.A. Coyne . The locus of evolution: Evo-devo and the genetics of adaptation. Evolution May 2007. (61): 995-1016.
[2] G.A. Wray. The evolutionary signiicance of cis-regulatory mutations. Nature Rev, Gen. 8:206-216. [Review of some examples of cis-regulatory mutations in animals. They happen to decrease pleiotropy, because they are co-dominant because each allele is regulated by its own associated cis-element, and so heterozygotes can experience selection pressure, whereas protein mutations that were recessive would experience less selection. Also, timing, dynamics and fine continuous control can be achieved with cis-regulation compared to mutations in protein coding regions.]
[3] Andolfatto, P. Adaptive evolution in non-coding DNA in Drosophila. Nature 437:1149-1152.
[4]. A stochastic model for the evolution of transcription factor binding site abundance. JTB 2007. Wagner, Otto, Lynch, Stadler. and Dermitzakis, E.T., Clark, A.G., 2002. Evolutionof transcription factor binding sites in mammalian gene regulatory regions: conservation and turnover. Mol. Biol. Evol. 19, 1114–1121.
[5] A stochastic model for the evolution of transcription factor binding site abundance. Wagner, Otto, Lynch, Stadler. 2007 JTB.
[6] Memory in Viral Quasispecies.
Weak interactions in Signaling References
H. Buc, W.R. McClure. "Kinetics of open complex formation between Escherichia coli RNA polymerase and the lac UV5 promoter. Evidence for a sequential mechanism involving three steps."
Biochemistry. 1985 24(11): 2712–2723.
W.R. McClure. "Mechanism and control of transcription initiation in prokaryotes." Annu Rev Biochem. 1985;54:171–204.
S. Roy, S. Garges, S. Adhya. "Activation and repression of transcription by differential contact: two sides of a coin."
J Biol Chem. 1998, 273(23): 14059-62.Click here to read
Sneppen K, Dodd IB, Shearwin KE, Palmer AC, Schubert RA, Callen BP, Egan JB. "A mathematical model for transcription interference"
J. Mol. Biol. 346 399 (2005).
I. Golding, J. Paulsson, S.M.Zawilski and E.C. Cox, "Real time kinetics of gene activity in individual bacteria"
Cell 123, 1025 (2005)
G. Bar Nuham and E. Nudler. "Isolation and characterization of sigma(70)-retaining transcription elongation complexes from E.coli."
Cell 106, 443-451 (2001)
G. Dieci and A. sentenac. "Detours and shortcuts to transcription reinitiation".
trends Biochem. Sci. 28, 202-209 (2003).