Help Pages

Introduction to YeTFaSCo

YeTFaSCo stands for Yeast Transcription Factor Specificity Compendium. It is a collection of all available TF specificities for the yeast Saccharomyces cerevisiae in Position Frequency Matrix (PFM) or Position Weight Matrix (PWM) formats. With it, you can scan sequences with the motifs to find where potential binding sites lie, inspect precomputed genome-wide binding sites, find which TFs have a motif similar to one you have found, and download the collection of motifs.

Cart

In order to scan sequences with motifs of your choice, a cart has been created that will keep track of your TFs of interest. For instance, if you wish to scan a sequence for only PBM derived motifs for RAP1 and FHL1, you can add these motifs to your cart from the Motifs page and then go to the scan page to scan your sequences with them. The columns in the cart are the same as the Motifs table.

GO!

Premade TF Sets

These are sets of specificities from the database.

All Motifs

This set includes all motifs in the database.

Expert Curated

This set includes the single "best" motif for each TF as determined by expert curation.

Expert Curated - no dubious

This set includes the single "best" motif for each TF as determined by expert curation, with all "dubious" TFs left out.


What's in the Database?

The database is split up into six main sections, described here.

The tables in the database share a common functionality. By clicking on the header links of the tables, you can change the sort order. The second row of the tables includes fields which you can use to filter the results. To use this, you enter in the filter criteria into the appropriate places and click "Filter!".

Genes

This is a table of all genes in Saccharomyces cerevisiae. The columns include:
Systematic Name: The systematic name. Links to SGD.
Gene Name: The standard gene name, if there is one.
Is a Complex?: Whether or not this entry represents a protein complex, and not a single gene. 1=yes, 0=no.
DBD: Lists the DNA-binding domains if present. Only DBDs thought to be sequence-specific are included. Links to DBD description.
Dubious?: Whether the motif is dubious as a transcription factor, by expert curation. Links to the expert curation page for this gene.
Shown DNA-binding?: Whether or not there is an experiment that shows DNA-binding activity in the collection. 1=yes, 0=no, blank=no motifs in collection
#Motif in Collection: How many motifs are included in the database for this gene. Links to the motif view, showing the particular gene.

GO!

Motifs

This table gives the basic details about the motifs in the database. Also included at the top is a button to change to the detailed motif view. The columns include:
Gene Name: The standard gene name, if there is one.
Systematic Name: The systematic name. Links to gene entry.
Motif ID: The ID of the motif in the database.
Sub Motif: The sub motif ID. This only applies when there is a variable length motif, for example CGGN{3,6}GCC. This results in one submotif for each length corresponding to the one Motif ID
Logo: Displays specificity logo for each motif. Links to the long motif description.
Total Score: The total score for the particular motif.
DBDs: Lists the DNA-binding domains of the TF, if present. Links to DBD description.
Expert Confidence: The confidence value of this motif, by expert curation. For each TF, only the best motif is given a confidence value.
Method: A brief description of what method was used to derive the motif.
Reference: The PMID of the paper which derived this motif. Links to references page.
Add to cart: Adds this protein to your cart.

GO!

Motifs - Detailed View

This table gives more details about the scoring metrics used for each motif. Also included at the top is a button to change to the simple motif view. The columns include:
Gene Name: The standard gene name, if there is one.
Systematic Name: The systematic name. Links to gene entry.
Motif ID: The ID of the motif in the database.
Sub Motif: The sub motif ID. This only applies when there is a variable length motif, for example CGGN{3,6}GCC. This results in one submotif for each length corresponding to the one Motif ID
Logo: Displays specificity logo for each motif. Links to a breakdown of the motif, including a PFM.
Total Score: The total score for the particular motif.
DBDs: Lists the DNA-binding domains of the TF, if present. Links to DBD description.
Expert Confidence: The confidence value of this motif, by expert curation. For each TF, only the best motif is given a confidence value. Links to the expert curation page for this motif.
Method: A brief description of what method was used to derive the motif.
Reference: The PMID of the paper which derived this motif. Links to references page.
Dubious?: Whether the motif is dubious as a transcription factor, by expert curation. Links to the expert curation page for this gene.
Ranksum GO P: The log(P value) for the GO term enrichment, calculated with the ranksum test.
Ranksum GO AUROC: The Area Under the ROC curve for the GO term enrichment. Values greater than 0.5 indicate the motif is enriched in promoters of genes annotated with this process, while less than 0.5 indicates depletion.
Ranksum GO Term: The GO Term which is enriched/depleted for the GO term enrichment criterion.
Mean ChIP chip Enrichment: The mean ChIP-chip enrichment which links to the detailed breakdown of the ChIP-chip score for the given motif.
Mean Expression Score: The mean expression enrichment score which links to the detailed breakdown of the expression enrichment for the given motif.
Concurrence: The score for deriving the same motif through independent means.
Add to cart: Adds this protein to your cart.

GO!

Breakdown of Expression/ChIP-chip Enrichment Data

This table includes a breakdown of all the correlation measures between ChIP-chip and gene expression experiments and motif occurrence. The columns include:
Gene Name: The standard gene name, if there is one.
Systematic Name: The systematic name. Links to gene entry.
Motif ID: The ID of the motif in the database.
Sub Motif: The sub motif ID. This only applies when there is a variable length motif, for example CGGN{3,6}GCC. This results in one submotif for each length corresponding to the one Motif ID
Source Study: The original study in which the expression/ChIP-chip data was derived. Links to pubmed.
Mutant/ Condition: The conditions of the ChIP experiment, or the conditions and genotype of the mutant for expression data.
ChIP-chip or Expression?: Whether the entry represents ChIP-chip or Expression data.
-log(P-value): The negative log10 P value of the Pearson correlation between motif occurrences and expression changes or ChIP signal.
R: The R value of the Pearson correlation between motif occurrences and expression changes or ChIP signal.

GO!

Expert Curation

This table includes the expert curation data for the database. For each TF, a single motif is selected and given a confidence value, or is marked dubious. In rare cases, multiple motifs are chosen when a TF can have multiple binding modes. The columns include:
Systematic Name: The systematic name. Links to gene entry.
Gene Name: The standard gene name, if there is one.
Motif ID: The ID of the motif in the database.
Expert Confidence: The confidence value of this motif, by expert curation. For each TF, only the best motif is given a confidence value.
Dubious?: Whether the motif is dubious as a transcription factor, by expert curation.
Notes: Notes on expert curation.

GO!

References

This table includes all the original publications which derived the motifs in the database. The columns include:
Reference: The reference.
Year: The year of publication.
Description: A short description of the methods used for motif derivation.
PMID: The PMID for the current reference. Links to PubMed.
#Motifs in DB: The number of motifs in the database from this reference. Links to the Motif View, displaying only motifs from that paper.

GO!

Notes on Methods

Here follows a brief description of some of the more common motif derivation methods listed in the database:
EMSA: Electrophoretic mobility shift assay. Tests for DNA-binding by running the protein on a gel with radio-labelled DNA of a specific sequence, and unlabelled competitor DNA.
ChIP: Chromatin ImmunoPrecipitation (Including ChIP-chip and ChIP-seq). In vivo experiment that uses an antibody to isolate DNA associated with a protein of interest.
PBM: Protein Binding Microarrays. In vitro experiment where DNA-binding protein is labelled, and bound to the DNA on a microarray slide. The protein is then detected using an antibody connected to a fluorophore.
In vitro selection: Scheme where DNA is bound to the DNA-binding protein, washed, and bound DNA is eluted and amplified. This process is repeated to enrich for high-affinity binders.
Few-site analysis: An analysis of a few binding sites in genes with something in common.
Expression enrichment (Microarray enrichment): Finding motifs common to genes which are co-regulated. For Microarray enrichment, expression is determined by microarray.
DNA protection: Footprinting of DNA-binding protein on DNA with one of several methods including methyl-protection, and DNase protection.
DIP-chip: In vitro experiment where protein is incubated with naked genomic DNA, immunoprecipitated on the DNA, and the bound DNA is quantified to see binding sites.
One-hybrid: DBD of transcription factor is bound to a trans-activating domain in front of a gene to be selected for, and by using a library of potential binding sites, binding sites bound by the DBD are selected for.
MITOMI: In vitro binding of DNA-binding proteins on a microfluidics chip.
Site-turns-on-gene: Simple assay of detecting a positive(or negative)-acting cis-regulatory element.
Mutational analysis: Mutation of a cis-element, combined with monitoring for expression changes to show which bases affect gene regulation.
Conservation: Detecting conserved regions between related yeasts, which are often indicative of cis-regulatory elements.


Motif Evaluation Criteria

Each evaluation criterion yields a score (-log(P value)) for a certain test. At the end, these scores are summed to get the total score.
For the first three criteria, the probability of each TF binding is calculated as the probability that at least one binding site is bound, assuming independence between sites.

Correlation to ChIP-chip data

This evaluation criterion tests the correlation between the motif occurrences for a TF and the binding sites observed in vivo by ChIP-chip. Many of the motifs are based off of this same ChIP-chip data, so this evaluation criterion is knowingly biased in favour of these motifs, however, it is not as biased as one might think as non-ChIP-chip derived motifs frequently correlate better to the ChIP-chip data than the ChIP-chip motifs. The score for this measure is the mean of -log(P values) of Spearman correlations of probe intensity to the probability of the TF binding that probe for the given motif.

Correlation to Expression Changes in TF Mutants

This evaluation measure relies on the fact that when you perturb a TF, the genes that are regulated by that TF are likely to have their expression levels altered. Thus, when TF perturbation data is available, the expression changes are compared to the probability of the perturbed TF binding at each gene's promoters, with the given motif. The score represents the mean of the -log(P values) of Spearman correlations for all available TF perturbation experiments.

GO Term Enrichment

This metric is based on TFs tending to regulate genes involved in a certain process. This works by, for each GO-slim term, performing a ranksum test comparing the probability of TFs binding each promoter in those genes with the GO term and those without. Here, three values are given: the AUROC score for how well the probability of binding predicts the GO term, the -log(P value) for the ranksum test, and the GO term with the most significant enrichment.

Concurrence

This criterion gives a positive score for how similar the motif is to the other motifs derived by other studies for the same TF. There are instances where different publications find motifs in different ways but using the same raw data and these cannot get a positive score here by matching each other. Here, the score represents the -log(P value) of finding a motif this similar for the same TF in the collection of motifs.


What tools are available?

Scan Sequences with Motifs

Using this feature, you can scan a given sequence for hits of a particular motif. You can scan using: a motif set, all motifs for a given transcription factor, or the contents of your cart.

Two additional parameters can be set to alter the motif search. In order to limit how many results are shown, you can set a threshold for the minimum allowable percent of the maximum possible score. Here, 0% would show all potential binding sites that are more like the motif than the background distribution, whereas, 100% would show only optimal binding sites. There is also an option to change the background base frequency. Only the %A/T can be altered because it is assumed that %A=%T and %G=%C=50%-%A.

The results include a summary of what sequence was submitted (length and start and end bases), the base content being used, a graphical representation of the motif occurrences, a table of all the binding sites found, and a table of all the motifs that were searched for, but were not found.

GO!

Precomputed Genome-Wide Binding Sites - Genome Browser

This feature includes genome-wide TF binding site predictions for the "Expert Curated - No Dubious" set. This way, you can browse to your favourite gene and see what predicted binding sites lie in its promoter. The intesnsity of the colour is proportional to the binding score, with the darker sites being stronger than the lighter sites.

In addition, there are also tracks for predicted/in vivo/in vitro nucleosome occupancy from Lee et al. 2007,Kaplan et al. 2009, and Tillo and Hughes 2009 as well as a conservation track representing the conservation between 7 related yeast species (from UCSC).

Since different TFs have different numbers of molecules in the nucleus (under different conditions), it is difficult to define a threshold where you can declare that a binding site will be bound or unbound. For this reason, we intentionally overestimated the numbers of binding sites in the genome in order to minimize the number of missed binding sites. For the current version, each TF has at least 2000 binding sites, and possibly many more if there are ties. For instance, if a motif has 70000 perfect motif instances genome-wide, all 70000 will be included.

To get more information about individual binding sites, you can click on the site itself in the genome browser and it will tell you the exact location and sequence of that binding site, as well as the log probability of the binding site and the rank of the binding site as it compares with the rest of the binding sites across the genome. Log probabilities are scaled by the max possible score such that 0 represents a perfect binding site (ie. log(Prob) = log(P(BS=bs))-log(max(P(BS))) ). Thus, the closer to 0, the better the binding site, but the lower bound depends on the PFM and can be quite negative. For this reason, a rank is also given, where a rank of "15 tied with 5" means that it is the 15th best binding site in the genome, but has the same score as 5 other sites. The next best-scoring binding site would have a rank of 21. Together, the score and the rank should give you a good idea how likely the binding site is to be bound in vivo.

GO!

Find Similar Motifs

This feature can be used to find the TF that corresponds to a given motif. For example, if you find a certain motif enriched upstream of certain genes and want to know if a TF is known to bind it, you can use your motif as a query against the database.

You can submit the motif in either IUPAC form, PFM form, or you can provide an alignment of sequences. The IUPAC frequency table is shown below.

Frequency
Letter CodeATGC
A1000
C0100
G0010
T0001
M1/21/200
R1/201/20
W1/2001/2
S01/21/20
Y01/201/2
K001/21/2
V1/31/31/30
H1/31/301/3
D1/301/31/3
B01/31/31/3
X1/41/41/41/4
N1/41/41/41/4
a1/21/61/61/6
c1/61/21/61/6
g1/61/61/21/6
t1/61/61/61/2
x1/41/41/41/4
n1/41/41/41/4

Alternatively, you can submit your motif as an alignment of sequences of the form:

ATGCGCTATGC
TCAGTCATGAC
ACTGTAGTGAA
TATGTCGTGAA

where each line represents one of the aligned sequences. Any lines starting with a ">" will be ignored and "-" indicates a gap in the alignment. From this, the position frequency matrix will be calculated and queried against the database.

Entering motifs as a PFM is as simple as entering the frequencies for each base in each position. Blanks are treated as 0, and the whole thing is normalized such that the sum of frequencies at every position will be 1. Alternatively, you can paste in an entire PFM in whitespace delimited format (as shown in the example provided in the search form).

The results are output as a table of all the significant (P<0.05) hits..

GO!

Find Regulators of Your Favourite Genes

This tool allows you to discover potential regulators of your favourite genes. This works by using data in the database to calculate how well each transcription factor correlates with your data. Your data can take the form of either a gene list (eg. genes in a pathway), or a dataset of gene-value pairs (eg. expression data).

Your data are compared against one of the following: gene expression data for TF mutants, ChIP-chip data of TFs in promoters, or predicted binding sites of TFs in promoters. The most direct of these is the expression data, since this represents transcriptional changes upon perturbation of a transcription factor. Next is ChIP-chip data since this represents actual binding of TFs in promoter regions. Finally, by comparing your data to the occurrences of motifs in the promoters of genes, motifs which are over/under represented in your data can be identified.

Three types of tests can be performed: the rank sum (Mann-Whitney-Wilcoxon) test, and Pearson and Spearman correlation.

Rank Sum

For the rank sum test, you give it your "query" list of genes, and possibly a list of "background" genes. If you do not provide a "background" set, all genes for which there is data are used. For the ChIP data, this can include tRNAs and anything for which there was an upstream probe, so using your own background set is recommended. Any systematic names in both the query and background sets are removed from the background.

As an example, say we are interested in phosphate metabolism genes, so we take all genes annotated with the appropriate GO terms, and use this as our query set. This tool can be used to identify which TFs are likely to be regulating these genes by identifying which TFs tend to be bound in the promoters of these genes (ChIP), which TFs' motifs tend to occur more in the promoters of these genes, and which TFs which, when perturbed, result in expression changes in these genes.

The rank sum test checks whether the data for your genes is significantly different than the data for the rest of the genes. It yields two values for each dataset: the P-value and the area under the ROC curve. The P-value represents how significant the association is and the ROC value represents whether your gene list had on average higher (ROC>0.5) or lower (ROC<0.5) values in the data.

The data being compared depends on what kind is selected (eg. Motif occurrence, ChIP-chip, TF mutant expression). For motifs, the data compared is the probability of the TF binding corresponding promoters, given the motif, where larger values indicate a higher probability of binding. For ChIP data, the data compared is the microarray signal (generally log ChIP-signal) in the promoter regions of corresponding genes. For expression data of TF mutants, the data compared is the expression data of genes in the microarray conditions (generally mutant/WT expression).

The interpretation of the ROC depends both on the kind of data being compared and on the nature of the experiment (if there was one). For motif occurrences, an ROC>0.5 means that your genes are enriched for binding sites, while an ROC of <0.5 means they are depleted in binding sites. For ChIP data, an ROC>0.5 for a standard ChIP experiment indicates an enrichment of binding events in the promoters of your genes, relative background. This is summarized in the following table.

Your genes are...
Data typeROC>0.5ROC<0.5
ChIP-chip*enriched in genes whose promoters are bound by this TFdepleted in genes whose promoters are bound by this TF
Binding sitesenriched in genes whose promoters contain this motifdepleted in genes whose promoters contain this motif

For expression data, the ROC interpretation depends on the nature of the mutant. Generally, an ROC>0.5 means that your genes are upregulated in the condition, but this corresponds to two different things depending on whether the mutant is activating (eg. OE or constitutive) or inactivating (eg. deletion, downregulation, most other mutations) and whether the TF is an activator or a repressor. Briefly, if the TF is an activator, the mutant activating, and ROC>0.5 then this TF activates your genes. If the TF is a repressor, the mutant activating, and ROC<0.5 then this TF represses of your genes. If the TF is an activator, the mutant inactivating, and ROC<0.5 then this TF activates your genes. If the TF is a repressor, the mutant inactivating, and ROC>0.5 then this TF represses of your genes. This is summarized in the following table.

Expression data (TF Mutants)Your genes are...
Activator/Repressor?TF Mutant TypeROC>0.5ROC<0.5
ActivatorActivating (eg. OE)enriched for genes activated by this TFdepleted in genes activated by this TF
ActivatorInactivating (eg. del)depleted in genes activated by this TFenriched in genes activated by this TF
RepressorActivating (eg. OE)depleted in genes repressed by this TFenriched in genes repressed by this TF
RepressorInactivating (eg. del)enriched in genes repressed by this TFdepleted in genes repressed by this TF

Pearson and Spearman Correlation

If you have quantitative data in gene-value form, you can test association with potential regulators using either Pearson or Spearman correlation. This test yields two values: the P-value and the R. The P-value represents the significance of the association (though only with normally distributed data*), and the R value represents the degree and direction of the association, where R>0 represents a positive association, and R<0 represents a negative association.

*The motif occurrences and some of the ChIP data deviate significantly from a normal distribution, so Spearman correlation would be more appropriate for these if P-values are important.

As an example, say we looked at gene expression in phosphate-limited conditions vs. standard conditions. We get gene-value pairs for each gene where the values represent the fold expression change in the condition vs normal. We can then use this tool to identify which TFs might be responsible for the expression changes. To do this, we can use the entire dataset and compare using Pearson or Spearman correlation. Perhaps we're only interested in the very upregulated genes. However, in this case the rank sum test is more appropriate, comparing the upregulated genes to the rest of the assayed genes.

In general, TF mutants are expressed in mut/wild type, so if the expression of your genes has a positive correlation with a TF-mutant expression data, it generally means that the TF is less active in the condition. ChIP-chip data is expressed in one of: log(ChIP-signal/background), log(ChIP-signal), or -log(P-value). So for ChIP data, a positive association means that genes which are upregulated in your condition tend to have that TF bound in the promoter region. Finally, binding sites are expressed as a log(probability of binding), so larger values mean more likely binding. Thus, here too, a positive association means genes which are upregulated in your condition tend to have more binding sites in their promoter regions. This is summarized in the table below, for clarity, and is assuming your data is condition/standard-style expression data. Obviously, the interpretation will change with your data type.

Correlation direction
Data typePositiveNegative
Expression data (TF less active, eg. del)TF less active in conditionTF more active in condition
Expression data (TF more active, eg. OE)TF more active in conditionTF less active in condition
ChIP-chipTF binds promoters of upregulated genesTF binds promoters of down-regulated genes
Binding sitesBinding sites more common in up-regulated genesBinding sites more common in down-regulated genes

One final note: It is possible that some of the ChIP-chip and TF-mutant datasets have the signals mixed up since this is difficult to determine (so that it is WT/mut or background/signal), so interpret with care.

GO!


References

  1. Chen, X., Hughes, T.R. and Morris, Q. (2007) RankMotif++: a motif-search algorithm that accounts for relative ranks of K-mers in binding transcription factors. Bioinformatics, 23, i72-79.
  2. Gene Ontology Consortium
  3. Lee W, Tillo D, Bray N, Morse RH, Davis RW, Hughes TR, Nislow C. (2007). A high-resolution atlas of nucleosome occupancy in yeast. Nat Genet. 2007 Oct;39(10):1235-44. Epub 2007 Sep 16.
  4. Kaplan N, Moore IK, Fondufe-Mittendorf Y, Gossett AJ, Tillo D, Field Y, LeProust EM, Hughes TR, Lieb JD, Widom J, Segal E. (2009). The DNA-encoded nucleosome organization of a eukaryotic genome. Nature. 2009 Mar 19;458(7236):362-6. Epub 2008 Dec 17.
  5. Tillo, D., Hughes, T.R. (2009). G+C content dominates intrinsic nucleosome occupancy. BMC Bioinformatics. 2009 Dec 22;10:442.
  6. Fujita, P.A., Rhead, B., Zweig, A.S., Hinrichs, A.S., Karolchik, D., Cline, M.S., Goldman, M., Barber, G.P., Clawson, H., Coelho, A. et al. (2011) The UCSC Genome Browser database: update 2011. Nucleic Acids Res, 39, D876-882.

Citation

If you use YeTFaSCo, please cite the following:

de Boer, C.G., Hughes, T.R. (2011) YeTFaSCo: a database of evaluated yeast transcription factor sequence specificities. NAR, 2011 Nov. 18 (DB Issue).