|
||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
| | What is ProDom | Conventions used in the database | The different line types | Citation | | ||||||||||||||||||||||||||||||||||||||||||
What is ProDom |
||||||||||||||||||||||||||||||||||||||||||
|
ProDom is a protein domain family database constructed automatically by clustering homologous segments. The ProDom building procedure MKDOM2 is based on recursive PSI-BLAST searches [ALTS2]. The source protein sequences are non-fragmentary sequences derived from UniProtKB (SWISS-PROT and TrEMBL databases). ProDom was first established in 1993 [SONN] and maintained by the Laboratoire de Génétique Cellulaire and the Laboratoire de Interactions Plantes-Microorganismes (INRA/CNRS) in Toulouse. It is now maintained by the PRABI (bioinformatics center of Rhone-Alpes). The ProDom database consists of domain family entries. Each entry provides a multiple sequence alignment of homologous domains and a family consensus sequence. |
||||||||||||||||||||||||||||||||||||||||||
Conventions used in the database |
||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
|
A ProDom entry is characterised by a unique accession number. The purpose of accession numbers is to provide a stable way of identifying entries through releases. As ProDom is built anew every new version, we have developped a tool which allows to transfer stably accession numbers from a version to another by searching for domain family overlaps between both versions. If an entry is split into two or more entries, the accession number of the parent entry is assign to one of the child entry, and new accession numbers are created for the other children entries. If two or more entries are merged, the accession number of one of the parent entries is assign to the child entry; the other parent accession numbers are stored as "obsolete" accession numbers, and they refer to the child one. If an entry is deleted from the database, its accesssion number is stored as "deleted" accession number in ProDom database. |
||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
|
Each domain is a segment derived from a protein sequence. Such a sub-sequence is identified by the name of the protein in the SWISS-PROT or TrEMBL database, followed by the start and end points of the domain in the whole amino acid sequence (domain boundaries). The SWISS-PROT sequence identifiers have two parts. The first one is the name of the entry (maximum four letters), and the second part is a code for the organism from which the sequence is extracted (maximum 5 letters). The TrEMBL identifiers are the sequence accession number of the protein in the TrEMBL database. ProDom adds the same 5 letter organism code as used in SWISS-PROT to the TrEMBL identifier when the OS line of the TrEMBL entry allows to find the organism from which the sequence is. Otherwise, the first word of the OC line is used to make a "rough" organism code reflecting the domain of life and added to the TrEMBL identifier as described in the following table: |
||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
|
The entries in the ProDom database are structured so as to be usable by human readers
as well as by computer programs. The comments and keywords are in ordinary English.
Each family entry is composed of lines. Different types of lines, each with their own format,
are used to record the data that make up the entry. |
||||||||||||||||||||||||||||||||||||||||||
ID 20167 p2002.1 10 seq. AC PD266930 KW FADR(2) Y586(1) // COMPLETE PROTEOME DNA-BINDING FATTY TRANSCRIPTION REGULATION METABOLISM REGULATOR ACID ACTIVATOR LA 74 ND 10 CC -!- DIAMETER: 119 PAM CC -!- RADIUS OF GYRATION: 53 PAM CC -!- SEQUENCE CLOSEST TO CONSENSUS: Q8ZEL9_YERPE 5-78 (distance:15 PAM) DC This family was generated by psi-blast, with a profile built from the seed aligment of the following SCOP FAMILY DC a.4.5.6 AL P09371|FADR_ECOLI 4 77 0.22 AQSPAGFAEEYIIESIWNNRFPPGTILPAERELSELIGVTRTTLREVLQRLARDGWLTIQHGKPTKVNNFWETS AL Q8ZP15|Q8ZP15_SALTY 5 78 0.22 AQSPAGFAEEYIIESIWNNRFPPGTILPAERELSELIGVTRTTLREVLQRLARDGWLTIQHGKPTKVNNFWETS AL Q8ZEL9|Q8ZEL9_YERPE 5 78 0.22 AQSPAGFAEEYIIESIWNNRFPPGSILPAERELSELIGVTRTTLREVLQRLARDGWLTIQHGKPTKVNNFWETS AL Q8Z685|Q8Z685_SALTI 5 78 0.35 AQSPAGFAEEYIIESIWNNCFPPGTILPAERELSELIGVTRTTLREVLQRLARDGWLTIQHGKPTKVNNFWETS AL Q9KQU8|Q9KQU8_VIBCH 5 78 0.62 AKSPAGFAEKYIIESIWNGRFPPGSILPAERELSELIGVTRTTLREVLQRLARDGWLTIQHGKPTKVNQFMETS AL Q9CPJ0|Q9CPJ0_PASMU 10 83 0.77 AQSPAGLAEEYIVRSIWNNHFPPGSDLPAERELAEKIGVTRTTLREVLQRLARDGWLNIQHGKPTKVNNIWETS AL P44705|FADR_HAEIN 10 81 1.08 AQSPAALAEEYIVKSIWQDVFPAGSNLPSERDLADKIGVTRTTLREVLQRLARDGWLTIQHGKPTKVNNIWD.. AL O07792|Y586_MYCTU 17 77 2.08 .........EQIATDVLTGEMPPGEALPSERRLAELLGVSRPAVREALKRLSAAGLVEVRQGDVTTVRDF.... AL Q11159|Y494_MYCTU 27 77 2.21 ...........IADAILDGVFPPGSTLPPERDLAERLGVNRTSLRQGLARLQQMGLIEVRHG............ AL Q8XFI2|Q8XFI2_SALTY 59 109 2.23 ...........IIKLINDNIFPPGTFLPPERELAKQLGVSRASLREALIVLEISGWIVIQSG............ CO AQSPAGFAEEYIVKSIWDGVFPPGSTLPPERELAERLGVSRTSLREALQRLERDGWIEIQHGKPTKVNNFWETS DR INTERPRO; IPR000524 "Bacterial regulatory proteins, GntR" DR PfamA; PF00392 gntR DR PROSITE; PS00043 PDOC00042 HTH_GNTR_FAMILY (27-51) DR PDB; 1H9T chain B (5-78) Q8ZP15_SALTY (5-78),1HW1 chain A (5-78),1HW1 chain B (5-78) DR PDB; 1H9T chain A (5-78) Q8ZP15_SALTY (5-78) ... // |
||||||||||||||||||||||||||||||||||||||||||
|
Each line begins with a two-character line code, which indicates the type of data contained in the line. The current line types and line codes and order in which they appear in an entry, are shown in the table below. |
||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
|
As shown in the above table, some line types are found in all entries, other are optional. Some line types occur many times in a single entry. Each entry must begin with an IDentification line (ID) and end with a Termination line (//). A detailed description of each line type is given in the next section of this document. |
||||||||||||||||||||||||||||||||||||||||||
The different line types |
||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
|
The ID (IDentification) line is always the first line of an entry.
The general form of the ID line is : |
||||||||||||||||||||||||||||||||||||||||||
Entry number |
||||||||||||||||||||||||||||||||||||||||||
|
The entry number is a number which characterises a family within a ProDom version; it is not stable through successive ProDom releases: this entry number is equal to the rank of a family, after sorting ProDom by decreasing number of domains in the family. |
||||||||||||||||||||||||||||||||||||||||||
Release |
||||||||||||||||||||||||||||||||||||||||||
|
The release number indicates the ProDom release of the current entry. As the database is built de-novo at each release, we strongly advise users to completely reload ProDom at each new release. |
||||||||||||||||||||||||||||||||||||||||||
Number of domains |
||||||||||||||||||||||||||||||||||||||||||
|
The number of domains is the number of homologous sub-sequences in the multiple alignment of the family. A protein could have several homologous domains of the same family; each occurrence of the domain is counted. |
||||||||||||||||||||||||||||||||||||||||||
Example |
||||||||||||||||||||||||||||||||||||||||||
ID 20167 p2002.1 10 seq. |
||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
|
The AC (ACcession number) line is a stable and unique key associated to
each ProDom entry to access the database.
The format of the accession number is: the 2 letters PD followed
by exactly 6 digits. |
||||||||||||||||||||||||||||||||||||||||||
Example |
||||||||||||||||||||||||||||||||||||||||||
AC PD266930 AC CG266930 |
||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
|
The KW line contains keywords which can help to identify the domain
family characteristics of the ProDom entry. |
||||||||||||||||||||||||||||||||||||||||||
Frequent name and occurrence |
||||||||||||||||||||||||||||||||||||||||||
|
The frequent name is one of the three most frequent sequence names in the family. A sequence name is the sequence identifier in an UniProt entry without the 5 letters organism code. The occurrence is the number of times this name appears in the family. |
||||||||||||||||||||||||||||||||||||||||||
Keyword |
||||||||||||||||||||||||||||||||||||||||||
|
A keyword is one of the 10 most frequent words found in the KW and DE lines of the
UniProt entries of all the domain family members. |
||||||||||||||||||||||||||||||||||||||||||
Example |
||||||||||||||||||||||||||||||||||||||||||
KW FADR(2) Y586(1) // COMPLETE PROTEOME DNA-BINDING FATTY TRANSCRIPTION REGULATION METABOLISM REGULATOR ACID
|
||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
|
The LA (Length of Alignment) line provides
the length of a domain sequence, with the gaps, once aligned
with the other homologous domains of the family. |
||||||||||||||||||||||||||||||||||||||||||
Example |
||||||||||||||||||||||||||||||||||||||||||
LA 74
|
||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
|
The ND (Number of Domains) line gives the number of homologous sub-sequences in the family. |
||||||||||||||||||||||||||||||||||||||||||
Example |
||||||||||||||||||||||||||||||||||||||||||
ND 10
|
||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
|
The NorMD value [THOM] computed for this family: it is generally admitted that the quality of the alignment may be considered as "good" if the NorMD value is higher than 0.4. |
||||||||||||||||||||||||||||||||||||||||||
Example |
||||||||||||||||||||||||||||||||||||||||||
NM 0.506
|
||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
|
The CC lines are parsable comments about the
ProDom family entry. They are used to record some family
consistency indicators, and the name of the domain closest to
consensus. |
||||||||||||||||||||||||||||||||||||||||||
The diameter |
||||||||||||||||||||||||||||||||||||||||||
|
The diameter is the maximal distance between two domains in the family. The distances are computed in PAM. In some cases, the distance between those domains can not be computed, so the value "1001 PAM" is given as default value. |
||||||||||||||||||||||||||||||||||||||||||
The radius of gyration |
||||||||||||||||||||||||||||||||||||||||||
|
The radius of gyration is the weighted root mean square distance between each domain and the family consensus sequence. The distances are also computed in PAM. |
||||||||||||||||||||||||||||||||||||||||||
The sequence closest to consensus |
||||||||||||||||||||||||||||||||||||||||||
|
The sequence closest to consensus is the sub-sequence whose distance to the family consensus sequence is the smallest. This information can help to select a domain representing the family at best. |
||||||||||||||||||||||||||||||||||||||||||
Example |
||||||||||||||||||||||||||||||||||||||||||
CC -!- DIAMETER: 119 PAM
CC -!- RADIUS OF GYRATION: 53 PAM
CC -!- SEQUENCE CLOSEST TO CONSENSUS: Q8ZEL9_YERPE 5-78 (distance:15 PAM)
|
||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
|
This line indicates the request used by Psiblast to build this family: a UniProt sequence, or a SCOP domain. |
||||||||||||||||||||||||||||||||||||||||||
Example |
||||||||||||||||||||||||||||||||||||||||||
DC This family was generated by psi-blast, with a profile built from the seed aligment of the following SCOP FAMILY
DC a.4.5.6
|
||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
|
Each AL (Alignment Line) represents a domain
aligned with all the homologous domains the family.
AL SWISS-PROT_AC|SWISS-PROT_ID BEGIN END WEIGHT ALIGNED_SEQUENCE
or:
AL TREMBL_AC|TREMBL_ID_SPECIES BEGIN END WEIGHT ALIGNED_SEQUENCE
|
||||||||||||||||||||||||||||||||||||||||||
The SWISS-PROT and the TrEMBL accession numbers |
||||||||||||||||||||||||||||||||||||||||||
|
The SWISS-PROT or the TrEMBL accession number is the accession number of the protein sequence in respectively the SWISS-PROT or TrEMBL database. |
||||||||||||||||||||||||||||||||||||||||||
The SWISS-PROT and the TREMBL identifiers |
||||||||||||||||||||||||||||||||||||||||||
|
The SWISS-PROT identifier is the sequence identifier of the protein in the SWISS-PROT database. The TrEMBL identifier is the accession number of the sequence in the TrEMBL database modified as decribed "database conventions" |
||||||||||||||||||||||||||||||||||||||||||
The domain begin and end |
||||||||||||||||||||||||||||||||||||||||||
|
The begin and end numbers provide the boundaries of the domain in the whole protein sequence. The amino acid numbering is the same as the SWISS-PROT and TrEMBL ones. |
||||||||||||||||||||||||||||||||||||||||||
The weight and the aligned sequence |
||||||||||||||||||||||||||||||||||||||||||
|
In Prodom families, the multiple sequence
alignment and the weights are computed by Multalin [CORP1]. |
||||||||||||||||||||||||||||||||||||||||||
Example |
||||||||||||||||||||||||||||||||||||||||||
AL P09371|FADR_ECOLI 4 77 0.22 AQSPAGFAEEYIIESIWNNRFPPGTILPAERELSELIGVTRTTLREVLQRLARDGWLTIQHGKPTKVNNFWETS
AL Q8ZP15|Q8ZP15_SALTY 5 78 0.22 AQSPAGFAEEYIIESIWNNRFPPGTILPAERELSELIGVTRTTLREVLQRLARDGWLTIQHGKPTKVNNFWETS
|
||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
|
The CO (COnsensus) line contains the consensus sequence of the domain family. It is computed by Multalin from the family multiple alignment. For each column of the multiple alignment, external gaps are not taken in acount when calculating the consensus amino acid. Thus, there is no external gap in the consensus sequence; only internal gaps are allowed. |
||||||||||||||||||||||||||||||||||||||||||
Example |
||||||||||||||||||||||||||||||||||||||||||
CO AQSPAGFAEEYIVKSIWDGVFPPGSTLPPERELAERLGVSRTSLREALQRLERDGWIEIQHGKPTKVNNFWETS
|
||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
|
The DR (Databases cross-Reference) lines are
used as pointers to information related to ProDom entry and
found in data collections other than ProDom.
The general form of a DR line is�: |
||||||||||||||||||||||||||||||||||||||||||
The database identifier |
||||||||||||||||||||||||||||||||||||||||||
|
ProDom families are currently cross-referenced to the following databases�: |
||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||
The cross-reference information |
||||||||||||||||||||||||||||||||||||||||||
|
The cross-reference information is
constituted by an unambiguous pointer to the information
entry in the target database, and some extra information
such as the name of the relevant domain, or the position in
the sequence.
DR GO; GO:0006810 P 0.275 1.00 "transport"
|
||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
|
This line signals the end of the ProDom entry. |
||||||||||||||||||||||||||||||||||||||||||
|
Acknoledgements |
||||||||||||||||||||||||||||||||||||||||||
|
Florence Corpet at Laboratoire de Génétique Cellulaire, Jérôme Gouzy, Daniel Kahn and Florence Servant at Laboratoire des Interactions Plantes-Microorganismes LIPM), INRA/CNRS in Toulouse |
||||||||||||||||||||||||||||||||||||||||||
|
If you want to cite ProDom in a publication, please use the reference [BRU] |
||||||||||||||||||||||||||||||||||||||||||
|
© The ProDom database is copyrighted by INRA and CNRS |
||||||||||||||||||||||||||||||||||||||||||