Release Home page ProDom, the Protein Domain Database
Main form Release informations Documentation The ProDom team Support
Help Data files format History References How to link Usefull tools Contact Site map
 

The ProDom data file description

Print this page
 
| What is ProDom | Conventions used in the database | The different line types | Citation |
 

What is ProDom

 

ProDom is a protein domain family database constructed automatically by clustering homologous segments. The ProDom building procedure MKDOM2 is based on recursive PSI-BLAST searches [ALTS2]. The source protein sequences are non-fragmentary sequences derived from UniProtKB (SWISS-PROT and TrEMBL databases). ProDom was first established in 1993 [SONN] and maintained by the Laboratoire de Génétique Cellulaire and the Laboratoire de Interactions Plantes-Microorganismes (INRA/CNRS) in Toulouse. It is now maintained by the PRABI (bioinformatics center of Rhone-Alpes). The ProDom database consists of domain family entries. Each entry provides a multiple sequence alignment of homologous domains and a family consensus sequence.

 

Conventions used in the database

 

Domain Family Accession Numbers

A ProDom entry is characterised by a unique accession number. The purpose of accession numbers is to provide a stable way of identifying entries through releases. As ProDom is built anew every new version, we have developped a tool which allows to transfer stably accession numbers from a version to another by searching for domain family overlaps between both versions. If an entry is split into two or more entries, the accession number of the parent entry is assign to one of the child entry, and new accession numbers are created for the other children entries. If two or more entries are merged, the accession number of one of the parent entries is assign to the child entry; the other parent accession numbers are stored as "obsolete" accession numbers, and they refer to the child one. If an entry is deleted from the database, its accesssion number is stored as "deleted" accession number in ProDom database.

Domain sequence identifiers

Each domain is a segment derived from a protein sequence. Such a sub-sequence is identified by the name of the protein in the SWISS-PROT or TrEMBL database, followed by the start and end points of the domain in the whole amino acid sequence (domain boundaries). The SWISS-PROT sequence identifiers have two parts. The first one is the name of the entry (maximum four letters), and the second part is a code for the organism from which the sequence is extracted (maximum 5 letters). The TrEMBL identifiers are the sequence accession number of the protein in the TrEMBL database. ProDom adds the same 5 letter organism code as used in SWISS-PROT to the TrEMBL identifier when the OS line of the TrEMBL entry allows to find the organism from which the sequence is. Otherwise, the first word of the OC line is used to make a "rough" organism code reflecting the domain of life and added to the TrEMBL identifier as described in the following table:

 

Code used in ProDom

Domain of life

EEEEE

Eukaryota

BBBBB

Bacteria

AAAAA

Archea

VVVVV

Viruses

XXXXX

Unknown

 

Structure of a domain family entry

The entries in the ProDom database are structured so as to be usable by human readers as well as by computer programs. The comments and keywords are in ordinary English. Each family entry is composed of lines. Different types of lines, each with their own format, are used to record the data that make up the entry.
A sample domain family entry is shown below :

 
ID 20167 p2002.1                           10 seq.
AC   PD266930
KW   FADR(2) Y586(1) // COMPLETE PROTEOME DNA-BINDING FATTY TRANSCRIPTION REGULATION METABOLISM REGULATOR ACID ACTIVATOR 
LA   74
ND   10
CC   -!- DIAMETER:      119 PAM
CC   -!- RADIUS OF GYRATION:    53 PAM
CC   -!- SEQUENCE CLOSEST TO CONSENSUS: Q8ZEL9_YERPE 5-78 (distance:15 PAM)
DC   This family was generated by psi-blast, with a profile built from the seed aligment of the following SCOP FAMILY
DC   a.4.5.6
AL P09371|FADR_ECOLI            4    77 0.22 AQSPAGFAEEYIIESIWNNRFPPGTILPAERELSELIGVTRTTLREVLQRLARDGWLTIQHGKPTKVNNFWETS
AL Q8ZP15|Q8ZP15_SALTY          5    78 0.22 AQSPAGFAEEYIIESIWNNRFPPGTILPAERELSELIGVTRTTLREVLQRLARDGWLTIQHGKPTKVNNFWETS
AL Q8ZEL9|Q8ZEL9_YERPE          5    78 0.22 AQSPAGFAEEYIIESIWNNRFPPGSILPAERELSELIGVTRTTLREVLQRLARDGWLTIQHGKPTKVNNFWETS
AL Q8Z685|Q8Z685_SALTI          5    78 0.35 AQSPAGFAEEYIIESIWNNCFPPGTILPAERELSELIGVTRTTLREVLQRLARDGWLTIQHGKPTKVNNFWETS
AL Q9KQU8|Q9KQU8_VIBCH          5    78 0.62 AKSPAGFAEKYIIESIWNGRFPPGSILPAERELSELIGVTRTTLREVLQRLARDGWLTIQHGKPTKVNQFMETS
AL Q9CPJ0|Q9CPJ0_PASMU         10    83 0.77 AQSPAGLAEEYIVRSIWNNHFPPGSDLPAERELAEKIGVTRTTLREVLQRLARDGWLNIQHGKPTKVNNIWETS
AL P44705|FADR_HAEIN           10    81 1.08 AQSPAALAEEYIVKSIWQDVFPAGSNLPSERDLADKIGVTRTTLREVLQRLARDGWLTIQHGKPTKVNNIWD..
AL O07792|Y586_MYCTU           17    77 2.08 .........EQIATDVLTGEMPPGEALPSERRLAELLGVSRPAVREALKRLSAAGLVEVRQGDVTTVRDF....
AL Q11159|Y494_MYCTU           27    77 2.21 ...........IADAILDGVFPPGSTLPPERDLAERLGVNRTSLRQGLARLQQMGLIEVRHG............
AL Q8XFI2|Q8XFI2_SALTY         59   109 2.23 ...........IIKLINDNIFPPGTFLPPERELAKQLGVSRASLREALIVLEISGWIVIQSG............
CO                                           AQSPAGFAEEYIVKSIWDGVFPPGSTLPPERELAERLGVSRTSLREALQRLERDGWIEIQHGKPTKVNNFWETS
DR   INTERPRO;    IPR000524 "Bacterial regulatory proteins, GntR"
DR   PfamA;       PF00392 gntR
DR   PROSITE;     PS00043 PDOC00042 HTH_GNTR_FAMILY (27-51)
DR   PDB;         1H9T chain B  (5-78) Q8ZP15_SALTY (5-78),1HW1 chain A (5-78),1HW1 chain B (5-78)
DR   PDB;         1H9T chain A  (5-78) Q8ZP15_SALTY (5-78)
...
//
 

Each line begins with a two-character line code, which indicates the type of data contained in the line. The current line types and line codes and order in which they appear in an entry, are shown in the table below.

 

Line code

Content

Occurrence in an entry

ID

IDentification

Once ; starts the entry

AC

ACcession number

Once

KW

KeyWords

Once

LA

Length of Alignment

Once

ND

Number of Domains

Once

NM

NorMD value

Once

CC

Parsable comments

Three times

DC

Database Comments

Optional

AL

Domain Alignment Line

One or more

CO

Consensus sequence

Once

DR

Database cross-References

Optional

CT

Copyright notice

Required (2 lines)

//

Termination line

Once ; ends the entry

 

As shown in the above table, some line types are found in all entries, other are optional. Some line types occur many times in a single entry. Each entry must begin with an IDentification line (ID) and end with a Termination line (//). A detailed description of each line type is given in the next section of this document.

 

The different line types

The ID line

The ID (IDentification) line is always the first line of an entry. The general form of the ID line is :
ID ENTRY_NUMBER pRELEASE NUMBER_OF_DOMAINS seq.

 

Entry number

The entry number is a number which characterises a family within a ProDom version; it is not stable through successive ProDom releases: this entry number is equal to the rank of a family, after sorting ProDom by decreasing number of domains in the family.

 

Release

The release number indicates the ProDom release of the current entry. As the database is built de-novo at each release, we strongly advise users to completely reload ProDom at each new release.

 

Number of domains

The number of domains is the number of homologous sub-sequences in the multiple alignment of the family. A protein could have several homologous domains of the same family; each occurrence of the domain is counted.

 

Example

ID   20167 p2002.1                           10 seq.
	   

The AC line

The AC (ACcession number) line is a stable and unique key associated to each ProDom entry to access the database. The format of the accession number is: the 2 letters PD followed by exactly 6 digits.
For Prodom-CG, the format of the accession number is: the 2 letters CG folloed by the same digits.

Example

AC   PD266930
AC   CG266930

The KW line

The KW line contains keywords which can help to identify the domain family characteristics of the ProDom entry.
The general form of the KW line is :
KW [FREQUENT_NAME(OCCURRENCE)...] // KEYWORD [KEYWORD ...]

 

Frequent name and occurrence

The frequent name is one of the three most frequent sequence names in the family. A sequence name is the sequence identifier in an UniProt entry without the 5 letters organism code. The occurrence is the number of times this name appears in the family.

 

Keyword

A keyword is one of the 10 most frequent words found in the KW and DE lines of the UniProt entries of all the domain family members.
Up to 10 keywords could be listed on the KW line, and the keywords are sorted by decreasing frequency. The building procedure of this automatic comment could be improved in futur ProDom releases.

 

Example

KW   FADR(2) Y586(1) // COMPLETE PROTEOME DNA-BINDING FATTY TRANSCRIPTION REGULATION METABOLISM REGULATOR ACID
      

The LA line

The LA (Length of Alignment) line provides the length of a domain sequence, with the gaps, once aligned with the other homologous domains of the family.

Example

LA   74
      

The ND line

The ND (Number of Domains) line gives the number of homologous sub-sequences in the family.

Example

ND   10
      

The NM line

The NorMD value [THOM] computed for this family: it is generally admitted that the quality of the alignment may be considered as "good" if the NorMD value is higher than 0.4.

Example

NM   0.506
      

The CC lines

The CC lines are parsable comments about the ProDom family entry. They are used to record some family consistency indicators, and the name of the domain closest to consensus.
The general form of a CC line is�:
CC   -!- TOPIC: INFORMATION
There are three topic types: DIAMETER, RADIUS OF GYRATION, and SEQUENCE CLOSEST TO CONSENSUS.

 

The diameter

The diameter is the maximal distance between two domains in the family. The distances are computed in PAM. In some cases, the distance between those domains can not be computed, so the value "1001 PAM" is given as default value.

 

The radius of gyration

The radius of gyration is the weighted root mean square distance between each domain and the family consensus sequence. The distances are also computed in PAM.

 

The sequence closest to consensus

The sequence closest to consensus is the sub-sequence whose distance to the family consensus sequence is the smallest. This information can help to select a domain representing the family at best.

 

Example

CC   -!- DIAMETER:      119 PAM
CC   -!- RADIUS OF GYRATION:    53 PAM
CC   -!- SEQUENCE CLOSEST TO CONSENSUS: Q8ZEL9_YERPE 5-78 (distance:15 PAM)
      

The DC lines

This line indicates the request used by Psiblast to build this family: a UniProt sequence, or a SCOP domain.

Example

DC   This family was generated by psi-blast, with a profile built from the seed aligment of the following SCOP FAMILY
DC   a.4.5.6
      

The AL lines

Each AL (Alignment Line) represents a domain aligned with all the homologous domains the family.
The general form of the AL line is�:

AL   SWISS-PROT_AC|SWISS-PROT_ID BEGIN END WEIGHT ALIGNED_SEQUENCE
      
or:
AL   TREMBL_AC|TREMBL_ID_SPECIES BEGIN END WEIGHT ALIGNED_SEQUENCE
      
 

The SWISS-PROT and the TrEMBL accession numbers

The SWISS-PROT or the TrEMBL accession number is the accession number of the protein sequence in respectively the SWISS-PROT or TrEMBL database.

 

The SWISS-PROT and the TREMBL identifiers

The SWISS-PROT identifier is the sequence identifier of the protein in the SWISS-PROT database. The TrEMBL identifier is the accession number of the sequence in the TrEMBL database modified as decribed "database conventions"

 

The domain begin and end

The begin and end numbers provide the boundaries of the domain in the whole protein sequence. The amino acid numbering is the same as the SWISS-PROT and TrEMBL ones.

 

The weight and the aligned sequence

In Prodom families, the multiple sequence alignment and the weights are computed by Multalin [CORP1].
Sequence weights allow to downweigh overly similar sequences in the alignment. The smaller the weight, the most very similar domains the current domain has in the family.
The aligned sequences are given with two types of gaps: . for gaps at domain extremities (external gaps), and - for gaps inside the domain sequence (internal gaps).

 

Example

AL P09371|FADR_ECOLI            4    77 0.22 AQSPAGFAEEYIIESIWNNRFPPGTILPAERELSELIGVTRTTLREVLQRLARDGWLTIQHGKPTKVNNFWETS
AL Q8ZP15|Q8ZP15_SALTY          5    78 0.22 AQSPAGFAEEYIIESIWNNRFPPGTILPAERELSELIGVTRTTLREVLQRLARDGWLTIQHGKPTKVNNFWETS
      

The CO line

The CO (COnsensus) line contains the consensus sequence of the domain family. It is computed by Multalin from the family multiple alignment. For each column of the multiple alignment, external gaps are not taken in acount when calculating the consensus amino acid. Thus, there is no external gap in the consensus sequence; only internal gaps are allowed.

Example

CO                                           AQSPAGFAEEYIVKSIWDGVFPPGSTLPPERELAERLGVSRTSLREALQRLERDGWIEIQHGKPTKVNNFWETS
      

The DR lines

The DR (Databases cross-Reference) lines are used as pointers to information related to ProDom entry and found in data collections other than ProDom. The general form of a DR line is�:
DR   DATABASE_IDENTIFIER; INFORMATION

 

The database identifier

ProDom families are currently cross-referenced to the following databases�:

 

Identifier

Database description

GO

GENE ONTOLOGY database

INTERPRO

INTERPRO protein families database

PROSITE

PROSITE protein domains and families database

PFAMA

Pfam-A protein domain database

PDB

Brookhaven Protein Data Bank

 

The cross-reference information

The cross-reference information is constituted by an unambiguous pointer to the information entry in the target database, and some extra information such as the name of the relevant domain, or the position in the sequence.

  • GO: the cross-reference information is the accession number of the GO entry, the corresponding ontology (biological Process, molecular Function, Cellular component), a precision indicator (from 0 to 1, the highest means the highest precision in the Gene Ontology tree, ie the nearest from a leaf), a probability of assignation, and the entry name.
  • DR GO; GO:0006810 P 0.275 1.00 "transport"
     
  • INTERPRO: the cross-reference information is the accession number of the INTERPRO entry and the name of the entry.
    DR INTERPRO; IPR000524 "Bacterial regulatory proteins, GntR">
     
  • PROSITE: the cross-reference information is the accession number of the PROSITE entry, the accession number of the associated documentation (PDOC), the identifier of the PROSITE entry and the position of the pattern matching on the consensus sequence.
    DR PROSITE; PS00043 PDOC00042 HTH_GNTR_FAMILY (27-51)
     
  • PFAM-A: The cross-reference information is the accession number and the identifier of the Pfam-A entry.
    DR PfamA; PF00392 gntR
     
  • PDB The cross-reference information is the PDB code of the three dimensional structure, the chain number, the position of the match in the structure, the name of the relevant sequence and the position of the match in the sequence. As several chains or several pdb Id generally match the same swissprot entry, those other pdb entries are indicated with comman (,) as separator. DR PDB; 1H9T chain B (5-78) Q8ZP15_SALTY (5-78),1HW1 chain A (5-78),1HW1 chain B (5-78)
     

The // line

This line signals the end of the ProDom entry.

Acknoledgements

Florence Corpet at Laboratoire de Génétique Cellulaire, Jérôme Gouzy, Daniel Kahn and Florence Servant at Laboratoire des Interactions Plantes-Microorganismes LIPM), INRA/CNRS in Toulouse

 

Citation

If you want to cite ProDom in a publication, please use the reference [BRU]

© The ProDom database is copyrighted by INRA and CNRS
© UniProtKB copyright (c) 2002-2011 UniProt Consortium
ProDom - Server maintained by Dominique Guyot , on behalf of the ProDom team
Graphics design Sandrine Dalmar
Last updated on December 22nd, 2011.