Release Home page

ProDom, the Protein Domain Database

Fetchdom, release 3.00

With fetchdom, release 3, you may dig into the ProDom or Xdom files, extracting data as needed. Using other Unix utilities (nawk, perl, grep, etc.) you may execute rather complex queries.

Most switches used by fetchdom 2.0 are still usable in fetchdom 2, with the same meaning and nerlay the same output: thus, you should not have to change your scripts too much. However, you are encouraged to use the new fetchdom syntax, much more easy to understand.

Downloading and installing the program

fetchdom, release 3 is covered by the same license as ProDom: there is no fee for academics, but private companies are charged. go to the download page.

You may download the source version as well as a binary program, statically linked for a linux-type machine.

Selecting the database

If the environment variable $PRODOM is set, its value must be a directory in which the files are searched for. If it is not set, the files are searched for in the current directory (".").
If the database is called prodom, the following files will be used by fetchdom:

The database may be selected with the switch -b, using the following rules:

No -b switch at all
The database is called prodom
-b name
The database is called name

In the rest of this document, it will be supposed that the database is called prodom, so that the -b switch will not be used.

Indexing the data

The data must be indexed for a better efficiency. This can be done in several ways:

Automatic indexation

The indexation is automatically started with any request, if the necessary index files are not present, or if they are older than the data files.

Manual indexation

A manual indexation can be forced with the command:

fetchdom -i W or fetchdom -makeidx Y

The -v switch

The command fetchdom -makeidx Y -v 1 or fetchdom -a PD39 -v 1displays a progress counter during the indexation progress.

Choosing a query

Three different queries may be used, using the switchs -d,-D,-a,-A,-s,-S:

Query by domain number

The data are taken from the prodom.srs file. The switches -d or -D are used for this query:

fetchdom -d 12 or fetchdom -domain 12
return the family number 12
fetchdom -d 10-100
return the familys 10 to 100
fetchdom -d all
return all the families
fetchdom -D family_numbers
open the file family_numbers and read the requests from this file

Query by accession number

The data are taken from the prodom.srs file.
The switches -a or -A are used for this query:

fetchdom -a PD000045 or fetchdom -acc PD000045
return the family PD000045
fetchdom -a all
return all the families
fetchdom -A accession_numbers
open the file accession_numbers and read the requests from this file
fetchdom -A --
read the requests from the standard input.

Query by sequence AC or ID

The data are taken from the prodom.xdom file.
You may provide fetchdom with Swissprot AC identifiers (e.g. P19084) as well as with ID identifiers (e.g. 11S3_HELAN).
The switches -s or -S are used for this query:

fetchdom -s 11S3_HELAN or fetchdom -sequence 11S3_HELAN
return the decomposition in domains of the protein 11S3_HELAN
fetchdom -s P19084
return the decomposition in domains of the protein P19084
fetchdom -s all
return the decomposition in domains of all the proteins.
fetchdom -s _HUMAN
return the decomposition in domains of all the proteins which belong to the specy HUMAN
fetchdom -s FIXJ_
return the decomposition in domains of all the proteins whose function code is FIXJ
fetchdom -s _ECOL*
return the decomposition in domains of all the proteins which belong to ECOLI, ECOL6, etc. (* may replace any number of characters)
fetchdom -S sequences
open the file sequences and read the requests from this file
fetchdom -S --
read the requests from the standard input.

Choosing a data-type representation

fetchdom may return a complete record from the data files (a ProDom family or the decomposition in domains of a protein), but it may also return only part of the information. The -t switch is used to control this. Several -t switches may be used with the same request, or several values for a -t switch, separated by a comma:
fetchdom -a all -t ac,id,la,nd

Data types to use with -dDaA switches

-t prodom
return as prodom formatted record(s)
-t axes
return as prodom formatted record(s) with a horizontal axis (default type)
-t msf
return as msf records
-t fasta
return as fasta records (no gaps)
-t consensus
return as fasta records, consensus only
-t closest
return the domain closest from the consensus, as a fasta formatted file.
-t blastpgp
return the alignment in a format suitable for use with blastpgp, for generating a profile from this alignment
-t ac
return the ac number.
-t id
return the id number.
-t kw
return the keyword line.
-t la
return the length of alignment.
-t nd
return the number of domains in the family.
-t nm
return the NormD value.
-t dia
return the diameter of the family.
-t rad
return the radius of gyration of the family.
-t dc
return the dc lines, concatenated in one line.
-t pdb
return the list of pdb links of the family.
-t interpro
return the list of interpro links of the family.
-t pfama
return the list of PfamA links of the family.
-t prosite
return the list of prosite links of the family.
-t spac
return the list of proteins (swissprot AC) which take part in the family.
-t spid
return the list of proteins (swissprot ID) which take part in the family.
-t anyword
return anyword (the word after -t)

Data types to use with -sS switches

-t xdom
return as xdom formatted record(s) (default type)
-t spac
return the SwissProt AC of the proteins
-t spid
return the SwissProt ID of the proteins
-t splen
return the length of the proteins
-t arch
return the architecture of the proteins, ie the list of domains.
-t anyword
return anyword (the word after -t)

Field separator

When you provide fetchdom with several data types, the answers are separated by a space character. But this default behaviour can be changed with the -fs switch, as in the following example:

> fetchdom -a all -t ac,id -fs ';' -t la,nd

PD000200 1 232;1274;
PD104964 2 77;291;
PD000588 3 93;156;
PD695519 4 172;3;
PD619762 5 82;1;

As it can be seen from this example, the -fs switch changes the field separator for the following -t switches only. Consequently, the order of the switches is meaningful in this program.

Compatibility with fetchdom version 2

The following switches were added to fetchdom only for compatibility reasons:

-dmsf 12
same as -d 12 -t msf
-amsf PD000039
same as -a PD000039 -t msf
-x Y or-noaxis
same as -t prodom
-c 12 or-consensus 12
same as -d 12 -t consensus

Lost functionalities

The -k switch is not yet implemented in fetchdom. You may simulated this request with:

> fetchdom -a all -t ac,kw | grep word

Besides, fetchdom cannot read the .mul and .cons files anymore.

Complex queries

Complex queries may be executed with the combining of fetchdom and other Unix utilities like grep, awk, perl... Here is a short tutorial:

Let's execute a request for answering the following question: "Find the proteins which have less than 5 domains, and for each of those domains, print the number of domains of the corresponding ProDom family.".

Requesting all the proteins and displaying the number of domains and the architecture

> ./fetchdom -s all -fs ';' -t ##,spid,#,spac,nd,arch 
##104K_THEPA;#P15711;1;PD855994;
##108_LYCES;#Q43495;1;PD015531;
##10KD_VIGUN;#P18646;2;PD420943 PD051267;
##11S3_HELAN;#P19084;7;PD000759 PD069806 PD000743 PD000784 PD186049 PD000438 PD000688;
##11SB_CUCMA;#P13744;6;PD000759 PD000743 PD000784 PD186049 PD000438 PD000688;
##120K_RICRI;#P14914;8;PD040036 PD013944 PD017598 PD039724 PD186791 PD387304 PD533245 PD010115;
##128U_DROME;#P32234;6;PD004042 PD000414 PD556426 PD002918 PD697807 PD471591;
##12KD_FRAAN;#Q05349;2;PD352332 PD010539;
##12S1_ARATH;#P15455;6;PD000759 PD000743 PD000784 PD186049 PD000438 PD000688;
##12S2_ARATH;#P15456;6;PD000759 PD000743 PD000784 PD186049 PD000438 PD000688;
##13S1_FAGES;#O23878;7;PD000759 PD069808 PD000743 PD000784 PD186049 PD000438 PD000688;
##13S2_FAGES;#O23880;6;PD000759 PD000743 PD000784 PD186049 PD000438 PD000688;
##13S3_FAGES;#Q9XFM4;7;PD000759 PD069808 PD000743 PD000784 PD186049 PD000438 PD000688;
##13SB_FAGES;#P83004;3;PD186049 PD000438 PD000688;
##140U_DROME;#P81928;2;PD212699 PD837228;
...

The switch -fs ';' is used to separated the fields with a ; rather than with a space: this will be useful for the next stage, because the field nb. 4 is a list of Prodom domains, separated with spaces. The ## and # characters just before spid and spac will also be explained further.

Filtering the results, keeping only proteins with less than 5 domains

> ./fetchdom -s all -fs ';' -t ##,spid,#,spac,nd,arch | awk -F';' '$3<5{print$1,$2,$4}'
##104K_THEPA #P15711 PD855994
##108_LYCES #Q43495 PD015531
##10KD_VIGUN #P18646 PD420943 PD051267
##12KD_FRAAN #Q05349 PD352332 PD010539
##13SB_FAGES #P83004 PD186049 PD000438 PD000688
##140U_DROME #P81928 PD212699 PD837228
...

The awk program is used to suppress the proteins with more than 5 domains: the field separator is declared to be ; (see the switch -F ';'), the number of domains is the variable $3 (field number 3) in awk parliance, and only the lines with $3 lower than 5 are printed. The only printed fields are the first, the second and the fourth.

Requesting the number of domains for each ProDom family returned

>./fetchdom -s all -fs ';' -t ##,spid,#,spac,nd,arch | awk -F';' '$3<5{print$1,$2,$4}' | ./fetchdom -A -- -t ac,nd
104K_THEPA P15711 PD855994 1 
108_LYCES Q43495 PD015531 58 
10KD_VIGUN P18646 PD420943 3 
10KD_VIGUN P18646 PD051267 63 
12KD_FRAAN Q05349 PD352332 23 
12KD_FRAAN Q05349 PD010539 13 
13SB_FAGES P83004 PD186049 110 
13SB_FAGES P83004 PD000438 122 
13SB_FAGES P83004 PD000688 111 
140U_DROME P81928 PD212699 7 
140U_DROME P81928 PD837228 1 
1431_ARATH P42643 PD000600 182 
...

The swissprot ID and the swissprot AC are NOT considered as requests by the second occurrence of fetchdom: when fetchdom reads a word starting with ##, it puts this word in a special variable. When it reads another word starting with only one #, it completes this variable with the word. The variable is then written just before the next requests: in the previous example, the swissprot AC and the swissprot ID are thus repeated for each domain appearing in their architecture (see 10KD_VIGUN for example). The variable is reset when the next swissprot ac (preceded with ##) is read, which is correct, as we are now treating another protein.