|
|
With fetchdom, release 3, you may dig into the ProDom or Xdom files, extracting data as needed.
Using other Unix utilities (nawk, perl, grep, etc.) you may execute rather complex
queries.
Most switches used by fetchdom 2.0 are still usable in fetchdom 2, with the same meaning and nerlay the same output: thus, you should not have to change your scripts too much. However, you are encouraged to use the new fetchdom syntax, much more easy to understand.
fetchdom, release 3 is covered by the same license as ProDom:
there is no fee for academics, but private companies are charged.
go to the download page.
You may download the source version as well as a binary program, statically linked for a linux-type machine.
If the environment variable $PRODOM is set, its value must be a directory
in which the files are searched for. If it is not set, the files are searched for in the current directory (".").
If the database is called prodom, the following files will be used by fetchdom:
$PRODOM/prodom.srs$PRODOM/prodom.xdom$PRODOM/prodom.idx (this directory is created by fetchdom if necessary).The database may be selected with the switch -b, using the following rules:
prodomnameIn the rest of this document, it will be supposed that the database is called prodom, so that the -b switch will not be used.
The data must be indexed for a better efficiency. This can be done in several ways:
The indexation is automatically started with any request, if the necessary index files are not present, or if they are older than the data files.
A manual indexation can be forced with the command:
fetchdom -i W or fetchdom -makeidx YThe command fetchdom -makeidx Y -v 1 or fetchdom -a PD39 -v 1displays a progress counter during the indexation progress.
Three different queries may be used, using the switchs -d,-D,-a,-A,-s,-S:
The data are taken from the prodom.srs file. The switches -d or -D are used for this query:
family_numbers and read the requests from this fileThe data are taken from the prodom.srs file.
The switches -a or -A are used for this query:
accession_numbers and read the requests from this fileThe data are taken from the prodom.xdom file.
You may provide fetchdom with Swissprot AC identifiers (e.g. P19084)
as well as with ID identifiers (e.g. 11S3_HELAN).
The switches -s or -S are used for this query:
11S3_HELANP19084HUMANFIXJECOLI, ECOL6, etc. (* may replace any number of characters)sequences and read the requests from this filefetchdom may return a complete record from the data files (a ProDom family or the decomposition in domains of a protein), but it may also return only part of the information. The -t switch is used to control this. Several -t switches may be used with the same request, or several values for a -t switch, separated by a comma:
fetchdom -a all -t ac,id,la,nd
-dDaA switches-sS switchesWhen you provide fetchdom with several data types, the answers are separated by a space character. But this default behaviour can be changed with the -fs switch, as in the following example:
> fetchdom -a all -t ac,id -fs ';' -t la,nd PD000200 1 232;1274; PD104964 2 77;291; PD000588 3 93;156; PD695519 4 172;3; PD619762 5 82;1;
As it can be seen from this example, the -fs switch changes the field separator for the following -t switches only. Consequently, the order of the switches is meaningful in this program.
The following switches were added to fetchdom only for compatibility reasons:
The -k switch is not yet implemented in fetchdom. You may simulated this request with:
> fetchdom -a all -t ac,kw | grep word
Besides, fetchdom cannot read the .mul and .cons files anymore.
Complex queries may be executed with the combining of fetchdom and other Unix utilities like grep, awk, perl... Here is a short tutorial:
Let's execute a request for answering the following question: "Find the proteins which have less than 5 domains, and for each of those domains, print the number of domains of the corresponding ProDom family.".
> ./fetchdom -s all -fs ';' -t ##,spid,#,spac,nd,arch ##104K_THEPA;#P15711;1;PD855994; ##108_LYCES;#Q43495;1;PD015531; ##10KD_VIGUN;#P18646;2;PD420943 PD051267; ##11S3_HELAN;#P19084;7;PD000759 PD069806 PD000743 PD000784 PD186049 PD000438 PD000688; ##11SB_CUCMA;#P13744;6;PD000759 PD000743 PD000784 PD186049 PD000438 PD000688; ##120K_RICRI;#P14914;8;PD040036 PD013944 PD017598 PD039724 PD186791 PD387304 PD533245 PD010115; ##128U_DROME;#P32234;6;PD004042 PD000414 PD556426 PD002918 PD697807 PD471591; ##12KD_FRAAN;#Q05349;2;PD352332 PD010539; ##12S1_ARATH;#P15455;6;PD000759 PD000743 PD000784 PD186049 PD000438 PD000688; ##12S2_ARATH;#P15456;6;PD000759 PD000743 PD000784 PD186049 PD000438 PD000688; ##13S1_FAGES;#O23878;7;PD000759 PD069808 PD000743 PD000784 PD186049 PD000438 PD000688; ##13S2_FAGES;#O23880;6;PD000759 PD000743 PD000784 PD186049 PD000438 PD000688; ##13S3_FAGES;#Q9XFM4;7;PD000759 PD069808 PD000743 PD000784 PD186049 PD000438 PD000688; ##13SB_FAGES;#P83004;3;PD186049 PD000438 PD000688; ##140U_DROME;#P81928;2;PD212699 PD837228; ...
The switch -fs ';' is used to separated the fields with a ; rather than with a space: this
will be useful for the next stage, because the field nb. 4 is a list of Prodom domains, separated with spaces. The ##
and # characters just before spid and spac will also be explained further.
> ./fetchdom -s all -fs ';' -t ##,spid,#,spac,nd,arch | awk -F';' '$3<5{print$1,$2,$4}'
##104K_THEPA #P15711 PD855994
##108_LYCES #Q43495 PD015531
##10KD_VIGUN #P18646 PD420943 PD051267
##12KD_FRAAN #Q05349 PD352332 PD010539
##13SB_FAGES #P83004 PD186049 PD000438 PD000688
##140U_DROME #P81928 PD212699 PD837228
...
The awk program is used to suppress the proteins with more than 5 domains: the field separator
is declared to be ; (see the switch -F ';'), the number of domains is the variable $3
(field number 3) in awk parliance, and only the lines with $3 lower than 5 are printed. The only printed
fields are the first, the second and the fourth.
>./fetchdom -s all -fs ';' -t ##,spid,#,spac,nd,arch | awk -F';' '$3<5{print$1,$2,$4}' | ./fetchdom -A -- -t ac,nd
104K_THEPA P15711 PD855994 1
108_LYCES Q43495 PD015531 58
10KD_VIGUN P18646 PD420943 3
10KD_VIGUN P18646 PD051267 63
12KD_FRAAN Q05349 PD352332 23
12KD_FRAAN Q05349 PD010539 13
13SB_FAGES P83004 PD186049 110
13SB_FAGES P83004 PD000438 122
13SB_FAGES P83004 PD000688 111
140U_DROME P81928 PD212699 7
140U_DROME P81928 PD837228 1
1431_ARATH P42643 PD000600 182
...
The swissprot ID and the swissprot AC are NOT considered as requests by the second occurrence of fetchdom:
when fetchdom reads a word starting with ##, it puts this word in a special variable. When it reads another
word starting with only one #, it completes this variable with the word. The variable is then written just
before the next requests: in the previous example, the swissprot AC and the swissprot ID are thus repeated
for each domain appearing in their architecture (see 10KD_VIGUN for example). The variable is reset when
the next swissprot ac (preceded with ##) is read, which is correct, as we are now treating another protein.