mkdom2 and xdom: the documentation
- What is mkdom2/xdom ?
- Installation
- Starting a new project
- Citing the mkdom2 program
What is mkdom2/xdom ?
mkdom2
is the program we use routinely to build each new release of ProDom.The algorithm is described elsewhere (Gouzy et al., 1999), but let's briefly say that it relies on the assumption that the shortest amino acid sequence corresponds to a single domain, and may be used as a query to screen the database with the psi-blast program, in order to cluster homologous domains. For building ProDom, we run this program on the whole swissprot/trembl database, but it can be run on any set of protein sequences (as long as you have a fasta file).
xdom
is a graphical program which will help you to analyze domains detected by mkdom2, as it visualises all domain arrangements in the protein set.
Installation
It may be considered as a good practice to install the package in the directory
/usr/local
, as all users will have the opportunity to runmkdom2/xdom2
. However this is not required, and you may prefer to install the pakage in your home directory, thus dispensing with root privilege.
Prerequisites
You must have a computer not too old, with enough memory. The program was tested with a pentium 3 computer with 512Mb memory (linux) and on a Sun Ultra Sparc 480 MHz (SunOS) machine. The programs run currently only on Linux/Intel-86 or SunOS/sparc based systems. However, you may install the package in a directory shared by several machines of both architectures.
You must haveperl
installed , version 5.6.1 or more. No module other than those found in the standard perl distribution are required, except for the Df module written by Ian Guthrie. For better convenience, this module is integrated into the mkdom2/xdom2 distribution, so you should not have to bother about this.
You should usecsh
ortcsh
shell.Unpacking the distribution and editing .login
Unpack the distribution with:
gunzip < xdom2.0-tar.gz | tar xvf -This should create a directory called
Xdom2.0
, whose content looks like this:ls -l Xdom2.0
total 32
drwxr-xr-x 4 manu prodom 4096 Dec 16 12:32 bin
drwxrwxr-x 3 manu prodom 4096 Dec 15 14:01 doc
drwxrwxr-x 5 manu prodom 4096 Dec 15 12:14 lib
-rw-rw-r-- 1 manu prodom 3658 Dec 17 15:53 mkdom2setup.pl
-rw-rw-r-- 1 manu prodom 1479 Dec 18 08:54 README
-rw-rw-r-- 1 manu prodom 229 Dec 17 16:36 setup.csh
drwxr-xr-x 2 manu prodom 4096 Dec 18 14:30 TestPlease run the script
mkdom2_install.pl
using the command:perl Xdom2.0/mkdom2_install.plThis will create a csh file called
setup_Linux.csh
orsetup_SunOS.csh
. You'll have tosource
this file before executing mkdom2 or xdom2. You may also source the file calledsetup.csh
, which will automatically call the good setup file considering the machine's architecture.
It may be useful to add the following line at the end of your.login
file:cd Xdom2.0; source setup.csh; cdTesting the program
Before starting, it is important to test the program to be sure that everything works perfectly well. This can be done simply with the command:
mkdom2test.plThis script:
- calls
cfg.pl
to configure a directory suitable for runningmkdom2
with a test file as input.- changes to this directory.
- calls mkdom2.
- calls the Unix command
diff
to check the differences between the obtained result and a reference file.
Please note the results of mkdom2 are different when you run the program on different architectures, due to differences in the implementation of the sort routines. Thus we have a reference file for each supported architecture.Should something be different between the two files, the program would tell that the test did not succeed, in which case you could have a look to the files:
Xdom2.0/Test/Test.51
and the locally generatedTest_lcl.51
to try investigating the problem.
Starting a new project
A whole domain analysis with
mkdom2/xdom2
includes the following operations:
- Setting up an environment
- execute
mkdom2
- Check the results files to detect possible problems
- post-process the data
- Look and may be print the data with the
xdom
programSetting up an environment
Before starting the
mkdom2
program, you have to create a working environment (i.e. some directories and files that will be used by the programs). This is done with executing the scriptmkdom2cfg.pl
. You'll have to answer some questions:
- a name for the project
- a version number (any string will be OK): default is a string containing today's date.
- Do you want to use some expert domains at the beginning of the clustering process ? (see later).
- The fasta-formatted input file name.
Let's say your project is called
organism
, and the version number is20031225
: a directory calledorganism-20031225
is then created, with some files or directories inside, as shown under:$ ls -l organism-20031225
total 12
drwxrwxr-x 2 manu prodom 4096 Dec 11 15:37 checkpoint
drwxrwxr-x 2 manu prodom 4096 Dec 11 15:37 data
-rw-rw-r-- 1 manu prodom 48 Dec 11 15:37 mkdom2.confStarting the mkdom2 script
You have to change directory to
data
, thenmkdom2
may be started with the command:mkdom2 IN=organism.fasta LOG=mkdom2.log &The
LOG
switch is not required, however if not specifiedmkdom2
logs to the standard output, which might be not very convenient, should program execution last a long time (typically several hours or days for big fasta files).The time stamps
From time to time, and especially for each blastpgp execution, a time stamp is calculated and formed as follows:
#03#12#09#07#27#54#11#This stamp is the coded value of a date, here December 9th, 2003 at 7:27:54. Stamps may be generated at a relatively high rate, and the last number (11) makes sure that the stamps are different, even if the time did change less that a second. Those stamps are used to check the synchronization between the many created files.
Checkpointing the data during the execution
The data are checkpointed from time to time, so that in case of unpredictable interruption, as few data as possible would be lost. The important temporary files are automatically copied to the directory
checkpoint/<stamp>
, where <stamp> is the stamp generated at the moment of the checkpoint. However, only 2 subdirectories are kept under thecheckpoint
directory in order to avoid disk saturation.Interrupting mkdom2 in an orderly manner
mkdom2
may run during a very long period of time, depending on the data. It may thus be useful to be able to interrupt the program without loosing the already executed job. This can be done by the creation of an empty file calledMKD.stop
:touch MKD.stopThe program looks from time to time for the existence of this file, thus it can go on during a few minutes before stopping its execution. The data are then checkpointed, and saved in the directory called
current
(this is in fact a symbolic link to a directory named1
,2
,...).Retrieving data after an unpredicted interruption
Should the program be interrupted in an unpredicted way (after an electrical shutdown, a system crash, etc.), it would be necessary to retrieve the last checkpointed files before resuming the operation: you thus have to identify (using the time stamp) the most recent subdirectory in the
checkpoint
directory, then change to this directory and type the following commands:cp * ../../data/currentThis copies every file found in this directory to the
current
results directory for later reference. However, please note the computations performed between this checkpointing and the time of interruption will be lost.Resuming the execution
Resuming the process after an ordered interruption, or after an unpredicted interruption followed by a successful retrieval of the data is an easy task:
cp current/organism.fasta.SL .
mkdom2 IN=organism.fasta.SL LOG=mkdom2.logPlease note that the input file is now the file
organism.fasta.SL
, that is the original fasta file sorted in sequence length, and purged from the already found domains.
Quality checks:
Looking at the log and result files
The process may be monitored during the
mkdom2
execution, mainly looking at the clustering log file, and at the temporary results file, respectively calledMkdom2.tmp.LogMKD
andMkdom2.tmp.prodom.51
. The following shows some lines from the clustering log file: it can be seen that a few families are generated, then from time to time the database is reorganized (some domains are taken out of the database, the database is sorted again, and the utility formatdb is run). When this occurs, the program checks the remaining disk space, because a disk full could lead to incorrect results and data loss: should the disk space drop too much, the program would be gently interrupted: you should then remove some files in order to recover disk space, then resume the program.#03#12#11#15#45#49#00# _M_ PSIBLAST OK - FAM 477
#03#12#11#15#45#49#00# _M_ NOW REORGANIZING DATABASE
#03#12#11#15#45#49#00# _M_ DISK SPACE (Ko) = 6681744 - NEEDED = 37137
#03#12#11#15#45#53#00# _M_ PSIBLAST OK - UNIQ 478
#03#12#11#15#45#53#01# _M_ PSIBLAST OK - UNIQ 479
#03#12#11#15#45#54#00# _M_ PSIBLAST OK - UNIQ 480
#03#12#11#15#45#54#01# _M_ PSIBLAST OK - UNIQ 481
#03#12#11#15#45#54#02# _M_ PSIBLAST OK - FAM 482
#03#12#11#15#45#54#02# _M_ NOW REORGANIZING DATABASE
#03#12#11#15#45#54#02# _M_ DISK SPACE (Ko) = 6681744 - NEEDED = 37140The following shows the corresponding lines, extracted from the intermediate results file:
Cluster #477: ---------------------------------------------
// STAMP #03#12#11#15#45#49#00#
// QUERY GSTEN:00003660:P:001#1#32
Set # 477:
[GSTEN:00003660:P:001 1 32
[GSTEN:00021931:P:001 22 54
Cluster #478: ---------------------------------------------
// STAMP #03#12#11#15#45#53#00#
// QUERY GSTEN:00021931:P:001#1#21
Set # 478:
[GSTEN:00021931:P:001 1 21 S=115
Cluster #479: ---------------------------------------------
// STAMP #03#12#11#15#45#53#01#
// QUERY GSTEN:00005831:P:001#1#32
Set # 479:
[GSTEN:00005831:P:001 1 32 S=163
Cluster #480: ---------------------------------------------
// STAMP #03#12#11#15#45#54#00#
// QUERY GSTEN:00005952:P:001#1#32
Set # 480:
[GSTEN:00005952:P:001 1 32 S=200
Cluster #481: ---------------------------------------------
// STAMP #03#12#11#15#45#54#01#
// QUERY GSTEN:00006207:P:001#1#32
Set # 481:
[GSTEN:00006207:P:001 1 32 S=182
Cluster #482: ---------------------------------------------
// STAMP #03#12#11#15#45#54#02#
// QUERY GSTEN:00006690:P:001#1#32
Set # 482:
[GSTEN:00006690:P:001 1 32
[GSTEN:00010495:P:001 44 75Executing mkdom2ck.pl
In order to verify the consistency of the log files, the
mkdom2ck.pl
script tests for synchronization problems which may cause data loss or data corruption: this could happen in particular if the program is interrupted during the process and incorrectly resumed.mkdom2ck.pl
performs the following checks:
- Are the result directories 1 2 3... covering the whole process, whithout overlap or without any gap ?
- Are the log files and the result files in each directory 1 2 3 synchronized ? (the time stamps are used for this purpose).
- for each result directory, look for sequences which were withdrawn from the database: these sequences must be found in the result file.
- The last check tries to find each sequence of the source database in one of the result files. Please note that in cases of interruptions, the previous checks make sense even if the whole process is not completed. This last check, however, makes sense only after the whole process is completed. You may skip this check, just calling
mkdom2ck.pl
with the switch--no_db_check
to check in incomplete process.The check may be done with:
mkdom2ck.pl --db organism.fasta [--no_db_check]Postprocessing and data analysis
Executing mkdom2pp.pl
The data must now be postprocessed, which implies:
Concatenation of all the results files (they are for now split between directories 1 2 3...). Transformation of those files to a standard ProDom file and to an xdom format file Computation of the multiple alignments of domains in each family (this rather long step may be skipped). Creation of a project file ready to be read by the xdom
visualization program.The postprocessing may be done with:
mkdom2pp.pl --db organism.fasta [--no_alignment]Viewing the data with the xdom program
You can now admire, print, think about your data with the
xdom2
program:xdom2 organism.prj
Citing the mkdom2 program
Should you use this program for a publication, please cite the following reference: Gouzy J., Corpet F. & Kahn D. (1999). Whole genome protein domain analysis using a new method for domain clustering, Computers and Chemistry. 23:333-340.