d2-vlmc is an effective pipeline for clustering the metatranscriptomic samples by using VLMC to model the underlying background genomes. In this pipeline, we develop Variable Length Markov Chains (VLMC) model applicable for high-throughput sequencing data based on d2Tool.
CAL-tuple is a very handy pipeline that implemented and executed under Linux-like platform. This pipeline allows users to execute the processing applications for each step with simple and friendly usage commands. The more detailed manual of this pipeline is provided in the following. Besides that, a detailed example of this pipeline is provided .
Pre-required running environment
Processing procedure
Python ./TupleCount10bp.py -l samplelist.txt -k $i –t TuplecountFiles
1. –l: List file of sample names. Write the file names of sequencing data to a text file(sample_list.txt) with the following format:
2. –t: Store path for output files of k-tuple frequency.The path can be complete absolute format(like“/home/Meta/TestData/TupleCountFiles”) or relative path(like“./TupleCount_Files/”).
3. –k: the length of k-tuple(1-10 for short k-tuple)
Note: All generated tuple-count files from 1-10 should be linked together to one file for each sample.
python ./VLMC327proliulincontext.py –i sampleXtuplecount.txt -t TuplecountFiles -K 120.0 -p MarkovProbability_Files
1. –i: File name of the generated tuple-count file for each sample.
2. –t: Store path for output files of k-tuple frequency.The path can be complete absolute format(like“/home/Meta/TestData/TupleCountFiles”) or relative path(like“./TupleCount_Files/”).
3. –K: The threshold value for pruning.
4. –p: Store path for probability files of VLMC model.The path can be complete absolute format(like“/home/Meta/TestData/TupleCountFiles”) or relative path(like“./TupleCount_Files/”).
Python ./calculatedissimiliraty.py -l samplelist.txt -k $i -d d2 -m MarkovProbabilityFiles -o DissmilarityMatrixFiles/d2/output_d2
1. –l: List file of sample names. Write the file names of sequencing data to a text file(sample_list.txt) with the following format:
2. –k: The length of k-tuple(1-10 for short k-tuple)
3. –d: The options for dissimilarity measurement : d2、Eu、Ma、Ch、d2S、d2Star;
4. –m: The directory of the probability files produced by MarkovProbabilityZeroToThree.py.
5. –o: The directory to keep the produced dissimilarity matrix between input datasets pairs.
Example data:
4 test samples were used in this example. These file are small (only 10 KB for each one)in “fa” format. These files are compressed into a zip file (d2-vlmc-example-data)
which can be downloaded from here.
Concrete steps of d2-vlmc pipeline:
d2-vlmc pipeline comprises the following specific steps. And results of each step can be found here .
for((i=1;i<=10;i++));
do
echo $i
./TupleCount10bp.py -l pipeline.txt -k $i -t vlmc-pipeline-test/tuplecount/
done
python ./VLMC327proliulin.py -i pipeline1.txt -t ./vlmc-pipeline-test/tuplecount/ -K 5.0 -p ./vlmc-pipeline-test/markovfile/
python ./VLMC327proliulin.py -i pipeline2.txt -t ./vlmc-pipeline-test/tuplecount/ -K 5.0 -p ./vlmc-pipeline-test/markovfile/
python ./VLMC327proliulin.py -i pipeline3.txt -t ./vlmc-pipeline-test/tuplecount/ -K 5.0 -p ./vlmc-pipeline-test/markovfile/
python ./VLMC327proliulin.py -i pipeline4.txt -t ./vlmc-pipeline-test/tuplecount/ -K 5.0 -p ./vlmc-pipeline-test/markovfile/
for((i=2;i<=9;i++));
do
echo $i
#d2S command
./calculatedissimiliraty.py -l pipeline.txt -k $i -r 0 -d d2S -m vlmc-pipeline-test/markovfile/ -o vlmc-pipeline-test/dissfile/d2S/pipelinek"$i"d2S
#d2Star command
./calculatedissimiliraty.py -l pipeline.txt -k $i -r 0 -d d2Star -m vlmc-pipeline-test/markovfile/ -o vlmc-pipeline-test/dissfile/d2Star/pipelinek"$i"d2Star
done
1: write a clustering.r file
source('/home/yingwang/weinan/vlmc/ClusterTreeby_upgma.R');
ClusterTreebyupgma('../vlmc-pipeline-test/dissfile/d2S/pipelinek2d2S.dissimilaritymatrix.txt','pipelined2S.tree.nwk');
ClusterTreebyupgma('../vlmc-pipeline-test/dissfile/d2S/pipelinek3d2S.dissimilaritymatrix.txt','pipelined2S.tree.nwk');
ClusterTreebyupgma('../vlmc-pipeline-test/dissfile/d2S/pipelinek4d2S.dissimilaritymatrix.txt','pipelined2S.tree.nwk');
ClusterTreebyupgma('../vlmc-pipeline-test/dissfile/d2S/pipelinek5d2S.dissimilaritymatrix.txt','pipelined2S.tree.nwk');
ClusterTreebyupgma('../vlmc-pipeline-test/dissfile/d2S/pipelinek6d2S.dissimilaritymatrix.txt','pipelined2S.tree.nwk');
ClusterTreebyupgma('../vlmc-pipeline-test/dissfile/d2S/pipelinek7d2S.dissimilaritymatrix.txt','pipelined2S.tree.nwk');
ClusterTreebyupgma('../vlmc-pipeline-test/dissfile/d2S/pipelinek8d2S.dissimilaritymatrix.txt','pipelined2S.tree.nwk');
ClusterTreebyupgma('../vlmc-pipeline-test/dissfile/d2S/pipelinek9d2S.dissimilaritymatrix.txt','pipelined2S.tree.nwk');
ClusterTreebyupgma('../vlmc-pipeline-test/dissfile/d2S/pipelinek10d2S.dissimilaritymatrix.txt','pipelined2S.tree.nwk');
ClusterTreebyupgma('../vlmc-pipeline-test/dissfile/d2Star/pipelinek2d2Star.dissimilaritymatrix.txt','pipelined2Star.tree.nwk');
ClusterTreebyupgma('../vlmc-pipeline-test/dissfile/d2Star/pipelinek3d2Star.dissimilaritymatrix.txt','pipelined2Star.tree.nwk');
ClusterTreebyupgma('../vlmc-pipeline-test/dissfile/d2Star/pipelinek4d2Star.dissimilaritymatrix.txt','pipelined2Star.tree.nwk');
ClusterTreebyupgma('../vlmc-pipeline-test/dissfile/d2Star/pipelinek5d2Star.dissimilaritymatrix.txt','pipelined2Star.tree.nwk');
ClusterTreebyupgma('../vlmc-pipeline-test/dissfile/d2Star/pipelinek6d2Star.dissimilaritymatrix.txt','pipelined2Star.tree.nwk');
ClusterTreebyupgma('../vlmc-pipeline-test/dissfile/d2Star/pipelinek7d2Star.dissimilaritymatrix.txt','pipelined2Star.tree.nwk');
ClusterTreebyupgma('../vlmc-pipeline-test/dissfile/d2Star/pipelinek8d2Star.dissimilaritymatrix.txt','pipelined2Star.tree.nwk');
ClusterTreebyupgma('../vlmc-pipeline-test/dissfile/d2Star/pipelinek9d2Star.dissimilaritymatrix.txt','pipelined2Star.tree.nwk');
ClusterTreebyupgma('../vlmc-pipeline-test/dissfile/d2Star/pipelinek10d2Star.dissimilaritymatrix.txt','pipelined2Star.tree.nwk');
2: visit the folder to store the clustering trees.
Rscript ../clustering.r
Version Release Notes
• Version 1.0 – 2015 October 10th
1. This is the original vision of d2-vlmc pipeline that has been implemented to run it on 16G memory or large Linux-like machine.
Downloads
• The pipeline relevant source code.
• The results that this pipeline applied on real metatranscriptomic data.
Development Team
The whole pipeline was designed and implemented by Ying Wang’s group, Automation Department, Xiamen University, P.R.China. Any questions and suggestions are more than welcome to
wangying@xmu.edu.cn orlwn19931224@gmail.com.