Manual of CAL-tuple pipeline on Linux platform

  1.  Linux operational platform;
  2. Python 2.7 or higher version;
  3. R 2.7 or higher version;
  4. GCC complier;
  5. k-tuple counting tools, (e.g: DSK, jellyfish);
  6. De novo sequence assembler, (e.g: MegaHit)

Step 1: Download this pipeline relevant source code  and testing dataset to your workspace directory.

Step 2: Counting the k length sequence tuple for each metagenomics sample by scanning each short reads with DSK. Command as follows:

  1. Compiling:      make omp=1 k=40     This is compile command is for parallel running model. k is tuple length. And for serial running model, just ignore the “omp” options.
  2. tuple Counting:     ./dsk ControlGroup_S1.fa 40
  3. Format Transformation:      ./parese ControlGroup_S1.solid_kmers_binary > ControlGroup_S1_40_tuple.txt.      The file "ControlGroup_S1.solid_kmers_binary" is the middle file generated by DSK.

Step 3: Filter out the barely occurred tuple for each sample

   ./CAL_tuple_filtering.py -f ControlGroup_S1_40_tuple.txt -n x

Where "x" is the threshold for the tuple that just its occurrence greater than "x" would be retained. In our study, x=1. A new file "ControlGroup_S1_40_tuple_filtered.txt" is generated and the file "ControlGroup_S1_40_tuple.txt" will be removed automatically after the step3.

Step 4: Sorting each filtered sample based on the tuple sequence signature.

  ./CAL-tuple_quickSort.py –f ControlGroup_S1_40_tuple_filtered.txt

Where –f option input the filtered file from step3. The default output file name is "ControlGroup_S1_40_tuple_filtered_sorted.txt"

Step 5: Integrate the k-tuple files of all samples into a matrix.

       The command for the first two samples is as follows:

  1.  join -a1 -a2 -1 1 -2 1 -e '0' -o 0,1.2,2.2 ControlGroup_S1_40_tuple_filtered_sorted.txt ControlGroup_S2_40_tuple_filtered_sorted.txt > tmp_001.txt
    join -a1 -a2 -1 1 -2 1 -e '0' -o 0,1.2,2.2 ControlGroup_S3_40_tuple_filtered_sorted.txt ControlGroup_S4_40_tuple_filtered_sorted.txt > tmp_002.txt
    join -a1 -a2 -1 1 -2 1 -e '0' -o 0,1.2,2.2 ControlGroup_S5_40_tuple_filtered_sorted.txt CaseGroup_S1_40_tuple_filtered_sorted.txt > tmp_003.txt
    join -a1 -a2 -1 1 -2 1 -e '0' -o 0,1.2,2.2 CaseGroup_S2_40_tuple_filtered_sorted.txt CaseGroup_S3_40_tuple_filtered_sorted.txt > tmp_004.txt
    join -a1 -a2 -1 1 -2 1 -e '0' -o 0,1.2,2.2 CaseGroup_S4_40_tuple_filtered_sorted.txt CaseGroup_S5_40_tuple_filtered_sorted.txt > tmp_005.txt
  2. join -a1 -a2 -1 1 -2 1 -e '0' -o 0,1.2,1.3,2.2,2.3 tmp_001.txt tmp_002.txt > tmp_101.txt
    join -a1 -a2 -1 1 -2 1 -e '0' -o 0,1.2,1.3,2.2,2.3 tmp_003.txt tmp_004.txt > tmp_102.txt
  3. join -a1 -a2 -1 1 -2 1 -e '0' -o 0,1.2,1.3,1.4,1.5,2.2,2.3,2.4,2.5 tmp_101.txt tmp_102.txt > tmp_201.txt
    join -a1 -a2 -1 1 -2 1 -e '0' -o 0,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.2,2.3 tmp_201.txt tmp_005.txt > AllSample_TupleMatrix.txt      

After that, integrate the other two k-tuple files until all the tuples are integrated to one matrix. For more details usage of command "join" is provided here. The running command for testing data and the running results are available for download

Step 6: Normalizing each filtered sample to reduce the impact of sequence coverage depth.

./CAL-tuple_normalization.py –f AllSample_TupleMatrix.txt

 Where "AllSample_TupleMatrix.txt" is the whole 40-tuple frequency matrix from step 5. Each row is each 40-tuple, and each column is each sample. The matrix value is the frequency of each 40-tuple in each sample. The frequency of each 40-tuple is normalized by the Million number of the reads in corresponding sample. The default output file name is "AllSample_normalized_TupleMatrix.txt"

Step 7: Feature filtering with Signal-to-Noise Ratio (SNR).

./CAL-tuple_SNR.py –f AllSample_normalized_TupleMatrix.txt -n 5 -p 5 -SNR  SNR_Filtered_Tuples.txt -GLD Tuples_for_GLD.txt

-f is followed by the normalized frequency matrix from step6; -n and -p are the number of samples from the two groups. -SNR and -GLD is followed by the file names of results from SNR filtering and the file for following Chi-Square filtering.

Step 8: Feature filtering with Chi-Square Testing.

 ./CAL-tuple_Chi-Square.py –f Tuples_for_GLD.txt -n 5 -p 5 -v 0.01 -c 0.95 -o Chi-Square_Filtered_Tuples.txt

where -f represents the normalized frequency matrix selected by SNR, -n and -p are the number of samples of the two groups. The -v is p-value threshold of Chi-Square testing and -c is the confidence interval. The -o is followed by the output file name.

Step 9:  Merging the filtered tuple features from step 7&8 as candidate group-specific features.

       cat SNR_Filtered_Tuples.txt Chi-Square_Filtered_Tuples.txt >> Candidate_Group-Specific_Tuples.txt