Microsoft
    Research

eScience Research Group

 

 
Computational Biology Tools  
 
 

Create Epitome
Details
Tool References Code
 
 

Overview

This tool models genetic diversity by summarizing a large input dataset into an epitome, a short sequence capturing many overlapping subsequences from the dataset.

For example, applying the tool to modeling the diversity of HIV, the epitome produces relatively small vaccine immunogens covering a large number of immune system targets known as epitopes. Our experiments have shown that the epitome includes more epitopes than other vaccine designs of similar length, including cocktails of consensus strains, phylogenetic tree centers, and observed strains.

The tool optimizes greedily, that is, it iteratively increases the length of the epitome by appending a patch (possibly with overlap) from the data which maximally reduces the ratio of the sum of the patch weights of the included patches to the length of the epitome. The process can be stopped once the desired length is achieved (rather than when the entire set of patches is included as in the superstring problem).

Input Format

The tool accepts input in two formats.

First, a text table:

Patch Weight
 NKIVRMYSP 167
 LNKIVRMYS 167
 PQDLNTMLN 166
 QDLNTMLNT 166
 GATPQDLNT 166
 EGATPQDLN 166
 ATPQDLNTM 165
 TPQDLNTML 165

Separate the columns with space, tab, or comma. The headers are required.

The first column contains patches for possible inclusion in the epitome. The second column gives their relative weights.

This format is easily created via a spreadsheet program such as Excel. Transfer data to the tool either with cut (cntl-C) and paste (cntl-V) or by saving the spreadsheet in text format and using the tool's "Upload File" button.

The tool also accepts a second format: free text without weights. For example,

Twinkle, twinkle, little star;
How I wonder what you are.

Output Format

When "Show Only Last" is unchecked, the tool shows the sequence of epitomes created. This output is tab-delimited and suitable for cutting (cntl-A,cntl-C) and pasting (cntl-V) into a spreadsheet such as Excel.

Method AminoAcidLength numComponents coverage Vaccine
Greedy 9 1 0.125753 LNKIVRMYS
Greedy 10 1 0.251506 LNKIVRMYSP
Greedy 17 1 0.376506 PQDLNTMLNKIVRMYSP
Greedy 18 1 0.500753 TPQDLNTMLNKIVRMYSP
Greedy 19 1 0.625 ATPQDLNTMLNKIVRMYSP
Greedy 20 1 0.75 GATPQDLNTMLNKIVRMYSP
Greedy 21 1 0.875 EGATPQDLNTMLNKIVRMYSP
Greedy 30 2 1 EGATPQDLNTMLNKIVRMYSP,QDLNTMLNT

 

Method AminoAcidLength numComponents coverage Vaccine
Greedy 7 1 0.3 TWINKLE
Greedy 7 1 0.3 TWINKLE
Greedy 10 2 0.4 TWINKLE,HOW
Greedy 13 2 0.5 WHATWINKLE,HOW
Greedy 12 1 0.5 HOWHATWINKLE
Greedy 15 2 0.6 HOWHATWINKLE,YOU
Greedy 18 3 0.7 HOWHATWINKLE,YOU,ARE
Greedy 20 3 0.8 HOWHATWINKLE,YOU,STARE
Greedy 26 4 0.9 HOWHATWINKLE,YOU,STARE,LITTLE
Greedy 32 5 1 HOWHATWINKLE,YOU,STARE,LITTLE,WONDER