Documents

  1. What is the ATC code?
  2. What is the purpose of SPACE?
  3. How is the prediction model underlying SPACE constructed?
  4. How are GSP and GSN sets constructed, which are used to train the prediction model underlying SPACE?
  5. How is the performance of the prediction model?
  6. Compared with other methods, how is the prediction performance of our method?
  7. How are the 6 similarity scores computed?
  8. What is the purpose and principle of the enrichment analysis?
  9. What is the difference between “DrugBank golden standard set” and “DrugBank+KEGG golden standard set” in the enrichment analysis?
  10. What is the Likelihood Ratio (LR)?
  11. Nomenclature in SPACE

1. What is the ATC code?

The Anatomical Therapeutic Chemical (ATC) classification system developed and maintained by the World Health Organization (WHO) Collaborating Center for Drug Statistics Methodology (WHOCC), is currently the most widely recognized classification system for drugs. It divides drug substances into different groups according to the organ or system on which they act and their therapeutic, pharmacological and chemical properties. The ATC classification system has five levels representing finer and finer classification for drugs, in which the first level has 14 anatomical groups, with pharmacological/therapeutic subgroups as the second level, the third and fourth levels are chemical/pharmacological/therapeutic subgroups and the fifth level is the chemical substance. For example, a complete classification of acetylsalicylic acid is B (blood and blood forming organ, 1st level), B01 (anti-thrombotic agents, 2nd level), B01A (anti-thrombotic agents, 3rd level), B01AC (platelet aggregation inhibitors, 4th level) and B01AC06 (acetylsalicylic acid, 5th level). A drug can be assigned more than one ATC code when it has multiple different therapeutic uses. For example, besides as a platelet aggregation inhibitor (B01AC06) mentioned above, acetylsalicylic acid is also used as a drug for “local oral treatment” (A01AD05) and used as an “analgesic and antipyretic” (N02BA01).

2. What is the purpose of SPACE?

SPACE (Similarity-based Predictor of ATC CodE) is designed to predict drug-ATC code associations. For each submitted compound, SPACE will give its predicted candidate ATC codes ranked according to the order of decreasing probability_score together with corresponding supporting evidence. The prediction of ATC classification of drugs is not only helpful for studying the utilization of drugs and knowing their therapeutic, pharmacological and chemical properties, but also provides valuable information for drug side-effect discovery and drug repositioning study. Meanwhile ATC classification prediction for chemical compounds also contributes to new drug development.

3. How is the prediction model underlying SPACE constructed?

The basic hypothesis underlying the prediction model is that if two drugs are similar in a certain aspect such as chemical structure, target or cellular transcriptional response etc., they might share therapeutic, pharmacological or chemical properties and thus belong to the same ATC class. That is, potential ATC codes of an interested drug could be predicted by known ATC codes of another drug similar to it.

Based on the golden standard positive (GSP) and negative (GSN) datasets, we use logistic regression model to integrate 6 similarity-based features including FP2 fingerprint similarity, functional group similarity, target profile similarity, side-effect profile similarity, drug-induced gene expression similarity and STITCH chemical-chemical association score, to construct the prediction model for drug-ATC code associations. For each drug-ATC code pair, the prediction model will give a probability_score measuring the possibility that the drug belongs to the ATC code.

Taking FP2 fingerprint similarity-based feature as an example, for a given drug-ATC code pair, the feature value is the largest similarity score of the most similar drug to the interested drug among the drugs known to belong to the interested ATC code (Please refer to A of Q11.4 in this document for more details).

4. How are GSP and GSN sets constructed, which are used to train the prediction model underlying SPACE?

Here for each level (1~4) of the ATC system, we construct a prediction model, and thus the GSP and GSN sets are constructed for each level.

The GSP set is composed of known drug-ATC code associations downloaded from DrugBank (version: July 7, 2012) and KEGG (version: July 5, 2012) databases, including 4211, 4465, 4596 and 5662 drug-code pairs between 3433 drugs and 14 ATC codes of level 1, 87 codes of level 2, 205 codes of level 3 and 617 codes of level 4.

For a certain level, suppose D and C are respectively drug and code spaces of the GSP set, and n is the number of drug-code pairs in the GSP set. To construct the GSN set, we removed the GSP set from D×C drug-code pairs, and then randomly picked n ones from the remaining pairs as the GSN set.

5. How is the performance of the prediction model?

Using known drug-ATC code associations only from DrugBank database (version: July 7, 2012) as the golden standard (positive) set, the ROC AUC of the prediction model based on 10-fold cross-validation is 0.8932 for the model of level 1, 0.9356 for level 2, 0.9457 for level 3 and 0.9469 for level 4. Using known drug-ATC code associations from KEGG database (version: July 5, 2012) as the independent test (positive) set, our prediction model also obtains good performance (Table 1). The results in the following table indicate that the performance of our method is excellent for the prediction of not only ATC codes of “unclassified” drug (whose ATC codes are unknown) but also new ATC codes of “classified” drug (whose ATC codes are known in part).

Please refer to our future paper for more details.

Table 1 ROC AUCs of the prediction model based on the independent test set

a Independent test set is independent of the GSP and GSN sets on drug-code pairs, independent test set-drug is independent on drug, and independent test set-code is independent on drug-code pairs but totally overlaps with the GSP/GSN set on drug. b Because of the small sample size of independent test positive set-code, to avoid the influence of random factors, we repeatedly construct the corresponding negative set 100 times and the reported ROC AUCs in the table are the mean ± standard deviation (SD) of the results of 100 times.

6. Compared with other methods, how is the prediction performance of our method?

The prediction performance of our method is better than all previous methods used for drug-ATC code association predication we know. Please see the detailed comparison in our future paper.

7. How are the 6 similarity scores computed?

Here we used 6 similarity scores to measure drug-drug similarity, respectively based on chemical structures, target proteins, side-effects, drug-induced gene expression and text mining score of chemical-chemical associations.

FP2 fingerprint and functional group similarity scores were computed both based on chemical structures. FP2 is a hash-based binary fingerprint which is generated by indexing the molecule structure’s all possible linear fragments with a length ranging from 1 to 7 atoms (J Cheminform. 2011, 3:33). Functional groups of a drug molecule are also represented as a binary vector, each dimension giving the presence (1) or absence (0) of a particular functional group in the molecule. Chemical structures represented by InChI were separately downloaded from DrugBank (downloaded on July 7, 2012) for the drugs in the golden standard set and KEGG database for those in the independent test set (version: July 5, 2012). Based on chemical structures, FP2 fingerprints of drugs were produced by Open Babel (J Cheminform. 2011, 3:33), and functional group vectors by Checkmol program (version: 29-Apr-2013) which can recognize a total number of 204 functional groups (Molecules. 2010, 15(8):5079-92). We used the Tanimoto coefficient of FP2 fingerprints/functional group vectors of a pair of drugs as their FP2/functional group similarity score. The Tanimoto coefficient is computed as , where Na and Nb are respectively the number of 1 in two fingerprints/vectors and Nab is the number of 1 common to both.

Similar to the functional group vector, the target profile of a drug is also defined as a binary vector denoted by . Each dimension of the vector represents a protein, and its value is set to 1 if the protein is targeted by the drug, and otherwise to 0. Here drug-target relationships were extracted from DrugBank database (downloaded on Mar. 24, 2013). In total, we obtained drug-target relationships between 4139 small molecule drugs and 1924 human genes, and thus we defined the target profile as a 1924-dimension vector (i.e. K=1924). The cosine correlation coefficient is used to measure the target profile similarity of two drugs x and y , which is defined as .

The side-effect profile of a drug is defined in the similar way. Side-effect information of drugs was downloaded from SIDER database (version: released on October 17, 2012). The processed drug-side effect relationship dataset involved 3209 side-effects represented using MedDRA preferred terms, and therefore here we defined side-effect profile as a 3209-dimension vector. The side-effect profile similarity score between drug x and y is calculated by weighted cosine correlation coefficient: , where wi is the weight function for the ith side-effect in the side-effect profile. wi is defined as , where di is the frequency of the ith side-effect in the dataset, i.e. the number of drugs having this side-effect in the dataset, σ is the mean of di and h is set to 1 like Takarabe et al. did (Bioinformatics. 2012, 28(18):i611-i618).

Gene expression profile similarity scores between 1144 drugs were directly obtained from Cheng et al. (Pac Symp Biocomput. 2013:5-16). These scores were calculated based on gene expression profile data in response to drug treatment downloaded from Connectivity Map (CMAP), using the Batch DMSO Control (BDC) data pre-processing method and the Xtreme cosine (XCos) similarity score (with 100 probes) to measure the similarity. Cheng et al.’s study indicated that compared with other methods, the similarity score obtained by XCos_BDC_100 method achieved the best performance to predict whether or not a pair of drugs shares a ATC code (see more details in Cheng et al.'s paper).

The texting mining scores of chemical-chemical associations were downloaded from STITCH database (v3.1), which are computed based both on co-occurrence in the literature and on natural language processing (Nucleic Acids Res. 2008,36(Database issue):D684-8).

8. What is the purpose and principle of the enrichment analysis?

The enrichment analysis is designed to analyze potential therapeutic/pharmacological/chemical properties of a drug (typically e.g. Traditional Chinese Medicine (TCM)) composed of multiple compositive compounds.

To obtain the significantly enriched (predicted candidate) ATC codes among the query compounds, we use the GSP set as the control group and use Fisher exact test to compute the P-value. If the fraction of compounds belonging to the ATC code among query compounds is significantly larger than that among the “control” set (Fisher exact test), we think that the ATC code is significantly enriched among the query compounds. Here we provide two datasets as the control set, “DrugBank golden standard set” and “DrugBank+KEGG golden standard set” (see A of Q9 in this document for more information).

9. What is the difference between “DrugBank golden standard set” and “DrugBank+KEGG golden standard set” in the enrichment analysis?

We provide two datasets as the “control” set of the enrichment analysis - “DrugBank golden standard set” and “DrugBank+KEGG golden standard set”. “DrugBank golden standard set” is composed of 1333 drugs and their corresponding ATC codes from DrugBank database (version: July 7, 2012), which is used as the GSP set to evaluate the performance of our model (Please refer to A of Q5 in this document). “DrugBank+KEGG golden standard set” includes known drug-ATC code associations from DrugBank (version: July 7, 2012) and KEGG (version: July 5, 2012) database, which is used as the GSP set to train the prediction model undelying SPACE (Please refer to A of Q4 in this document).

10. What is the Likelihood Ratio (LR)?

Likelihood Ratio is defined as the ratio of the probability of feature f observed in the GSP set to that in the GSN set. LR is used to assess the prediction ability of features. Generally LR>1 means the feature has the prediction ability, that is, the feature can be thought to be a piece of supporting evidence for a drug-ATC code association.

11. Nomenclature in SPACE

11.1 Predicted ATC code (probability_score)

For each query compound, SPACE will give its predicted candidate ATC codes of user-specified level ranked according to the decreasing probability_score given by the prediction model. For each pair of drug-ATC code, the prediction model can give a probability_score measuring the possibility that the drug belongs to the ATC code. When the query compound is a “classified” drug (that is, it is a member of the golden standard set used to construct the prediction model (A of Q4), and thus its codes are at least partly known), its known ATC codes will be given first, followed by predicted “new” ATC codes.

11.2 Enriched ATC code, M, N, Control_M, Control_N, P_value of Fisher exact test

Please refer to A of Q8 in this document for details.

11.3 Similarity-based feature

Here we in total consider 6 similarity score-based features, including FP2 fingerprint similarity, functional group similarity, target profile similarity, side-effect profile similarity, drug-induced gene expression profile similarity and STITCH text mining score. Taking FP2 fingerprint similarity-based feature as an example, for a given drug-ATC code pair, the feature value is the largest similarity score of the most similar drug to the interested drug among the drugs known to belong to the interested ATC code (Please refer to A of Q11.4 in this document for more details).

11.4 The most similar drug, the largest similarity score

For a drug-ATC code association, the most similar drug is referred to as the most similar one among the drugs known to belong to the interested ATC code in the golden standard set (A of Q4 in this document) to the interested drug, and the largest similarity score is the similarity score between the interested drug and the most similar drug.

11.5 FP2_LR

Likelihood Ratio (LR) of FP2 fingerprint similarity-based feature.

11.6 Functional_group_LR

Likelihood Ratio (LR) of functional group similarity-based feature.

11.7 Target_LR

Likelihood Ratio (LR) of target profile similarity-based feature.

11.8 Side-effect_LR

Likelihood Ratio (LR) of side-effect similarity-based feature.

11.9 Expression_LR

Likelihood Ratio (LR) of drug-induced gene expression profile similarity-based feature.

11.10 STITCH_score_LR

Likelihood Ratio (LR) of STITCH text mining score of chemical-chemical associations.

11.11 Target protein

The target proteins of the query drug are from DrugBank database (version: July 7, 2012).

11.12 Side-effect

The side-effects of the query drug are from SIDER database (version: released on October 17, 2012).