DASEV

Introduction

DASEV is developed for zero-inflated mass spectrometry data, specially, metabolomics and proteomics data. The data is considered to have a mixture distribution of biological and technical zero values, which are called BPMVs and TPMVs. Bayes shrinkage method is applied to improve the variance estimation, thus to improve the performance of the differential abundance analysis. An R package is available to download here.

Sample R code

Simulation

R code to perform the simulation analysis in our manuscript can be downloaded here.

Data "simpool " is available in the R package. We applied DASEV to the HUPD dataset (descripted in the HUPD data analysis section) and saved the four parameters needed for the simulation to "simpool". The four parameters are DL (detection limit), mu0 (group mean), p0 (group BPMV proportion), and sd (standard deviation).

The first part (Line 1) of the R code (Function for simulation) is to define a function (SIMULATION) to generate simulated datasets and perform differential abundance analysis using DASEV, TLK, AFT, and 2T. TLK and AFT are the compared method proposed by Taylor et al. (2013). The R code for TLK and AFT is downloaded from the publication website (https://www.degruyter.com/view/j/sagmb.2013.12.issue-6/sagmb-2013-0021/Appendix2_corrected.pdf). The function Mixture is used for the differential abundance analysis. Because the original function only provides a test for the null hypothesis that both non-BPMV mean and BPMV proportion are the same between groups, it is modified to add two tests for assessing non-BPMV mean and BPMV proportion separately. The original R code for 2T is from Taylor and Pollard (2009). In addition, the original function does not adjust for multiple comparisons. To make the method more comparable with DASEV, we added the Benjamini-Hockberg procedure to control the false discovery rate. The modified code can be downloaded here. Parameters, simulated dataset, result from DASEV, and result from TLK are saved for each simulation. File paths need to be changed.

The second part (Line 163) of the R code (Preform Simualtions) is to call the function and perform differential abundance analysis.

There are six different simulation scenarios in the first part. Each section performs 100 simulations with sample size 10, 20, 100, and 200 per group.

    1. #Simulation scenario one with difference in group means. (Line 171)

    2. #Simulation scenario two with difference in BPMV proportions. (Line 183)

    3. #Simulation scenario three with difference in group means and BPMV proportions. (Line 195)

    4. #Simulation scenario four with more dissonant features and difference in group means and BPMV proportions. (Line 207)

    5. #Simulation scenario five with more consonant features and difference in group means and BPMV proportions. (Line 223)

    6. #Simulation scenario six with no TPMVs and difference in group means. (Line 239)

Plot

Single Plot

R code to generate plot using the saved parameter datasets and results for comparing DASEV and TLK in our manuscript can be downloaded here.

The following list contains the components of the analysis:

    1. #Load results for a simulation with 200 observations per group. This is to import datasets and results for a single simulation. (Line 4)

    2. #Figure 1. This is to produce Figure 1 in the manuscript (single simulation). (Line 56)

    3. #Figure 2. This is to produce Figure 2 in the manuscript (single simulation). (Line 99)

    4. #Figure 3. This is to produce Figure 3 in the manuscript (single simulation). (Line 139)

    5. #Figure 4. This is to load multiple simulations and to produce Figure 4 in the manuscript. (Line 178)

    6. #Load results for a simulation with 100 observations per group. This is to import datasets and results for a single simulation. (Line 667)

    7. #Figure S2. This is to produce Figure S2 in the manuscript (single simulation). (Line 719)

    8. #Figure S3. This is to produce Figure S3 in the manuscript (single simulation). (Line 763)

    9. #Figure S4. This is to produce Figure S4 in the manuscript (single simulation). (Line 803)

    10. #Figure S5. This is to load multiple simulations and to produce Figure S5 in the manuscript. (Line 841)

Average TPR and FDR

R code to obtain aggregated results over 100 simulations in our manuscript can be downloaded here.

The first part (Line 1) of the R code (Function to get average results) is to define a function (Figure4f) to obtain aggregated results over 100 simulations.

The second part (Line 287) of the R code (Call Function to get average results) is to call the function to obtain and save the results as Rdata.

The Third part (Line 392) of the R code (Plots) is to generate plots for the average results.

The following list contains the components of the Plots:

    1. #Figure S8. This is to produce Figure S8 in the manuscript. (Line 402)

    2. #Figure S9. This is to produce Figure S9 in the manuscript. (Line 452)

    3. #Figure S10. This is to produce Figure S10 in the manuscript. (Line 502)

    4. #Figure S11. This is to produce Figure S11 in the manuscript. (Line 552)

    5. #Figure S12. This is to produce Figure S12 in the manuscript. (Line 602)

    6. #Figure S13. This is to produce Figure S13 in the manuscript. (Line 652)

Extra plots

R code to obtain other selected plots in our manuscript can be downloaded here.

The following list contains the components of the Plots:

    1. #Figure S16. This is to produce Figure S16 in the manuscript. (Line 1)

References:

Taylor, S. L., Leiserowitz, G. S. & Kim, K. Accounting for undetected compounds in statistical analyses of mass spectrometry 'omic studies. Stat. applications genetics molecular biology 12, 703-722, DOI: 10.1515/sagmb-2013-0021 (2013).

Taylor, S. L. & Pollard, K. Hypothesis tests for point-mass mixture data with application to ’omics data with many zero values. Stat. applications genetics molecular biology 8, DOI: 10.2202/1544-6115.1425 (2009).