pairwisecomparisons_simulateddata

Last updated: 2021-10-21

Checks: 6 1

Knit directory: pair_con_select/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown is untracked by Git. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20190211)

The command set.seed(20190211) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 69efefb

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 69efefb. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rproj.user/
    Ignored:    analysis/.DS_Store
    Ignored:    analysis/.Rproj.user/
    Ignored:    analysis/Archive/.DS_Store
    Ignored:    analysis/Archive/shinyapp/.DS_Store
    Ignored:    code/.DS_Store
    Ignored:    code/archive/.DS_Store
    Ignored:    data/.DS_Store
    Ignored:    data/depmap_alkati/.DS_Store
    Ignored:    data/depmap_alkati/Data_Raw/.DS_Store
    Ignored:    data/depmap_alkati/Data_Raw/CCLE/CCLE_RNAseq_ExonUsageRatio_20180929.gct
    Ignored:    data/tcga_brca_expression/
    Ignored:    data/tcga_luad_expression/
    Ignored:    data/tcga_skcm_expression/
    Ignored:    output/.DS_Store
    Ignored:    output/alkati_filtercutoff_allfilters.csv

Untracked files:
    Untracked:  README.html
    Untracked:  analysis/Archive/ks_results_forshiny.csv
    Untracked:  analysis/pairwisecomparisons_simulateddata.Rmd
    Untracked:  data/archive/
    Untracked:  data/depmap_alkati/archive/

Unstaged changes:
    Modified:   README.md
    Modified:   analysis/index.Rmd
    Deleted:    analysis/ks_results_forshiny.csv
    Deleted:    analysis/updated_resampling_strategy2.Rmd
    Deleted:    data/All_Data_V2.csv
    Deleted:    data/alkati_simulations_compiled_200_51121.csv
    Deleted:    data/all_data.csv
    Deleted:    data/depmap_alkati/To_do_depmap.docx
    Deleted:    data/skmel28_sos1_mekq56p_vemurafenib.csv

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

There are no past versions. Publish this analysis with wflow_publish() to start tracking its development.

# library(ensembldb) #Loading this with Dplyr commands seems to throw an error in Rmd
# library(EnsDb.Hsapiens.v86) #Loading this with Dplyr commands seems to throw an error in Rmd
# source("../code/mut_excl_genes_generator3.R")
# source("../code/mut_excl_genes_datapoints.R")

source("code/contab_maker.R")
source("code/contab_simulator.R")
source("code/contab_downsampler.R")
source("code/alldata_compiler.R")
source("code/mut_excl_genes_generator.R")

# source("../code/contab_maker.R")
# source("../code/contab_simulator.R")
# source("../code/contab_downsampler.R")
# source("../code/alldata_compiler.R")
# source("../code/mut_excl_genes_generator.R")

Pairwise Comparisons using Simulated Genes

This is the part of the code where I performed pairwise comparisons on gene pairs having different abundances, odds ratios, and cohort sizes.

# rm(list=ls())
or_pair1=c(.01,.05,.1,.2,.3,.4,.5,.6,.7,.8,.9,1)
or_pair2=seq(.01,.2,by=.02)
incidence=seq(4,36,by=2)
cohort_size=seq(100,1000,by=200)

# or_pair1=c(.01,.05)
# or_pair2=seq(.01,.1,by=.05)
# incidence=seq(4,36,by=20)
# cohort_size=seq(100,1000,by=400)

    # or_pair1=.05
    # or_pair2=.01
    # incidence=12
    # cohort_size=500
    # i=1
    # j=1
    # k=1
    # l=1

true_or_vals=or_pair1
cohort_size_vals=cohort_size
gene1_total_vals=incidence

simresults_compiled_alldata=as.list(length(or_pair2)*length(or_pair1)*length(incidence)*length(cohort_size))
simresults_compiled=matrix(,length(or_pair2)*length(or_pair1)*length(incidence)*length(cohort_size),ncol=16)
ct=1
for(j in 1:length(incidence)){
  tic()
for(l in 1:length(or_pair2)){
  for(k in 1:length(cohort_size_vals)){
    for(i in 1:length(true_or_vals)){
gene_pair_1=unlist(mut_excl_genes_generator(cohort_size[k],incidence[j],or_pair1[i],or_pair2[l])[1])
gene_pair_1_table=rbind(c(gene_pair_1[1],gene_pair_1[2]),c(gene_pair_1[3],gene_pair_1[4])) ###make sure these are the right indices
gene_pair_2=unlist(mut_excl_genes_generator(cohort_size[k],incidence[j],or_pair1[i],or_pair2[l])[2])
gene_pair_2_table=rbind(c(gene_pair_2[1],gene_pair_2[2]),c(gene_pair_2[3],gene_pair_2[4]))
# alldata_1=mut_excl_genes_datapoints(gene_pair_1)
# 
# alldata_2=mut_excl_genes_datapoints(gene_pair_2)
# alldata_comp_1=alldata_compiler(alldata_1,"gene2","gene3","gene1",'N',"N/A","N/A")[[2]]
# 
# genex_replication_prop_1=alldata_compiler(alldata_1,"gene2","gene3","gene1",'N',"N/A","N/A")[[1]]
# alldata_comp_2=alldata_compiler(alldata_2,"gene2","gene3","gene1",'N',"N/A","N/A")[[2]]
# genex_replication_prop_2=alldata_compiler(alldata_2,"gene2","gene3","gene1",'N',"N/A","N/A")[[1]]


###Calculating Odds ratios and GOI frequencies for the raw data###
# cohort_size_curr=length(alldata_comp$Positive_Ctrl1)
cohort_size_curr=cohort_size[k]
# pc1pc2_contab_counts=contab_maker(alldata_comp$Positive_Ctrl1,alldata_comp$Positive_Ctrl2,alldata_comp)[2:1, 2:1]
pc1pc2_contab_counts=gene_pair_2_table
# goipc1_contab_counts=contab_maker(alldata_comp$genex,alldata_comp$Positive_Ctrl1,alldata_comp)[2:1, 2:1]
goipc1_contab_counts=gene_pair_1_table


# goipc2_contab_counts=contab_maker(alldata_comp$genex,alldata_comp$Positive_Ctrl2,alldata_comp)[2:1, 2:1]
# pc1pc2_contab_probabilities=pc1pc2_contab_counts/cohort_size_curr
# goipc1_contab_probabilities=goipc1_contab_counts/cohort_size_curr
pc1pc2_contab_probabilities=pc1pc2_contab_counts
goipc1_contab_probabilities=goipc1_contab_counts
# goipc2_contab_probabilities=goipc2_contab_counts/cohort_size

or_pc1pc2=pc1pc2_contab_probabilities[1,1]*pc1pc2_contab_probabilities[2,2]/(pc1pc2_contab_probabilities[1,2]*pc1pc2_contab_probabilities[2,1])
or_goipc1=goipc1_contab_probabilities[1,1]*goipc1_contab_probabilities[2,2]/(goipc1_contab_probabilities[1,2]*goipc1_contab_probabilities[2,1])
# or_goipc2=goipc2_contab_probabilities[1,1]*goipc2_contab_probabilities[2,2]/(goipc2_contab_probabilities[1,2]*goipc2_contab_probabilities[2,1])

goi_freq=goipc1_contab_probabilities[1,1]+goipc1_contab_probabilities[1,2]
# goi_freq=.01
# class(goi_freq)

###

###Downsampling PC1 to the probability of GOI without changing ORs###
###The function below converts contingency table data to a new contingency table in which the data is downsampled to the desired frequency, aka the frequency of the GOI in this case###
pc1new_pc2_contab=contab_downsampler(pc1pc2_contab_probabilities,goi_freq)
goinew_pc1_contab=contab_downsampler(goipc1_contab_probabilities,goi_freq)
# goinew_pc2_contab=contab_downsampler(goipc2_contab_probabilities,goi_freq)
###original contab:
# head(pc1pc2_contab_probabilities) 
###downsampled contab:
# head(pc1new_pc2_contab) 

pc1rawpc2_contabs_sims=contab_simulator(pc1pc2_contab_probabilities,1000,cohort_size_curr)
pc1pc2_contabs_sims=contab_simulator(pc1new_pc2_contab,1000,cohort_size_curr)
goipc1_contabs_sims=contab_simulator(goinew_pc1_contab,1000,cohort_size_curr)
# goipc2_contabs_sims=contab_simulator(goinew_pc2_contab,1000,cohort_size)
# head(pc1pc2_contabs_sims) #each row in this dataset is a new contab
pc1rawpc2_contabs_sims=data.frame(pc1rawpc2_contabs_sims)
  pc1rawpc2_contabs_sims=pc1rawpc2_contabs_sims%>%
  mutate(or=p11*p00/(p10*p01))

pc1pc2_contabs_sims=data.frame(pc1pc2_contabs_sims)
  pc1pc2_contabs_sims=pc1pc2_contabs_sims%>%
  mutate(or=p11*p00/(p10*p01))
  
goipc1_contabs_sims=data.frame(goipc1_contabs_sims)
  goipc1_contabs_sims=goipc1_contabs_sims%>%
  mutate(or=p11*p00/(p10*p01))
  

  
# goipc2_contabs_sims=data.frame(goipc2_contabs_sims)
#   goipc2_contabs_sims=goipc2_contabs_sims%>%
#   mutate(or=p11*p00/(p10*p01))
pc1rawpc2_contabs_sims$comparison="pc1rawpc2"
pc1pc2_contabs_sims$comparison="pc1pc2"
goipc1_contabs_sims$comparison="goipc1"
# goipc2_contabs_sims$comparison="goipc2"
or_median_raw=quantile(pc1rawpc2_contabs_sims$or,na.rm = T)[3]
or_uq_raw=quantile(pc1rawpc2_contabs_sims$or,na.rm = T)[4]
or_median_downsampled=quantile(pc1pc2_contabs_sims$or,na.rm = T)[3]
or_uq_downsampled=quantile(pc1pc2_contabs_sims$or,na.rm = T)[4]



pc1rawpc2_contabs_sims=pc1rawpc2_contabs_sims%>%
  mutate(isgreater_raw_median=case_when(or>=or_median_raw~1,
                             TRUE~0),
         isgreater_raw_uq=case_when(or>or_uq_raw~1,
                             TRUE~0),
         isgreater_median=case_when(or>or_median_downsampled~1,
                             TRUE~0),
         isgreater_uq=case_when(or>or_uq_downsampled~1,
                             TRUE~0)
         )
pc1pc2_contabs_sims=pc1pc2_contabs_sims%>%
  mutate(isgreater_raw_median=case_when(or>=or_median_raw~1,
                             TRUE~0),
         isgreater_raw_uq=case_when(or>or_uq_raw~1,
                             TRUE~0),
         isgreater_median=case_when(or>or_median_downsampled~1,
                             TRUE~0),
         isgreater_uq=case_when(or>or_uq_downsampled~1,
                             TRUE~0)
         )
goipc1_contabs_sims=goipc1_contabs_sims%>%
  mutate(isgreater_raw_median=case_when(or>or_median_raw~1,
                             TRUE~0),
         isgreater_raw_uq=case_when(or>or_uq_raw~1,
                             TRUE~0),
         isgreater_median=case_when(or>or_median_downsampled~1,
                             TRUE~0),
         isgreater_uq=case_when(or>or_uq_downsampled~1,
                             TRUE~0)
         )
# pc1pc2_contabs_sims=pc1pc2_contabs_sims%>%
#   mutate(isgreater=case_when(or>=or_pc1pc2~1,
#                              TRUE~0))
# goipc1_contabs_sims=goipc1_contabs_sims%>%
#   mutate(isgreater=case_when(or>=or_pc1pc2~1,
#                              TRUE~0))
# goipc2_contabs_sims=goipc2_contabs_sims%>%
#   mutate(isgreater=case_when(or>=or_pc1pc2~1,
#                              TRUE~0))
pc1rawpc2_isgreater_raw_median=sum(pc1rawpc2_contabs_sims$isgreater_raw_median)
pc1rawpc2_isgreater_raw_uq=sum(pc1rawpc2_contabs_sims$isgreater_raw_uq)
pc1rawpc2_isgreater_median=sum(pc1rawpc2_contabs_sims$isgreater_median)
pc1rawpc2_isgreater_uq=sum(pc1rawpc2_contabs_sims$isgreater_uq)

pc1pc2_isgreater_raw_median=sum(pc1pc2_contabs_sims$isgreater_raw_median)
pc1pc2_isgreater_raw_uq=sum(pc1pc2_contabs_sims$isgreater_raw_uq)
pc1pc2_isgreater_median=sum(pc1pc2_contabs_sims$isgreater_median)
pc1pc2_isgreater_uq=sum(pc1pc2_contabs_sims$isgreater_uq)

goipc1_isgreater_raw_median=sum(goipc1_contabs_sims$isgreater_raw_median)
goipc1_isgreater_raw_uq=sum(goipc1_contabs_sims$isgreater_raw_uq)
goipc1_isgreater_median=sum(goipc1_contabs_sims$isgreater_median)
goipc1_isgreater_uq=sum(goipc1_contabs_sims$isgreater_uq)


# pc1rawpc2_isgreater=sum(pc1rawpc2_contabs_sims$isgreater)
# pc1pc2_isgreater=sum(pc1pc2_contabs_sims$isgreater)
# goipc1_isgreater=sum(goipc1_contabs_sims$isgreater)

simresults=c(cohort_size[k],
             incidence[j],
             or_pair1[i],
             or_pair2[l],
             pc1rawpc2_isgreater_raw_median,
             pc1rawpc2_isgreater_raw_uq,
             pc1rawpc2_isgreater_median,
             pc1rawpc2_isgreater_uq,
             pc1pc2_isgreater_raw_median,
             pc1pc2_isgreater_raw_uq,
             pc1pc2_isgreater_median,
             pc1pc2_isgreater_uq,
             goipc1_isgreater_raw_median,
             goipc1_isgreater_raw_uq,
             goipc1_isgreater_median,
             goipc1_isgreater_uq)

simresults_alldata=c(cohort_size[k],
             incidence[j],
             or_pair1[i],
             or_pair2[l],
    list(goipc1_contabs_sims$or),
    list(pc1pc2_contabs_sims$or))

######Comment out the following section if you wish to not gather all the 1,000 Odds ratios for each simulation####
simresults_compiled_alldata[[ct]]=simresults_alldata
# a=simresults_compiled_alldata[[1]]

simresults_compiled[ct,]=simresults

ct=ct+1
    }
  }
}
  toc()
}

12.996 sec elapsed
10.755 sec elapsed
12.917 sec elapsed
11.37 sec elapsed
9.556 sec elapsed
9.618 sec elapsed
8.819 sec elapsed
8.47 sec elapsed
8.485 sec elapsed
8.326 sec elapsed
8.582 sec elapsed
8.46 sec elapsed
9.803 sec elapsed
9.421 sec elapsed
9.728 sec elapsed
9.299 sec elapsed
9.627 sec elapsed

simresults_compiled=data.frame(simresults_compiled)
colnames(simresults_compiled)=c("cohort_size",
                                "incidence",
                                "or1",
                                "or2",
                                "pc1rawpc2_isgreater_raw_median",
                                "pc1rawpc2_isgreater_raw_uq",
                                "pc1rawpc2_isgreater_median",
                                "pc1rawpc2_isgreater_uq",
                                "pc1pc2_isgreater_raw_median",
                                "pc1pc2_isgreater_raw_uq",
                                "pc1pc2_isgreater_median",
                                "pc1pc2_isgreater_uq",
                                "goipc1_isgreater_raw_median",
                                "goipc1_isgreater_raw_uq",
                                "goipc1_isgreater_median",
                                "goipc1_isgreater_uq")
simresults_compiled$delta_median=simresults_compiled$goipc1_isgreater_raw_median-simresults_compiled$goipc1_isgreater_median
simresults_compiled$delta_uq=simresults_compiled$goipc1_isgreater_raw_uq-simresults_compiled$goipc1_isgreater_uq

simresults_compiled=simresults_compiled%>%
  mutate(fp_corrected_95=case_when(
    goipc1_isgreater_raw_uq<950~-1,
     (goipc1_isgreater_raw_uq>=950&(goipc1_isgreater_raw_uq-delta_uq)<=950)~1,
    TRUE~0),
    fp_corrected_99=case_when(
      goipc1_isgreater_raw_uq<990~-1,
    (goipc1_isgreater_raw_uq>=990&(goipc1_isgreater_raw_uq-delta_uq)<=990)~1,
    TRUE~0))

# simresults_compiled$goipc1_isgreater_percent=simresults_compiled$goipc1_isgreater*100/1000
simresults_subset=simresults_compiled%>%filter(cohort_size%in%500)

Here I am plotting the score obtained for all simulations. The score is defined as the percentage of GOI vs PC1 trials that fall in the median of the odds ratios of the frequency corrected PC1 vs PC2 simulations.

ggplot(simresults_compiled%>%filter(or1<=.4),aes(x=incidence,y=or2))+geom_tile(aes(fill=goipc1_isgreater_median/10))+facet_grid(cohort_size~or1)+scale_fill_gradient2(low ="red" ,mid ="white",midpoint=50,high ="blue",name="Score")

Here I am plotting the score obtained for all simulations. The score is defined as the percentage of GOI vs PC1 trials that fall in the upper quartile of the odds ratios of the frequency corrected PC1 vs PC2 simulations. i.e. it is a less stringent test

ggplot(simresults_compiled%>%filter(or1<=.4),aes(x=incidence,y=or2))+geom_tile(aes(fill=goipc1_isgreater_uq/10))+facet_grid(cohort_size~or1)+scale_fill_gradient2(low ="red" ,mid ="white",midpoint=50,high ="blue",name="Percent succesful trials")

A few more ways to look at the plots above:

ggplot(simresults_compiled%>%filter(or1<=.4,cohort_size%in%500),aes(x=incidence,y=or2))+geom_tile(aes(fill=goipc1_isgreater_median/10))+facet_wrap(~or1)+scale_fill_gradient2(low ="red" ,mid ="white",midpoint=50,high ="blue",name="Score")

ggplot(simresults_compiled%>%filter(or1<=.4,cohort_size%in%500),aes(x=incidence,y=or2))+geom_tile(aes(fill=goipc1_isgreater_median/10))+facet_wrap(~or1,ncol=6)+scale_fill_gradient2(low ="red" ,mid ="white",midpoint=50,high ="blue",name="Score")

ggplot(simresults_compiled%>%filter(or1%in%c(.05,0.1,.5),cohort_size%in%500),aes(x=incidence,y=or2))+geom_tile(color="black",aes(fill=goipc1_isgreater_median/10))+facet_wrap(~or1,ncol=6)+scale_fill_gradient2(low ="red" ,mid ="white",midpoint=50,high ="blue",name="Score")+scale_x_continuous(expand = c(0,0),name="Incidence")+
  scale_y_continuous(expand = c(0,0),name="Odds ratio of PC1vsPC2")+ggtitle("Odds ratio of GOIvsPC1")+
  theme(plot.title = element_text(hjust = 0.5))

# ggsave("score_heatmap.pdf",width=8,heigh=3,units="in",useDingbats=F)

ggplot(simresults_compiled%>%filter(or1%in%c(.05,0.1,.5),cohort_size%in%c(100,500,900)),aes(x=incidence,y=or2))+geom_tile(color="black",aes(fill=goipc1_isgreater_median/10))+facet_grid(cohort_size~or1)+scale_fill_gradient2(low ="red" ,mid ="white",midpoint=50,high ="blue",name="Score")+scale_x_continuous(expand = c(0,0),name="Incidence")+
  scale_y_continuous(expand = c(0,0),name="Odds ratio of PC1vsPC2")+ggtitle("Odds ratio of GOIvsPC1")+
  theme(plot.title = element_text(hjust = 0.5))

# ggsave("score_heatmap_supplement.pdf",width=8,heigh=7,units="in",useDingbats=F)

ggplot(simresults_compiled%>%filter(incidence%in%c(6,16,26,36),cohort_size%in%c(100,300,500,900)),aes(x=factor(or1),y=or2))+geom_tile(color="black",aes(fill=goipc1_isgreater_uq/10))+scale_fill_gradient2(low ="red" ,mid ="white",midpoint=50,high ="blue",name="Score")+scale_x_discrete(expand = c(0,0),name="Odds ratio of GOI and PC1")+facet_grid(cohort_size~incidence)+
  scale_y_continuous(expand = c(0,0),name="Odds ratio of PC1 and PC2")+theme(plot.title = element_text(hjust = 0.5))

ggplot(simresults_compiled%>%filter(incidence%in%c(6,16,26,36),cohort_size%in%500),aes(x=factor(or1),y=or2))+geom_tile(color="black",aes(fill=goipc1_isgreater_uq/10))+scale_fill_gradient2(low ="red" ,mid ="white",midpoint=50,high ="blue",name="Score")+scale_x_discrete(expand = c(0,0),name="Odds ratio of GOI and PC1")+facet_wrap(~incidence,ncol=4)+
  scale_y_continuous(expand = c(0,0),name="Odds ratio of PC1 and PC2")+theme(plot.title = element_text(hjust = 0.5),axis.text.x=element_text(angle=90,hjust=.5,vjust=.5),axis.text=element_text(face="bold",size="9",color="black"))

# ggsave("score_heatmap_bestoption1.pdf",width=8,heigh=2.5,units="in",useDingbats=F)
# sort(unique(simresults_compiled$or2))

ggplot(simresults_compiled%>%filter(cohort_size%in%500),aes(x=factor(or1),y=or2))+geom_tile(color="black",aes(fill=goipc1_isgreater_median/10))+scale_fill_gradient2(low ="red" ,mid ="white",midpoint=50,high ="blue",name="Score")+scale_x_discrete(expand = c(0,0),name="Odds ratio of GOI and PC1")+
  scale_y_continuous(expand = c(0,0),name="Odds ratio of PC1 and PC2")+theme(plot.title = element_text(hjust = 0.5))

# ggsave("score_heatmap_bestoption.pdf",width=4,heigh=3,units="in",useDingbats=F)

Here, I am plotting how the score worsens as we get a worse positive control gene pair (worse in this case means having a higher odds ratio)

ggplot(simresults_compiled%>%filter(or2%in%.05,cohort_size%in%c(500)),aes(x=factor(or1),y=goipc1_isgreater_median/10))+geom_boxplot()+scale_x_discrete(name="Odds ratio of PC1vsPC2")+
  scale_y_continuous(name="Score")+
  theme(plot.title = element_text(hjust = 0.5))+cleanup

# ggsave("score_plot.pdf",width=8,heigh=3,units="in",useDingbats=F)

Here, I am looking at whether frequency correction of the positive control genes is in fact a more fair test of a comparison with gene of interest than no frequency correction. i.e. does it matter if the positive control genes are downsampled to the frequency of the gene of interest? Answer: yes. If the positive control genes were present at a lower frequency, it would be tougher to tell PC1 vs PC2 apart from GOI vs PC1.

ggplot(simresults_compiled%>%filter(or1<=.7,or1>=.2),aes(x=incidence,y=or2))+geom_tile(color="black",aes(fill=fp_corrected_95))+facet_grid(cohort_size~or1)+scale_fill_gradient2(low ="red" ,mid ="white",high ="blue",name="Regions where downsampling helps")

ggplot(simresults_compiled%>%filter(or1<=.7,or1>=.2),aes(x=incidence,y=or2))+geom_tile(color="black",aes(fill=fp_corrected_99))+facet_grid(cohort_size~or1)+scale_fill_gradient2(low ="red" ,mid ="white",high ="blue",name="Regions where downsampling helps")

The following plots went in Fig 1C, 1D, and Fig S2

ggplot(simresults_compiled%>%filter(cohort_size%in%500),aes(x=factor(or1),y=or2))+geom_tile(color="black",aes(fill=goipc1_isgreater_uq/10))+scale_fill_gradient2(low ="red" ,mid ="white",midpoint=50,high ="blue",name="Score")+scale_x_discrete(expand = c(0,0),name="Odds ratio of GOI and PC1")+
  scale_y_continuous(expand = c(0,0),name="Odds ratio of PC1 and PC2")+theme(plot.title = element_text(hjust = 0.5))+theme(plot.title = element_text(hjust = 0.5),axis.text.x=element_text(angle=90,hjust=.5,vjust=.5),axis.text=element_text(face="bold",size="9",color="black"))

# ggsave("score_heatmap_bestoption.pdf",width=4,heigh=3,units="in",useDingbats=F)

ggplot(simresults_compiled%>%filter(or1%in%c(.05,.3,.6),incidence%in%c(4,8,12,16,20,24,28,32,36),cohort_size%in%500),aes(x=factor(incidence),y=or2))+geom_tile(color="black",aes(fill=goipc1_isgreater_median/10))+facet_wrap(~or1,ncol=6)+scale_fill_gradient2(low ="red" ,mid ="white",midpoint=50,high ="blue",name="Score")+scale_x_discrete(expand = c(0,0),name="Incidence")+
  scale_y_continuous(expand = c(0,0),name="Odds ratio of PC1 and PC2")+theme(strip.text=element_blank(),plot.title = element_text(hjust = 0.5),axis.text.x=element_text(angle=90,hjust=.5,vjust=.5),axis.text=element_text(face="bold",size="9",color="black"),legend.position = "none")

# ggsave("score_heatmap_bestoption1.pdf",width=6,heigh=2.5,units="in",useDingbats=F)

ggplot(simresults_compiled%>%filter(or1%in%c(.1,.2,.5),incidence%in%c(4,8,12,16,20,24,28,32,36),cohort_size%in%500),aes(x=factor(incidence),y=or2))+geom_tile(color="black",aes(fill=goipc1_isgreater_raw_median/10))+facet_wrap(~or1,ncol=6)+scale_fill_gradient2(low ="red" ,mid ="white",midpoint=50,high ="blue",name="Score")+scale_x_discrete(expand = c(0,0),name="Incidence")+
  scale_y_continuous(expand = c(0,0),name="Odds ratio of PC1 and PC2")+theme(strip.text=element_blank(),plot.title = element_text(hjust = 0.5),axis.text.x=element_text(angle=90,hjust=.5,vjust=.5),axis.text=element_text(face="bold",size="9",color="black"),legend.position = "none")

# ggsave("score_heatmap_bestoption1.pdf",width=6,heigh=2.5,units="in",useDingbats=F)

ggplot(simresults_compiled%>%filter(cohort_size%in%500,incidence%in%c(4,8,12,16,20,24,28,32,36)),aes(x=factor(incidence),y=or2))+geom_tile(color="black",aes(fill=goipc1_isgreater_raw_median/10))+facet_wrap(~or1,ncol=4)+scale_fill_gradient2(low ="red" ,mid ="white",midpoint=50,high ="blue",name="Score")+scale_x_discrete(expand = c(0,0),name="Incidence")+
  scale_y_continuous(expand = c(0,0),name="Odds ratio of PC1 and PC2")+theme(plot.title = element_text(hjust = 0.5),axis.text.x=element_text(angle=90,hjust=.5,vjust=.5),axis.text=element_text(face="bold",size="9",color="black"),legend.position = "none")

# ggsave("score_heatmap_bestoption1_supplement.pdf",width=6,heigh=6,units="in",useDingbats=F)

ggplot(simresults_compiled%>%filter(or1%in%c(.05,.1,.5),cohort_size%in%c(100,300,500)),aes(x=incidence,y=or2))+geom_tile(color="black",aes(fill=goipc1_isgreater_median/10))+facet_grid(cohort_size~or1)+scale_fill_gradient2(low ="red" ,mid ="white",midpoint=50,high ="blue",name="Score")+scale_x_discrete(expand = c(0,0),name="Incidence")+
  scale_y_continuous(expand = c(0,0),name="Odds ratio of PC1 and PC2")

# a=simresults_compiled%>%filter(cohort_size%in%500,or1%in%.7,or2%in%.05,incidence%in%)

# simresults_subset=simresults_compiled%>%filter(or2%in%0.01,or1%in%c(0.01,.1,1))
simresults_subset=unlist(simresults_compiled_alldata)

simresults_unlisted=data.frame(unlist(lapply(simresults_compiled_alldata,'[[',1)),
             unlist(lapply(simresults_compiled_alldata,'[[',2)),
             unlist(lapply(simresults_compiled_alldata,'[[',3)),
             unlist(lapply(simresults_compiled_alldata,'[[',4)))
simresults_unlisted$list1=(lapply(simresults_compiled_alldata,'[[',5))
simresults_unlisted$list2=(lapply(simresults_compiled_alldata,'[[',6))

colnames(simresults_unlisted)=c("cohort_size","incidence","or1","or2","or1_list","or2_list")

simresults_unlisted=simresults_unlisted%>%filter(or2%in%0.01,or1%in%c(0.05,.1,1),cohort_size%in%500,incidence%in%c(4,8,12,16,20))
# library(reshape2)
a=simresults_unlisted%>%filter(incidence%in%8,or2%in%0.01,or1%in%.05)
# b=unnest(a)
median(b$or2_list)>median(b$or1_list)

# library(tidyr)
simresults_melted=unnest(simresults_unlisted)
simresults_melted2=melt(simresults_melted,
                        id.vars = c("cohort_size","or1","or2","incidence"),
                        measure.vars =c("or1_list","or2_list"),
                        variable.name = "Comparison",
                        value.name = "OR"
                        )
ggplot(simresults_melted2,aes(x=factor(incidence),y=OR,fill=Comparison))+facet_wrap(~or1)+geom_boxplot()+scale_y_continuous(trans="log2")+cleanup

plotly=ggplot(simresults_melted2,aes(x=factor(incidence),y=OR,fill=Comparison))+facet_wrap(~or1)+geom_boxplot()+cleanup
ggplotly(plotly)




simresults_unlisted=data.frame(unlist(lapply(simresults_compiled_alldata,'[[',1)),
             unlist(lapply(simresults_compiled_alldata,'[[',2)),
             unlist(lapply(simresults_compiled_alldata,'[[',3)),
             unlist(lapply(simresults_compiled_alldata,'[[',4)))
simresults_unlisted$list1=(lapply(simresults_compiled_alldata,'[[',5))
simresults_unlisted$list2=(lapply(simresults_compiled_alldata,'[[',6))

colnames(simresults_unlisted)=c("cohort_size","incidence","or1","or2","or1_list","or2_list")
simresults_unlisted=simresults_unlisted%>%filter(or1%in%.5,or2%in%c(0.05,.11,.19),cohort_size%in%500,incidence%in%c(4,8,12,16,20))

simresults_melted=unnest(simresults_unlisted)
simresults_melted2=melt(simresults_melted,
                        id.vars = c("cohort_size","or1","or2","incidence"),
                        measure.vars =c("or1_list","or2_list"),
                        variable.name = "Comparison",
                        value.name = "OR"
                        )

ggplot(simresults_melted2,aes(x=factor(incidence),y=OR,fill=Comparison))+facet_wrap(~or2)+geom_boxplot()+scale_y_continuous(trans="log2")+cleanup

plotly=ggplot(simresults_melted2,aes(x=factor(incidence),y=OR,fill=Comparison))+facet_wrap(~or2)+geom_boxplot()+cleanup
ggplotly(plotly)
# sort(unique(simresults_unlisted$or2))

#I used the following website to make sure that my Gaussian elimination was working properly
# https://www.emathhelp.net/calculators/linear-algebra/gauss-jordan-elimination-calculator/?i=%5B%5B1%2C1%2C0%2C0%2C.4836%5D%2C%5B1%2C1%2C1%2C1%2C1%5D%2C%5B0%2C1%2C0%2C-1%2C0%5D%2C%5B1%2C0%2C-0.01%2C0%2C0%5D%5D&reduced=on

sessionInfo()

R version 4.0.0 (2020-04-24)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS  10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  grid      stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] BiocManager_1.30.10 plotly_4.9.2.1      ggsignif_0.6.0     
 [4] devtools_2.3.0      usethis_1.6.1       RColorBrewer_1.1-2 
 [7] reshape2_1.4.4      ggplot2_3.3.3       doParallel_1.0.15  
[10] iterators_1.0.12    foreach_1.5.0       dplyr_1.0.6        
[13] VennDiagram_1.6.20  futile.logger_1.4.3 workflowr_1.6.2    
[16] tictoc_1.0          knitr_1.28         

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6         tidyr_1.1.3          prettyunits_1.1.1   
 [4] ps_1.3.3             assertthat_0.2.1     rprojroot_1.3-2     
 [7] digest_0.6.25        utf8_1.1.4           R6_2.4.1            
[10] plyr_1.8.6           futile.options_1.0.1 backports_1.1.7     
[13] evaluate_0.14        httr_1.4.2           pillar_1.6.1        
[16] rlang_0.4.11         lazyeval_0.2.2       data.table_1.12.8   
[19] callr_3.7.0          rmarkdown_2.8        labeling_0.3        
[22] desc_1.2.0           stringr_1.4.0        htmlwidgets_1.5.1   
[25] munsell_0.5.0        compiler_4.0.0       httpuv_1.5.2        
[28] xfun_0.22            pkgconfig_2.0.3      pkgbuild_1.0.8      
[31] htmltools_0.4.0      tidyselect_1.1.0     tibble_3.1.2        
[34] codetools_0.2-16     viridisLite_0.3.0    fansi_0.4.1         
[37] crayon_1.4.1         withr_2.4.2          later_1.0.0         
[40] jsonlite_1.7.2       gtable_0.3.0         lifecycle_1.0.0     
[43] DBI_1.1.0            git2r_0.27.1         magrittr_2.0.1      
[46] formatR_1.7          scales_1.1.1         cli_2.5.0           
[49] stringi_1.4.6        farver_2.0.3         fs_1.4.1            
[52] promises_1.1.0       remotes_2.1.1        testthat_2.3.2      
[55] ellipsis_0.3.2       generics_0.0.2       vctrs_0.3.8         
[58] lambda.r_1.2.4       tools_4.0.0          glue_1.4.1          
[61] purrr_0.3.4          processx_3.5.2       pkgload_1.0.2       
[64] yaml_2.2.1           colorspace_1.4-1     sessioninfo_1.1.1   
[67] memoise_1.1.0