Introduction
We are simulating combining MRI datasets across multiple sites with different scanners and patient populations. In this notebook we will test which calibration method is best for a logistic regression, that classifies patients with MS into low and high severity groups.
The three calibration methods are:
Protocol: A nominal vector (representing the site of acquisition) is added to the model -
Group ~ Age + BV + Protocol
Age-calibrated: Brain volumes are corrected by the healthy control population, and age dependence is removed:
Group ~ BV(HC)
Percentile-Calibrated: Brain volumes of patients are converted to percentile scores based on the brain volumes of healthy controls.
Group ~ pBV(HC) + Age
In previous analyses, we found that in biased conditions, the age-calibrated method results in more powerful and accurate coefficient estimates for modelling brain volume as a function of age and EDSS(a clinical measure of MS severity):
BV ~ Age + EDSS
Because calibration methods depend on unbiased healthy control populations, 14 multisite healthy controls will be used to test for bias in the healthy control populations of each site. Previously, we found that the percentile calibrated method is more sensitive to bias in healthy control populations. On the other hand, despite small biases in the healthy controls, the age calibrated method is still more powerful than the protocol method in EDSS coefficient estimates.
%pylab inline
from utils import *
from IPython.html.widgets import interact, interactive, fixed
from IPython.html import widgets
from IPython.display import clear_output, display, HTML
import sklearn.linear_model as lm
import time
from mpld3 import enable_notebook, disable_notebook
from mpld3 import plugins
params = {'n_hc':50, #number of controls
'n_ms':[250,250], #number of MS patients at each site
'age_hc_range':(20,52), # Age range of the controls
'age_ms_range':[(20,45),(20,45)], # Age range of MS patients at each site
'd_dist':[.65,.65], # Distribution weighting of EDSS scores for each site
'alpha_a':-3, # Real coefficient of age
'alpha_d':[-16,-16], # Real coefficient of EDSS score
'b0':1644, #Brain Volume Intercept (in ccs)
'sig_b':[57,57], # Brain variation across subjects (in ccs) for each site
'sig_s':[16,16], # Variability in brain volume measurement across scanners for each site
'alpha':[0.7,1.3], # Scaling of each scanner
'gamma':[20,-20], # Offset of each scanner
'lo_mean':[2,2], # EDSS distribution parameters at both sites
'hi_mean':[6,6], #EDSS distribution parameters at both sites
'lo_sig': [1.1,1.1], #EDSS distribution parameters at both sites
'hi_sig': [0.8,0.8], #EDSS distribution parameters at both sites
}
sim = sim_data(**params)
Assigning Groups: Log Likelihood Ratio¶
We first need to convert EDSS scores into a binary variable that represents two severity groups. We do this by calculating the log-likehood ratio.
The Log Likelihood Ratio (LLR) is defined as:
$LLR = log(\frac{Pr_{\theta_L}(E)}{Pr_{\theta_H}(E)})$
Where $Pr_{\theta_L}$ is the probability density function(PDF) of a normal distribution centered at an EDSS 2 with a standard deviation of 1.1 (the low group), and $Pr_{\theta_H}$ is the PDF of a normal distribution centered at EDSS of 6 with a standard deviation of 0.8.
We define the groups as:
Low: $LLR > 0$
High: $LLR < 0$
Below we plot the distribution of log likelihood ratios at each site:
_,x = subplots(1,2,figsize=(10,5))
[ppl.remove_chartjunk(i,["top","right"]) for i in x]
x[1].set_xlabel("EDSS LLR")
x[1].set_title("Site 2",size=14)
x[0].set_xlabel("EDSS LLR")
x[0].set_title("Site 1",size=14)
ppl.hist(x[0],sim["llr_1"],color=ppl.set1[0],alpha=0.7);
ppl.hist(x[1],sim["llr_2"],color=ppl.set1[1],alpha=0.7);
suptitle("EDSS Lo/Hi Log-Likelihood Ratio", size=16);
Now we can look at how brain volume, EDSS and age relate while looking at their log likelihood groups.
_,x=subplots(2,2,figsize=(12,12))
x = x.ravel()
llr = sim["llr_1"]
ppl.plot(x[0],sim["age_ms_1"][llr],sim["BV_ms_1"][llr],marker='o',linewidth=0,
markersize=10,color=ppl.set2[0],label="Lo",mew=0.5,mec=ppl.almost_black,alpha=0.9);
ppl.plot(x[0],sim["age_ms_1"][llr==False],sim["BV_ms_1"][llr==False],marker='o',linewidth=0,
markersize=10,color=ppl.set2[1],label="Hi",mew=0.5,mec=ppl.almost_black,alpha=0.9);
ppl.plot(x[1],sim["dis_1"][llr],sim["BV_ms_1"][llr],marker='o',linewidth=0,
markersize=10,color=ppl.set2[0],label="Lo",mew=0.5,mec=ppl.almost_black,alpha=0.9);
ppl.plot(x[1],sim["dis_1"][llr==False],sim["BV_ms_1"][llr==False],marker='o',linewidth=0,
markersize=10,color=ppl.set2[1],label="Hi",mew=0.5,mec=ppl.almost_black,alpha=0.9);
x[0].set_ylabel("Brain Volume",size=14)
x[0].set_xlabel("Age",size=14)
x[1].set_xlabel("EDSS",size=14)
llr = sim["llr_2"]
ppl.plot(x[2],sim["age_ms_2"][llr],sim["BV_ms_2"][llr],marker='o',linewidth=0,
markersize=10,color=ppl.set2[0],label="Lo",mew=0.5,mec=ppl.almost_black,alpha=0.9);
ppl.plot(x[2],sim["age_ms_2"][llr==False],sim["BV_ms_2"][llr==False],marker='o',linewidth=0,
markersize=10,color=ppl.set2[1],label="Hi",mew=0.5,mec=ppl.almost_black,alpha=0.9);
ppl.plot(x[3],sim["dis_2"][llr],sim["BV_ms_2"][llr],marker='o',linewidth=0,
markersize=10,color=ppl.set2[0],label="Lo",mew=0.5,mec=ppl.almost_black,alpha=0.9);
ppl.plot(x[3],sim["dis_2"][llr==False],sim["BV_ms_2"][llr==False],marker='o',linewidth=0,
markersize=10,color=ppl.set2[1],label="Hi",mew=0.5,mec=ppl.almost_black,alpha=0.9);
x[2].set_ylabel("Brain Volume",size=14)
x[2].set_xlabel("Age",size=14)
x[3].set_xlabel("EDSS",size=14)
ppl.legend(x[2]);
x[2].set_title("Site 2", size=14)
x[3].set_title("Site 2", size=14)
x[1].set_title("Site 1", size=14)
x[0].set_title("Site 1", size=14)
ppl.legend(x[3]);
ppl.legend(x[0]);
#suptitle("Site 1",size=14)
ppl.legend(x[1]);
suptitle("EDSS Severity Groups",size=16);
Here we see that EDSS scores are split at an EDSS value of around 4.5. The next step is to classify the patients into two groups based on brain volume measures and age.
Classification via Logistic Regression¶
We can run a logistic regression on the data and look at the accuracy of both sites:
Site 1:
import sklearn.linear_model as lm
a = lm.LogisticRegression()
X1 = np.vstack((sim["BV_ms_1"],sim["age_ms_1"])).T
y1 = sim["llr_1"]>0
X1 = sm.add_constant(X1)
a.fit(X1,y1)
acc1 = a.score(X1,y1)
print "Accuracy:", acc1
Site 2:
a = lm.LogisticRegression()
X2 = np.vstack((sim["BV_ms_2"],sim["age_ms_2"])).T
X2 = sm.add_constant(X2)
y2 = sim["llr_2"]>0
a.fit(X2,sim["llr_2"]>0)
acc2 = a.score(X2,sim["llr_2"]>0)
print "Accuracy:", acc2
Next we run the percentile calibration from earlier and visualize our logistic regression at each site separately:
run_p_calib(sim)
disable_notebook()
_,x = subplots(1,2,figsize=(12,4))
sns.set_style("darkgrid")
df1 = pd.DataFrame(index=range(params["n_ms"][0]))
df1["pBV"] = sim["pBV_ms_1"]
df1["age"] = sim["age_ms_1"]
df1["EDSS"] = sim["dis_1"]
df1["Low Group"] = sim["llr_1"] > 0
iax = sns.interactplot("age","pBV","Low Group", df1,
cmap="coolwarm", filled=True, levels=25,ax=x[0],
colorbar=True,logistic=True,
scatter_kws={"color": "black"},
contour_kws={"alpha": .8});
#iax.figure.colorbar(iax)
sns.set_style("darkgrid")
df2 = pd.DataFrame(index=range(params["n_ms"][1]))
df2["pBV"] = sim["pBV_ms_2"]
df2["age"] = sim["age_ms_2"]
df2["EDSS"] = sim["dis_2"]
df2["Low Group"] = sim["llr_2"] > 0
sns.interactplot("age","pBV","Low Group", df2,
cmap="coolwarm", filled=True, levels=25,ax=x[1],
colorbar=True,logistic=True,
scatter_kws={"color": "black"},
contour_kws={"alpha": .8});
x[1].set_title("Site 2\nAccuracy = %2.2f"%(acc2),size=16);
x[0].set_title("Site 1\n Accuracy = %2.2f"%acc1,size=16);
def get_ROC(X,y):
from sklearn.metrics import roc_curve, auc
from sklearn.utils import shuffle
import sklearn.linear_model as lm
X, y = shuffle(X, y)
half = int(len(y) / 2)
X_train, X_test = X[:half], X[half:]
y_train, y_test = y[:half], y[half:]
# Run classifier
classifier = lm.LogisticRegression()
classifier.fit(X_train, y_train)
probas_ = classifier.predict_proba(X_test)
# Compute ROC curve and area the curve
fpr, tpr, thresholds = roc_curve(y_test, probas_[:, 1])
roc_auc = auc(fpr, tpr)
return fpr,tpr,roc_auc
roc_info = map(get_ROC,[X1]*100,[y1]*100)
roc_info2 = map(get_ROC,[X2]*100,[y2]*100)
roc_info = np.asarray(roc_info)
roc_info2 = np.asarray(roc_info2)
fpr = roc_info[:,0]
tpr = roc_info[:,1]
auc = roc_info[:,2]
fpr2 = roc_info2[:,0]
tpr2 = roc_info2[:,1]
auc2 = roc_info2[:,2]
We can also look at the ROC curves of the two sites, and the areas under the curve to get a better estimate of accuracy:
_,x=subplots(1,2,figsize=(12,4))
x[0].errorbar(np.mean(fpr),np.mean(tpr),xerr = np.std(fpr),yerr=np.std(tpr),alpha=0.15,color=ppl.set1[0],linewidth=3)
x[0].plot(np.mean(fpr),np.mean(tpr),color=ppl.set1[0],linewidth=3,label="Site 1")
x[0].errorbar(np.mean(fpr2),np.mean(tpr2),xerr = np.std(fpr2),yerr=np.std(tpr2),color=ppl.set1[1],alpha=0.15,linewidth=3)
x[0].plot(np.mean(fpr2),np.mean(tpr2),color=ppl.set1[1],linewidth=3,label="Site 2")
x[0].set_xlim([0,1])
x[0].set_ylim([0,1])
ppl.plot(x[0],[0, 1], [0, 1],color=ppl.set1[2],linewidth=2,linestyle='--')
x[1].set_title("Area Under the Curve",size=16)
x[0].set_title("ROC Curve",size=16)
x[0].set_xlabel("False Positive Rate")
x[0].set_xlabel("True Positive Rate")
x[0].legend(loc=0,fontsize=14)
ppl.boxplot(x[1],[auc,auc2]);
x[1].set_xticklabels(["Site 1","Site 2"]);
df_all = pd.DataFrame(index=range(sum(params["n_ms"])))
df_all["age"] = sim["age_all_ms"]
df_all["pBV"] = np.hstack((sim["pBV_ms_1"],sim["pBV_ms_2"]))
df_all["llr"] = np.hstack((sim["llr_1"],sim["llr_2"]))>0
a = lm.LogisticRegression()
X = np.vstack((df_all["pBV"],df_all["age"])).T
X = sm.add_constant(X)
y = df_all["llr"]
a.fit(X,y)
acc = a.score(X,y)
roc_info_all = map(get_ROC,[X]*100,[y]*100)
roc_info_all = np.asarray(roc_info_all)
fpr_all = roc_info_all[:,0]
tpr_all = roc_info_all[:,1]
auc_all = roc_info_all[:,2]
#This isn't that accurate but oh well
minlen = min([len(x) for x in fpr_all])
fpr_all = np.asarray([x[:minlen] for x in fpr_all])
tpr_all = np.asarray([x[:minlen] for x in tpr_all])
Here we combine the data by the percentile calibration method and compare the ROC curves:
# Plot ROC curve
_,x=subplots(1,2,figsize=(12,4))
sns.interactplot("age","pBV","llr", df_all,cmap="coolwarm",
filled=True, levels=20,colorbar=True,logistic=True,ax=x[0],
scatter_kws={"color": "black"},
contour_kws={"alpha": .8});
x[0].set_title("Logistic Regression\n %2.2f Accuracy"%(acc),size= 16)
x[1].errorbar(np.mean(fpr),np.mean(tpr),xerr = np.std(fpr),yerr=np.std(tpr),alpha=0.15,color=ppl.set1[0],linewidth=3)
x[1].plot(np.mean(fpr),np.mean(tpr),color=ppl.set1[0],linewidth=3,label="Site 1, AUC = %2.2f"%(np.mean(auc)))
x[1].errorbar(np.mean(fpr2),np.mean(tpr2),xerr = np.std(fpr2),yerr=np.std(tpr2),color=ppl.set1[1],alpha=0.15,linewidth=3)
x[1].plot(np.mean(fpr2),np.mean(tpr2),color=ppl.set1[1],linewidth=3,label="Site 2, AUC = %2.2f"%(np.mean(auc2)))
x[1].errorbar(np.mean(fpr_all,axis=0),np.mean(tpr_all,axis=0),xerr = np.std(fpr_all,axis=0),
yerr=np.std(tpr_all,axis=0),alpha=0.15,color=ppl.set1[2],linewidth=3)
x[1].plot(np.mean(fpr_all,axis=0),np.mean(tpr_all,axis=0),color=ppl.set1[2],linewidth=3,
label="Combined, AUC = %2.2f"%(np.mean(auc_all)))
ppl.plot(x[1],[0, 1], [0, 1],color=ppl.set1[3],linewidth=2,linestyle="--")
x[1].set_xlim([0.0, 1.0])
x[1].set_ylim([0.0, 1.0])
x[1].set_xlabel('False Positive Rate', size=14)
x[1].set_ylabel('True Positive Rate',size=14)
x[1].set_title('ROC',size=16)
x[1].legend(loc="lower right",fontsize=14);
Here we see that the overall accuracy increases when the two sites are calibrated, then combined.
Next, we will bootstrap the analysis (by picking from the population multiple times) and look at the differences in coefficient power, accuracy, and ROC area for the three calibration methods (Protocol, Age(or "z"), and Percentile).
Coefficient Power Estimates¶
First lets look at the coefficient power of the disease variable from the linear regression:
Linear Regression¶
n_ms_range = [[50,50],[100,100],[250,250],[500,500]]
a = deepcopy(params)
zcal,pcal,z_pro = grid(a,n_ms_range,'n_ms',100)
a["n_ms"] = ["??","??"]
foo = get_table(a)
fig,ax = subplots(1,4,figsize=(20,5))
diff_dis = zcal[-1][-1,:,:] + z_pro[-1][2,:,:]
doo = np.zeros((1,4,100))
doo[0,:,:] = diff_dis
ax0 = plot_grid_box(doo,[d[0] for d in n_ms_range],["Z Calibration - Protocol"],
"Difference Between Models ($\Delta$z)","Number of MS Patients at each site",ax=ax[0]);
ax[-1].text(0,-0.2,get_table(a),size=16);
ax0[0].set_ylim([-5,5])
ax[-1].grid("off");
ax[-1].axis("off");
diff_dis = zcal[-1][-1,:,:] + pcal[-1][-1,:,:]
doo = np.zeros((1,4,100))
doo[0,:,:] = diff_dis
ax0 = plot_grid_box(doo,[d[0] for d in n_ms_range],["Z Calibration - P Calibration"],
"Difference Between Models ($\Delta$z)","Number of MS Patients at each site",ax=ax[1]);
ax0[0].set_ylim([-1,10])
diff_dis = np.abs(pcal[-1][-1,:,:]) - np.abs(z_pro[-1][2,:,:])
doo = np.zeros((1,4,100))
doo[0,:,:] = diff_dis
ax0 = plot_grid_box(doo,[d[0] for d in n_ms_range],["P Calibration - Protocol"],
"Difference Between Models ($\Delta$z)","Number of MS Patients at each site",ax=ax[2]);
We see here that the power of the percentile calibration is lower compared to the age-corrected calibration and worse than the protocol model. Given that the sites are identical, except for the scanner scaling factor, the age-corrected calibration and the protocol method are mostly equivalent. As the number of MS patients increases, the percentile calibration method gets weaker in comparison to age-corrected and protocol method.
Next we will look at differences in coefficient power from the logistic regression:
Logistic Regression¶
n_ms_range = [[50,50],[100,100],[250,250],[500,500]]
a = deepcopy(params)
zcal,pcal,z_pro = grid(a,n_ms_range,'n_ms',100,model=sm.Logit,combine_type=run_combined_log_models)
a["n_ms"] = ["??","??"]
foo = get_table(a)
fig,ax = subplots(1,4,figsize=(20,5))
diff_dis = zcal[-1][-1,:,:] + z_pro[-1][2,:,:]
doo = np.zeros((1,4,100))
doo[0,:,:] = diff_dis
ax0 = plot_grid_box(doo,[d[0] for d in n_ms_range],["Z Calibration - Protocol"],
"Difference Between Logstic Models ($\Delta$z)","Number of MS Patients at each site",ax=ax[0]);
ax[-1].text(0,-0.2,get_table(a),size=16);
ax0[0].set_ylim([-2,2])
ax[-1].grid("off");
ax[-1].axis("off");
diff_dis = zcal[-1][-1,:,:] + pcal[-1][-1,:,:]
doo = np.zeros((1,4,100))
doo[0,:,:] = diff_dis
ax0 = plot_grid_box(doo,[d[0] for d in n_ms_range],["Z Calibration - P Calibration"],
"Difference Between Logistic Models ($\Delta$z)","Number of MS Patients at each site",ax=ax[1]);
ax0[0].set_ylim([-2,2])
diff_dis = np.abs(pcal[-1][-1,:,:]) - np.abs(z_pro[-1][2,:,:])
doo = np.zeros((1,4,100))
doo[0,:,:] = diff_dis
ax0 = plot_grid_box(doo,[d[0] for d in n_ms_range],["P Calibration - Protocol"],
"Difference Between Logistic Models ($\Delta$z)","Number of MS Patients at each site",ax=ax[2]);
When we run the logistic regression, and look at the coefficient estimates, we see a similar trend as with the linear regression. The percentile calibration is considerably worse compared to the protocol and calibration model. Why is this?
I think the age-calibrated model is the strongest because it has the least number of regressors (just brain volume) since age is already corrected. The protocol model is stronger compared to the percentile calibration because the variance of patients outside the percentile range of the healthy controls is 0. When we lose that variance, the model may be harder to fit.
While the calibration method was slightly superior compared to the protocol model for a linear regression, it looks like the protocol model is slightly superior in the logistic regression case. I'm not entirely sure why this is.
ROC Curves¶
Each time a logistic regression is run, we want to calculate the area under the ROC curve. Then we will compare the two calibration models and study the effect of site asymmetry on classification.
def run_log_model(xi,y,add_intercept=True):
a = lm.LogisticRegression()
k = len(xi)
X = xi[0]
for i in range(1,k):
X = np.column_stack((X,xi[i]))
if add_intercept:
X = sm.add_constant(X)
a.fit(X,y)
#ROC
roc_info = map(get_ROC,[X]*100,[y]*100)
roc_info = np.asarray(roc_info)
fpr = roc_info[:,0]
tpr = roc_info[:,1]
auc = roc_info[:,2]
return np.mean(auc)
def run_all_roc(params):
import sys
sys.path.append("/home/akeshavan/software/bioe_rotation3/notebooks")
from utils import sim_data, run_calib, run_p_calib
from utils_3 import run_log_model
import numpy as np
sim = sim_data(**params)
run_calib(sim)
run_p_calib(sim)
site1 = run_log_model([sim["BV_ms_1"],sim["age_ms_1"]],sim["llr_1"].astype(int))
site2 = run_log_model([sim["BV_ms_2"],sim["age_ms_2"]],sim["llr_2"].astype(int))
allz = run_log_model([sim["BV_all_z"],sim["age_all_ms"]],sim["llr_all"].astype(int))
allp = run_log_model([sim["pBV_all_ms"],sim["age_all_ms"]],sim["llr_all"].astype(int))
return site1, site2, allz, allp
from IPython.parallel.client.client import Client
from itertools import repeat
c = Client()
view = c[:]
view.block = True
roc = view.map(run_all_roc, [params]*100)
roc = np.asarray(roc)
The graph below shows the distribution of areas under the ROC curves when the sites are identical
fig, ax = subplots(1,2,figsize=(15,6))
sns.kdeplot(roc[:,0], label="Site 1",ax=ax[0],color=ppl.set2[0],linewidth=4)
sns.kdeplot(roc[:,1], label="Site 2",ax=ax[0],color=ppl.set2[1],linewidth=4)
sns.kdeplot(roc[:,2], label="Combined (age)",ax=ax[0],color=ppl.set2[2],linewidth=4)
sns.kdeplot(roc[:,3], label="Combined (percentile)",ax=ax[0],color=ppl.set2[3],linewidth=4);
ppl.bar(ax[1],range(4),np.mean(roc,axis=0),yerr = np.std(roc,axis=0),color=ppl.set2[:4],annotate=True,alpha=0.8)
ax[0].legend(fontsize=12,loc=0)
ax[0].set_xlabel("AUC", size=14)
ax[0].set_ylabel("Frequency", size=14)
ax[1].set_ylabel("AUC", size=14)
ax[1].set_xticks(arange(4)+0.5)
ax[1].set_xticklabels(["Site 1","Site 2","Combined\nAge","Combined\nPercentile"],size=14)
ax[0].set_title("Distribution",size=16);
ax[1].set_title("Average",size=16);
suptitle("Area under the Curve (AUC)",size=18);
The combined scores are higher than the individual sites because there is more power from the increase in sample size. We run a paired T Test between the Age and Percentile calibrations to see if the age calibration outperforms the percentile calibration consistently:
t,p = stats.ttest_rel(roc[:,2],roc[:,3])
print "Paired T Test p value:", p
The test is inconclusive - perhaps the calibration methods are equivalent when the sites are very similar.
What happens if the two sites are different (i.e. different age ranges of patients, different number of patients, etc)
p = {'n_hc':50, #number of controls
'n_ms':[250,100], #number of MS patients at each site
'age_hc_range':(20,52), # Age range of the controls
'age_ms_range':[(20,45),(20,75)], # Age range of MS patients at each site
'd_dist':[.65,.88], # Distribution weighting of EDSS scores for each site
'alpha_a':-3, # Real coefficient of age
'alpha_d':[-16,-16], # Real coefficient of EDSS score
'b0':1644, #Brain Volume Intercept (in ccs)
'sig_b':[57,57], # Brain variation across subjects (in ccs) for each site
'sig_s':[16,16], # Variability in brain volume measurement across scanners for each site
'alpha':[0.7,1.3], # Scaling of each scanner
'gamma':[20,-20], # Offset of each scanner
'lo_mean':[2,2], # EDSS distribution parameters at both sites
'hi_mean':[6,6], #EDSS distribution parameters at both sites
'lo_sig': [1.1,1.1], #EDSS distribution parameters at both sites
'hi_sig': [0.8,0.8], #EDSS distribution parameters at both sites
}
p
roc2 = map(run_all_roc, [p]*100)
roc2 = np.asarray(roc2)
fig, ax = subplots(1,2,figsize=(15,6))
sns.kdeplot(roc2[:,0], label="Site 1",ax=ax[0],color=ppl.set2[0],linewidth=4)
sns.kdeplot(roc2[:,1], label="Site 2",ax=ax[0],color=ppl.set2[1],linewidth=4)
sns.kdeplot(roc2[:,2], label="Combined (age)",ax=ax[0],color=ppl.set2[2],linewidth=4)
sns.kdeplot(roc2[:,3], label="Combined (percentile)",ax=ax[0],color=ppl.set2[3],linewidth=4);
ppl.bar(ax[1],range(4),np.mean(roc2,axis=0),yerr = np.std(roc2,axis=0),color=ppl.set2[:4],annotate=True,alpha=0.8)
ax[0].legend(fontsize=12,loc=0)
ax[0].set_xlabel("AUC", size=14)
ax[0].set_ylabel("Frequency", size=14)
ax[1].set_ylabel("AUC", size=14)
ax[1].set_xticks(arange(4)+0.5)
ax[1].set_xticklabels(["Site 1","Site 2","Combined\nAge","Combined\nPercentile"],size=14)
ax[0].set_title("Distribution",size=16);
ax[1].set_title("Average",size=16);
suptitle("Area under the Curve (AUC)",size=18);
Here we see that the age calibration is slightly higher than the percentile calibration, but we must run the paired T test to test if the pairwise difference significantly differs from 0:
t,p = stats.ttest_rel(roc2[:,2],roc2[:,3])
print "Paired T Test p value:", p
Here we see that the age calibration consistently outperforms the percentile calibration method with the ROC area metric.
Finally, we will test coefficient accuracy by first calculating the ideal coefficient values, and testing how the accuracy differs across calibration methods.
Coefficient Accuracy¶
First we need to figure out what the real coefficients are for the logistic regression. We do this by increasing the number of subjects and decreasing the scanner variance.
p = deepcopy(params)
p["alpha"] = [1,1]
p["gamma"] = [0,0]
p["sig_s"] = [0.000001,0.000001]
p["n_ms"] = [100000,100000]
p["d_dist"] = [0.65,0.65]
p["age_ms_range"] = [(20,75),(20,75)]
sim = sim_data(**p)
p
Using the parameters above, we first run the protocol model and calculate the ideal coefficients:
coefs = np.zeros((10,4))
for i in range(10):
sim = sim_data(**p)
run_calib(sim)
run_p_calib(sim)
l = run_model([sim["age_all_ms"],sim["BV_all_ms"],sim["protocol_ms"]],sim["llr_all"],
model=sm.Logit,xnames=["const","Age","BV","Protocol"],yname="llr",write=False)
coefs[i,:] = l.params
real_vals = np.mean(coefs,axis=0)
std_vals = np.std(coefs,axis=0)
print "Intercept:", real_vals[0],"+-", std_vals[0]
print "Age:", real_vals[1],"+-", std_vals[1]
print "Brain Volume:", real_vals[2],"+-", std_vals[2]
print "Protocol:", real_vals[3],"+-", std_vals[3]
Now we find the real coefficients for the age-calibrated case :
coefs_z = np.zeros((100,2))
for i in range(100):
sim = sim_data(**p)
run_calib(sim)
l = run_model([sim["BV_ms_1_z"]],sim["llr_1"],
model=sm.Logit,xnames=["const","BV"],yname="llr",write=False)
coefs_z[i,:] = l.params
real_vals_z = np.mean(coefs_z,axis=0)
std_vals_z = np.std(coefs_z,axis=0)
print "Intercept:", real_vals_z[0],"+-", std_vals_z[0]
print "Brain Volume:", real_vals_z[1],"+-", std_vals_z[1]
For the percentile calibration, our coefficients should be:
coefs_p = np.zeros((100,3))
for i in range(100):
sim = sim_data(**p)
run_p_calib(sim)
l = run_model([sim["pBV_ms_1"],sim["age_ms_1"]],sim["llr_1"],
model=sm.Logit,xnames=["const","BV","Age"],yname="llr",write=False)
coefs_p[i,:] = l.params
real_vals_p = np.mean(coefs_p,axis=0)
std_vals_p = np.std(coefs_p,axis=0)
print "Intercept:", real_vals_p[0],"+-", std_vals_p[0]
print "Brain Volume:", real_vals_p[1],"+-", std_vals_p[1]
print "Age:", real_vals_p[2],"+-", std_vals_p[2]
Now it is time to check the accuracy of the logistic regression for each case:
Protocol Model: $Group$ ~ $Age + BV + Protocol$
Age Calibrated Model: $Group$ ~ $z(BV)$
- Where z is the calibration function explained in previous analysis, where the brain volume is corrected by the age of healthy controls.
Percentile Calibrated Model: $Group$ ~ $Age + p(BV)$
We first compare the methods when the scanners scale, offset and add noise to the brain volume measurements, but the sites are identical otherwise:
a["n_ms"] = ["??","??"]
foo = get_table(a)
fig,ax = subplots(1,4,figsize=(20,5))
ax0 = plot_grid_box(z_pro[0][[2],:,:],[d[0] for d in n_ms_range],["Protocol"],
"Difference Between Logstic Models ($\Delta$z)","Number of MS Patients at each site",ax=ax[0]);
ax0[0].hlines(real_vals[2],0,4.5)
ax[-1].text(0,-0.2,get_table(a),size=16);
#ax0[0].set_ylim([-2,2])
ax[-1].grid("off");
ax[-1].axis("off");
ax0 = plot_grid_box(zcal[0][[-1],:,:],[d[0] for d in n_ms_range],["Z Calibration"],
"Difference Between Logistic Models ($\Delta$z)","Number of MS Patients at each site",ax=ax[1]);
ax0[0].hlines(real_vals_z[1],0,4.5)
#ax0[0].set_ylim([-2,2])
ax0 = plot_grid_box(pcal[0][[-1],:,:],[d[0] for d in n_ms_range],["P Calibration"],
"Difference Between Logistic Models ($\Delta$z)","Number of MS Patients at each site",ax=ax[2]);
ax0[0].hlines(real_vals_p[1],0,4.5);
As shown above, the percentile calibration is the least accurate. Let us compare to the ideal case with very low scanner noise and no scanner scaling/offset:
n_ms_range = [[50,50],[100,100],[250,250],[500,500]]
a = deepcopy(p)
zcal,pcal,z_pro = grid(a,n_ms_range,'n_ms',100,model=sm.Logit,combine_type=run_combined_log_models)
a["n_ms"] = ["??","??"]
foo = get_table(a)
fig,ax = subplots(1,4,figsize=(20,5))
ax0 = plot_grid_box(z_pro[0][[2],:,:],[d[0] for d in n_ms_range],["Protocol"],
"Difference Between Logstic Models ($\Delta$z)","Number of MS Patients at each site",ax=ax[0]);
ax0[0].hlines(real_vals[2],0,4.5)
ax[-1].text(0,-0.2,get_table(a),size=16);
#ax0[0].set_ylim([-2,2])
ax[-1].grid("off");
ax[-1].axis("off");
ax0 = plot_grid_box(zcal[0][[-1],:,:],[d[0] for d in n_ms_range],["Z Calibration"],
"Difference Between Logistic Models ($\Delta$z)","Number of MS Patients at each site",ax=ax[1]);
ax0[0].hlines(real_vals_z[1],0,4.5)
#ax0[0].set_ylim([-2,2])
ax0 = plot_grid_box(pcal[0][[-1],:,:],[d[0] for d in n_ms_range],["P Calibration"],
"Difference Between Logistic Models ($\Delta$z)","Number of MS Patients at each site",ax=ax[2]);
ax0[0].hlines(real_vals_p[1],0,4.5);
ax0[0].set_ylim([0,20]);
In the ideal case with no scanner added variance, the percentile method outperforms the age-calibrated method. The age calibration is the least accurate in the low noise case, but probably converges to the ideal value as the number of patients increases.
Finally, we test the models in the biased case, where the populations at each site are different:
p = {'n_hc':50, #number of controls
'n_ms':[250,100], #number of MS patients at each site
'age_hc_range':(20,52), # Age range of the controls
'age_ms_range':[(20,45),(20,75)], # Age range of MS patients at each site
'd_dist':[.65,.88], # Distribution weighting of EDSS scores for each site
'alpha_a':-3, # Real coefficient of age
'alpha_d':[-16,-16], # Real coefficient of EDSS score
'b0':1644, #Brain Volume Intercept (in ccs)
'sig_b':[57,57], # Brain variation across subjects (in ccs) for each site
'sig_s':[16,16], # Variability in brain volume measurement across scanners for each site
'alpha':[0.7,1.3], # Scaling of each scanner
'gamma':[20,-20], # Offset of each scanner
'lo_mean':[2,2], # EDSS distribution parameters at both sites
'hi_mean':[6,6], #EDSS distribution parameters at both sites
'lo_sig': [1.1,1.1], #EDSS distribution parameters at both sites
'hi_sig': [0.8,0.8], #EDSS distribution parameters at both sites
}
p
n_ms_range = [[50,100],[100,100],[250,100],[500,100]]
a = deepcopy(p)
zcal,pcal,z_pro = grid(a,n_ms_range,'n_ms',100,model=sm.Logit,combine_type=run_combined_log_models)
a["n_ms"] = ["??","??"]
foo = get_table(a)
fig,ax = subplots(1,4,figsize=(20,5))
ax0 = plot_grid_box((z_pro[0][[2],:,:]-real_vals[2])/real_vals[2],[d[0] for d in n_ms_range],["Protocol"],
"Difference Between Logstic Models ($\Delta$z)","Number of MS Patients at each site",ax=ax[0]);
ax0[0].hlines(0,0,4.5)
ax[-1].text(0,-0.2,get_table(a),size=16);
ax0[0].set_ylim([-1,1])
ax[-1].grid("off");
ax[-1].axis("off");
ax0 = plot_grid_box((zcal[0][[-1],:,:]-real_vals_z[1])/real_vals_z[1],[d[0] for d in n_ms_range],["Z Calibration"],
"Difference Between Logistic Models ($\Delta$z)","Number of MS Patients at each site",ax=ax[1]);
ax0[0].hlines(0,0,4.5)
ax0[0].set_ylim([-1,1])
ax0 = plot_grid_box((pcal[0][[-1],:,:]-real_vals_p[1])/real_vals_p[1],[d[0] for d in n_ms_range],["P Calibration"],
"Difference Between Logistic Models ($\Delta$z)","Number of MS Patients at each site",ax=ax[2]);
ax0[0].hlines(0,0,4.5);
ax0[0].set_ylim([-1,1]);
Here we see that the percentile calibration and the protocol method get worse as the asymmetry increases across sites, but the age calibrated method is the closest to the ideal value.
Conclusion¶
The power of the brain volume coefficient estimate after age calibration is always higher then percentile calibration. The power of the protocol calibration method is higher as well.
The accuracy of the coefficient estimates for the age-calibrated method under biased conditions is better than both the protocol and the percentile calibration
The auROC is highest for the age-calibrated method when there is biases between sites.