Hilary Parker | Math 155 Project

HOME | DESIGN | MICROARRAY SAMPLES | NORMALIZATION | SIGNIFICANCE TESTING | SAM | H. CLUSTERING | PAM CLUSTERING | PAM CLASSIFICATION | CONCLUSIONS

This web page was produced as an assignment for a course on Statistical Analysis of Microarray Data at Pomona College.

The normalization procedure used in this project was the Loess method, which is a Robust regression method. With user-specified parameters, the algorithm for normalization is basically a series of weighted linear regressions over a small subset of the data, repeated many times and smoothed to create a best-fit line.

The MA plots below display data before and after loess normalization for two arrays (5 and 6). In the case of loess plots, the M's for each A-value are determined using the loess fit described above. Notice how in both cases, after normalization the data appear to be centered over the M=0 line, but not before. This is especially apparent in array 6, where a significant positive trend is essentially erased with normalization.

Boxplots for each of the arrays before and after normalization are displayed below.

Note that the y-axis is quite different for each plot. Arrays 26 and 34 look as though normalization actually skewed the data more, although MA plots do not indicate such a drastic change in M-values.

Boxplots of the background confirm what we saw visually with the microarray chip - the background data is very noisy.

Next, normalization was tested two ways (Kolmogorov-Smirnov method and Looney and Gulledge method).

Using 50 randomly selected genes, the following p-values were obtained using the KS method:
0.177828
0.02952348*
0.5625915
0.516381
0.1233222
0.2360967
0.34369
0.09966977
0.1853652
0.3020463
0.04049711*
0.0702582
0.00320633*
3.040773e-05*
0.1848228
0.9039434
0.1461928
0.02026443*
0.03136127*
0.8024172
0.006344712*
0.5064732
0.1773753
0.6533653
0.1195844
0.009225614*
0.1996645
0.01036490*
0.5589412
0.675567
0.8987557
0.03379657*
0.1580227
0.0372678*
0.7973501
0.1790031
0.1819115
0.7938111
0.4737908
0.5437275
0.09287292
0.01194443*
0.06556916
0.004963614*
0.2068782
0.6062126
0.1652469
0.01839006*
0.07193951
0.2956484

Samples with stars are significant at the 0.05 level, indicating that we can reject the null hypothesis that they are normal. Note that we would expect approximately 2-3 type I errors in 50 samples, but we have 14. This may suggest that our data is not reliably normalized.

The next test for normality was the Looney & Gulledge test of normality, which finds the correlation coefficient of the qq-plot of normal quantiles and determines its significance. The correlations are listed below. Note that 34 arrays were used in total, so at a 0.05 level, the critical value from a Looney table is 0.968. Stars are placed next to values that are significantly different from normal.

0.907135*
0.7911997*
0.8533497*
0.9820423
0.9657144*
0.9847417
0.8433906*
0.95118*
0.7972965*
0.9397885*
0.8361207*
0.9109652*
0.914329*
0.9211378*
0.8658665*
0.9493248*
0.920989*
0.9527418*
0.8242175*
0.8843698*
0.9065188*
0.756679*
0.7891022*
0.9142712*
0.848973*
0.9721124
0.9738254
0.9534558*
0.9300567*
0.7930633*
0.883229*
0.6168813*
0.9860473
0.774802*
0.901778*
0.9948245
0.7889048*
0.941132*
0.7807806*
0.740908*
0.819511*
0.9040048*
0.7649094*
0.9756437
0.7799284*
0.9495617*
0.9864484
0.9088602*
0.7422973*
0.9724674

Clearly, most genes fail the qq-plot test for normality. One factor that was not taken into consideration is that for some of the genes above, they may have had "NA" entries in one or more of the arrays. As such, a different critical value should be used for them. However, an efficient means to find these critical values was not known by me.