﻿ NESS: Abstracts for Contributed Sessions - WPI

# Abstracts for Contributed Sessions

## Biostatistics

Chair: Pushpa Gupta, University of Maine

### Statistical Inference for the Risk Ratio in 2x2 Binomial Trials with Structural Zero

Ramesh C. Gupta and Suzhong Tian
Department of Mathematics and Statistics
University of Maine, Orono

In some statistical analyses, researchers may encounter the problem of analyzing a correlated 2x2 table with a structural zero in one of the off diagonal cells. Structural zeros arise in situation where it is theoretically impossible for a particular cell to be observed. For instance, Agresti (2002) provided an example involving a sample of 156 calves born in Okeechobee County, Florida. Calves are first classified according to whether they get a pneumonia infection within certain time. They are then classified again according to whether they get a secondary infection within a period after the infection clears up. Because subjects cannot, by definition, have a secondary infection without first having a primary infection, a structural void in the cell of the summary table that corresponds with no primary infection and has secondary infection is introduced.

The risk ratio (RR) between the secondary infection, given the primary infection, and the primary infection may be a useful measure of change in the pneumonia infection rates of the primary infection and the secondary infection. In this paper, we first develop and evaluate the large sample confidence intervals of RR. In addition to the three confidence intervals in the literatures, we propose a confidence interval based on Rao's score test. The performance of these confidence intervals is studied by means of extensive simulation studies. We also investigate the tests of hypothesis for the RR and the power of these tests. Simulation studies are carried out to examine the performance of these tests in terms of their power. An example, from the literature, is also provided to illustrate these procedures.

### Models for Over-dispersed Binomial Data: Fitting a Beta-binomial Distribution to the Toxicological Data

Krishna K. Saha
Department of Mathematical Sciences
Central Connecticut State University

Binomial models are widely used in the analysis of binomial data. In toxicology, epidemiology and other biostatistics research, it is well recognized that binomial data often display substantial extra-binomial variation, or over-dispersion, relative to a binomial model. Statistical analysis of these data based on binomial model fails to accommodate this over-dispersion, which can make misleading inferences about the mean or the regression parameters. One way of dealing with this problem is to use a parametric model which is more general than the binomial, and one such model is the beta-binomial. The purpose of this talk is to review some possible models for over-dispersed binomial data and discuss beta-binomial modeling (in R or S-Plus) in the relation to the toxicological data sets.

### Gamma Frailty Process for Longitudinal Count Data

Sourish Das and Dipak Dey
Department of Statistics, University of Connecticut

In this paper, we consider longitudinal count data as non-homogenous Poisson process. Then frailty process is considered as gamma process where intensity is a scaled gamma process. We then suggest a form of the density of a new p-variate Gamma Distribution. We showed that univariate Gamma distribution is a special case of such a generalized multivariate Gamma distribution and so as chi-square distribution is a special case. We showed the relationship between the Multivariate Gamma and the Wishart Distribution. We also find the Fourier transform or characteristic function of p-variate Gamma distribution. Then we study some properties of this newly proposed Multivariate Gamma distribution and finally we derive a general multivariate gamma process.

### Wavelet Estimation in Censored Regression

Linyuan Li
Department of Mathematics and Statistics
University of New Hampshire

The Cox proportional hazards model has become the model of choice to use in analyzing the effects of covariates on survival data. However, this assumption has significant restrictions on the behavior of the conditional survival function. The accelerated failure time model, which models the survival time and covariates directly through regression, provides an alternative approach to interpret the relationship between survival times and covariates. We consider here the estimation of the nonparametric regression function in the accelerated failure time model under right random censorship and investigate the asymptotic rates of convergence of estimators based on thresholding of empirical wavelet coefficients. We show that the estimators achieve nearly optimal minimax convergence rates within logarithmic terms over a large range of Besov function classes, a feature not available for the linear estimators. The performance of the estimators is tested via simulation and the method is applied to the Stanford Heart Transplant data.

### A Semiparametric Modeling Approach for the Development of Metabonomic Profile and Bio-Marker Discovery

Samiran Ghosh
University of Connecticut

The discovery and validation of biomarkers is an important step towards the development of criteria for early diagnosis of disease status. Recently ESI and MALDI time-of-flight mass spectrometry have been used to identify biomarkers both in Proteomics and Metabonomics studies. Data sets generated from mass spectrometers are generally very large in size and thus require the use of sophisticated statistical techniques to glean useful information. Traditionally, different data processing steps are generally carried out separately resulting in unsatisfactory propagation of signals to the final model. It is more intuitive to develop models for patterns rather than discrete points. In the present study a novel semi-parametric approach has been developed to distinguish urinary metabolic profiles in a group of traumatic patients from those of a control group consisting of normal individuals. To address instrument variability all data sets have been analyzed in replicates, an important issue ignored by most of the studies of similar kind in the past. We have proposed different models by prescribing different choices of centering function to capture non-standard shape of the profiles. Different model comparisons were performed to select the best model for each subject. The inherent assumption that traumatic individuals will show irregular patterns in their profile was checked through an intensity function compared with normal individuals. The m/z values in the window of the irregular pattern are then further recommended for biomarker discovery associated with trauma.

## Statistical Applications

Chair: Carlos Morales, State Street Global Advisors

### The Effect of Ramadan on the Automobile Accident Rate in Turkey and a Variation of the Wilcoxon Rank Test

Herman Chernoff
Harvard University

Monthly accident data were compiled over the 22 year period from 1984 through 2005, to investigate the conjecture that Ramadan leads to an increase in auto accidents, presumably the effect of dehydration during the fasting period. Since the month of Ramadan is governed by a lunar calendar, the onset varies with respect to the calendar months, moving from September to April during the 22 year period. To test the conjecture a variation of the Wilcoxon rank test is applied to the differences of the monthly residuals from a regression used to account for the effects of time and the calendar month. These differences are measured against "Ramadan Time" which is the time from the beginning of the nearest occurrence of Ramadan and the end of the current calendar month, and varies from -6 to 6.

### Dynamic Programming Applications for Semi-Markov Processes

Andrew C. Thomas
Harvard University

Semi-Markov processes, where time-independent transitions are Markovian but transition times are not, yield time-dependent transition probabilities that are difficult to calculate analytically. Simulation methods can provide rough solutions in long time periods, suggesting that more exact methods should be used. I discuss a dynamic programming method to solve these types of problems, and demonstrate several applications to show its usefulness.

### How Well Can One Predict Stock Price Based on Quarterly Earnings Forecasts? An Application of the Ohlson Model and Bayesian Statistics

Huong N. Higgins
Worcester Polytechnic Institute

Over the past decade of accounting and finance research, the Ohlson model has been often examined as a framework for equity valuation. In this paper, we apply Bayesian statistics to the Ohlson model, and evaluate improvement in predictive power. Specifically, focusing on SP500 firms, we use 23 quarters of data starting in Q1 1999 to estimate the prediction models, which we then use to predict stock price in Q4 of 2004. We use two types of estimation approaches, maximum likelihood and Bayesian statistics. We find that Bayesian analyses generally result in smaller predictive errors than maximum likelihood analyses. We perform several transformations, however transformations of the maximum likelihood models do not outweigh the usefulness of applying Bayesian statistics. We conclude that applying Bayesian statistics is a fruitful way to improve the Ohlson's classical framework for equity valuation.

### Text Mining: Some Challenges of Analyzing Natural Language

Roger Bilisoly
Department of Mathematical Sciences
Central Connecticut State University

Natural languages such as English are complex. Basic tasks such as identifying words pose more difficulties than one would suppose. But even given flawless word identification, word frequencies are typically very low, and sample sizes are never large enough to assume the limiting case is ever approximately true. Hence the usual statistical techniques are often inapplicable to analyzing text. This talk will first discuss some of these problems posed by natural languages, and then will outline some techniques to transform a text to a database, which then can be analyzed by either statistical or data mining techniques. Examples using literary texts such as Charles Dickens' A Christmas Carol will be given.

### The Prediction of Business Statistics Course Grades Using Freshman Economics Course Grades: A Case in Point

Dr. Deborah J. Gougeon
University of Scranton

This study investigated whether student academic achievement in college Business Statistics courses could be predicted by freshman level Economics courses at a private university. Subjects were 381 college seniors from the School of Business at this institution. That data used were grades from two freshman level Economics courses (Micro-Economics and Macro-Economics) and two sophomore level Business Statistics courses(Introduction to Business Statistics and Intermediate Business Statistics) that these seniors had taken in their freshman and sophomore years. A correlation study was conducted with this data. Results of this study will be discussed.

## Statistical Methodology

Chair: Daniel Zelterman, Yale University

### Model-building: Statistical, Mathematical, Computerised, Scientific

G. Arthur Mihram and Danielle Mihram
University of Southern California

Our statistical model-building, particularly that undertaken by regression methodology, is often implemented by computerised programmes which "fit" mathematical curves to recorded data. Another approach to model-building is that of the applied mathematician, who seeks to conduct himself/herself as if miming the ‘theorem-proving' of the (pure) mathematician: viz.: State your assumptions; then proceed through logical deductions; derive the result (the model, the mathematical equation or formula). This approach can sometimes at the conclusion also be computerised: by ‘solving' numerically the resulting time-dependent differential equations, once ‘initial conditions (t = 0)' have been specified. Computer scientists, early on, saw the opportunity to model the dynamics of ‘systems' of phenomena by authoring algorithms, one for each decision-making activity which can occur in the life of the modelled system. This model-building activity, though it must be as logically impeccable as that of either the statistician or the applied mathematician, does not require at the outset a statement of one's assumptions and does not progress as though one is attempting to "prove a theorem", as one would in geometry. We compare and contrast then the three model-building formats ands ask how each may differ from the Scientific Method (After all, the biologist, Charles Darwin, seldom ever used numerals!). Our statistical methodologies, including hypothesis testing, are highlighted in this comparison.

### Do Bayesian Methods Overstate the Credibility of Phylogenetic Trees?

Edward M. Johnston
Bessel Consulting

Since 1996 a number of authors have shown how the evolutionary trees of species can be estimated by Bayesian analysis of their DNA sequences. In a 2002 paper, Suzuki and colleagues argue that Bayes methods "can be excessively liberal," and they contrast the Bayesian support values with those obtained by bootstrap methods, which they judge to be conservative. An attack on the correctness of Bayes' Theorem is unlikely to succeed, so other ways of explaining the anomaly must be considered. In a 2005 paper, Lewis et al. explain this overcredibility with what they call a 'fair coin paradox', where arbitrary resolutions of a tied case can be assigned posterior probabilities over 98%, in situations that appear to 'deserve' a probability closer to 33%. We applaud the attempt at simplification begun by Lewis et al, but we don't share their conclusions. Their example shows little difference between the classical and Bayesian measures, thus failing to serve as an instance of Suzuki's problem, and the finding of a paradox can be questioned. We close by considering whether the problem found by Suzuki et al (a) is real and needs a better explanation, or (b) is not real and at most leads to criticism of the priors.

### A Hierarchical Bayesian Approach for Estimating Origin of Mixed Population

Feng Guo and Dipak K. Dey, Department of Statistics
Kent Holsinger, Department of Ecology and Evolutionary Biology
University of Connecticut

A hierarchical Bayesian model is proposed to estimate the proportional contribution of the origin of colonized mixture population. The genetic data of the mixture population and source population that might contribute individuals to the mixture population, as well as environmental and demographic factors that might affect the colonizing process are used. The model is a mixture multinomial distribution which reflects the colonizing process. The environmental and demographic information are incorporated into the model through hierarchical prior structure. The model is applied to a gray seal data set. Markov chain Monte Carlo (MCMC) simulation is used to conduct inference for the posterior distribution of the model parameters. The effects of the covariates are also investigated by information measures. The proposed model is compared with a model existing in the literature. Through model comparison criterion, it is demonstrated that our proposed model outperforms existing model in various ways.

### Statistical Based Material Design Values Obtained from Small Data Sets

Donald Neal, SRRC
Mark Vangel, MIT/Mass.General

This paper identifies an acceptable statistical procedure for obtaining design allowable values from a small set of material strength data. The allowable represents a material design number defined as the 95% lower confidence bound on the specified percentile of the population of material strength data. The percentiles are the first and tenth for the A and B allowables. The proposed method reduces the penalties commonly associated with small sample allowable computation by accurately maintaining the definition requirements and reducing variability in the estimate. Application of very small samples will obviously reduce cost in testing and manufacturing of test specimens, which is the primary motivation for this study.

In the evaluation process five methods were considered for computing the design allowable. Three of these methods involved certain statistical distribution assumptions while the other two were nonparametric methods. The latter methods introduced a pooling process such that the small sample was combined with a larger previously obtained sample. The prior methods used a reduced ratio method, Weibull and Normal distributions.

Monte Carlo studies showed that the nonparametric procedures were the most desirable for computing small sample design allowables.