Abstracts for Invited Sessions
Organizer and Chair: Jayson D. Wilbur, WPI
Cross-validating and Bagging Algorithms for Predicting Clinical Outcomes with Proteomic and Genomic Data
Division of Biostatistics, Yale University
Clinicians aim toward a more preventative model of attacking cancer by pinpointing and targeting specific early events in disease development. These early events can be measured as genomic, proteomic, epidemiologic, and/or clinical variables, using expression or Comparative Genomic Hybridization (CGH) microarrays, SELDI-TOF/mass spectra, patient histories, and pathology and histology reports. These measurements are then used to predict clinical outcomes such as primary occurrence, recurrence, metastasis, or mortality. There are numerous considerations when approaching such data. Most important is the ability to unearth biologically driven associations between variables and clinical outcomes. Additionally, statisticians must be able to quantify the interactions between different types of variables and the effects of those interactions on the clinical outcome. After a model is built which elucidates relevant patterns of association in a given data set, the next challenge is to assess how well this model will predict outcomes in an independent validation sample (i.e., in future data sets). We will propose a new algorithm in the framework of a general approach for comparing methods, selecting models, and assessing prediction error in the presence of censored outcomes. In addition to simulations, we will evaluate our approach and the varying methods in tissue microarray data on a breast cancer cohort.
Longitudinal Variability in Gene Expression
This talk will illustrate the use of genomic data in drug development. Wyeth Pharmaceuticals recently completed a longitudinal population-based study of peripheral blood gene expression in healthy volunteers. Approximately 400 healthy subjects in North America and Europe were enrolled. Each subject contributed up to seven blood samples for genomic analysis. Affymetrix U133A GeneChips were used to characterize the within-subject (longitudinal) and between-subject variability in gene expression in peripheral blood mononuclear cells. 1682 GeneChip profiles were included in the analysis data set. Enrollment in the study was stratified by sex, age, race and smoking history. The analysis of the data, including experiences with multiple testing, lab effects and data mining, will be discussed in light of the role of the study in biomarker and drug development.
A Link Free Motif Discovery Method Using Time Course Gene Expression Information
Department of Statistics, Harvard University
Identification of transcription factor binding motifs (TFBM) is a crucial first step towards the understanding of regulatory circuitries controlling the expression of genes. In this talk, we propose a novel procedure called multivariate sliced inverse regression (mSIR) for identifying TFBMs. mSIR follows the recent trend to combine information contained in both gene expression measurements and gene promoter sequences. mSIR is a generalization of the sliced inverse regression method to the case of multivariate response. Compared with existing method, mSIR is efficient in computation and stable for data with high dimensionality and high multicollinearity. Furthermore, by avoiding estimate the link function between the gene expression measurements and the sequence information, mSIR can reduce the false discovery rate caused by inappropriate model specification.
Organizer: Naitee Ting, Pfizer
Chair: Mingxiu Hu, Millennium Pharmaceutical
How To Prevent Bad Drugs From Ever Getting In A Human: Preclinical Screening For Off-Target Pharmacologic Effects
Pfizer Inc., Global Research & Development
Prior to ever dosing humans with a compound, many screens will be run in animals to identify possible "ancillary" (i.e. undesirable) pharmacology. From a statistical point of view, analysis of these experiments is interesting in several ways, including: since the onset and duration of the effect is not known in advance, repeated measures designs are used; the time resolution used to analyze the data can be decided based on both physiological and statistical considerations; since compelling evidence of "little or no effect" is desired, testing the null hypothesis of "no effect" is insufficient; since screens are run frequently, expert modeling of the covariance structure is not possible for each analysis, ... and many other issues! I will also mention parallel computing techniques used for simulations to evaluate our methodology.
Statistical Issues in Combining Adverse Events Data
Shailendra S. Menjoge
Boehringer Ingelheim Pharmaceuticals Inc.
Most clinical trials are designed to assess efficacy of a new treatment. Adverse events are collected in these trials to assess the safety of the new treatment. Usually individual trials are too small to provide adequate assurance of safety. Therefore data from several clinical trials are combined to provide a better assessment of safety. Many statistical issues arise when you combine data from different trials. Simplest procedure to combine data is to pool the data from different trials and treat them as resulting from one large trial. When you compute crude incidence rates of adverse events from these data statistical anomaly popularly known as Simpson's paradox can occur. Meta analysis or some kind of stratified analysis usually resolves this problem. When trials of different duration are combined to determine relationship of time to event these options are not available. Early withdrawals pose some additional problems and a paradox similar to Simpson's paradox can occur. This will be illustrated with a real life example.
General Statistical Issues in Pharmaceutical Industry
It takes over 10 years and costs hundreds of millions to over a billion dollars to bring a drug to its patients. The pharmaceutical/biotech industry is one of the biggest employers of statisticians. What roles do pharmaceutical statisticians play in the drug development process? What types of statistical issues do we encounter and what kinds of statistical tools do we use? The purpose of this presentation is to shed some light on these questions. I will focus on general statistical issues and common statistical tools used in the following areas of the drug development process: drug discovery, clinical pharmacology, clinical trial design, clinical data analysis, and outcomes research. All discussions will be general not in depth.
Organizer and Chair: Dominique Haughton, Bentley College
Uplift Modeling in Direct Marketing
Qizhi Wei, Epsilon Data Management
In the past, predictive models were typically evaluated based only on the ability to generate additional responses or sales compared to randomly targeted audiences. With growing demands for marketing accountability, statisticians are increasingly being asked to build uplift models that identify consumers who are most positively influenced by direct marketing campaigns and show the incremental impact of direct marketing programs so that ROI can be more accurately measured. This in turn leads to better decisions about how marketing dollars are spread across channels as well as better decisions about overall marketing budgets.
In this presentation, we will discuss the pros and cons of different approaches to build and validate the uplift models. In-market results will be used to illustrate how models can be enhanced over time. We will also propose a new method to implement the uplift models, which has been applied to a recent marketing campaign with great success.
Hierarchical Modeling Using Individual Patient Meta-Analyses: Imputing Methods for Comparisons in Multi-Site Studies
I. Elain Allen, Babson College
Christopher A. Seaman, Human Services Institute
Meta-analyses are increasingly being used to provide evidence for decision-making. The imposition of strict inclusion criteria tends to diminish the database, and often leads to sparse data. Sites in a multi-site study may follow a common protocol, but include different controls and different treatment groups. In either case, the totality of evidence may proceed along two somewhat different tracks: (1) empty cells in certain categories are assumed as missing at random; (2) certain categories are considered to have additional data. This is analogous to the glass being half-empty or half-full. Although both views are reasonable, they can lead to the use of different methods. We use a combination of meta-analysis and hierarchical modeling to generate treatment effects. An analysis that exemplifies the methodology is provided from a large multi-site trial used to make patient decisions in medical and mental health care.
CSBIGS: Case Studies in Business Industry and Government Statistics A new journal for data analysis cases
We envision a journal with the following features:
- An international journal with and editor-in-chief (Dominique Haughton, US) and two co-editors, one in France (Christine Thomas, University of Toulouse I) and one in Hong Kong (K.W. Ng, University of Hong Kong) and an editorial team encompassing several countries
- A journal published electronically, and possibly in print version, with the data for each case study available for use in a variety of educational settings
- A journal associated with and promoted by a reputable publisher
- A journal that will serve at least three major constituencies: instructors who need good case studies ready with data to use in class, case writers who need a reputable refereed outlet for their work, and publishers who are eager to stay abreast of current developments in statistical applications, as well as students and practitioners
- A journal that will appear once or twice a year (and four times a year beginning with its second volume), with about 10 cases (and 7 or 8 cases beginning with the second volume) in each issue, each between ten and twenty-pages long
We will discuss the rationale for such as new journal, and invite researchers willing to serve on the Editorial Board or to submit cases. This presentation will be very interactive, thus we look forward to your participation!
Organizer and Chair: Myron Katzoff, National Center for Health Statistics
Reengineering the 2010 Decennial Census - Challenges and Opportunities
Preston Jay Waite
Associate Director for Decennial Census, U. S. Census Bureau
The census of population and housing is conducted every ten years. It is the largest peace time activity conducted by the Federal Government. Because of the cost and sheer volume of the effort, a lot of research is needed to examine efficiencies in the collection effort. The 2000 Census was generally considered to be a quality and operational success. All activities were conducted on time and within budget. As we look to the future, there are some important improvements needed.
This paper will discuss the census reengineering efforts currently underway, including a major restructuring of the field work enabled by dramatically expanded use of electronics in the nonresponse follow up phase of the census. This will provide the context for our research efforts, as well as several specific research efforts underway. Some of the specific topics to be discussed include:
- coverage improvement and coverage measurement methodologies;
- cognitive research into questionnaire wording;
- split panel experimental design mail treatments;
- testing effects of mandatory messaging on questionnaires;
- estimation issues associated with a 5-year moving average for the American Community Survey are designed to replace a once-a-decade census long form.
Characterization of Cost Structures, Perceived Value and Optimization Issues in Small Domain Estimation
John L. Eltinge
Office of Survey Methods Research, U.S. Bureau of Labor Statistics
In recent years, government statistical agencies have encountered many requests from stakeholders for production of estimates covering a large number of relatively small subpopulations. Due to resource constraints, agencies generally are not able to satisfy these requests through additional data collection and subsequent production of standard direct estimates. Instead, agencies attempt to meet some of the stakeholders' requests with estimators that combine information from sample data and auxiliary sources. In essence, the agencies are substituting technology (i.e., modeling and related methodological work) for data-collection labor, and in exchange the agencies and data users incur additional risks related to potential model lack of fit and potential misinterpretation of published results.
This presentation characterizes some of the resulting trade-offs among cost structures, data quality, perceived value and optimization issues in small domain estimation. Four topics receive principal attention. First, we highlight several classes of direct and indirect costs incurred by the producers and users of small domain estimates. This leads to consideration of possible cost optimization for small domain estimation programs, which may include the costs of sample design features, access to auxiliary data sources, analytic resources and dissemination efforts. Second, we use the Brackstone (1999) framework of six components of data quality to review some statistical properties of direct design-based and model-based estimators for small domains, and to link these properties with related components of risk. Quality issues related to exploratory analysis and implicit multiple comparisons receive special attention. Third, we explore data users' perceptions of the value of published small domain estimates, and of costs incurred through decisions not to publish estimates for some subpopulations. We suggest that the data users' perceptions are similar to those reported in the general literature on adoption and diffusion of technology, and that this literature can offer some important insights into efficient integration of efforts by researchers, survey managers and data users. Fourth, we emphasize the importance of constraints in the administrative development and implementation of small domain estimation programs. We consider constraints on both the production processes and on the availability of information regarding costs and data quality. These constraints can often dominate the administrative decision process. This in turn suggests some mathematically rich classes of constrained optimization problems that would warrant further research.
Key Words: Adoption and diffusion of technology; Components of risk; Constraints; Data quality; Feedback loop; Multiple comparisons; Prospect theory; Small area estimation; Total survey error.
Research Highlights from the National Agricultural Statistics Service
Carol C. House
National Agricultural Statistics Service
The mission of the National Agricultural Statistics Service (NASS) is to provide timely, accurate and useful statistics in service to U. S. agriculture. Supporting this mission is a small but robust research and development division that tackles a diversity of complex problems. This paper provides highlights of ongoing research work including work in remote sensing and geographical information systems, cognitive studies of survey non-response and response incentives, and estimation enhancements for the upcoming Census of Agriculture.
Bayesian and Frequentist Methods for Provider Profiling Using Risk-Adjusted Assessments of Medical Outcomes
Michael Racz, New York State Department of Health
J. Sedransk, Department of Statistics, Case Western Reserve University
We propose a new method and compare conventional and Bayesian methodologies that are used or proposed for use for ‘provider profiling,' an evaluation of the quality of health care. The conventional approaches to computing these provider assessments are to use likelihood-based frequentist methodologies, and the new Bayesian method is patterned after these. For each of three models we compare the frequentist and Bayesian approaches using the data employed by the New York State Department of Health for its annually released reports that profile hospitals permitted to perform coronary artery bypass graft surgery. Additional, constructed, data sets are used to sharpen our conclusions. With the advances of Markov chain Monte Carlo methods, Bayesian methods are easily implemented and are preferable to standard frequentist methods for models with a binary dependent variable since the latter always rely on asymptotic approximations.
Comparisons across methods associated with different models are important because of current proposals to use random effect (exchangeable) models for provider profiling. We also summarize and discuss important issues in the conduct of provider profiling such as inclusion of provider characteristics in the model and choice of criteria for determining unsatisfactory performance.
Organizer and Chair: Brenda Ramirez, W.L. Gore & Associates
Producible, Robust Designs at GE Aviation
David Rumpf*, Gene Wiggs, Todd Williams and James Worachek
* GE Aviation
Until about 10 years ago, Design Engineering and Manufacturing at GE Aviation were separate organizations. Design engineers produced part designs for an integrated engine system and expected manufacturing to make and assemble the parts. Tolerance decisions were influenced by the arguing ability of each discipline along with historic precedent. Most form-fit-function characteristics were about 95% producible or 2 sigma designs. Non-conformance control by a Material Review Board (MRB) was used by Design Engineering to monitor manufacturing quality. Although this process demonstrated the ability to produce excellent engines, it depended on inspecting in quality, multiple rework loops and resulting high cost.
This paper will discuss the evolving process used at GE Aviation to design engines which are producible, robust to environmental and customer use variation and error proofed.
Discussion of the process will include the organizational structure supporting the needed culture change, the impact of six sigma providing common terminology and supporting data driven decisions, the structured approach using manufacturing process capability data to facilitate producibility, use of assembly defect and customer escape data to drive error-proofing early during the design process, the focus on standardized notes and automated characteristic accountability for error prevention, use of reliability and advanced optimization models in evaluating parameter space tradeoffs and meeting the multiple objective goals of robust designs. Discussion will include examples demonstrating the significant improvements in quality and robustness accomplished by the new process.
Vapor Leak Detection for an Underground Storage Tank System
Department of Mathematical Sciences, Worcester Polytechnic Institute
In this talk, I shall present a mathematical model for the pressure inside the vapor space of an Underground Storage Tank (UST). Using mass balances based on dispensing activities, vapor recovery, evaporation, leaks and the safety vent, the model resulted in a system of nonlinear differential equations with unknown parameters leak rate and evaporation rate. Given the leak rate and evaporation rate, the pressure inside the UST can be obtained by solving the system of differential equations numerically. Given the pressure inside a UST over a period of time, one can find the evaporation and leak rates such that the calculated pressure from our model best fits the given pressure data. This optimal value of leak rate can allow us to determine whether the UST system is leaking and if the leak rate is above the EPA standard. Numerical results and statistical analysis of the model will also be also presented.
Bayesian Predictive Inference for Multivariate Control Charting
Jai Won Choi*, Balgobin Nandram, and Brenda Ramírez
* Center for Disease Control (CDC)
Statistical process control has been applied with marginal success in the semiconductor industry. This may be, in part, due to a heavy reliance on univariate control chart practices when the quality of many production processes is characterized by a number of variables which may be highly correlated. When the data are correlated, the univariate charts are not as sensitive to out-of-control values as the multivariate charts which incorporate the correlation. It is, therefore, pertinent to use multivariate control charts to perform statistical process monitoring for such processes. Multivariate control charts can be used effectively to monitor the quality of complex processes with several critical variables simultaneously. However, when the covariance matrix has large dimensions in comparison to the number of runs available for parameter estimation, these charts can perform poorly. We incorporate prior information about the covariance matrix in which the number of parameters is reduced to just two. There are many semiconductor manufacturing processes that may induce a covariance structure that conforms to the parsimonious covariance that we are proposing (correlation that degrades with distance). Two examples are an LPCVD (low pressure chemical vapor deposition) reactor that is used for the deposition of polysilicon or nitride, and PECVD (plasma enhanced chemical vapor deposition) of nitride or oxide. We consider a passivation process for semiconductor manufacturing, where each of the variables represents a value at a specific location in a passivation tube, and because of the interaction between the plasma and the reactant gases flowing down the tube, the correlation among the variables might decay with distance between these locations. Moreover, the variability at the locations might be taken equal, further reducing the number of parameters. We use a Bayesian method to construct the multivariate control chart, and a statistic, analogous to Hotelling's T2, is used for charting. The control limits are constructed using a Bayesian predictive inference, and the Metropolis-Hastings algorithm is used to perform the computations. Simulations show that our method can detect out-of-control observations more quickly than the classical multivariate approach. Also our method is robust against departures from the parsimonious covariance structure.
Quantitative Techniques to Evaluate Process Stability
Brenda Ramirez* and George Runger, W.L. Gore & Associates
There have been numerous measures proposed to quantify the capability of a process using statistics such as Cp, Cpk, and Cpm. There are even tests of hypothesis to see if the calculated Cp or Cpk exceeds some constant. However, there are few quantitative summaries to use as a starting point to assess the overall in-control performance of a process. In this paper, we propose three measures that can be used to assess the stability of a process and classify the process parameter as stable or unstable. Test criteria for each of the three measures are suggested and their overall performance, under certain conditions, are studied. These concepts are demonstrated using three case studies.
Organizer: John McKenzie, Babson College
Chair: Katherine Halvorsen
What's Hot and What's Not in Statistical Education: A Panel Discussion
John McKenzie, Babson College
Katherine Halvorsen, Smith College,
Joan Weinstein, Harvard University Extension School
This session will explain some recent developments in statistical education. First, John McKenzie will use the GAISE Project recommendations as the basis for what should be part of (and what shouldn't part of) introductory statistics courses today. Then he will discuss recent advances in technology that enable students to better understand concepts and analyze data. He will also explain how technology allows instructors to do a better job in assessing their students. Second, Katherine Halvorsen will look at several proposed sets of national standards and make recommendations for the data analysis and probability that high school students need to learn. These are attempts at a coherent plan that integrates coursework from middle school through the last year of high school because academic preparation for advanced study begins in middle school as reported by the National Academy of Sciences on Advanced Placement in 2002. Joan Weinstein will present an overview of distance education for statisticians. Among the topics she will discuss are how such courses are presented to students, how the instructor communicates with individual students, and how these students are assessed. She will compare such courses with traditionally delivered courses. There will be ample time for audience participation at the end of the session.
Organizer and Chair: Paul Gendron, Naval Research Laboratory
Bayesian Inference in Room-Acoustic Decay Analysis
School of Architecture, Rensselaer Polytechnic Institute
In architectural acoustics practice, a number of concert halls have incorporated secondary reverberation chambers coupled to the main floor because of their potential for creating desirable effects in meeting conflicting requirements between clarity and reverberance. As a result of their successful application, the acoustics of coupled spaces is drawing growing attention in the field of architectural acoustics. In contrast to essentially exponential decay in single-room spaces, sound energy in coupled spaces decays at multiple rates. As a result, it can often require considerable effort to analyze decay characteristics from acoustical measurements in these coupled spaces. Bayesian Probability Inference proves to be powerful tools for evaluating decay times and relevant parameters in acoustically coupled spaces. Following a brief introduction to Bayesian probability inference, this talk will discuss recent applications of Bayesian parameter estimation and model selection in concert hall acoustics.
The Effect of Random Processes on Acoustic Propagation in the Ocean
Stefanie Wojcik*, William W. Durgin, Frank Weber, Tatiana Andreeva
*Worcester Polytechnic Institute
Using a predictive, ray-based methodology for received signal variation as a function of ocean perturbations, the effects of ocean internal waves as well as ocean turbulence on underwater acoustic wave propagation are analyzed. In the present formulation the eikonal equations are considered in the form of a second order, nonlinear ordinary differential equation with harmonic excitation due to an internal wave. The harmonic excitation is taken imperfect, i.e., with a random phase modulation due to Gaussian white noise, simulating both the chaotic and stochastic effects of internal waves. Mesoscale turbulence is represented using a potential theory two-wavenumber model for a vortex array. The focus of the paper is to numerically study the influence of these fluid velocity fluctuations and the role of initial ray angle on underwater acoustic propagation to provide a realistic characterization of acoustic arrivals. Predicted arrival behavior is analyzed using ray trace, time-front and phase plots for varying initial conditions. The regions of instability are identified using the bifurcation and phase diagrams. The purpose of this work is to both to understand how turbulence and internal waves affect sound transmission and also to utilize the statistics of received signals to identify fluid mechanic phenomena that occurred along the sound channel. The simulations indicate that it is possible to distinguish the effects of internal waves from turbulence using spectra of travel time variations.
A computationally reduced estimator of mutual information for the underwater acoustic channel
Paul J. Gendron
Naval Research Laboratory
Estimates of upper and lower bounds on the information rates attainable for acoustic communications with white Gaussian sources under uncertain channel conditions are presented that account for finite duration signaling and finite coherence time. The approach is useful in providing guidance in the selection of coding strategies based on available probe data or from propagation modeling. For the case where the coherence time of the channel exceeds that of the packet duration the estimators are computationally efficient with the Levinson recursion and provide a means of tuning the accuracy of the estimates to fit a given computational budget. The reduction in the information rate relative to the known channel case is quantified in terms associated with the coherence time of the channel, the posterior covariance of the channel and the prior covariance of the channel. For scenarios where the prior channel covariance is known exact bounds on the information rates are computable else the estimators provide confidence intervals for these rates. The estimators are tested on 18 kHz at sea data collected in shallow water north of Elba Italy on a 1.8m aperture vertical array. (Work supported by the Office of Naval Research)