RMME/STAT Joint Colloquium
Modeling Coarsened Categorical Variables: Techniques and Software
Dr. Joseph L. Schafer
U.S. Census Bureau
Friday, March 24, at 11AM ET
Coarsened data can express intermediate states of knowledge between fully observed and fully missing. For example, when classifying survey respondents by cigarette smoking behavior as 1=never smoked, 2=former smoker, or 3=current smoker, we may encounter some who reported having smoked in the past but whose current activity is unknown (either 2 or 3, but not 1). Software for categorical data modeling typically provides codes for missing values but lacks convenient ways to convey states of partial knowledge. A new R package cvam: Coarsened Variable Modeling, extends R’s implementation of categorical variables (factors) and fits log-linear and latent-class models to incomplete datasets containing coarsened and missing values. Methods include maximum likelihood estimation using an expectation-maximization algorithm, approximate Bayesian and Bayesian inference via Markov chain Monte Carlo. Functions are also provided for comparing models, predicting missing values, creating multiple imputations, and generating partially or fully synthetic data. In the first major application of this software, data from the U.S. Decennial Census and administrative records were combined to predict citizenship status for 309 million residents of the United States.