Title: | Variable Selection Using a Smooth Information Criterion |
---|---|
Description: | Implementation of the SIC epsilon-telescope method, either using single or distributional (multiparameter) regression. Includes classical regression with normally distributed errors and robust regression, where the errors are from the Laplace distribution. The "smooth generalized normal distribution" is used, where the estimation of an additional shape parameter allows the user to move smoothly between both types of regression. See O'Neill and Burke (2022) "Robust Distributional Regression with Automatic Variable Selection" for more details. <arXiv:2212.07317>. This package also contains the data analyses from O'Neill and Burke (2023). "Variable selection using a smooth information criterion for distributional regression models". <doi:10.1007/s11222-023-10204-8>. |
Authors: | Meadhbh O'Neill [aut, cre], Kevin Burke [aut] |
Maintainer: | Meadhbh O'Neill <[email protected]> |
License: | GPL-3 |
Version: | 1.2.0 |
Built: | 2024-11-18 06:26:14 UTC |
Source: | https://github.com/meadhbh-oneill/smoothic |
Original data, which come from a study by Harrison Jr and Rubinfeld (1978), examining
the association between median house prices in a particular community with
various community characteristics. See bostonhouseprice2
for the corrected version, with additional variables.
bostonhouseprice
bostonhouseprice
A data frame with 506 rows and 9 variables:
crimes committed per capita
average number of rooms per house
index of accessibility to radial highways
average student-teacher ratio of schools in the community
percentage of the population that are "lower status"
log(annual average nitrogen oxide concentration (pphm))
log(property tax per $1000)
log(weighted distances to five employment centres in the Boston region)
log(median house price ($))
https://CRAN.R-project.org/package=wooldridge
Harrison Jr, D. and Rubinfeld, D. L. (1978). Hedonic housing prices and the demand for clean air. Journal of environmental economics and management, 5(1):81-102.
Wooldridge, J. M. (2015). Introductory econometrics: A modern approach. Cengage learning.
Corrected data, which come from a study by Harrison Jr and Rubinfeld (1978), examining
the association between median house prices in a particular community with
various community characteristics. See bostonhouseprice
for the
original version.
bostonhouseprice2
bostonhouseprice2
A data frame with 506 rows and 13 variables:
per capita crime rate by town
proportion of residential land zoned for lots over 25,000 sq.ft
proportion of non-retail business acres per town
average number of rooms per dwelling
proportion of owner-occupied units built prior to 1940
index of accessibility to radial highways
pupil-teacher ratio by town
log(nitric oxides concentration (parts per 10 million))
log(weighted distances to five Boston employment centres)
log(full-value property-tax rate per USD 10,000)
log(percentage of lower status of the population)
Charles River dummy variable (=1 if tract bounds river; 0 otherwise)
log(corrected median value of owner-occupied homes in USD 1000's)
https://CRAN.R-project.org/package=mlbench
Harrison Jr, D. and Rubinfeld, D. L. (1978). Hedonic housing prices and the demand for clean air. Journal of environmental economics and management, 5(1):81-102.
Leisch F, Dimitriadou E (2021). mlbench: Machine Learning Benchmark Problems. R package version 2.1-3.
Data relating to crime rates per one million residents in 50 U.S cities, taken from Thomas (1990).
citycrime
citycrime
A data frame with 50 rows and 7 variables:
reported violent crime rate per 100,000 residents
annual police funding per resident ($)
percentage of people 25 years+ with 4 years of high school
percentage of 16 to 19 year-olds not in high school and not high school graduates
percentage of 18 to 24 year-olds in college
percentage of people 25 years+ with at least 4 years of college
total overall reported crime rate per 1 million residents
https://hastie.su.domains/StatLearnSparsity_files/DATA/crime.txt
Thomas, G.S., 1990. The Rating Guide to Life in America's Small Cities. Prometheus Books, 59 John Glenn Drive, Amherst, NY 14228-2197.
Hastie, T., Tibshirani, R. and Wainwright, M., 2015. Statistical learning with sparsity: the lasso and generalizations. CRC press.
Data relating to a study of disease progression one year after baseline.
diabetes
diabetes
A data frame with 442 rows and 11 variables:
age of the patient
sex of the patient
body mass index of the patient
blood pressure of the patient
blood serum measurement 1
blood serum measurement 2
blood serum measurement 3
blood serum measurement 4
blood serum measurement 5
blood serum measurement 6
quantitative measure of disease progression one year after baseline
https://CRAN.R-project.org/package=lars
Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., et al. (2004). Least angle regression. The Annals of Statistics.
Data, which come from a study by Stamey et al. (1989), examining the correlation between the level of prostate-specific antigen (PSA) and various clinical measures in men who were about to receive a radical prostatectomy.
pcancer
pcancer
A data frame with 97 rows and 9 variables:
log(cancer volume (cm^3))
log(prostate weight (g))
age of the patient
log(amount of benign prostatic hyperplasia (cm^2))
presence of seminal vesicle invasion (1=yes, 0=no)
log(capsular penetration (cm))
Gleason score
percentage of Gleason scores four of five
log(PSA (ng/mL))
https://web.stanford.edu/~hastie/ElemStatLearn/datasets/prostate.data
Stamey, T. A., Kabalin, J. N., McNeal, J. E., Johnstone, I. M., Freiha, F., Redwine, E. A., and Yang, N. (1989). Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate. ii. radical prostatectomy treated patients. The Journal of urology, 141(5):1076-1083.
This function plots the model-based conditional density curves for
different effect combinations. For example, take a particular covariate that is selected
in the final model. The other selected covariates are fixed at their median values by default
(see covariate_fix
to fix at other values) and then the plotted red and blue densities
correspond to the modification of the chosen covariate as “low” (25th quantile by default) and
“high” (75th quantile by default).
plot_effects( obj, what = "all", show_average_indiv = TRUE, p = c(0.25, 0.75), covariate_fix, density_range )
plot_effects( obj, what = "all", show_average_indiv = TRUE, p = c(0.25, 0.75), covariate_fix, density_range )
obj |
An object of class “ |
what |
The covariate effects to be plotted, default is |
show_average_indiv |
Should a “baseline” or “average” individual be shown,
default is |
p |
The probabilities given to the |
covariate_fix |
Optional values to fix the covariates at that are chosen in the final model. When not supplied, the covariates are fixed at their median values. See the example for more detail. |
density_range |
Optional range for which the density curves should be plotted. |
A plot of the conditional density curves.
Meadhbh O'Neill
# Sniffer Data -------------------- # MPR Model ---- results <- smoothic( formula = y ~ ., data = sniffer, family = "normal", model = "mpr" ) plot_effects(results) # Only plot gastemp and gaspres # Do not show the average individual plot # Plot the lower and upper density curves using 10th quantile (lower) and 90th quantile (upper) # Fix violent to its violent to 820 and funding to 40 plot_effects(results, what = c("gastemp", "gaspres"), show_average_indiv = FALSE, p = c(0.1, 0.9), covariate_fix = c("gastemp" = 70, "gaspres" = 4)) # The curves for the gastemp variable are computed by fixing gaspres = 4 (as is specified # in the input). The remaining variables that are not specified in covariate_fix are fixed # to their median values (i.e., tanktemp is fixed at its median). gastemp is then modified # to be low (10th quantile) and high (90th quantile), as specified by p in the function.
# Sniffer Data -------------------- # MPR Model ---- results <- smoothic( formula = y ~ ., data = sniffer, family = "normal", model = "mpr" ) plot_effects(results) # Only plot gastemp and gaspres # Do not show the average individual plot # Plot the lower and upper density curves using 10th quantile (lower) and 90th quantile (upper) # Fix violent to its violent to 820 and funding to 40 plot_effects(results, what = c("gastemp", "gaspres"), show_average_indiv = FALSE, p = c(0.1, 0.9), covariate_fix = c("gastemp" = 70, "gaspres" = 4)) # The curves for the gastemp variable are computed by fixing gaspres = 4 (as is specified # in the input). The remaining variables that are not specified in covariate_fix are fixed # to their median values (i.e., tanktemp is fixed at its median). gastemp is then modified # to be low (10th quantile) and high (90th quantile), as specified by p in the function.
-telescope coefficient pathsThis function plots the standardized coefficient values with respect
to the -telescope for the location (and dispersion) components.
plot_paths( obj, log_scale_x = TRUE, log_scale_x_pretty = TRUE, facet_scales = "fixed" )
plot_paths( obj, log_scale_x = TRUE, log_scale_x_pretty = TRUE, facet_scales = "fixed" )
obj |
An object of class “ |
log_scale_x |
Default is |
log_scale_x_pretty |
Default is |
facet_scales |
Default is |
A plot of the standardized coefficient values through the -telescope.
Meadhbh O'Neill
# Sniffer Data -------------------- # MPR Model ---- results <- smoothic( formula = y ~ ., data = sniffer, family = "normal", model = "mpr" ) plot_paths(results)
# Sniffer Data -------------------- # MPR Model ---- results <- smoothic( formula = y ~ ., data = sniffer, family = "normal", model = "mpr" ) plot_paths(results)
predict
method class “smoothic
”
## S3 method for class 'smoothic' predict(object, newdata, ...)
## S3 method for class 'smoothic' predict(object, newdata, ...)
object |
an object of class “ |
newdata |
new data object |
... |
further arguments passed to or from other methods. |
a matrix containing the predicted values for the location mu and scale s
Meadhbh O'Neill
# Sniffer Data -------------------- # MPR Model ---- results <- smoothic( formula = y ~ ., data = sniffer, family = "normal", model = "mpr" ) predict(results)
# Sniffer Data -------------------- # MPR Model ---- results <- smoothic( formula = y ~ ., data = sniffer, family = "normal", model = "mpr" ) predict(results)
Implements the SIC -telescope method, either using
single or multiparameter regression. Returns estimated coefficients, estimated
standard errors and the value of the penalized likelihood function.
Note that the function will scale the predictors to have unit variance, however,
the final estimates are converted back to their original scale.
smoothic( formula, data, family = "sgnd", model = "mpr", lambda = "log(n)", epsilon_1 = 10, epsilon_T = 1e-04, steps_T = 100, zero_tol = 1e-05, max_it = 10000, kappa, tau, max_it_vec, stepmax_nlm )
smoothic( formula, data, family = "sgnd", model = "mpr", lambda = "log(n)", epsilon_1 = 10, epsilon_T = 1e-04, steps_T = 100, zero_tol = 1e-05, max_it = 10000, kappa, tau, max_it_vec, stepmax_nlm )
formula |
An object of class |
data |
A data frame containing the variables in the model; the data frame should be unstandardized. |
family |
The family of the model, default is |
model |
The type of regression to be implemented, either |
lambda |
Value of penalty tuning parameter. Suggested values are
|
epsilon_1 |
Starting value for |
epsilon_T |
Final value for |
steps_T |
Number of steps in |
zero_tol |
Coefficients below this value are treated as being zero.
Defaults to |
max_it |
Maximum number of iterations to be performed before the
optimization is terminated. Defaults to |
kappa |
Optional user-supplied positive kappa value (> 0.2 to avoid
computational issues) if |
tau |
Optional user-supplied positive smoothing parameter value in the
"Smooth Generalized Normal Distribution" if |
max_it_vec |
Optional vector of length |
stepmax_nlm |
Optional maximum allowable scaled step length (positive scalar) to be passed to
|
A list with estimates and estimated standard errors.
coefficients
- vector of coefficients.
see
- vector of estimated standard errors.
model
- the matched type of model which is called.
plike
- value of the penalized likelihood function.
kappa
- value of the estimated/fixed shape parameter kappa if family = "sgnd"
.
Meadhbh O'Neill
O'Neill, M. and Burke, K. (2023) Variable selection using a smooth information criterion for distributional regression models. <doi:10.1007/s11222-023-10204-8>
O'Neill, M. and Burke, K. (2022) Robust Distributional Regression with Automatic Variable Selection. <arXiv:2212.07317>
# Sniffer Data -------------------- # MPR Model ---- results <- smoothic( formula = y ~ ., data = sniffer, family = "normal", model = "mpr" ) summary(results)
# Sniffer Data -------------------- # MPR Model ---- results <- smoothic( formula = y ~ ., data = sniffer, family = "normal", model = "mpr" ) summary(results)
Data examining the factors that impact the amount of hydrocarbon vapour released when gasoline is pumped into a tank.
sniffer
sniffer
A data frame with 125 rows and 5 variables:
initial tank temperature (degrees F)
temperature of the dispensed gasoline (degrees F)
initial vapour pressure in the tank (psi)
vapour pressure of the dispensed gasoline (psi)
hydrocarbons emitted (g)
https://CRAN.R-project.org/package=alr4
Bedrick, E.J. (2000). Checking for lack of fit in linear models with parametric variance functions. Technometrics 42 (3), 226–236.
Weisberg, S. (2014). Applied Linear Regression, 4th edition. Hoboken NJ: Wiley.
summary
method class “smoothic
”
## S3 method for class 'smoothic' summary(object, ...)
## S3 method for class 'smoothic' summary(object, ...)
object |
an object of class “ |
... |
further arguments passed to or from other methods. |
A list containing the following components:
model
- the matched model from the smoothic
object.
coefmat
- a typical coefficient matrix whose columns are the
estimated regression coefficients, estimated standard errors (SEE) and p-values.
plike
- value of the penalized likelihood function.
Meadhbh O'Neill
# Sniffer Data -------------------- # MPR Model ---- results <- smoothic( formula = y ~ ., data = sniffer, family = "normal", model = "mpr" ) summary(results)
# Sniffer Data -------------------- # MPR Model ---- results <- smoothic( formula = y ~ ., data = sniffer, family = "normal", model = "mpr" ) summary(results)