| Title: | Survey Data Cleaning, Weighting and Analysis |
|---|---|
| Description: | Provides utilities for cleaning survey data, computing weights, and performing descriptive statistical analysis. Methods follow Lohr (2019, ISBN:978-0367272454) "Sampling: Design and Analysis" and Lumley (2010) <doi:10.1002/9780470580066>. |
| Authors: | Muhammad Ali [aut, cre] |
| Maintainer: | Muhammad Ali <[email protected]> |
| License: | GPL-3 |
| Version: | 1.0.3 |
| Built: | 2026-05-23 07:08:59 UTC |
| Source: | https://github.com/cran/SurveyStat |
This function applies survey weights by creating a weighted version of the dataset. The weights are normalized to sum to the sample size for computational stability.
apply_weights(data, weight_col)apply_weights(data, weight_col)
data |
A data.frame containing survey data |
weight_col |
Character string specifying column name containing weights |
A data.frame with normalized weights
data <- data.frame(age = c(25, 30, 35), weight = c(1.2, 0.8, 1.0)) weighted_data <- apply_weights(data, "weight")data <- data.frame(age = c(25, 30, 35), weight = c(1.2, 0.8, 1.0)) weighted_data <- apply_weights(data, "weight")
This function handles missing values using specified imputation method. Supports mean, median, and mode imputation for numeric variables.
clean_missing(data, col, method = c("mean", "median", "mode"))clean_missing(data, col, method = c("mean", "median", "mode"))
data |
A data.frame containing survey data |
col |
Character string specifying column name to clean |
method |
Character string specifying imputation method ("mean", "median", or "mode") |
A data.frame with missing values imputed
data <- data.frame(age = c(25, NA, 30, NA, 35)) clean_data <- clean_missing(data, "age", method = "mean")data <- data.frame(age = c(25, NA, 30, NA, 35)) clean_data <- clean_missing(data, "age", method = "mean")
This function creates a cross-tabulation between two categorical variables and performs a chi-square test of independence. Can incorporate survey weights.
cross_tabulation(data, col1, col2, weight_col = NULL)cross_tabulation(data, col1, col2, weight_col = NULL)
data |
A data.frame containing survey data |
col1 |
Character string specifying first categorical variable |
col2 |
Character string specifying second categorical variable |
weight_col |
Character string specifying column name containing weights (optional) |
A list containing cross-tabulation and chi-square test results
data <- data.frame(gender = c("M", "F", "M", "F"), education = c("HS", "College", "HS", "College")) cross_tab <- cross_tabulation(data, "gender", "education")data <- data.frame(gender = c("M", "F", "M", "F"), education = c("HS", "College", "HS", "College")) cross_tab <- cross_tabulation(data, "gender", "education")
This function provides a comprehensive description of survey data including sample size, variable types, missing value patterns, and basic statistics. Can incorporate survey weights if provided.
describe_survey(data, weight_col = NULL)describe_survey(data, weight_col = NULL)
data |
A data.frame containing survey data |
weight_col |
Character string specifying column name containing weights (optional) |
A list containing descriptive statistics
data <- data.frame( age = c(25, 30, 35), gender = c("M", "F", "M"), weight = c(1.2, 0.8, 1.0) ) desc <- describe_survey(data) desc_weighted <- describe_survey(data, "weight")data <- data.frame( age = c(25, 30, 35), gender = c("M", "F", "M"), weight = c(1.2, 0.8, 1.0) ) desc <- describe_survey(data) desc_weighted <- describe_survey(data, "weight")
A small example dataset used to demonstrate SurveyStat functions.
example_surveyexample_survey
A data frame with 10 rows and 5 variables:
Numeric age of respondent
Gender of respondent (Male/Female)
Education level (High School/Bachelor/Graduate)
Numeric income value
Survey weight
Simulated data for demonstration purposes
This function creates a frequency table for a categorical variable, optionally incorporating survey weights.
frequency_table(data, col, weight_col = NULL)frequency_table(data, col, weight_col = NULL)
data |
A data.frame containing survey data |
col |
Character string specifying column name for categorical variable |
weight_col |
Character string specifying column name containing weights (optional) |
A data.frame with frequency statistics
data <- data.frame(gender = c("M", "F", "M", "F"), weight = c(1, 1.2, 0.8, 1.1)) freq_table <- frequency_table(data, "gender") weighted_freq <- frequency_table(data, "gender", "weight")data <- data.frame(gender = c("M", "F", "M", "F"), weight = c(1, 1.2, 0.8, 1.1)) freq_table <- frequency_table(data, "gender") weighted_freq <- frequency_table(data, "gender", "weight")
This function creates a clean, publication-quality box plot for numeric variables, optionally grouped by a categorical variable.
plot_boxplot(data, col, group_col = NULL, add_points = TRUE)plot_boxplot(data, col, group_col = NULL, add_points = TRUE)
data |
A data.frame containing survey data |
col |
Character string specifying column name for numeric variable |
group_col |
Character string specifying column name for grouping variable (optional) |
add_points |
Logical whether to add individual data points (default: TRUE) |
A ggplot object
data <- data.frame(age = c(25, 30, 35, 40, 45), gender = c("M", "F", "M", "F", "M")) box_plot <- plot_boxplot(data, "age") grouped_box <- plot_boxplot(data, "age", "gender")data <- data.frame(age = c(25, 30, 35, 40, 45), gender = c("M", "F", "M", "F", "M")) box_plot <- plot_boxplot(data, "age") grouped_box <- plot_boxplot(data, "age", "gender")
This function creates a clean, publication-quality histogram for numeric variables using ggplot2 with minimal theme and appropriate statistical overlays.
plot_histogram(data, col, bins = 30, add_density = TRUE)plot_histogram(data, col, bins = 30, add_density = TRUE)
data |
A data.frame containing survey data |
col |
Character string specifying column name for numeric variable |
bins |
Number of bins for histogram (default: 30) |
add_density |
Logical whether to add density curve (default: TRUE) |
A ggplot object
data <- data.frame(age = rnorm(100, 35, 10)) hist_plot <- plot_histogram(data, "age") print(hist_plot)data <- data.frame(age = rnorm(100, 35, 10)) hist_plot <- plot_histogram(data, "age") print(hist_plot)
This function creates a bar plot for categorical variables, optionally using survey weights to show weighted frequencies.
plot_weighted_bar(data, col, weight_col = NULL, show_percentages = TRUE)plot_weighted_bar(data, col, weight_col = NULL, show_percentages = TRUE)
data |
A data.frame containing survey data |
col |
Character string specifying column name for categorical variable |
weight_col |
Character string specifying column name containing weights (optional) |
show_percentages |
Logical whether to show percentage labels (default: TRUE) |
A ggplot object
data <- data.frame(gender = c("M", "F", "M", "F"), weight = c(1, 1.2, 0.8, 1.1)) bar_plot <- plot_weighted_bar(data, "gender") weighted_bar <- plot_weighted_bar(data, "gender", "weight")data <- data.frame(gender = c("M", "F", "M", "F"), weight = c(1, 1.2, 0.8, 1.1)) bar_plot <- plot_weighted_bar(data, "gender") weighted_bar <- plot_weighted_bar(data, "gender", "weight")
This function implements simple raking (iterative proportional fitting) to adjust survey weights to match known population marginal totals. Assumes two-dimensional raking for simplicity.
rake_weights(data, population_targets, weight_col = "weight")rake_weights(data, population_targets, weight_col = "weight")
data |
A data.frame containing survey data |
population_targets |
Named list with population totals for each variable |
weight_col |
Character string specifying initial weight column name |
A data.frame with raked weights
# Assuming we have gender and education population totals targets <- list( gender = c(Male = 1000000, Female = 1050000), education = c(HighSchool = 800000, Bachelor = 900000, Graduate = 350000) ) data <- data.frame( gender = c("Male", "Female", "Male", "Female", "Male"), education = c("HighSchool", "Bachelor", "Bachelor", "HighSchool", "Graduate"), weight = c(1, 1, 1, 1, 1) ) raked_data <- rake_weights(data, targets, "weight")# Assuming we have gender and education population totals targets <- list( gender = c(Male = 1000000, Female = 1050000), education = c(HighSchool = 800000, Bachelor = 900000, Graduate = 350000) ) data <- data.frame( gender = c("Male", "Female", "Male", "Female", "Male"), education = c("HighSchool", "Bachelor", "Bachelor", "HighSchool", "Graduate"), weight = c(1, 1, 1, 1, 1) ) raked_data <- rake_weights(data, targets, "weight")
This function identifies and removes duplicate rows based on all columns. Preserves the first occurrence of each duplicate.
remove_duplicates(data)remove_duplicates(data)
data |
A data.frame containing survey data |
A data.frame with duplicates removed
data <- data.frame(id = c(1, 2, 2, 3), age = c(25, 30, 30, 35)) clean_data <- remove_duplicates(data)data <- data.frame(id = c(1, 2, 2, 3), age = c(25, 30, 30, 35)) clean_data <- remove_duplicates(data)
This function standardizes categorical variables by mapping values to standardized categories. Useful for consolidating different representations of the same category.
standardize_categories(data, col, mapping)standardize_categories(data, col, mapping)
data |
A data.frame containing survey data |
col |
Character string specifying column name to standardize |
mapping |
Named list or vector mapping old values to new values |
A data.frame with standardized categories
data <- data.frame(gender = c("M", "Male", "F", "Female", "m")) mapping <- list("M" = "Male", "Male" = "Male", "F" = "Female", "Female" = "Female", "m" = "Male") clean_data <- standardize_categories(data, "gender", mapping)data <- data.frame(gender = c("M", "Male", "F", "Female", "m")) mapping <- list("M" = "Male", "Male" = "Male", "F" = "Female", "Female" = "Female", "m" = "Male") clean_data <- standardize_categories(data, "gender", mapping)
This function calculates the weighted mean of a numeric variable. Uses standard weighted mean formula: sum(x * w) / sum(w)
weighted_mean(data, target_col, weight_col)weighted_mean(data, target_col, weight_col)
data |
A data.frame containing survey data |
target_col |
Character string specifying column name for target variable |
weight_col |
Character string specifying column name containing weights |
Numeric weighted mean
data <- data.frame(income = c(50000, 75000, 100000), weight = c(1.2, 0.8, 1.0)) weighted_income <- weighted_mean(data, "income", "weight")data <- data.frame(income = c(50000, 75000, 100000), weight = c(1.2, 0.8, 1.0)) weighted_income <- weighted_mean(data, "income", "weight")
This function calculates the weighted total of a numeric variable. Useful for estimating population totals from survey data.
weighted_total(data, target_col, weight_col)weighted_total(data, target_col, weight_col)
data |
A data.frame containing survey data |
target_col |
Character string specifying column name for target variable |
weight_col |
Character string specifying column name containing weights |
Numeric weighted total
data <- data.frame(income = c(50000, 75000, 100000), weight = c(1000, 800, 1200)) total_income <- weighted_total(data, "income", "weight")data <- data.frame(income = c(50000, 75000, 100000), weight = c(1000, 800, 1200)) total_income <- weighted_total(data, "income", "weight")