Title: | Functions to Compute the Biweight Mean Vector and Covariance and Correlation Matrices |
---|---|
Description: | The base functions compute multivariate location, scale, and correlation estimates based on Tukey's biweight M-estimator. Using the base function, the computations can be applied to a large number of observations to create either a matrix of biweight distances or biweight correlations. |
Authors: | Johanna Hardin [aut, cre] |
Maintainer: | Johanna Hardin <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.1.0 |
Built: | 2024-10-25 05:49:44 UTC |
Source: | https://github.com/hardin47/biwt |
A function to compute the biweight mean vector and covariance matrix
biwt_cor(x, r, median = TRUE, full_init = TRUE)
biwt_cor(x, r, median = TRUE, full_init = TRUE)
x |
an |
r |
breakdown ( |
median |
a logical command to determine whether the initialization is done using the coordinate-wise median and MAD (TRUE) or using the minimum covariance determinant (MCD) (FALSE). Using the MCD is substantially slower. |
full_init |
a logical command to determine whether the initialization is done for each pair separately (FALSE) or only one time at the beginning using the entire data matrix (TRUE). Initializing for each pair separately is substantially slower. |
Using biwt_est
to estimate the robust covariance matrix, a robust measure of correlation is computed using Tukey's biweight M-estimator. The biweight correlation is essentially a weighted correlation where the weights are calculated based on the distance of each measurement to the data center with respect to the shape of the data. The correlations are computed pair-by-pair because the weights should depend only on the pairwise relationship at hand and not the relationship between all the observations globally. The biwt functions compute many pairwise correlations and create distance matrices for use in other algorithms (e.g., clustering).
In order for the biweight estimates to converge, a reasonable initialization must be given. Typically, using TRUE for the median and full_init arguments will provide acceptable initializations. With particularly irregular data, the MCD should be used to give the initial estimate of center and shape. With data sets in which the observations are orders of magnitudes different, full_init=FALSE should be specified.
Returns a list with components:
biwt_corr |
a vector consisting of the lower triangle of the correlation matrix stored by columns in a vector, say bwcor. If |
biwt_NAid |
a vector which is indexed in the same way as |
# note that biwt_cor() takes data that is nxg where the # goal is to find correlations between each of the g items samp_data <- MASS::mvrnorm(30,mu=c(0,0,0),Sigma=matrix(c(1,.75,-.75,.75,1,-.75,-.75,-.75,1),ncol=3)) r <- 0.2 # breakdown # To compute the 3 pairwise correlations from the sample data: samp_bw_cor <- biwt_cor(samp_data,r) samp_bw_cor
# note that biwt_cor() takes data that is nxg where the # goal is to find correlations between each of the g items samp_data <- MASS::mvrnorm(30,mu=c(0,0,0),Sigma=matrix(c(1,.75,-.75,.75,1,-.75,-.75,-.75,1),ncol=3)) r <- 0.2 # breakdown # To compute the 3 pairwise correlations from the sample data: samp_bw_cor <- biwt_cor(samp_data,r) samp_bw_cor
Compute a multivariate location and scale estimate based on Tukey's biweight weight function.
biwt_cor_matrix(x, r, median = TRUE, full_init = TRUE)
biwt_cor_matrix(x, r, median = TRUE, full_init = TRUE)
x |
an |
r |
breakdown ( |
median |
a logical command to determine whether the initialization is done using the coordinate-wise median and MAD (TRUE) or using the minimum covariance determinant (MCD) (FALSE). Using the MCD is substantially slower. |
full_init |
a logical command to determine whether the initialization is done for each pair separately (FALSE) or only one time at the beginning using the entire data matrix (TRUE). Initializing for each pair separately is substantially slower. |
Using biwt_est
to estimate the robust covariance matrix, a robust measure of correlation is computed using Tukey's biweight M-estimator. The biweight correlation is essentially a weighted correlation where the weights are calculated based on the distance of each measurement to the data center with respect to the shape of the data. The correlations are computed pair-by-pair because the weights should depend only on the pairwise relationship at hand and not the relationship between all the observations globally. The biwt functions compute many pairwise correlations and create distance matrices for use in other algorithms (e.g., clustering).
In order for the biweight estimates to converge, a reasonable initialization must be given. Typically, using TRUE for the median and full_init arguments will provide acceptable initializations. With particularly irregular data, the MCD should be used to give the initial estimate of center and shape. With data sets in which the observations are orders of magnitudes different, full_init=FALSE should be specified.
Returns a list with components:
biwt_corr_matrix |
a matrix of the biweight correlations. |
biwt_NAid_matrix |
a matrix representing whether the biweight correlation was possible to compute (will be NA if too much data is missing or if the initializations are not accurate). 0 if computed accurately, 1 if NA. |
Hardin, J., Mitani, A., Hicks, L., VanKoten, B.; A Robust Measure of Correlation Between Two Genes on a Microarray, BMC Bioinformatics, 8:220; 2007.
# note that biwt_cor_matrix() takes data that is nxg where the # goal is to find correlations between each of the g items samp_data <- MASS::mvrnorm(30,mu=c(0,0,0),Sigma=matrix(c(1,.75,-.75,.75,1,-.75,-.75,-.75,1),ncol=3)) r <- 0.2 # breakdown # To compute the 3 pairwise correlations in matrix form: samp_bw_cor_mat <- biwt_cor_matrix(samp_data,r) samp_bw_cor_mat
# note that biwt_cor_matrix() takes data that is nxg where the # goal is to find correlations between each of the g items samp_data <- MASS::mvrnorm(30,mu=c(0,0,0),Sigma=matrix(c(1,.75,-.75,.75,1,-.75,-.75,-.75,1),ncol=3)) r <- 0.2 # breakdown # To compute the 3 pairwise correlations in matrix form: samp_bw_cor_mat <- biwt_cor_matrix(samp_data,r) samp_bw_cor_mat
Compute a multivariate location and scale estimate based on Tukey's biweight weight function.
biwt_dist_matrix(x, r, median = TRUE, full_init = TRUE, absval = TRUE)
biwt_dist_matrix(x, r, median = TRUE, full_init = TRUE, absval = TRUE)
x |
an |
r |
breakdown ( |
median |
a logical command to determine whether the initialization is done using the coordinate-wise median and MAD (TRUE) or using the minimum covariance determinant (MCD) (FALSE). Using the MCD is substantially slower. |
full_init |
a logical command to determine whether the initialization is done for each pair separately (FALSE) or only one time at the beginning using the entire data matrix (TRUE). Initializing for each pair separately is substantially slower. |
absval |
a logical command to determine whether the distance should be measured as 1 minus the absolute value of the correlation (TRUE) or simply 1 minus the correlation (FALSE) |
Using biwt_est
to estimate the robust covariance matrix, a robust measure of correlation is computed using Tukey's biweight M-estimator. The biweight correlation is essentially a weighted correlation where the weights are calculated based on the distance of each measurement to the data center with respect to the shape of the data. The correlations are computed pair-by-pair because the weights should depend only on the pairwise relationship at hand and not the relationship between all the observations globally. The biwt functions compute many pairwise correlations and create distance matrices for use in other algorithms (e.g., clustering).
In order for the biweight estimates to converge, a reasonable initialization must be given. Typically, using TRUE for the median and full_init arguments will provide acceptable initializations. With particularly irregular data, the MCD should be used to give the initial estimate of center and shape. With data sets in which the observations are orders of magnitudes different, full_init=FALSE should be specified.
Returns a list with components:
biwt_dist_matrix |
a matrix of the biweight distances (default is 1 minus absolute value of the biweight correlation). |
biwt_NAid_matrix |
a matrix representing whether the biweight correlation was possible to compute (will be NA if too much data is missing or if the initializations are not accurate). 0 if computed accurately, 1 if NA. |
Hardin, J., Mitani, A., Hicks, L., VanKoten, B.; A Robust Measure of Correlation Between Two Genes on a Microarray, BMC Bioinformatics, 8:220; 2007.
# note that biwt_dist_matrix() takes data that is nxg where the # goal is to find distances between each of the g items samp_data <- MASS::mvrnorm(30,mu=c(0,0,0),Sigma=matrix(c(1,.75,-.75,.75,1,-.75,-.75,-.75,1),ncol=3)) r <- 0.2 # breakdown # To compute the 3 pairwise distances in matrix form: samp_bw_dist_mat <- biwt_dist_matrix(samp_data, r) samp_bw_dist_mat # To convert the distances into an element of class 'dist' as.dist(samp_bw_dist_mat$biwt_dist_mat)
# note that biwt_dist_matrix() takes data that is nxg where the # goal is to find distances between each of the g items samp_data <- MASS::mvrnorm(30,mu=c(0,0,0),Sigma=matrix(c(1,.75,-.75,.75,1,-.75,-.75,-.75,1),ncol=3)) r <- 0.2 # breakdown # To compute the 3 pairwise distances in matrix form: samp_bw_dist_mat <- biwt_dist_matrix(samp_data, r) samp_bw_dist_mat # To convert the distances into an element of class 'dist' as.dist(samp_bw_dist_mat$biwt_dist_mat)
Compute a multivariate location and scale estimate based on Tukey's biweight weight function.
biwt_est(x, r, med.init)
biwt_est(x, r, med.init)
x |
an |
r |
breakdown ( |
med.init |
a (robust) initial estimate of the center and shape of the data. form is a list with components center and cov. |
A robust measure of center and shape is computed using Tukey's biweight M-estimator. The biweight estimates are essentially weighted means and covariances where the weights are calculated based on the distance of each measurement to the data center with respect to the shape of the data. The estimates should be computed pair-by-pair because the weights should depend only on the pairwise relationship at hand and not the relationship between all the observations globally.
Returns a list with components:
biwt_mu |
the final estimate of location |
biwt_sig |
the final estimate of scatter |
biwt_NAid |
a logical of whether the given initialization was used (coded as 0) or whether a more precise initialization was used (namely, the pair by pair MCD, coded as 1). |
Hardin, J., Mitani, A., Hicks, L., VanKoten, B.; A Robust Measure of Correlation Between Two Genes on a Microarray, BMC Bioinformatics, 8:220; 2007.
# note that biwt_est() takes data that is nx2 where the # goal is to find the correlation between the 2 items samp_data <- MASS::mvrnorm(30,mu=c(0,0),Sigma=matrix(c(1,.75,.75,1),ncol=2)) r <- 0.2 # breakdown samp_mcd <- robustbase::covMcd(samp_data) samp_bw <- biwt_est(samp_data, r, samp_mcd) samp_bw_var1 <- samp_bw$biwt_sig[1,1] samp_bw_var2 <- samp_bw$biwt_sig[2,2] samp_bw_cov <- samp_bw$biwt_sig[1,2] samp_bw_corr <- samp_bw_cov / sqrt(samp_bw_var1 * samp_bw_var2) # or: samp_bw_corr <- samp_bw$biwt_sig[1,2] / sqrt(samp_bw$biwt_sig[1,1]*samp_bw$biwt_sig[2,2]) samp_bw_corr ############## # to speed up the calculations, use the median/mad for the initialization: ############## samp_init <- list() samp_init$cov <- diag(apply(samp_data, 2, stats::mad, na.rm=TRUE)) samp_init$center <- apply(samp_data, 2, median, na.rm=TRUE) samp_bw <- biwt_est(samp_data, r, samp_init) samp_bw_corr <- samp_bw$biwt.sig[1,2] / sqrt(samp_bw$biwt.sig[1,1]*samp_bw$biwt.sig[2,2])
# note that biwt_est() takes data that is nx2 where the # goal is to find the correlation between the 2 items samp_data <- MASS::mvrnorm(30,mu=c(0,0),Sigma=matrix(c(1,.75,.75,1),ncol=2)) r <- 0.2 # breakdown samp_mcd <- robustbase::covMcd(samp_data) samp_bw <- biwt_est(samp_data, r, samp_mcd) samp_bw_var1 <- samp_bw$biwt_sig[1,1] samp_bw_var2 <- samp_bw$biwt_sig[2,2] samp_bw_cov <- samp_bw$biwt_sig[1,2] samp_bw_corr <- samp_bw_cov / sqrt(samp_bw_var1 * samp_bw_var2) # or: samp_bw_corr <- samp_bw$biwt_sig[1,2] / sqrt(samp_bw$biwt_sig[1,1]*samp_bw$biwt_sig[2,2]) samp_bw_corr ############## # to speed up the calculations, use the median/mad for the initialization: ############## samp_init <- list() samp_init$cov <- diag(apply(samp_data, 2, stats::mad, na.rm=TRUE)) samp_init$center <- apply(samp_data, 2, median, na.rm=TRUE) samp_bw <- biwt_est(samp_data, r, samp_init) samp_bw_corr <- samp_bw$biwt.sig[1,2] / sqrt(samp_bw$biwt.sig[1,1]*samp_bw$biwt.sig[2,2])
A function to compute the biweight mean vector and covariance matrix
biwt.cor( x, r = 0.2, output = "matrix", median = TRUE, full.init = TRUE, absval = TRUE )
biwt.cor( x, r = 0.2, output = "matrix", median = TRUE, full.init = TRUE, absval = TRUE )
x |
a |
r |
breakdown ( |
output |
a character string specifying the output format. Options are "matrix" (default), "vector", or "distance". See value below. |
median |
a logical command to determine whether the initialization is done using the coordinate-wise median and MAD^2 (TRUE, default) or using the minimum covariance determinant (MCD) (FALSE). Using the MCD is substantially slower. The MAD is the median of the absolute deviations from the median. See R help file on |
full.init |
a logical command to determine whether the initialization is done for each pair separately (FALSE) or only one time at the beginning using the entire data matrix (TRUE, default). Initializing for each pair separately is substantially slower. |
absval |
a logical command to determine whether the distance should be measured as 1 minus the absolute value of the correlation (TRUE, default) or as 1 minus the correlation (FALSE). |
Specifying "vector" for the output argument returns a vector consisting of the lower triangle of the correlation matrix stored by columns in a vector, say . If
is the number of observations and
is the correlation vector, then for
, the biweight correlation between (rows)
and
is
. The length of the vector is
, i.e., of order
.
Specifying "matrix" for the output argument returns a matrix of the biweight correlations.
Specifying "distance" for the output argument returns a matrix of the biweight distances (default is 1 minus absolute value of the biweight correlation).
If there is too much missing data or if the initialization is not accurate, the function will compute the MCD for a given pair of observations before computing the biweight correlation (regardless of the initial settings given in the call to the function).
The "vector" output option is given so that correlations can be stored as vectors which are less computationally intensive than matrices.
Returns a list with components:
corr |
a vector consisting of the lower triangle of the correlation matrix stored by columns in a vector, say bwcor. If |
corr.mat |
a matrix consisting of the lower triangle of the correlation matrix stored by columns in a vector, say bwcor. If |
dist.mat |
a matrix consisting of the correlations converted to distances (either 1 - correlation or 1 - abs(correlation)). |
# note that biwt.cor() takes data that is gxn where the # goal is to find correlations or distances between each of the g items samp.data <- t(MASS::mvrnorm(30,mu=c(0,0,0), Sigma=matrix(c(1,.75,-.75,.75,1,-.75,-.75,-.75,1),ncol=3))) # To compute the 3 pairwise correlations from the sample data: samp.bw.cor <- biwt.cor(samp.data, output="vector") samp.bw.cor # To compute the 3 pairwise correlations in matrix form: samp.bw.cor.mat <- biwt.cor(samp.data) samp.bw.cor.mat # To compute the 3 pairwise distances in matrix form: samp.bw.dist.mat <- biwt.cor(samp.data, output="distance") samp.bw.dist.mat # To convert the distances into an object of class `dist' as.dist(samp.bw.dist.mat)
# note that biwt.cor() takes data that is gxn where the # goal is to find correlations or distances between each of the g items samp.data <- t(MASS::mvrnorm(30,mu=c(0,0,0), Sigma=matrix(c(1,.75,-.75,.75,1,-.75,-.75,-.75,1),ncol=3))) # To compute the 3 pairwise correlations from the sample data: samp.bw.cor <- biwt.cor(samp.data, output="vector") samp.bw.cor # To compute the 3 pairwise correlations in matrix form: samp.bw.cor.mat <- biwt.cor(samp.data) samp.bw.cor.mat # To compute the 3 pairwise distances in matrix form: samp.bw.dist.mat <- biwt.cor(samp.data, output="distance") samp.bw.dist.mat # To convert the distances into an object of class `dist' as.dist(samp.bw.dist.mat)
Compute a multivariate location and scale estimate based on Tukey's biweight weight function.
biwt.est(x, r = 0.2, med.init = robustbase::covMcd(x))
biwt.est(x, r = 0.2, med.init = robustbase::covMcd(x))
x |
a |
r |
breakdown ( |
med.init |
a (robust) initial estimate of the center and shape of the data. The format is a list with components center and cov (as in the output of |
A list with components
biwt.mu |
the final estimate of center |
biwt.sig |
the final estimate of shape |
Hardin, J., Mitani, A., Hicks, L., VanKoten, B.; A Robust Measure of Correlation Between Two Genes on a Microarray, BMC Bioinformatics, 8:220; 2007.
# note that biwt.est() takes data that is 2xn where the # goal is to find the correlation between the 2 items samp.data <- t(MASS::mvrnorm(30,mu=c(0,0), Sigma=matrix(c(1,.75,.75,1),ncol=2))) samp.bw <- biwt.est(samp.data) samp.bw samp.bw.var1 <- samp.bw$biwt.sig[1,1] samp.bw.var2 <- samp.bw$biwt.sig[2,2] samp.bw.cov <- samp.bw$biwt.sig[1,2] samp.bw.cor <- samp.bw.cov / sqrt(samp.bw.var1 * samp.bw.var2) samp.bw.cor # or: samp.bw.cor <- samp.bw$biwt.sig[1,2] / sqrt(samp.bw$biwt.sig[1,1]*samp.bw$biwt.sig[2,2]) samp.bw.cor ############## # to speed up the calculations, use the median/mad for the initialization: ############## samp.init <- list() samp.init$cov <- diag(apply(samp.data, 1, stats::mad, na.rm=TRUE)) samp.init$center <- apply(samp.data, 1, median, na.rm=TRUE) samp.init samp.bw <- biwt.est(samp.data, med.init = samp.init) samp.bw.cor <- samp.bw$biwt.sig[1,2] / sqrt(samp.bw$biwt.sig[1,1]*samp.bw$biwt.sig[2,2]) samp.bw.cor
# note that biwt.est() takes data that is 2xn where the # goal is to find the correlation between the 2 items samp.data <- t(MASS::mvrnorm(30,mu=c(0,0), Sigma=matrix(c(1,.75,.75,1),ncol=2))) samp.bw <- biwt.est(samp.data) samp.bw samp.bw.var1 <- samp.bw$biwt.sig[1,1] samp.bw.var2 <- samp.bw$biwt.sig[2,2] samp.bw.cov <- samp.bw$biwt.sig[1,2] samp.bw.cor <- samp.bw.cov / sqrt(samp.bw.var1 * samp.bw.var2) samp.bw.cor # or: samp.bw.cor <- samp.bw$biwt.sig[1,2] / sqrt(samp.bw$biwt.sig[1,1]*samp.bw$biwt.sig[2,2]) samp.bw.cor ############## # to speed up the calculations, use the median/mad for the initialization: ############## samp.init <- list() samp.init$cov <- diag(apply(samp.data, 1, stats::mad, na.rm=TRUE)) samp.init$center <- apply(samp.data, 1, median, na.rm=TRUE) samp.init samp.bw <- biwt.est(samp.data, med.init = samp.init) samp.bw.cor <- samp.bw$biwt.sig[1,2] / sqrt(samp.bw$biwt.sig[1,1]*samp.bw$biwt.sig[2,2]) samp.bw.cor
Internal function
chi.int(p, a, c1)
chi.int(p, a, c1)
p |
a number |
a |
a number |
c1 |
a number |
a number
chi.int(2,3,4)
chi.int(2,3,4)
Title
chi.int.p(p, a, c1)
chi.int.p(p, a, c1)
p |
a number |
a |
a number |
c1 |
a number |
a number
chi.int.p(1,2,3)
chi.int.p(1,2,3)
Internal function
chi.int2(p, a, c1)
chi.int2(p, a, c1)
p |
a number |
a |
a number |
c1 |
a number |
a number
chi.int(2,3,4)
chi.int(2,3,4)
Internal function
chi.int2.p(p, a, c1)
chi.int2.p(p, a, c1)
p |
a number |
a |
a number |
c1 |
a number |
a number
chi.int2.p(2,3,4)
chi.int2.p(2,3,4)
Internal function
erho.bw(p, c1)
erho.bw(p, c1)
p |
a number |
c1 |
a cutoff |
a number, The expected value of rho
erho.bw(2,3)
erho.bw(2,3)
Internal function
erho.bw.p(p, c1)
erho.bw.p(p, c1)
p |
a number |
c1 |
a cutoff |
a number, The derivative of the expected value of rho
erho.bw.p(2,3)
erho.bw.p(2,3)
Internal function
ksolve(d, p, c1, b0)
ksolve(d, p, c1, b0)
d |
a vector |
p |
a number |
c1 |
a number |
b0 |
a number |
a number
ksolve(rnorm(20,.1,2),1,3,1)
ksolve(rnorm(20,.1,2),1,3,1)
Internal function
psibw(x, c1)
psibw(x, c1)
x |
a vector |
c1 |
a cutoff |
a vector
psibw(rnorm(10),3)
psibw(rnorm(10),3)
Internal function
rejpt.bw(p, r)
rejpt.bw(p, r)
p |
a number |
r |
the breakdown |
the asymptotic rejection point
rejpt.bw(2,3)
rejpt.bw(2,3)
Internal function
rhobw(x, c1)
rhobw(x, c1)
x |
a number |
c1 |
a cutoff |
a number
rhobw(2,3)
rhobw(2,3)
Internal function
vbw(x, c1)
vbw(x, c1)
x |
a vector |
c1 |
a cutoff |
a vector
vbw(rnorm(10),3)
vbw(rnorm(10),3)
Internal function
vect2diss(v)
vect2diss(v)
v |
a vector |
a vector
vect2diss(rnorm(10))
vect2diss(rnorm(10))
Internal function
vect2diss_hop(v)
vect2diss_hop(v)
v |
a vector |
a vector
vect2diss_hop(rnorm(10))
vect2diss_hop(rnorm(10))
Internal function
wtbw(x, c1)
wtbw(x, c1)
x |
a vector |
c1 |
a number |
a vector
wtbw(rnorm(10),3)
wtbw(rnorm(10),3)