Title: | Meta Clustering with Similarity Network Fusion |
---|---|
Description: | Framework to facilitate patient subtyping with similarity network fusion and meta clustering. The similarity network fusion (SNF) algorithm was introduced by Wang et al. (2014) in <doi:10.1038/nmeth.2810>. SNF is a data integration approach that can transform high-dimensional and diverse data types into a single similarity network suitable for clustering with minimal loss of information from each initial data source. The meta clustering approach was introduced by Caruana et al. (2006) in <doi:10.1109/ICDM.2006.103>. Meta clustering involves generating a wide range of cluster solutions by adjusting clustering hyperparameters, then clustering the solutions themselves into a manageable number of qualitatively similar solutions, and finally characterizing representative solutions to find ones that are best for the user's specific context. This package provides a framework to easily transform multi-modal data into a wide range of similarity network fusion-derived cluster solutions as well as to visualize, characterize, and validate those solutions. Core package functionality includes easy customization of distance metrics, clustering algorithms, and SNF hyperparameters to generate diverse clustering solutions; calculation and plotting of associations between features, between patients, and between cluster solutions; and standard cluster validation approaches including resampled measures of cluster stability, standard metrics of cluster quality, and label propagation to evaluate generalizability in unseen data. Associated vignettes guide the user through using the package to identify patient subtypes while adhering to best practices for unsupervised learning. |
Authors: | Prashanth S Velayudhan [aut, cre], Xiaoqiao Xu [aut], Prajkta Kallurkar [aut], Ana Patricia Balbon [aut], Maria T Secara [aut], Adam Taback [aut], Denise Sabac [aut], Nicholas Chan [aut], Shihao Ma [aut], Bo Wang [aut], Daniel Felsky [aut], Stephanie H Ameis [aut], Brian Cox [aut], Colin Hawco [aut], Lauren Erdman [aut], Anne L Wheeler [aut, ths] |
Maintainer: | Prashanth S Velayudhan <[email protected]> |
License: | GPL (>= 3) |
Version: | 2.0.5 |
Built: | 2025-03-11 21:25:54 UTC |
Source: | https://github.com/branchlab/metasnf |
A randomly shuffled and anonymized copy of anxiety data from the NIMH Data
archive. The original file used was pdem02.txt. The file was pre-processed
by the abcdutils package (https://github.com/BRANCHlab/abcdutils) function
get_cbcl_anxiety
.
abcd_anxiety
abcd_anxiety
abcd_anxiety
A data frame with 275 rows and 2 columns:
The unique identifier of the ABCD dataset
Ordinal value of impairment on CBCL anxiety, either 0 (no impairment), 1 (borderline clinical), or 2 (clinically impaired)
Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive DevelopmentSM (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.
A randomly shuffled and anonymized copy of depression data from the NIMH
Data archive. The original file used was pdem02.txt. The file was
pre-processed by the abcdutils package
(https://github.com/BRANCHlab/abcdutils) function get_cbcl_depress
.
The data was transformed into categorical colour values to demonstrate
the Chi-squared test capabilities of extend_solutions
.
abcd_colour
abcd_colour
abcd_colour
A data frame with 275 rows and 2 columns:
The unique identifier of the ABCD dataset
Categorical transformation of cbcl_depress
.
Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive DevelopmentSM (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.
A randomly shuffled and anonymized copy of cortical surface area data from the NIMH Data
archive. The original file used was mrisdp10201.txt The file was pre-processed
by the abcdutils package (https://github.com/BRANCHlab/abcdutils) function
get_cort_t
.
abcd_cort_sa
abcd_cort_sa
abcd_cort_sa
A data frame with 188 rows and 152 columns:
The unique identifier of the ABCD dataset
Cortical surface areas of various ROIs (mm^2, I think)
Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive DevelopmentSM (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.
A randomly shuffled and anonymized copy of cortical thickness data from the NIMH Data
archive. The original file used was mrisdp10201.txt The file was pre-processed
by the abcdutils package (https://github.com/BRANCHlab/abcdutils) function
get_cort_t
.
abcd_cort_t
abcd_cort_t
abcd_cort_t
A data frame with 188 rows and 152 columns:
The unique identifier of the ABCD dataset
Cortical thicknesses of various ROIs (mm^3, I think)
Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive DevelopmentSM (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.
A randomly shuffled and anonymized copy of depression data from the NIMH
Data archive. The original file used was pdem02.txt. The file was
pre-processed by the abcdutils package
(https://github.com/BRANCHlab/abcdutils) function get_cbcl_depress
.
abcd_depress
abcd_depress
abcd_depress
A data frame with 275 rows and 2 columns:
The unique identifier of the ABCD dataset
Ordinal value of impairment on CBCL anxiety, either 0 (no impairment), 1 (borderline clinical), or 2 (clinically impaired)
Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive DevelopmentSM (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.
Like abcd_income, but with no NAs in patient column
abcd_h_income
abcd_h_income
abcd_income
A data frame with 300 rows and 2 columns:
The unique identifier of the ABCD dataset
Household income in 3 category levels (low = 1, medium = 2, high = 3)
Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive DevelopmentSM (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.
A randomly shuffled and anonymized copy of income data from the NIMH Data
archive. The original file used was pdem02.txt The file was pre-processed
by the abcdutils package (https://github.com/BRANCHlab/abcdutils) function
get_income
.
abcd_income
abcd_income
abcd_income
A data frame with 300 rows and 2 columns:
The unique identifier of the ABCD dataset
Household income in 3 category levels (low = 1, medium = 2, high = 3)
Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive DevelopmentSM (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.
A randomly shuffled and anonymized copy of pubertal status data from the NIMH Data
archive. The original files used were abcd_ssphp01.txt and abcd_ssphy01.txt. The file was pre-processed
by the abcdutils package (https://github.com/BRANCHlab/abcdutils) function
get_pubertal_status
.
abcd_pubertal
abcd_pubertal
abcd_pubertal
A data frame with 275 rows and 2 columns:
The unique identifier of the ABCD dataset
Average reported pubertal status between child and parent (1-5 categorical scale)
Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive DevelopmentSM (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.
A randomly shuffled and anonymized copy of subcortical volume data from the NIMH Data
archive. The original file used was smrip10201.txt The file was pre-processed
by the abcdutils package (https://github.com/BRANCHlab/abcdutils) function
get_subc_v
.
abcd_subc_v
abcd_subc_v
abcd_subc_v
A data frame with 174 rows and 31 columns:
The unique identifier of the ABCD dataset
Subcortical volumes of various ROIs (mm^3, I think)
Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive DevelopmentSM (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.
Add rows to a settings_df
add_settings_df_rows( sdf, n_solutions = 0, min_removed_inputs = 0, max_removed_inputs = sum(startsWith(colnames(sdf), "inc_")) - 1, dropout_dist = "exponential", min_alpha = NULL, max_alpha = NULL, min_k = NULL, max_k = NULL, min_t = NULL, max_t = NULL, alpha_values = NULL, k_values = NULL, t_values = NULL, possible_snf_schemes = c(1, 2, 3), clustering_algorithms = NULL, continuous_distances = NULL, discrete_distances = NULL, ordinal_distances = NULL, categorical_distances = NULL, mixed_distances = NULL, dfl = NULL, snf_input_weights = NULL, snf_domain_weights = NULL, retry_limit = 10, allow_duplicates = FALSE )
add_settings_df_rows( sdf, n_solutions = 0, min_removed_inputs = 0, max_removed_inputs = sum(startsWith(colnames(sdf), "inc_")) - 1, dropout_dist = "exponential", min_alpha = NULL, max_alpha = NULL, min_k = NULL, max_k = NULL, min_t = NULL, max_t = NULL, alpha_values = NULL, k_values = NULL, t_values = NULL, possible_snf_schemes = c(1, 2, 3), clustering_algorithms = NULL, continuous_distances = NULL, discrete_distances = NULL, ordinal_distances = NULL, categorical_distances = NULL, mixed_distances = NULL, dfl = NULL, snf_input_weights = NULL, snf_domain_weights = NULL, retry_limit = 10, allow_duplicates = FALSE )
sdf |
The existing settings data frame |
n_solutions |
Number of rows to generate for the settings data frame. |
min_removed_inputs |
The smallest number of input data frames that may be randomly removed. By default, 0. |
max_removed_inputs |
The largest number of input data frames that may be randomly removed. By default, this is 1 less than all the provided input data frames in the data list. |
dropout_dist |
Parameter controlling how the random removal of input data frames should occur. Can be "none" (no input data frames are randomly removed), "uniform" (uniformly sample between min_removed_inputs and max_removed_inputs to determine number of input data frames to remove), or "exponential" (pick number of input data frames to remove by sampling from min_removed_inputs to max_removed_inputs with an exponential distribution; the default). |
min_alpha |
The minimum value that the alpha hyperparameter can have.
Random assigned value of alpha for each row will be obtained by uniformly
sampling numbers between |
max_alpha |
The maximum value that the alpha hyperparameter can have.
See |
min_k |
The minimum value that the k hyperparameter can have.
Random assigned value of k for each row will be obtained by uniformly
sampling numbers between |
max_k |
The maximum value that the k hyperparameter can have.
See |
min_t |
The minimum value that the t hyperparameter can have.
Random assigned value of t for each row will be obtained by uniformly
sampling numbers between |
max_t |
The maximum value that the t hyperparameter can have.
See |
alpha_values |
A number or numeric vector of a set of possible values
that alpha can take on. Value will be obtained by uniformly sampling the
vector. Cannot be used in conjunction with the |
k_values |
A number or numeric vector of a set of possible values
that k can take on. Value will be obtained by uniformly sampling the
vector. Cannot be used in conjunction with the |
t_values |
A number or numeric vector of a set of possible values
that t can take on. Value will be obtained by uniformly sampling the
vector. Cannot be used in conjunction with the |
possible_snf_schemes |
A vector containing the possible snf_schemes to uniformly randomly select from. By default, the vector contains all 3 possible schemes: c(1, 2, 3). 1 corresponds to the "individual" scheme, 2 corresponds to the "domain" scheme, and 3 corresponds to the "twostep" scheme. |
clustering_algorithms |
A list of clustering algorithms to uniformly randomly pick from when clustering. When not specified, randomly select between spectral clustering using the eigen-gap heuristic and spectral clustering using the rotation cost heuristic. See ?clust_fns_list for more details on running custom clustering algorithms. |
continuous_distances |
A vector of continuous distance metrics to use when a custom dist_fns_list is provided. |
discrete_distances |
A vector of categorical distance metrics to use when a custom dist_fns_list is provided. |
ordinal_distances |
A vector of categorical distance metrics to use when a custom dist_fns_list is provided. |
categorical_distances |
A vector of categorical distance metrics to use when a custom dist_fns_list is provided. |
mixed_distances |
A vector of mixed distance metrics to use when a custom dist_fns_list is provided. |
dfl |
List containing distance metrics to vary over. See ?generate_dist_fns_list. |
snf_input_weights |
Nested list containing weights for when SNF is used to merge individual input measures (see ?generate_snf_weights) |
snf_domain_weights |
Nested list containing weights for when SNF is used to merge domains (see ?generate_snf_weights) |
retry_limit |
The maximum number of attempts to generate a novel row.
This function does not return matrices with identical rows. As the range of
requested possible settings tightens and the number of requested rows
increases, the risk of randomly generating a row that already exists
increases. If a new random row has matched an existing row |
allow_duplicates |
If TRUE, enables creation of a settings data frame with duplicate non-feature weighting related hyperparameters. This function should only be used when paired with a custom weights matrix that has non-duplicate rows. |
A settings data frame
Defunct function to create an ARI heatmap. Please use
meta_cluster_heatmap()
instead.
adjusted_rand_index_heatmap( aris, order = NULL, cluster_rows = FALSE, cluster_columns = FALSE, log_graph = FALSE, scale_diag = "none", min_colour = "#282828", max_colour = "firebrick2", col = circlize::colorRamp2(c(min(aris), max(aris)), c(min_colour, max_colour)), ... )
adjusted_rand_index_heatmap( aris, order = NULL, cluster_rows = FALSE, cluster_columns = FALSE, log_graph = FALSE, scale_diag = "none", min_colour = "#282828", max_colour = "firebrick2", col = circlize::colorRamp2(c(min(aris), max(aris)), c(min_colour, max_colour)), ... )
aris |
Matrix of adjusted rand indices from |
order |
Numeric vector containing row order of the heatmap. |
cluster_rows |
Whether rows should be clustered. |
cluster_columns |
Whether columns should be clustered. |
log_graph |
If TRUE, log transforms the graph. |
scale_diag |
Method of rescaling matrix diagonals. Can be "none" (don't change diagonals), "mean" (replace diagonals with average value of off-diagonals), or "zero" (replace diagonals with 0). |
min_colour |
Colour used for the lowest value in the heatmap. |
max_colour |
Colour used for the highest value in the heatmap. |
col |
Colour ramp to use for the heatmap. |
... |
Additional parameters passed to |
Returns a heatmap (class "Heatmap" from package ComplexHeatmap) that displays the pairwise adjusted Rand indices (similarities) between the cluster solutions of the provided solutions data frame.
Mock age data
age_df
age_df
age_df
A data frame with 200 rows and 2 columns:
Random three-digit number uniquely identifying the patient
Mock age feature
This data came from the SNFtool package, with slight modifications.
This alluvial plot shows how observations in a similarity matrix could have been clustered over a set of clustering functions.
alluvial_cluster_plot( cluster_sequence, similarity_matrix, dl = NULL, data = NULL, key_outcome, key_label = key_outcome, extra_outcomes = NULL, title = NULL )
alluvial_cluster_plot( cluster_sequence, similarity_matrix, dl = NULL, data = NULL, key_outcome, key_label = key_outcome, extra_outcomes = NULL, title = NULL )
cluster_sequence |
A list of clustering algorithms. |
similarity_matrix |
A similarity matrix. |
dl |
A data list. |
data |
A data frame that contains any features to include in the plot. |
key_outcome |
The name of the feature that determines how each patient stream is coloured in the alluvial plot. |
key_label |
Name of key outcome to be used for the plot legend. |
extra_outcomes |
Names of additional features to add to the plot. |
title |
Title of the plot. |
An alluvial plot (class "gg" and "ggplot") showing distribution of a feature across varying number cluster solutions.
input_dl <- data_list( list(gender_df, "gender", "demographics", "categorical"), list(diagnosis_df, "diagnosis", "clinical", "categorical"), uid = "patient_id" ) sc <- snf_config(input_dl, n_solutions = 1) sol_df <- batch_snf(input_dl, sc, return_sim_mats = TRUE) sim_mats <- sim_mats_list(sol_df) clust_fn_sequence <- list(spectral_two, spectral_four) alluvial_cluster_plot( cluster_sequence = clust_fn_sequence, similarity_matrix = sim_mats[[1]], dl = input_dl, key_outcome = "gender", # the name of the feature of interest key_label = "Gender", # how the feature of interest should be displayed extra_outcomes = "diagnosis", # more features to plot but not colour by title = "Gender Across Cluster Counts" )
input_dl <- data_list( list(gender_df, "gender", "demographics", "categorical"), list(diagnosis_df, "diagnosis", "clinical", "categorical"), uid = "patient_id" ) sc <- snf_config(input_dl, n_solutions = 1) sol_df <- batch_snf(input_dl, sc, return_sim_mats = TRUE) sim_mats <- sim_mats_list(sol_df) clust_fn_sequence <- list(spectral_two, spectral_four) alluvial_cluster_plot( cluster_sequence = clust_fn_sequence, similarity_matrix = sim_mats[[1]], dl = input_dl, key_outcome = "gender", # the name of the feature of interest key_label = "Gender", # how the feature of interest should be displayed extra_outcomes = "diagnosis", # more features to plot but not colour by title = "Gender Across Cluster Counts" )
Like the mock data frame "abcd_colour", but with "unique_id" as the "uid".
anxiety
anxiety
anxiety
A data frame with 275 rows and 2 columns:
The unique identifier of the ABCD dataset
Ordinal value of impairment on CBCL anxiety, either 0 (no impairment), 1 (borderline clinical), or 2 (clinically impaired)
Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive DevelopmentSM (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.
Arrange rows in an object
arrange(.data, ...)
arrange(.data, ...)
.data |
The object to arrange columns from. |
... |
Additional arguments for arranging. |
Object with arrangeed columns.
This function coerces non-ari_matrix
class objects into
ari_matrix
class objects.
as_ari_matrix(x)
as_ari_matrix(x)
x |
The object to convert into a weights matrix. |
An ari_matrix
class object.
This function coerces non-data_list
class objects into data_list
class
objects.
as_data_list(x)
as_data_list(x)
x |
The object to convert into a data list. |
A data_list
class object.
This function coerces non-settings_df
class objects into settings_df
class
objects.
as_settings_df(x)
as_settings_df(x)
x |
The object to convert into a data list. |
A settings_df
class object.
This function converts non-sim_mats_list
class objects into
sim_mats_list
class objects.
as_sim_mats_list(x)
as_sim_mats_list(x)
x |
The object to convert into a |
A sim_mats_list
class object.
This function coerces non-snf_config
class objects into snf_config
class
objects.
as_snf_config(x)
as_snf_config(x)
x |
The object to convert into a snf config. |
A snf_config
class object.
This function converts non-weights_matrix
objects into weights_matrix
class objects.
as_weights_matrix(x)
as_weights_matrix(x)
x |
The object to convert into a data list. |
A weights_matrix
class object.
data_list
class object into a data.frame
class objectHorizontally joins data frames within a data list into a single data frame,
using the uid
attribute as the joining key.
## S3 method for class 'data_list' as.data.frame(x, row.names = NULL, optional = FALSE, ...)
## S3 method for class 'data_list' as.data.frame(x, row.names = NULL, optional = FALSE, ...)
x |
A |
row.names |
Additional parameter passed to |
optional |
Additional parameter passed to |
... |
Additional parameter passed to |
dl_df A data.frame
class object with all the features and
observations of dl
.
ext_solutions_df
class object into a data.frame
class objectCoerce a ext_solutions_df
class object into a data.frame
class object
## S3 method for class 'ext_solutions_df' as.data.frame( x, row.names = NULL, optional = FALSE, keep_attributes = FALSE, ... )
## S3 method for class 'ext_solutions_df' as.data.frame( x, row.names = NULL, optional = FALSE, keep_attributes = FALSE, ... )
x |
A |
row.names |
Additional parameter passed to |
optional |
Additional parameter passed to |
keep_attributes |
If TRUE, resulting data frame includes settings data frame and weights matrix. |
... |
Additional parameter passed to |
A data.frame
class object with all the columns of x and its
contained solutions data frame.
solutions_df
class object into a data.frame
class objectCoerce a solutions_df
class object into a data.frame
class object
## S3 method for class 'solutions_df' as.data.frame( x, row.names = NULL, optional = FALSE, keep_attributes = FALSE, ... )
## S3 method for class 'solutions_df' as.data.frame( x, row.names = NULL, optional = FALSE, keep_attributes = FALSE, ... )
x |
A |
row.names |
Additional parameter passed to |
optional |
Additional parameter passed to |
keep_attributes |
If TRUE, resulting data frame includes settings data frame and weights matrix. |
... |
Additional parameter passed to |
A data.frame
class object with all the columns of x and its
contained solutions data frame.
clust_fns_list
class object into a list
class objectCoerce a clust_fns_list
class object into a list
class object
## S3 method for class 'clust_fns_list' as.list(x, ...)
## S3 method for class 'clust_fns_list' as.list(x, ...)
x |
A |
... |
Additional parameter passed to |
A list
class object with all the functions of x
.
data_list
class object into a list
class objectCoerce a data_list
class object into a list
class object
## S3 method for class 'data_list' as.list(x, ...)
## S3 method for class 'data_list' as.list(x, ...)
x |
A |
... |
Additional parameter passed to |
A list
class object with all the objects of x
.
dist_fns_list
class object into a list
class objectCoerce a dist_fns_list
class object into a list
class object
## S3 method for class 'dist_fns_list' as.list(x, ...)
## S3 method for class 'dist_fns_list' as.list(x, ...)
x |
A |
... |
Additional parameter passed to |
A list
class object with all the functions of x
.
weights_matrix
class object into a matrix
class objectCoerce a weights_matrix
class object into a matrix
class object
## S3 method for class 'weights_matrix' as.matrix(x, ...)
## S3 method for class 'weights_matrix' as.matrix(x, ...)
x |
A |
... |
Additional parameter passed to |
A matrix
and array
class object.
Collapse a data frame and/or a data list into a single data frame
assemble_data(data, dl)
assemble_data(data, dl)
data |
A data frame. |
dl |
A nested list of input data from |
A class "data.frame" object containing all the features of the provided data frame and/or data list.
Heatmap of pairwise associations between features
assoc_pval_heatmap( correlation_matrix, scale_diag = "max", cluster_rows = TRUE, cluster_columns = TRUE, show_row_names = TRUE, show_column_names = TRUE, show_heatmap_legend = FALSE, confounders = NULL, out_of_models = NULL, annotation_colours = NULL, labels_colour = NULL, split_by_domain = FALSE, dl = NULL, significance_stars = TRUE, slice_font_size = 8, ... )
assoc_pval_heatmap( correlation_matrix, scale_diag = "max", cluster_rows = TRUE, cluster_columns = TRUE, show_row_names = TRUE, show_column_names = TRUE, show_heatmap_legend = FALSE, confounders = NULL, out_of_models = NULL, annotation_colours = NULL, labels_colour = NULL, split_by_domain = FALSE, dl = NULL, significance_stars = TRUE, slice_font_size = 8, ... )
correlation_matrix |
Matrix containing all pairwise association p-values. The recommended way to obtain this matrix is through the calc_assoc_pval function. |
scale_diag |
Parameter that controls how the diagonals of the correlation_matrix are adjusted in the heatmap. For best viewing, this is set to "max", which will match the diagonals to whichever pairwise association has the highest p-value. |
cluster_rows |
Parameter for ComplexHeatmap::Heatmap. Will be ignored if split_by_domain is also provided. |
cluster_columns |
Parameter for ComplexHeatmap::Heatmap. Will be ignored if split_by_domain is also provided. |
show_row_names |
Parameter for ComplexHeatmap::Heatmap. |
show_column_names |
Parameter for ComplexHeatmap::Heatmap. |
show_heatmap_legend |
Parameter for ComplexHeatmap::Heatmap. |
confounders |
A named list where the elements are columns in the correlation_matrix and the names are the corresponding display names. |
out_of_models |
Like confounders, but a named list of out of model measures (who are also present as columns in the correlation_matrix). |
annotation_colours |
Named list of heatmap annotations and their colours. |
labels_colour |
Vector of colours to use for the columns and rows of the heatmap. |
split_by_domain |
The results of |
dl |
A nested list of input data from |
significance_stars |
If TRUE (default), plots significance stars on heatmap cells |
slice_font_size |
Font size for domain separating labels. |
... |
Additional parameters passed into ComplexHeatmap::Heatmap. |
Returns a heatmap (class "Heatmap" from package ComplexHeatmap) that displays the pairwise associations between features from the provided correlation_matrix.
#data_list <- data_list( # list(income, "household_income", "demographics", "ordinal"), # list(pubertal, "pubertal_status", "demographics", "continuous"), # list(fav_colour, "favourite_colour", "demographics", "categorical"), # list(anxiety, "anxiety", "behaviour", "ordinal"), # list(depress, "depressed", "behaviour", "ordinal"), # uid = "unique_id" #) # #assoc_pval_matrix <- calc_assoc_pval_matrix(data_list) #ap_heatmap <- assoc_pval_heatmap(assoc_pval_matrix)
#data_list <- data_list( # list(income, "household_income", "demographics", "ordinal"), # list(pubertal, "pubertal_status", "demographics", "continuous"), # list(fav_colour, "favourite_colour", "demographics", "categorical"), # list(anxiety, "anxiety", "behaviour", "ordinal"), # list(depress, "depressed", "behaviour", "ordinal"), # uid = "unique_id" #) # #assoc_pval_matrix <- calc_assoc_pval_matrix(data_list) #ap_heatmap <- assoc_pval_heatmap(assoc_pval_matrix)
Given a single row of a solutions data frame and data provided through a data list, this function will return a series of bar and/or jitter plots based on feature types.
auto_plot( sol_df_row = NULL, dl = NULL, cluster_df = NULL, return_plots = TRUE, save = NULL, jitter_width = 6, jitter_height = 6, bar_width = 6, bar_height = 6, verbose = FALSE )
auto_plot( sol_df_row = NULL, dl = NULL, cluster_df = NULL, return_plots = TRUE, save = NULL, jitter_width = 6, jitter_height = 6, bar_width = 6, bar_height = 6, verbose = FALSE )
sol_df_row |
A single row of a solutions data frame. |
dl |
A data list containing data to plot. |
cluster_df |
Directly provide a cluster_df rather than a solutions matrix. Useful if plotting data from label propagated results. |
return_plots |
If |
save |
If a string is provided, plots will be saved and this string will be used to prefix plot names. |
jitter_width |
Width of jitter plots if save is specified. |
jitter_height |
Height of jitter plots if save is specified. |
bar_width |
Width of bar plots if save is specified. |
bar_height |
Height of bar plots if save is specified. |
verbose |
If TRUE, output progress to console. |
By default, returns a list of plots (class "gg", "ggplot") with
one plot for every feature in the provided data list and/or target list.
If return_plots
is FALSE, will instead return a single "data.frame"
object containing every provided feature for every observation in long
format.
Bar plot separating a feature by cluster
bar_plot(df, feature)
bar_plot(df, feature)
df |
A data.frame containing cluster column and the feature to plot. |
feature |
The feature to plot. |
A bar plot (class "gg", "ggplot") showing the distribution of a feature across clusters.
This is the core function of the metasnf
package. Using the information
stored in a settings_df (see ?settings_df) and a data list
(see ?data_list), run repeated complete SNF pipelines to generate
a broad space of post-SNF cluster solutions.
batch_snf(dl, sc, processes = 1, return_sim_mats = FALSE, sim_mats_dir = NULL)
batch_snf(dl, sc, processes = 1, return_sim_mats = FALSE, sim_mats_dir = NULL)
dl |
A nested list of input data from |
sc |
An |
processes |
Specify number of processes used to complete SNF iterations
|
return_sim_mats |
If TRUE, function will return a list where the first element is the solutions data frame and the second element is a list of similarity matrices for each row in the sol_df. Default FALSE. |
sim_mats_dir |
If specified, this directory will be used to save all generated similarity matrices. |
By default, returns a solutions data frame (class "data.frame"), a
a data frame containing one row for every row of the provided settings
matrix, all the original columns of that settings data frame, and new columns
containing the assigned cluster of each observation from the cluster
solution derived by that row's settings. If return_sim_mats
is
TRUE, the function will instead return a list containing the
solutions data frame as well as a list of the final similarity matrices (class
"matrix") generated by SNF for each row of the settings data frame. If
suppress_clustering
is TRUE, the solutions data frame will not be returned
in the output.
input_dl <- data_list( list(gender_df, "gender", "demographics", "categorical"), list(diagnosis_df, "diagnosis", "clinical", "categorical"), uid = "patient_id" ) sc <- snf_config(input_dl, n_solutions = 3) # A solutions data frame without similarity matrices: sol_df <- batch_snf(input_dl, sc) # A solutions data frame with similarity matrices: # sol_df <- batch_snf(input_dl, sc, return_sim_mats = TRUE) # sim_mats_list(sol_df)
input_dl <- data_list( list(gender_df, "gender", "demographics", "categorical"), list(diagnosis_df, "diagnosis", "clinical", "categorical"), uid = "patient_id" ) sc <- snf_config(input_dl, n_solutions = 3) # A solutions data frame without similarity matrices: sol_df <- batch_snf(input_dl, sc) # A solutions data frame with similarity matrices: # sol_df <- batch_snf(input_dl, sc, return_sim_mats = TRUE) # sim_mats_list(sol_df)
Run SNF clustering pipeline on a list of subsampled data lists.
batch_snf_subsamples( dl_subsamples, sc, processes = 1, return_sim_mats = FALSE, sim_mats_dir = NULL, verbose = TRUE )
batch_snf_subsamples( dl_subsamples, sc, processes = 1, return_sim_mats = FALSE, sim_mats_dir = NULL, verbose = TRUE )
dl_subsamples |
A list of subsampled data lists. This object is
generated by the function |
sc |
An |
processes |
Specify number of processes used to complete SNF iterations
|
return_sim_mats |
If TRUE, function will return a list where the first element is the solutions data frame and the second element is a list of similarity matrices for each row in the sol_df. Default FALSE. |
sim_mats_dir |
If specified, this directory will be used to save all generated similarity matrices. |
verbose |
If TRUE, output progress to console. |
By default, returns a one-element list: cluster_solutions
, which
is itself a list of cluster solution data frames corresponding to each of
the provided data list subsamples. Setting the parameters
return_sim_mats
and return_solutions
to TRUE
will turn the result of the function to a three-element list containing the
corresponding solutions data frames and final fused similarity matrices of
those cluster solutions, should you require these objects for your own
stability calculations.
# my_dl <- data_list( # list(subc_v, "subcortical_volume", "neuroimaging", "continuous"), # list(income, "household_income", "demographics", "continuous"), # list(pubertal, "pubertal_status", "demographics", "continuous"), # uid = "unique_id" # ) # # sc <- snf_config(my_dl, n_solutions = 5, max_k = 40) # # my_dl_subsamples <- subsample_dl( # my_dl, # n_subsamples = 20, # subsample_fraction = 0.85 # ) # # batch_subsample_results <- batch_snf_subsamples( # my_dl_subsamples, # sc, # verbose = TRUE # )
# my_dl <- data_list( # list(subc_v, "subcortical_volume", "neuroimaging", "continuous"), # list(income, "household_income", "demographics", "continuous"), # list(pubertal, "pubertal_status", "demographics", "continuous"), # uid = "unique_id" # ) # # sc <- snf_config(my_dl, n_solutions = 5, max_k = 40) # # my_dl_subsamples <- subsample_dl( # my_dl, # n_subsamples = 20, # subsample_fraction = 0.85 # ) # # batch_subsample_results <- batch_snf_subsamples( # my_dl_subsamples, # sc, # verbose = TRUE # )
This function constructs an ari_matrix
class object from a solutions_df
class object. The ARI matrix stores pairwise adjusted Rand indices for all
cluster solutions as well as a numeric order for the solutions data frame
based on the hierarchical clustering of the ARI matrix.
calc_aris( sol_df, processes = 1, verbose = FALSE, dist_method = "euclidean", hclust_method = "complete" )
calc_aris( sol_df, processes = 1, verbose = FALSE, dist_method = "euclidean", hclust_method = "complete" )
sol_df |
Solutions data frame containing cluster solutions to calculate pairwise ARIs for. |
processes |
Specify number of processes used to complete calculations
|
verbose |
If TRUE, output progress to console. |
dist_method |
Distance method to use when calculating sorting order to of the matrix. Argument is directly passed into stats::dist. Options include "euclidean", "maximum", "manhattan", "canberra", "binary", or "minkowski". |
hclust_method |
Agglomerative method to use when calculating sorting
order by |
om_aris ARIs between clustering solutions of an solutions data frame
dl <- data_list( list(subc_v, "subcortical_volume", "neuroimaging", "continuous"), list(pubertal, "pubertal_status", "demographics", "continuous"), uid = "unique_id" ) sc <- snf_config(dl, n_solutions = 3) sol_df <- batch_snf(dl, sc) calc_aris(sol_df)
dl <- data_list( list(subc_v, "subcortical_volume", "neuroimaging", "continuous"), list(pubertal, "pubertal_status", "demographics", "continuous"), uid = "unique_id" ) sc <- snf_config(dl, n_solutions = 3) sol_df <- batch_snf(dl, sc) calc_aris(sol_df)
Calculate p-values for all pairwise associations of features in a data list
calc_assoc_pval_matrix(dl, verbose = FALSE, cat_test = "chi_squared")
calc_assoc_pval_matrix(dl, verbose = FALSE, cat_test = "chi_squared")
dl |
A nested list of input data from |
verbose |
If TRUE, output progress to the console. |
cat_test |
String indicating which statistical test will be used to associate cluster with a categorical feature. Options are "chi_squared" for the Chi-squared test and "fisher_exact" for Fisher's exact test. |
A "matrix" class object containing pairwise association p-values between the features in the provided data list.
data_list <- data_list( list(income, "household_income", "demographics", "ordinal"), list(pubertal, "pubertal_status", "demographics", "continuous"), list(anxiety, "anxiety", "behaviour", "ordinal"), list(depress, "depressed", "behaviour", "ordinal"), uid = "unique_id" ) assoc_pval_matrix <- calc_assoc_pval_matrix(data_list)
data_list <- data_list( list(income, "household_income", "demographics", "ordinal"), list(pubertal, "pubertal_status", "demographics", "continuous"), list(anxiety, "anxiety", "behaviour", "ordinal"), list(depress, "depressed", "behaviour", "ordinal"), uid = "unique_id" ) assoc_pval_matrix <- calc_assoc_pval_matrix(data_list)
Normalized mutual information scores can be used to indirectly measure how important a feature may have been in producing a cluster solution. This function will calculate the normalized mutual information between cluster solutions in a solutions data frame as well as cluster solutions created by including only a single feature from a provided data list, but otherwise using all the same hyperparameters as specified in the original SNF config. Note that NMIs can be calculated between two cluster solutions regardless of what features were actually used to create those cluster solutions. For example, a feature that was not involved in producing a particular cluster solution may still have a high NMI with that cluster solution (typically because it was highly correlated with a different feature that was used).
calc_nmis( dl, sol_df, transpose = TRUE, ignore_inclusions = TRUE, verbose = FALSE )
calc_nmis( dl, sol_df, transpose = TRUE, ignore_inclusions = TRUE, verbose = FALSE )
dl |
A nested list of input data from |
sol_df |
Result of |
transpose |
If TRUE, will transpose the output data frame. |
ignore_inclusions |
If TRUE, will ignore the inclusion columns in the solutions data frame and calculate NMIs for all features. If FALSE, will give NAs for features that were dropped on a given settings_df row. |
verbose |
If TRUE, output progress to console. |
A "data.frame" class object containing one row for every feature in the provided data list and one column for every solution in the provided solutions data frame. Populated values show the calculated NMI score for each feature-solution combination.
input_dl <- data_list( list(gender_df, "gender", "demographics", "categorical"), list(diagnosis_df, "diagnosis", "clinical", "categorical"), uid = "patient_id" ) sc <- snf_config(input_dl, n_solutions = 2) sol_df <- batch_snf(input_dl, sc) calc_nmis(input_dl, sol_df)
input_dl <- data_list( list(gender_df, "gender", "demographics", "categorical"), list(diagnosis_df, "diagnosis", "clinical", "categorical"), uid = "patient_id" ) sc <- snf_config(input_dl, n_solutions = 2) sol_df <- batch_snf(input_dl, sc) calc_nmis(input_dl, sol_df)
Calculate coclustering data.
calculate_coclustering(subsample_solutions, sol_df, verbose = FALSE)
calculate_coclustering(subsample_solutions, sol_df, verbose = FALSE)
subsample_solutions |
A list of containing cluster solutions from
distinct subsamples of the data. This object is generated by the function
|
sol_df |
A solutions data frame. This object is generated by the
function |
verbose |
If TRUE, output time remaining estimates to console. |
A list containing the following components:
cocluster_dfs: A list of data frames, one per cluster solution, that shows the number of times that every pair of observations in the original cluster solution occurred in the same subsample, the number of times that every pair clustered together in a subsample, and the corresponding fraction of times that every pair clustered together in a subsample.
cocluster_ss_mats: The number of times every pair of observations occurred in the same subsample, formatted as a pairwise matrix.
cocluster_sc_mats: The number of times every pair of observations occurred in the same cluster, formatted as a pairwise matrix.
cocluster_cf_mats: The fraction of times every pair of observations occurred in the same cluster, formatted as a pairwise matrix.
cocluster_summary: Specifically among pairs of observations that clustered together in the original full cluster solution, what fraction of those pairs remained clustered together throughout the subsample solutions. This information is formatted as a data frame with one row per cluster solution.
# my_dl <- data_list( # list(subc_v, "subcortical_volume", "neuroimaging", "continuous"), # list(income, "household_income", "demographics", "continuous"), # list(pubertal, "pubertal_status", "demographics", "continuous"), # uid = "unique_id" # ) # # sc <- snf_config(my_dl, n_solutions = 5, max_k = 40) # # sol_df <- batch_snf(my_dl, sc) # # my_dl_subsamples <- subsample_dl( # my_dl, # n_subsamples = 20, # subsample_fraction = 0.85 # ) # # batch_subsample_results <- batch_snf_subsamples( # my_dl_subsamples, # sc, # verbose = TRUE # ) # # coclustering_results <- calculate_coclustering( # batch_subsample_results, # sol_df, # verbose = TRUE # )
# my_dl <- data_list( # list(subc_v, "subcortical_volume", "neuroimaging", "continuous"), # list(income, "household_income", "demographics", "continuous"), # list(pubertal, "pubertal_status", "demographics", "continuous"), # uid = "unique_id" # ) # # sc <- snf_config(my_dl, n_solutions = 5, max_k = 40) # # sol_df <- batch_snf(my_dl, sc) # # my_dl_subsamples <- subsample_dl( # my_dl, # n_subsamples = 20, # subsample_fraction = 0.85 # ) # # batch_subsample_results <- batch_snf_subsamples( # my_dl_subsamples, # sc, # verbose = TRUE # ) # # coclustering_results <- calculate_coclustering( # batch_subsample_results, # sol_df, # verbose = TRUE # )
This is the same data as diagnosis_df
, with renamed features and columns.
cancer_diagnosis_df
cancer_diagnosis_df
cancer_diagnosis_df
A data frame with 200 rows and 2 columns:
Random three-digit number uniquely identifying the patient
Mock cancer diagnosis feature (1, 2, or 3)
This data came from the SNFtool package, with slight modifications.
This is an internal function meant to be used to by the assoc_pval_heatmap function.
cell_significance_fn(data)
cell_significance_fn(data)
data |
The matrix containing the cells to base the significance stars on. |
cell_fn Another function that is well-formatted for usage as the cell_fun argument in ComplexHeatmap::Heatmap.
Helper function to stop annotation building when no data was provided
check_dataless_annotations(annotation_requests, data)
check_dataless_annotations(annotation_requests, data)
annotation_requests |
A list of requested annotations |
data |
A data frame with data to build annotations |
Does not return any value. This function just raises an error when annotations are requested without any provided data for a heatmap.
Check for ComplexHeatmap and circlize dependencies
check_hm_dependencies()
check_hm_dependencies()
Does not return any value. This function just checks that the ComplexHeatmap and circlize packages are installed.
Check to see if similarity matrices in a list have the following properties:
The maximum value in the entire matrix is 0.5
Every value in the diagonal is 0.5
check_similarity_matrices(similarity_matrices)
check_similarity_matrices(similarity_matrices)
similarity_matrices |
A list of similarity matrices |
valid_matrices Boolean indicating if properties are met by all similarity matrices
These functions can be used when building a metasnf
clustering functions
list. Each function converts a similarity matrix (matrix class object) to a
cluster solution (numeric vector). Note that these functions (or custom
clustering functions) cannot accept number of clusters as a parameter; this
value must be built into the function itself if necessary.
spectral_eigen(similarity_matrix) spectral_rot(similarity_matrix) spectral_eigen_classic(similarity_matrix) spectral_rot_classic(similarity_matrix) spectral_two(similarity_matrix) spectral_three(similarity_matrix) spectral_four(similarity_matrix) spectral_five(similarity_matrix) spectral_six(similarity_matrix) spectral_seven(similarity_matrix) spectral_eight(similarity_matrix) spectral_nine(similarity_matrix) spectral_ten(similarity_matrix)
spectral_eigen(similarity_matrix) spectral_rot(similarity_matrix) spectral_eigen_classic(similarity_matrix) spectral_rot_classic(similarity_matrix) spectral_two(similarity_matrix) spectral_three(similarity_matrix) spectral_four(similarity_matrix) spectral_five(similarity_matrix) spectral_six(similarity_matrix) spectral_seven(similarity_matrix) spectral_eight(similarity_matrix) spectral_nine(similarity_matrix) spectral_ten(similarity_matrix)
similarity_matrix |
A similarity matrix. |
spectral_eigen: Spectral clustering where the number of clusters is based on the eigen-gap heuristic
spectral_rot: Spectral clustering where the number of clusters is based on the rotation-cost heuristic
spectral_(C): Spectral clustering for a C-cluster solution.
solution_data A vector of cluster assignments
This function can be used to specify custom clustering algorithms to apply to the final similarity matrices produced by each run of the batch_snf function.
clust_fns_list(clust_fns = NULL, use_default_clust_fns = FALSE)
clust_fns_list(clust_fns = NULL, use_default_clust_fns = FALSE)
clust_fns |
A list of named clustering functions |
use_default_clust_fns |
If TRUE, prepend the base clustering algorithms (spectral_eigen and spectral_rot, which apply spectral clustering and use the eigen-gap and rotation cost heuristics respectively for determining the number of clusters in the graph) to clust_fns. |
A list of clustering algorithm functions that can be passed into the batch_snf and generate_settings_list functions.
# Using just the base clustering algorithms -------------------------------- # This will just contain spectral_eigen and spectral_rot cfl <- clust_fns_list(use_default_clust_fns = TRUE) # Adding algorithms provided by the package -------------------------------- # This will contain the base clustering algorithms (spectral_eigen, # spectral_rot) as well as two pre-defined spectral clustering functions # that force the number of clusters to be two or five cfl <- clust_fns_list( clust_fns = list( "two_cluster_spectral" = spectral_two, "five_cluster_spectral" = spectral_five ) ) # Adding your own algorithms ----------------------------------------------- # This will contain the base and user-provided clustering algorithms my_clustering_algorithm <- function(similarity_matrix) { # your code that converts similarity matrix to clusters here... } # Suppress the base algorithms---------------------------------------------- # This will contain only user-provided clustering algorithms cfl <- clust_fns_list( clust_fns = list( "two_cluster_spectral" = spectral_two, "five_cluster_spectral" = spectral_five ) )
# Using just the base clustering algorithms -------------------------------- # This will just contain spectral_eigen and spectral_rot cfl <- clust_fns_list(use_default_clust_fns = TRUE) # Adding algorithms provided by the package -------------------------------- # This will contain the base clustering algorithms (spectral_eigen, # spectral_rot) as well as two pre-defined spectral clustering functions # that force the number of clusters to be two or five cfl <- clust_fns_list( clust_fns = list( "two_cluster_spectral" = spectral_two, "five_cluster_spectral" = spectral_five ) ) # Adding your own algorithms ----------------------------------------------- # This will contain the base and user-provided clustering algorithms my_clustering_algorithm <- function(similarity_matrix) { # your code that converts similarity matrix to clusters here... } # Suppress the base algorithms---------------------------------------------- # This will contain only user-provided clustering algorithms cfl <- clust_fns_list( clust_fns = list( "two_cluster_spectral" = spectral_two, "five_cluster_spectral" = spectral_five ) )
This function creates a density plot that shows, for all pairs of observations that originally clustered together, the distribution of the the fractions that those pairs clustered together across subsampled data.
cocluster_density(cocluster_df)
cocluster_density(cocluster_df)
cocluster_df |
A data frame containing coclustering data for a single
cluster solution. This object is generated by the |
Density plot (class "gg", "ggplot") of the distribution of coclustering across pairs and subsamples of the data.
# my_dl <- data_list( # list(subc_v, "subcortical_volume", "neuroimaging", "continuous"), # list(income, "household_income", "demographics", "continuous"), # list(pubertal, "pubertal_status", "demographics", "continuous"), # uid = "unique_id" # ) # # sc <- snf_config(my_dl, n_solutions = 5, max_k = 40) # # sol_df <- batch_snf(my_dl, sc) # # my_dl_subsamples <- subsample_dl( # my_dl, # n_subsamples = 20, # subsample_fraction = 0.85 # ) # # batch_subsample_results <- batch_snf_subsamples( # my_dl_subsamples, # sc, # verbose = TRUE # ) # # coclustering_results <- calculate_coclustering( # batch_subsample_results, # sol_df, # verbose = TRUE # ) # # cocluster_density(cocluster_dfs[[1]])
# my_dl <- data_list( # list(subc_v, "subcortical_volume", "neuroimaging", "continuous"), # list(income, "household_income", "demographics", "continuous"), # list(pubertal, "pubertal_status", "demographics", "continuous"), # uid = "unique_id" # ) # # sc <- snf_config(my_dl, n_solutions = 5, max_k = 40) # # sol_df <- batch_snf(my_dl, sc) # # my_dl_subsamples <- subsample_dl( # my_dl, # n_subsamples = 20, # subsample_fraction = 0.85 # ) # # batch_subsample_results <- batch_snf_subsamples( # my_dl_subsamples, # sc, # verbose = TRUE # ) # # coclustering_results <- calculate_coclustering( # batch_subsample_results, # sol_df, # verbose = TRUE # ) # # cocluster_density(cocluster_dfs[[1]])
Create a heatmap that shows the distribution of observation co-clustering across resampled data.
cocluster_heatmap( cocluster_df, cluster_rows = TRUE, cluster_columns = TRUE, show_row_names = FALSE, show_column_names = FALSE, dl = NULL, data = NULL, left_bar = NULL, right_bar = NULL, top_bar = NULL, bottom_bar = NULL, left_hm = NULL, right_hm = NULL, top_hm = NULL, bottom_hm = NULL, annotation_colours = NULL, min_colour = NULL, max_colour = NULL, ... )
cocluster_heatmap( cocluster_df, cluster_rows = TRUE, cluster_columns = TRUE, show_row_names = FALSE, show_column_names = FALSE, dl = NULL, data = NULL, left_bar = NULL, right_bar = NULL, top_bar = NULL, bottom_bar = NULL, left_hm = NULL, right_hm = NULL, top_hm = NULL, bottom_hm = NULL, annotation_colours = NULL, min_colour = NULL, max_colour = NULL, ... )
cocluster_df |
A data frame containing coclustering data for a single
cluster solution. This object is generated by the |
cluster_rows |
Argument passed to |
cluster_columns |
Argument passed to |
show_row_names |
Argument passed to |
show_column_names |
Argument passed to |
dl |
See ?similarity_matrix_heatmap. |
data |
See ?similarity_matrix_heatmap. |
left_bar |
See ?similarity_matrix_heatmap. |
right_bar |
See ?similarity_matrix_heatmap. |
top_bar |
See ?similarity_matrix_heatmap. |
bottom_bar |
See ?similarity_matrix_heatmap. |
left_hm |
See ?similarity_matrix_heatmap. |
right_hm |
See ?similarity_matrix_heatmap. |
top_hm |
See ?similarity_matrix_heatmap. |
bottom_hm |
See ?similarity_matrix_heatmap. |
annotation_colours |
See ?similarity_matrix_heatmap. |
min_colour |
See ?similarity_matrix_heatmap. |
max_colour |
See ?similarity_matrix_heatmap. |
... |
Arguments passed to |
Heatmap (class "Heatmap" from ComplexHeatmap) object showing the distribution of observation co-clustering across resampled data.
# my_dl <- data_list( # list(subc_v, "subcortical_volume", "neuroimaging", "continuous"), # list(income, "household_income", "demographics", "continuous"), # list(pubertal, "pubertal_status", "demographics", "continuous"), # uid = "unique_id" # ) # # sc <- snf_config(my_dl, n_solutions = 5, max_k = 40) # # sol_df <- batch_snf(my_dl, sc) # # my_dl_subsamples <- subsample_dl( # my_dl, # n_subsamples = 20, # subsample_fraction = 0.85 # ) # # batch_subsample_results <- batch_snf_subsamples( # my_dl_subsamples, # sc, # verbose = TRUE # ) # # coclustering_results <- calculate_coclustering( # batch_subsample_results, # sol_df, # verbose = TRUE # ) # # cocluster_dfs <- coclustering_results$"cocluster_dfs" # # cocluster_heatmap( # cocluster_dfs[[1]], # dl = my_dl, # top_hm = list( # "Income" = "household_income", # "Pubertal Status" = "pubertal_status" # ), # annotation_colours = list( # "Pubertal Status" = colour_scale( # c(1, 4), # min_colour = "black", # max_colour = "purple" # ), # "Income" = colour_scale( # c(0, 4), # min_colour = "black", # max_colour = "red" # ) # ) # )
# my_dl <- data_list( # list(subc_v, "subcortical_volume", "neuroimaging", "continuous"), # list(income, "household_income", "demographics", "continuous"), # list(pubertal, "pubertal_status", "demographics", "continuous"), # uid = "unique_id" # ) # # sc <- snf_config(my_dl, n_solutions = 5, max_k = 40) # # sol_df <- batch_snf(my_dl, sc) # # my_dl_subsamples <- subsample_dl( # my_dl, # n_subsamples = 20, # subsample_fraction = 0.85 # ) # # batch_subsample_results <- batch_snf_subsamples( # my_dl_subsamples, # sc, # verbose = TRUE # ) # # coclustering_results <- calculate_coclustering( # batch_subsample_results, # sol_df, # verbose = TRUE # ) # # cocluster_dfs <- coclustering_results$"cocluster_dfs" # # cocluster_heatmap( # cocluster_dfs[[1]], # dl = my_dl, # top_hm = list( # "Income" = "household_income", # "Pubertal Status" = "pubertal_status" # ), # annotation_colours = list( # "Pubertal Status" = colour_scale( # c(1, 4), # min_colour = "black", # max_colour = "purple" # ), # "Income" = colour_scale( # c(0, 4), # min_colour = "black", # max_colour = "red" # ) # ) # )
Defunct function for converting a data list into a data frame. Please
use
as.data.frame()
instead.
collapse_dl(data_list)
collapse_dl(data_list)
data_list |
A nested list of input data from |
A "data.frame"-formatted version of the provided data list.
Given a numeric vector and min and max colour values, return a colour ramp
that assigns a colour to each element in the vector. This function is a
wrapper for circlize::colorRamp2
.'
colour_scale(data, min_colour, max_colour)
colour_scale(data, min_colour, max_colour)
data |
Vector of numeric values. |
min_colour |
Minimum colour value. |
max_colour |
Maximum colour value. |
A "function" class object that can build a circlize-style colour ramp.
Create a heatmap where each row corresponds to a different set of hyperparameters in an SNF config object. Numeric parameters are scaled normalized and non-numeric parameters are added as heatmap annotations. Rows can be reordered to match prior meta clustering results.
config_heatmap( sc, order = NULL, hide_fixed = FALSE, show_column_names = TRUE, show_row_names = TRUE, rect_gp = grid::gpar(col = "black"), colour_breaks = c(0, 1), colours = c("black", "darkseagreen"), column_split_vector = NULL, row_split_vector = NULL, column_split = NULL, row_split = NULL, column_title = NULL, include_weights = TRUE, include_settings = TRUE, ... )
config_heatmap( sc, order = NULL, hide_fixed = FALSE, show_column_names = TRUE, show_row_names = TRUE, rect_gp = grid::gpar(col = "black"), colour_breaks = c(0, 1), colours = c("black", "darkseagreen"), column_split_vector = NULL, row_split_vector = NULL, column_split = NULL, row_split = NULL, column_title = NULL, include_weights = TRUE, include_settings = TRUE, ... )
sc |
An |
order |
Numeric vector indicating row ordering of SNF config. |
hide_fixed |
Whether fixed parameters should be removed. |
show_column_names |
Whether show column names. |
show_row_names |
Whether show row names. |
rect_gp |
Graphic parameters for drawing rectangles (for heatmap body). The value should be specified by |
colour_breaks |
Numeric vector of breaks for the legend. |
colours |
Vector of colours to use for the heatmap. Should match the length of colour_breaks. |
column_split_vector |
Vector of indices to split columns by. |
row_split_vector |
Vector of indices to split rows by. |
column_split |
Split on columns. For heatmap splitting, please refer to https://jokergoo.github.io/ComplexHeatmap-reference/book/a-single-heatmap.html#heatmap-split . |
row_split |
Same as |
column_title |
Title on the column. |
include_weights |
If TRUE, includes feature weights of the weights matrix into the config heatmap. |
include_settings |
If TRUE, includes columns from the settings data frame into the config heatmap. |
... |
Additional parameters passed to |
Returns a heatmap (class "Heatmap" from package ComplexHeatmap) that displays the scaled values of the provided SNF config.
dl <- data_list( list(income, "household_income", "demographics", "ordinal"), list(pubertal, "pubertal_status", "demographics", "continuous"), list(fav_colour, "favourite_colour", "demographics", "categorical"), list(anxiety, "anxiety", "behaviour", "ordinal"), list(depress, "depressed", "behaviour", "ordinal"), uid = "unique_id" ) sc <- snf_config( dl, n_solutions = 10, dropout_dist = "uniform" ) config_heatmap(sc)
dl <- data_list( list(income, "household_income", "demographics", "ordinal"), list(pubertal, "pubertal_status", "demographics", "continuous"), list(fav_colour, "favourite_colour", "demographics", "categorical"), list(anxiety, "anxiety", "behaviour", "ordinal"), list(depress, "depressed", "behaviour", "ordinal"), uid = "unique_id" ) sc <- snf_config( dl, n_solutions = 10, dropout_dist = "uniform" ) config_heatmap(sc)
Like the mock data frame "abcd_cort_sa", but with "unique_id" as the "uid".
cort_sa
cort_sa
cort_sa
A data frame with 188 rows and 152 columns:
The unique identifier of the ABCD dataset
Cortical surface areas of various ROIs (mm^2, I think)
Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive DevelopmentSM (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.
Like the mock data frame "abcd_cort_t", but with "unique_id" as the "uid".
cort_t
cort_t
cort_t
A data frame with 188 rows and 152 columns:
The unique identifier of the ABCD dataset
Cortical thicknesses of various ROIs (mm^3, I think)
Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive DevelopmentSM (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.
data_list
class objectdata_list()
constructs a data list object which inherits from classes
data_list
and list
. This object is the primary way in which features to
be used along the metasnf
clustering pipeline are stored. The data list is
fundamentally a 2-level nested list object where each inner list contains a
data frame and associated metadata for that data frame. The metadata
includes the name of the data frame, the 'domain' of that data frame (the
broader source of information that the input data frame is capturing,
determined by user's domain knowledge), and the type of feature stored in
the data frame (continuous, discrete, ordinal, categorical, or mixed).
data_list(..., uid)
data_list(..., uid)
... |
Any number of lists formatted as (df, "df_name", "df_domain", "df_type") and/or any number of lists of lists formatted as (df, "df_name", "df_domain", "df_type"). |
uid |
(character) the name of the uid column currently used data. data frame. |
heart_rate_df <- data.frame( patient_id = c("1", "2", "3"), var1 = c(0.04, 0.1, 0.3), var2 = c(30, 2, 0.3) ) personality_test_df <- data.frame( patient_id = c("1", "2", "3"), var3 = c(900, 1990, 373), var4 = c(509, 2209, 83) ) survey_response_df <- data.frame( patient_id = c("1", "2", "3"), var5 = c(1, 3, 3), var6 = c(2, 3, 3) ) city_df <- data.frame( patient_id = c("1", "2", "3"), var7 = c("toronto", "montreal", "vancouver") ) # Explicitly (Name each nested list element): dl <- data_list( list( data = heart_rate_df, name = "heart_rate", domain = "clinical", type = "continuous" ), list( data = personality_test_df, name = "personality_test", domain = "surveys", type = "continuous" ), list( data = survey_response_df, name = "survey_response", domain = "surveys", type = "ordinal" ), list( data = city_df, name = "city", domain = "location", type = "categorical" ), uid = "patient_id" ) # Compact loading dl <- data_list( list(heart_rate_df, "heart_rate", "clinical", "continuous"), list(personality_test_df, "personality_test", "surveys", "continuous"), list(survey_response_df, "survey_response", "surveys", "ordinal"), list(city_df, "city", "location", "categorical"), uid = "patient_id" ) # Printing data list summaries summary(dl) # Alternative loading: providing a single list of lists list_of_lists <- list( list(heart_rate_df, "data1", "domain1", "continuous"), list(personality_test_df, "data2", "domain2", "continuous") ) dl <- data_list( list_of_lists, uid = "patient_id" )
heart_rate_df <- data.frame( patient_id = c("1", "2", "3"), var1 = c(0.04, 0.1, 0.3), var2 = c(30, 2, 0.3) ) personality_test_df <- data.frame( patient_id = c("1", "2", "3"), var3 = c(900, 1990, 373), var4 = c(509, 2209, 83) ) survey_response_df <- data.frame( patient_id = c("1", "2", "3"), var5 = c(1, 3, 3), var6 = c(2, 3, 3) ) city_df <- data.frame( patient_id = c("1", "2", "3"), var7 = c("toronto", "montreal", "vancouver") ) # Explicitly (Name each nested list element): dl <- data_list( list( data = heart_rate_df, name = "heart_rate", domain = "clinical", type = "continuous" ), list( data = personality_test_df, name = "personality_test", domain = "surveys", type = "continuous" ), list( data = survey_response_df, name = "survey_response", domain = "surveys", type = "ordinal" ), list( data = city_df, name = "city", domain = "location", type = "categorical" ), uid = "patient_id" ) # Compact loading dl <- data_list( list(heart_rate_df, "heart_rate", "clinical", "continuous"), list(personality_test_df, "personality_test", "surveys", "continuous"), list(survey_response_df, "survey_response", "surveys", "ordinal"), list(city_df, "city", "location", "categorical"), uid = "patient_id" ) # Printing data list summaries summary(dl) # Alternative loading: providing a single list of lists list_of_lists <- list( list(heart_rate_df, "data1", "domain1", "continuous"), list(personality_test_df, "data2", "domain2", "continuous") ) dl <- data_list( list_of_lists, uid = "patient_id" )
Like the mock data frame "abcd_depress", but with "unique_id" as the "uid".
depress
depress
depress
A data frame with 275 rows and 2 columns:
The unique identifier of the ABCD dataset
Ordinal value of impairment on CBCL anxiety, either 0 (no impairment), 1 (borderline clinical), or 2 (clinically impaired)
Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive DevelopmentSM (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.
This is the same data as cancer_diagnosis_df
, with renamed features and columns.
diagnosis_df
diagnosis_df
diagnosis_df
A data frame with 200 rows and 2 columns:
Random three-digit number uniquely identifying the patient
Mock diagnosis feature
This data came from the SNFtool package, with slight modifications.
These functions can be used when building a metasnf
distance functions
list. Each function converts a data frame into to a distance matrix.
euclidean_distance(df, weights_row) gower_distance(df, weights_row) sn_euclidean_distance(df, weights_row) sew_euclidean_distance(df, weights_row) hamming_distance(df, weights_row)
euclidean_distance(df, weights_row) gower_distance(df, weights_row) sn_euclidean_distance(df, weights_row) sew_euclidean_distance(df, weights_row) hamming_distance(df, weights_row)
df |
Data frame containing at least 1 data column |
weights_row |
Single-row data frame where the column names contain the column names in df and the row contains the corresponding weights_row. |
Functions that work for numeric data:
euclidean_distance: typical Euclidean distance
sn_euclidean_distance: Data frame is first standardized and normalized before typical Euclidean distance is applied
siw_euclidean_distance: Squared (including weights) Euclidean distance, where the weights are also squared
sew_euclidean_distance: Squared (excluding weights) Euclidean distance, where the weights are not also squared
Functions that work for binary data:
hamming_distance: typical Hamming distance
Functions that work for any type of data:
gower_distance: Gower distance (cluster::daisy)
A matrix class object containing pairwise distances.
The distance metrics list object (inherits classes dist_fns_list
and list
) is a list that stores R functions which can convert a data
frame of features into a matrix of pairwise distances. The list is a nested
one, where the first layer of the list can hold up to 5 items (one for each
of the metasnf
recognized feature types, continuous, discrete, ordinal,
categorical, and mixed), and the second layer can hold an arbitrary number
of distance functions for each of those types.
dist_fns_list( cnt_dist_fns = NULL, dsc_dist_fns = NULL, ord_dist_fns = NULL, cat_dist_fns = NULL, mix_dist_fns = NULL, automatic_standard_normalize = FALSE, use_default_dist_fns = FALSE )
dist_fns_list( cnt_dist_fns = NULL, dsc_dist_fns = NULL, ord_dist_fns = NULL, cat_dist_fns = NULL, mix_dist_fns = NULL, automatic_standard_normalize = FALSE, use_default_dist_fns = FALSE )
cnt_dist_fns |
A named list of continuous distance metric functions. |
dsc_dist_fns |
A named list of discrete distance metric functions. |
ord_dist_fns |
A named list of ordinal distance metric functions. |
cat_dist_fns |
A named list of categorical distance metric functions. |
mix_dist_fns |
A named list of mixed distance metric functions. |
automatic_standard_normalize |
If TRUE, will automatically use standard normalization prior to calculation of any numeric distances. This parameter overrides all other distance functions list-related parameters. |
use_default_dist_fns |
If TRUE, prepend the base distance metrics (euclidean distance for continuous, discrete, and ordinal data and gower distance for categorical and mixed data) to the resulting distance metrics list. |
Call ?distance_metrics to see all distance metric functions provided in metasnf.
A distance metrics list object.
# Using just the base distance metrics ------------------------------------ dist_fns_list <- dist_fns_list() # Adding your own metrics -------------------------------------------------- # This will contain only the and user-provided distance function: cubed_euclidean <- function(df, weights_row) { # (your code that converts a data frame to a distance metric here...) weights <- diag(weights_row, nrow = length(weights_row)) weighted_df <- as.matrix(df) %*% weights distance_matrix <- weighted_df |> stats::dist(method = "euclidean") |> as.matrix() distance_matrix <- distance_matrix^3 return(distance_matrix) } dist_fns_list <- dist_fns_list( cnt_dist_fns = list( "my_cubed_euclidean" = cubed_euclidean ) ) # Using default base metrics------------------------------------------------ # Call ?distance_metrics to see all distance metric functions provided in # metasnf. The code below will contain a mix of user-provided and built-in # distance metric functions. dist_fns_list <- dist_fns_list( cnt_dist_fns = list( "my_distance_metric" = cubed_euclidean ), dsc_dist_fns = list( "my_distance_metric" = cubed_euclidean ), ord_dist_fns = list( "my_distance_metric" = cubed_euclidean ), cat_dist_fns = list( "my_distance_metric" = gower_distance ), mix_dist_fns = list( "my_distance_metric" = gower_distance ), use_default_dist_fns = TRUE )
# Using just the base distance metrics ------------------------------------ dist_fns_list <- dist_fns_list() # Adding your own metrics -------------------------------------------------- # This will contain only the and user-provided distance function: cubed_euclidean <- function(df, weights_row) { # (your code that converts a data frame to a distance metric here...) weights <- diag(weights_row, nrow = length(weights_row)) weighted_df <- as.matrix(df) %*% weights distance_matrix <- weighted_df |> stats::dist(method = "euclidean") |> as.matrix() distance_matrix <- distance_matrix^3 return(distance_matrix) } dist_fns_list <- dist_fns_list( cnt_dist_fns = list( "my_cubed_euclidean" = cubed_euclidean ) ) # Using default base metrics------------------------------------------------ # Call ?distance_metrics to see all distance metric functions provided in # metasnf. The code below will contain a mix of user-provided and built-in # distance metric functions. dist_fns_list <- dist_fns_list( cnt_dist_fns = list( "my_distance_metric" = cubed_euclidean ), dsc_dist_fns = list( "my_distance_metric" = cubed_euclidean ), ord_dist_fns = list( "my_distance_metric" = cubed_euclidean ), cat_dist_fns = list( "my_distance_metric" = gower_distance ), mix_dist_fns = list( "my_distance_metric" = gower_distance ), use_default_dist_fns = TRUE )
Defunct function to summarize a data list. Please use
summary()
with
argument scope = "feature"
instead.
dl_variable_summary(dl)
dl_variable_summary(dl)
dl |
A nested list of input data from |
variable_level_summary A data frame containing the name, type, and domain of every variable in a data list.
This function enables manipulating a data_list
class object with lapply
syntax without removing that object's data_list
class attribute. The
function will only preserve this attribute if the result of the apply call
has a valid data list structure.
dlapply(X, FUN, ...)
dlapply(X, FUN, ...)
X |
A |
FUN |
The function to be applied to each data list component. |
... |
Optional arguments to |
If FUN applied to each component of X yields a valid data list, a data list. Otherwise, a list.
# Convert all UID values to lowercase dl <- data_list( list(abcd_income, "income", "demographics", "discrete"), list(abcd_colour, "colour", "likes", "categorical"), uid = "patient" ) dl_lower <- dlapply( dl, function(x) { x$"data"$"uid" <- tolower(x$"data"$"uid") return(x) } )
# Convert all UID values to lowercase dl <- data_list( list(abcd_income, "income", "demographics", "discrete"), list(abcd_colour, "colour", "likes", "categorical"), uid = "patient" ) dl_lower <- dlapply( dl, function(x) { x$"data"$"uid" <- tolower(x$"data"$"uid") return(x) } )
Function to extend dplyr to extended solutions data frame objects
dplyr_row_slice.ext_solutions_df(data, i, ...)
dplyr_row_slice.ext_solutions_df(data, i, ...)
data |
An extended solutions data frame. |
i |
A vector of row indices. |
... |
Additional arguments. |
Row sliced object with appropriately preserved attributes.
Function to extend dplyr to solutions data frame objects
dplyr_row_slice.solutions_df(data, i, ...)
dplyr_row_slice.solutions_df(data, i, ...)
data |
A solutions data frame. |
i |
A vector of row indices. |
... |
Additional arguments. |
Row sliced object with appropriately preserved attributes.
Manhattan plot of feature-cluster association p-values
esm_manhattan_plot( esm, neg_log_pval_thresh = 5, threshold = NULL, point_size = 5, jitter_width = 0.1, jitter_height = 0.1, text_size = 15, plot_title = NULL, hide_x_labels = FALSE, bonferroni_line = FALSE )
esm_manhattan_plot( esm, neg_log_pval_thresh = 5, threshold = NULL, point_size = 5, jitter_width = 0.1, jitter_height = 0.1, text_size = 15, plot_title = NULL, hide_x_labels = FALSE, bonferroni_line = FALSE )
esm |
Extended solutions data frame storing associations between features
and cluster assignments. See |
neg_log_pval_thresh |
Threshold for negative log p-values. |
threshold |
P-value threshold to plot dashed line at. |
point_size |
Size of points in the plot. |
jitter_width |
Width of jitter. |
jitter_height |
Height of jitter. |
text_size |
Size of text in the plot. |
plot_title |
Title of the plot. |
hide_x_labels |
If TRUE, hides x-axis labels. |
bonferroni_line |
If TRUE, plots a dashed black line at the Bonferroni-corrected equivalent of the p-value threshold. |
A Manhattan plot (class "gg", "ggplot") showing the association p-values of features against each solution in the provided solutions data frame.
# full_dl <- data_list( # list(subc_v, "subcortical_volume", "neuroimaging", "continuous"), # list(income, "household_income", "demographics", "continuous"), # list(pubertal, "pubertal_status", "demographics", "continuous"), # list(anxiety, "anxiety", "behaviour", "ordinal"), # list(depress, "depressed", "behaviour", "ordinal"), # uid = "unique_id" # ) # # dl <- full_dl[1:3] # target_dl <- full_dl[4:5] # # set.seed(42) # sc <- snf_config( # dl = dl, # n_solutions = 20, # min_k = 20, # max_k = 50 # ) # # sol_df <- batch_snf(dl, sc) # # ext_sol_df <- extend_solutions( # sol_df, # dl = dl, # target = target_dl, # min_pval = 1e-10 # p-values below 1e-10 will be thresholded to 1e-10 # ) # # esm_manhattan <- esm_manhattan_plot( # ext_sol_df[1:5, ], # neg_log_pval_thresh = 5, # threshold = 0.05, # point_size = 3, # jitter_width = 0.1, # jitter_height = 0.1, # plot_title = "Feature-Solution Associations", # text_size = 14, # bonferroni_line = TRUE # )
# full_dl <- data_list( # list(subc_v, "subcortical_volume", "neuroimaging", "continuous"), # list(income, "household_income", "demographics", "continuous"), # list(pubertal, "pubertal_status", "demographics", "continuous"), # list(anxiety, "anxiety", "behaviour", "ordinal"), # list(depress, "depressed", "behaviour", "ordinal"), # uid = "unique_id" # ) # # dl <- full_dl[1:3] # target_dl <- full_dl[4:5] # # set.seed(42) # sc <- snf_config( # dl = dl, # n_solutions = 20, # min_k = 20, # max_k = 50 # ) # # sol_df <- batch_snf(dl, sc) # # ext_sol_df <- extend_solutions( # sol_df, # dl = dl, # target = target_dl, # min_pval = 1e-10 # p-values below 1e-10 will be thresholded to 1e-10 # ) # # esm_manhattan <- esm_manhattan_plot( # ext_sol_df[1:5, ], # neg_log_pval_thresh = 5, # threshold = 0.05, # point_size = 3, # jitter_width = 0.1, # jitter_height = 0.1, # plot_title = "Feature-Solution Associations", # text_size = 14, # bonferroni_line = TRUE # )
Calculate eigengap and rotation-cost estimates of the number of clusters
to use when clustering a similarity matrix. This function was adapted
from SNFtool::estimateClustersGivenGraph
, but scales up the Laplacian
operator prior to eigenvalue calculations to minimize the risk of
floating point-related errors.
estimate_nclust_given_graph(W, NUMC = 2:10)
estimate_nclust_given_graph(W, NUMC = 2:10)
W |
Similarity matrix to calculate number of clusters for. |
NUMC |
Range of cluster counts to consider among when picking best number of clusters. |
A list containing the top two eigengap and rotation-cost estimates for the number of clusters in a given similarity matrix.
input_dl <- data_list( list(gender_df, "gender", "demographics", "categorical"), list(diagnosis_df, "diagnosis", "clinical", "categorical"), uid = "patient_id" ) sc <- snf_config(input_dl, n_solutions = 1) sol_df <- batch_snf(input_dl, sc, return_sim_mats = TRUE) sim_mat <- sim_mats_list(sol_df)[[1]] estimate_nclust_given_graph(sim_mat)
input_dl <- data_list( list(gender_df, "gender", "demographics", "categorical"), list(diagnosis_df, "diagnosis", "clinical", "categorical"), uid = "patient_id" ) sc <- snf_config(input_dl, n_solutions = 1) sol_df <- batch_snf(input_dl, sc, return_sim_mats = TRUE) sim_mat <- sim_mats_list(sol_df)[[1]] estimate_nclust_given_graph(sim_mat)
Modification of SNFtool mock data frame "Data1"
expression_df
expression_df
expression_df
A data frame with 200 rows and 3 columns:
Mock gene expression feature
Mock gene expression feature
Random three-digit number uniquely identifying the patient
This data came from the SNFtool package, with slight modifications.
Extend a solutions data frame to include outcome evaluations
extend_solutions( sol_df, target_dl = NULL, dl = NULL, cat_test = "chi_squared", min_pval = 1e-10, processes = 1, verbose = FALSE )
extend_solutions( sol_df, target_dl = NULL, dl = NULL, cat_test = "chi_squared", min_pval = 1e-10, processes = 1, verbose = FALSE )
sol_df |
Result of |
target_dl |
A data list with features to calculate p-values for. Features in the target list will be included during p-value summary measure calculations. |
dl |
A data list with features to calcualte p-values for, but that should not be incorporated into p-value summary measure columns (i.e., min/mean/max p-value columns). |
cat_test |
String indicating which statistical test will be used to associate cluster with a categorical feature. Options are "chi_squared" for the Chi-squared test and "fisher_exact" for Fisher's exact test. |
min_pval |
If assigned a value, any p-value less than this will be replaced with this value. |
processes |
The number of processes to use for parallelization. Progress is only reported for sequential processing (processes = 1). |
verbose |
If TRUE, output progress to console. |
An extended solutions data frame (ext_sol_df
class object)
that contains p-value columns for each outcome in the provided data lists
input_dl <- data_list( list(gender_df, "gender", "demographics", "categorical"), list(diagnosis_df, "diagnosis", "clinical", "categorical"), uid = "patient_id" ) sc <- snf_config(input_dl, n_solutions = 2) sol_df <- batch_snf(input_dl, sc) ext_sol_df <- extend_solutions(sol_df, input_dl)
input_dl <- data_list( list(gender_df, "gender", "demographics", "categorical"), list(diagnosis_df, "diagnosis", "clinical", "categorical"), uid = "patient_id" ) sc <- snf_config(input_dl, n_solutions = 2) sol_df <- batch_snf(input_dl, sc) ext_sol_df <- extend_solutions(sol_df, input_dl)
Like the mock data frame "abcd_colour", but with "unique_id" as the "uid".
fav_colour
fav_colour
fav_colour
A data frame with 275 rows and 2 columns:
The unique identifier of the ABCD dataset
Categorical transformation of cbcl_depress
.
Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive DevelopmentSM (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.
Return character vector of features stored in an object
features(x)
features(x)
x |
The object to pull features from. |
A character vector of features in x.
Mock gender data
gender_df
gender_df
gender_df
A data frame with 200 rows and 2 columns:
Random three-digit number uniquely identifying the patient
Mock gene methylation feature
This data came from the SNFtool package, with slight modifications.
Deprecated function for building a clustering algorithms list. Please use
clust_fns_list()
(or better yet, snf_config()
) instead.
generate_clust_algs_list(..., disable_base = FALSE)
generate_clust_algs_list(..., disable_base = FALSE)
... |
An arbitrary number of named clustering functions |
disable_base |
If TRUE, do not prepend the base clustering algorithms (spectral_eigen and spectral_rot, which apply spectral clustering and use the eigen-gap and rotation cost heuristics respectively for determining the number of clusters in the graph. |
A list of clustering algorithm functions that can be passed into the batch_snf and generate_settings_list functions.
Deprecated function for building a distance metrics list. Please use
dist_fns_list()
(or better yet, snf_config()
) instead.
generate_distance_metrics_list( continuous_distances = NULL, discrete_distances = NULL, ordinal_distances = NULL, categorical_distances = NULL, mixed_distances = NULL, keep_defaults = TRUE )
generate_distance_metrics_list( continuous_distances = NULL, discrete_distances = NULL, ordinal_distances = NULL, categorical_distances = NULL, mixed_distances = NULL, keep_defaults = TRUE )
continuous_distances |
A named list of distance metric functions |
discrete_distances |
A named list of distance metric functions |
ordinal_distances |
A named list of distance metric functions |
categorical_distances |
A named list of distance metric functions |
mixed_distances |
A named list of distance metric functions |
keep_defaults |
If TRUE (default), prepend the base distance metrics (euclidean and standard normalized euclidean) |
A nested and named list of distance metrics functions.
Deprecated function for building a settings matrix. Please use
settings_df()
instead.
generate_settings_matrix(...)
generate_settings_matrix(...)
... |
Arguments used to generate a settings matrix. |
Raises a deprecated error.
Deprecated function for building extracting cluster solutions from a
solutions data frame. Please use
t()
instead.
This function takes in a single row of a solutions data frame and returns a
data frame containing the cluster assignments for each uid. It is
similar to get_clusters()
, which takes one solutions data frame row and
returns a vector of cluster assignments' and get_cluster_solutions()
,
which takes a solutions data frame with any number of rows and returns a
data frame indicating the cluster assignments for each of those rows.
get_cluster_df(sol_df_row)
get_cluster_df(sol_df_row)
sol_df_row |
One row from a solutions data frame. |
cluster_df data frame of cluster and uid.
Deprecated function for building extracting cluster solutions from a
solutions data frame. Please use
t()
instead.
This function takes in a solutions data frame and returns a data frame containing
the cluster assignments for each uid. It is similar to
'get_clusters()
, which takes one solutions data frame row and returns a vector
of cluster assignments' and get_cluster_df()
, which takes a solutions
matrix with only one row and returns a data frame with two columns: "cluster"
and "uid" (the UID of the observation).
get_cluster_solutions(sol_df)
get_cluster_solutions(sol_df)
sol_df |
A sol_df. |
A "data.frame" object where each row is an observation and each column (apart from the uid column) indicates the cluster that observation was assigned to for the corresponding solutions data frame row.
Deprecated function for building extracting cluster solutions from a
solutions data frame. Please use
t()
instead.
This function takes in a single row of a solutions data frame and returns a
vector containing the cluster assignments for each observation. It is
similar to get_cluster_df()
, which takes a solutions data frame with only one
row and returns a data frame with two columns: "cluster" and "uid"
'(the UID of the observation) and get_cluster_solutions()
, which takes a
solutions data frame with any number of rows and returns a data frame indicating
the cluster assignments for each of those rows.
get_clusters(sol_df_row)
get_clusters(sol_df_row)
sol_df_row |
Output matrix row. |
clusters Vector of assigned clusters.
This function identifies all observations within a list of data frames that
have no missing data across all data frames. This function is useful when
constructing data lists of distinct feature sets from the same sample of
observations. As data_list()
strips away observations with any missing
data, distinct sets of observations may be generated by building a data
list from the same group of observations over different sets of features.
Reducing the pool of observations to only those with complete UIDs first
will avoid downstream generation of data lists of differing sizes.
get_complete_uids(list_of_dfs, uid)
get_complete_uids(list_of_dfs, uid)
list_of_dfs |
List of data frames. |
uid |
Name of column across data frames containing UIDs |
A character vector of the UIDs of observations that have complete data across the provided list of data frames.
complete_uids <- get_complete_uids( list(income, pubertal, anxiety, depress), uid = "unique_id" ) income <- income[income$"unique_id" %in% complete_uids, ] pubertal <- pubertal[pubertal$"unique_id" %in% complete_uids, ] anxiety <- anxiety[anxiety$"unique_id" %in% complete_uids, ] depress <- depress[depress$"unique_id" %in% complete_uids, ] input_dl <- data_list( list(income, "income", "demographics", "ordinal"), list(pubertal, "pubertal", "demographics", "continuous"), uid = "unique_id" ) target_dl <- data_list( list(anxiety, "anxiety", "behaviour", "ordinal"), list(depress, "depressed", "behaviour", "ordinal"), uid = "unique_id" )
complete_uids <- get_complete_uids( list(income, pubertal, anxiety, depress), uid = "unique_id" ) income <- income[income$"unique_id" %in% complete_uids, ] pubertal <- pubertal[pubertal$"unique_id" %in% complete_uids, ] anxiety <- anxiety[anxiety$"unique_id" %in% complete_uids, ] depress <- depress[depress$"unique_id" %in% complete_uids, ] input_dl <- data_list( list(income, "income", "demographics", "ordinal"), list(pubertal, "pubertal", "demographics", "continuous"), uid = "unique_id" ) target_dl <- data_list( list(anxiety, "anxiety", "behaviour", "ordinal"), list(depress, "depressed", "behaviour", "ordinal"), uid = "unique_id" )
Deprecated function for extracting UIDs from a data list.
Please use
uids()
instead.
get_dl_uids(dl, prefix = FALSE)
get_dl_uids(dl, prefix = FALSE)
dl |
A nested list of input data from |
prefix |
If TRUE, preserves the "uid_" prefix added to UIDs when creating a data list. |
A character vector of the UID labels contained in a data list.
Return the row or column ordering present in a heatmap
get_heatmap_order(heatmap, type = "rows")
get_heatmap_order(heatmap, type = "rows")
heatmap |
A heatmap object to collect ordering from. |
type |
The type of ordering to return. Either "rows" or "columns". |
A numeric vector of the ordering used within the provided ComplexHeatmap "Heatmap" object.
Return the hierarchical clustering order of a matrix
get_matrix_order(matrix, dist_method = "euclidean", hclust_method = "complete")
get_matrix_order(matrix, dist_method = "euclidean", hclust_method = "complete")
matrix |
Matrix to cluster. |
dist_method |
Distance method to use when calculating sorting order to of the matrix. Argument is directly passed into stats::dist. Options include "euclidean", "maximum", "manhattan", "canberra", "binary", or "minkowski". |
hclust_method |
Agglomerative method to use when calculating sorting
order by |
A numeric vector of the ordering derivied by the specified hierarchical clustering method applied to the provided matrix.
# dl <- data_list( # list(subc_v, "subcortical_volume", "neuroimaging", "continuous"), # list(income, "household_income", "demographics", "continuous"), # list(pubertal, "pubertal_status", "demographics", "continuous"), # list(anxiety, "anxiety", "behaviour", "ordinal"), # list(depress, "depressed", "behaviour", "ordinal"), # uid = "unique_id" # ) # # sc <- snf_config( # dl = dl, # n_solutions = 20, # min_k = 20, # max_k = 50 # ) # # sol_df <- batch_snf(dl, sc) # # ext_sol_df <- extend_solutions( # sol_df, # dl = dl, # min_pval = 1e-10 # p-values below 1e-10 will be thresholded to 1e-10 # ) # # # Calculate pairwise similarities between cluster solutions # sol_aris <- calc_aris(sol_df) # # # Extract hierarchical clustering order of the cluster solutions # meta_cluster_order <- get_matrix_order(sol_aris)
# dl <- data_list( # list(subc_v, "subcortical_volume", "neuroimaging", "continuous"), # list(income, "household_income", "demographics", "continuous"), # list(pubertal, "pubertal_status", "demographics", "continuous"), # list(anxiety, "anxiety", "behaviour", "ordinal"), # list(depress, "depressed", "behaviour", "ordinal"), # uid = "unique_id" # ) # # sc <- snf_config( # dl = dl, # n_solutions = 20, # min_k = 20, # max_k = 50 # ) # # sol_df <- batch_snf(dl, sc) # # ext_sol_df <- extend_solutions( # sol_df, # dl = dl, # min_pval = 1e-10 # p-values below 1e-10 will be thresholded to 1e-10 # ) # # # Calculate pairwise similarities between cluster solutions # sol_aris <- calc_aris(sol_df) # # # Extract hierarchical clustering order of the cluster solutions # meta_cluster_order <- get_matrix_order(sol_aris)
This function can be used to neatly format the p-values associated with an extended solutions data frame. It can also calculate the negative logs of those p-values to make it easier to interpret large-scale differences.
get_pvals(ext_sol_df, negative_log = FALSE, keep_summaries = TRUE)
get_pvals(ext_sol_df, negative_log = FALSE, keep_summaries = TRUE)
ext_sol_df |
The output of |
negative_log |
If TRUE, will replace p-values with negative log p-values. |
keep_summaries |
If FALSE, will remove the mean, min, and max p-value. |
A "data.frame" class object Of only the p-value related columns of the provided ext_sol_df.
Following clustering with batch_snf
, a matrix of pairwise ARIs that show
how related each cluster solution is to each other can be generated by the
calc_aris
function. Partitioning of the ARI matrix can be done by
visual inspection of meta_cluster_heatmap()
results or by
shiny_annotator
. Given the indices of meta cluster boundaries, this
function will return a single representative solution from each meta cluster
based on maximum average ARI to all other solutions within that meta
cluster.
get_representative_solutions(aris, sol_df, filter_fn = NULL)
get_representative_solutions(aris, sol_df, filter_fn = NULL)
aris |
Matrix of adjusted rand indices from |
sol_df |
Output of |
filter_fn |
Optional function to filter the meta-cluster by prior to
maximum average ARI determination. This can be useful if you are explicitly
trying to select a solution that meets a certain condition, such as only
picking from the 4 cluster solutions within a meta cluster. An example
valid function could be |
The provided solutions data frame reduced to just one row per meta cluster defined by the split vector.
# dl <- data_list( # list(subc_v, "subcortical_volume", "neuroimaging", "continuous"), # list(income, "household_income", "demographics", "continuous"), # list(pubertal, "pubertal_status", "demographics", "continuous"), # list(anxiety, "anxiety", "behaviour", "ordinal"), # list(depress, "depressed", "behaviour", "ordinal"), # uid = "unique_id" # ) # # sc <- snf_config( # dl = dl, # n_solutions = 20, # min_k = 20, # max_k = 50 # ) # # sol_df <- batch_snf(dl, sc) # # ext_sol_df <- extend_solutions( # sol_df, # dl = dl, # min_pval = 1e-10 # p-values below 1e-10 will be thresholded to 1e-10 # ) # # # Calculate pairwise similarities between cluster solutions # sol_aris <- calc_aris(sol_df) # # # Extract hierarchical clustering order of the cluster solutions # meta_cluster_order <- get_matrix_order(sol_aris) # # # Identify meta cluster boundaries with shiny app or trial and error # # ari_hm <- meta_cluster_heatmap(sol_aris, order = meta_cluster_order) # # shiny_annotator(ari_hm) # # # Result of meta cluster examination # split_vec <- c(2, 5, 12, 17) # # ext_sol_df <- label_meta_clusters(ext_sol_df, split_vec, meta_cluster_order) # # # Extracting representative solutions from each defined meta cluster # rep_solutions <- get_representative_solutions(sol_aris, ext_sol_df)
# dl <- data_list( # list(subc_v, "subcortical_volume", "neuroimaging", "continuous"), # list(income, "household_income", "demographics", "continuous"), # list(pubertal, "pubertal_status", "demographics", "continuous"), # list(anxiety, "anxiety", "behaviour", "ordinal"), # list(depress, "depressed", "behaviour", "ordinal"), # uid = "unique_id" # ) # # sc <- snf_config( # dl = dl, # n_solutions = 20, # min_k = 20, # max_k = 50 # ) # # sol_df <- batch_snf(dl, sc) # # ext_sol_df <- extend_solutions( # sol_df, # dl = dl, # min_pval = 1e-10 # p-values below 1e-10 will be thresholded to 1e-10 # ) # # # Calculate pairwise similarities between cluster solutions # sol_aris <- calc_aris(sol_df) # # # Extract hierarchical clustering order of the cluster solutions # meta_cluster_order <- get_matrix_order(sol_aris) # # # Identify meta cluster boundaries with shiny app or trial and error # # ari_hm <- meta_cluster_heatmap(sol_aris, order = meta_cluster_order) # # shiny_annotator(ari_hm) # # # Result of meta cluster examination # split_vec <- c(2, 5, 12, 17) # # ext_sol_df <- label_meta_clusters(ext_sol_df, split_vec, meta_cluster_order) # # # Extracting representative solutions from each defined meta cluster # rep_solutions <- get_representative_solutions(sol_aris, ext_sol_df)
Like the mock data frame "abcd_h_income", but with "unique_id" as the "uid".
Like the mock data frame "abcd_cort_sa", but with "unique_id" as the "uid".
income income
income income
income
A data frame with 300 rows and 2 columns:
The unique identifier of the ABCD dataset
Household income in 3 category levels (low = 1, medium = 2, high = 3)
income
A data frame with 300 rows and 2 columns:
The unique identifier of the ABCD dataset
Household income in 3 category levels (low = 1, medium = 2, high = 3)
Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive DevelopmentSM (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.
Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive DevelopmentSM (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.
Given an object, returns TRUE
if that object inherits from the data_list
class.
is_data_list(x)
is_data_list(x)
x |
An object. |
TRUE
if the object inherits from the data_list
class.
Jitter plot separating a feature by cluster
jitter_plot(df, feature)
jitter_plot(df, feature)
df |
A data.frame containing cluster column and the feature to plot. |
feature |
The feature to plot. |
A jitter+violin plot (class "gg", "ggplot") showing the distribution of a feature across clusters.
Given a solutions data frame or extended solutions data frame class object and a numeric vector indicating which rows correspond to which meta clusters, assigns meta clustering information to the "meta_clusters" attribute of the data frame.
label_meta_clusters(sol_df, split_vector, order = NULL)
label_meta_clusters(sol_df, split_vector, order = NULL)
sol_df |
A solutions data frame or extended solutions data frame to assign meta clusters to. |
split_vector |
A numeric vector indicating which rows of sol_df should be the split points for meta cluster labeling. |
order |
An optional numeric vector indicating how the solutions data
frame should be reordered prior to meta cluster labeling. This vector can
be obtained by running |
A solutions data frame with a populated "meta_clusters" attribute.
#dl <- data_list( # list(cort_sa, "cortical_surface_area", "neuroimaging", "continuous"), # list(subc_v, "subcortical_volume", "neuroimaging", "continuous"), # list(income, "household_income", "demographics", "continuous"), # list(pubertal, "pubertal_status", "demographics", "continuous"), # uid = "unique_id" #) # #set.seed(42) #my_sc <- snf_config( # dl = dl, # n_solutions = 20, # min_k = 20, # max_k = 50 #) # #sol_df <- batch_snf(dl, my_sc) # #sol_df # #sol_aris <- calc_aris(sol_df) # #meta_cluster_order <- get_matrix_order(sol_aris) # ## `split_vec` found by iteratively plotting ari_hm or by ?shiny_annotator() #split_vec <- c(6, 10, 16) #ari_hm <- meta_cluster_heatmap( # sol_aris, # order = meta_cluster_order, # split_vector = split_vec #) # #mc_sol_df <- label_meta_clusters( # sol_df, # order = meta_cluster_order, # split_vector = split_vec #) # #mc_sol_df
#dl <- data_list( # list(cort_sa, "cortical_surface_area", "neuroimaging", "continuous"), # list(subc_v, "subcortical_volume", "neuroimaging", "continuous"), # list(income, "household_income", "demographics", "continuous"), # list(pubertal, "pubertal_status", "demographics", "continuous"), # uid = "unique_id" #) # #set.seed(42) #my_sc <- snf_config( # dl = dl, # n_solutions = 20, # min_k = 20, # max_k = 50 #) # #sol_df <- batch_snf(dl, my_sc) # #sol_df # #sol_aris <- calc_aris(sol_df) # #meta_cluster_order <- get_matrix_order(sol_aris) # ## `split_vec` found by iteratively plotting ari_hm or by ?shiny_annotator() #split_vec <- c(6, 10, 16) #ari_hm <- meta_cluster_heatmap( # sol_aris, # order = meta_cluster_order, # split_vector = split_vec #) # #mc_sol_df <- label_meta_clusters( # sol_df, # order = meta_cluster_order, # split_vector = split_vec #) # #mc_sol_df
Given a solutions data frame containing clustered observations and a data list containing those clustered observations as well as additional to-be-clustered observations, this function will re-run SNF to generate a similarity matrix of all observations and use the label propagation algorithm to assigned predicted clusters to the unclustered observations.
label_propagate(partial_sol_df, full_dl, verbose = FALSE)
label_propagate(partial_sol_df, full_dl, verbose = FALSE)
partial_sol_df |
A solutions data frame derived from the training set. |
full_dl |
A data list containing observations from both the training and testing sets. |
verbose |
If TRUE, output progress to console. |
A data frame with one row per observation containing a column for UIDs, a column for whether the observation was in the train (original) or test (held out) set, and one column per row of the solutions data frame indicating the original and propagated clusters.
## Function to identify obervations with complete data #uids_with_complete_obs <- get_complete_uids( # list(subc_v, income, pubertal, anxiety, depress), # uid = "unique_id" #) # ## Dataframe assigning 80% of observations to train and 20% to test #train_test_split <- train_test_assign( # train_frac = 0.8, # uids = uids_with_complete_obs #) # ## Pulling the training and testing observations specifically #train_obs <- train_test_split$"train" #test_obs <- train_test_split$"test" # ## Partition a training set #train_subc_v <- subc_v[subc_v$"unique_id" %in% train_obs, ] #train_income <- income[income$"unique_id" %in% train_obs, ] #train_pubertal <- pubertal[pubertal$"unique_id" %in% train_obs, ] #train_anxiety <- anxiety[anxiety$"unique_id" %in% train_obs, ] #train_depress <- depress[depress$"unique_id" %in% train_obs, ] # ## Partition a test set #test_subc_v <- subc_v[subc_v$"unique_id" %in% test_obs, ] #test_income <- income[income$"unique_id" %in% test_obs, ] #test_pubertal <- pubertal[pubertal$"unique_id" %in% test_obs, ] #test_anxiety <- anxiety[anxiety$"unique_id" %in% test_obs, ] #test_depress <- depress[depress$"unique_id" %in% test_obs, ] # ## Find cluster solutions in the training set #train_dl <- data_list( # list(train_subc_v, "subc_v", "neuroimaging", "continuous"), # list(train_income, "household_income", "demographics", "continuous"), # list(train_pubertal, "pubertal_status", "demographics", "continuous"), # uid = "unique_id" #) # ## We'll pick a solution that has good separation over our target features #train_target_dl <- data_list( # list(train_anxiety, "anxiety", "behaviour", "ordinal"), # list(train_depress, "depressed", "behaviour", "ordinal"), # uid = "unique_id" #) # #sc <- snf_config( # train_dl, # n_solutions = 5, # min_k = 10, # max_k = 30 #) # #train_sol_df <- batch_snf( # train_dl, # sc, # return_sim_mats = TRUE #) # #ext_sol_df <- extend_solutions( # train_sol_df, # train_target_dl #) # ## Determining solution with the lowest minimum p-value #lowest_min_pval <- min(ext_sol_df$"min_pval") #which(ext_sol_df$"min_pval" == lowest_min_pval) #top_row <- ext_sol_df[1, ] # ## Propagate that solution to the observations in the test set ## data list below has both training and testing observations #full_dl <- data_list( # list(subc_v, "subc_v", "neuroimaging", "continuous"), # list(income, "household_income", "demographics", "continuous"), # list(pubertal, "pubertal_status", "demographics", "continuous"), # uid = "unique_id" #) # ## Use the solutions data frame from the training observations and the data list ## from the training and testing observations to propagate labels to the test observations #propagated_labels <- label_propagate(top_row, full_dl) # #propagated_labels_all <- label_propagate(ext_sol_df, full_dl) # #head(propagated_labels_all) #tail(propagated_labels_all)
## Function to identify obervations with complete data #uids_with_complete_obs <- get_complete_uids( # list(subc_v, income, pubertal, anxiety, depress), # uid = "unique_id" #) # ## Dataframe assigning 80% of observations to train and 20% to test #train_test_split <- train_test_assign( # train_frac = 0.8, # uids = uids_with_complete_obs #) # ## Pulling the training and testing observations specifically #train_obs <- train_test_split$"train" #test_obs <- train_test_split$"test" # ## Partition a training set #train_subc_v <- subc_v[subc_v$"unique_id" %in% train_obs, ] #train_income <- income[income$"unique_id" %in% train_obs, ] #train_pubertal <- pubertal[pubertal$"unique_id" %in% train_obs, ] #train_anxiety <- anxiety[anxiety$"unique_id" %in% train_obs, ] #train_depress <- depress[depress$"unique_id" %in% train_obs, ] # ## Partition a test set #test_subc_v <- subc_v[subc_v$"unique_id" %in% test_obs, ] #test_income <- income[income$"unique_id" %in% test_obs, ] #test_pubertal <- pubertal[pubertal$"unique_id" %in% test_obs, ] #test_anxiety <- anxiety[anxiety$"unique_id" %in% test_obs, ] #test_depress <- depress[depress$"unique_id" %in% test_obs, ] # ## Find cluster solutions in the training set #train_dl <- data_list( # list(train_subc_v, "subc_v", "neuroimaging", "continuous"), # list(train_income, "household_income", "demographics", "continuous"), # list(train_pubertal, "pubertal_status", "demographics", "continuous"), # uid = "unique_id" #) # ## We'll pick a solution that has good separation over our target features #train_target_dl <- data_list( # list(train_anxiety, "anxiety", "behaviour", "ordinal"), # list(train_depress, "depressed", "behaviour", "ordinal"), # uid = "unique_id" #) # #sc <- snf_config( # train_dl, # n_solutions = 5, # min_k = 10, # max_k = 30 #) # #train_sol_df <- batch_snf( # train_dl, # sc, # return_sim_mats = TRUE #) # #ext_sol_df <- extend_solutions( # train_sol_df, # train_target_dl #) # ## Determining solution with the lowest minimum p-value #lowest_min_pval <- min(ext_sol_df$"min_pval") #which(ext_sol_df$"min_pval" == lowest_min_pval) #top_row <- ext_sol_df[1, ] # ## Propagate that solution to the observations in the test set ## data list below has both training and testing observations #full_dl <- data_list( # list(subc_v, "subc_v", "neuroimaging", "continuous"), # list(income, "household_income", "demographics", "continuous"), # list(pubertal, "pubertal_status", "demographics", "continuous"), # uid = "unique_id" #) # ## Use the solutions data frame from the training observations and the data list ## from the training and testing observations to propagate labels to the test observations #propagated_labels <- label_propagate(top_row, full_dl) # #propagated_labels_all <- label_propagate(ext_sol_df, full_dl) # #head(propagated_labels_all) #tail(propagated_labels_all)
Given a data list to correct and another data list of categorical features to linearly adjust for, corrects the first data list based on the residuals of the linear model relating the numeric features in the first data list to the unwanted signal features in the second data list.
linear_adjust(dl, unwanted_signal_list, sig_digs = NULL)
linear_adjust(dl, unwanted_signal_list, sig_digs = NULL)
dl |
A nested list of input data from |
unwanted_signal_list |
A data list of categorical features that should have their mean differences removed in the first data list. |
sig_digs |
Number of significant digits to round the residuals to. |
A data list ("list") in which each data component has been converted to contain residuals off of the linear model built against the features in the unwanted_signal_list.
has_tutor <- sample(c(1, 0), size = 9, replace = TRUE) math_score <- 70 + 30 * has_tutor + rnorm(9, mean = 0, sd = 5) math_df <- data.frame(uid = paste0("id_", 1:9), math = math_score) tutor_df <- data.frame(uid = paste0("id_", 1:9), tutor = has_tutor) dl <- data_list( list(math_df, "math_score", "school", "continuous"), uid = "uid" ) adjustment_dl <- data_list( list(tutor_df, "tutoring", "school", "categorical"), uid = "uid" ) adjusted_dl <- linear_adjust(dl, adjustment_dl) adjusted_dl[[1]]$"data"$"math" # Equivalent to: as.numeric(resid(lm(math_score ~ has_tutor)))
has_tutor <- sample(c(1, 0), size = 9, replace = TRUE) math_score <- 70 + 30 * has_tutor + rnorm(9, mean = 0, sd = 5) math_df <- data.frame(uid = paste0("id_", 1:9), math = math_score) tutor_df <- data.frame(uid = paste0("id_", 1:9), tutor = has_tutor) dl <- data_list( list(math_df, "math_score", "school", "continuous"), uid = "uid" ) adjustment_dl <- data_list( list(tutor_df, "tutoring", "school", "categorical"), uid = "uid" ) adjusted_dl <- linear_adjust(dl, adjustment_dl) adjusted_dl[[1]]$"data"$"math" # Equivalent to: as.numeric(resid(lm(math_score ~ has_tutor)))
Given a data frame of representative meta cluster solutions (see
get_representative_solutions()
, returns a Manhattan plot for showing
feature separation across all features in provided data/target lists.
mc_manhattan_plot( ext_sol_df, dl = NULL, target_dl = NULL, variable_order = NULL, neg_log_pval_thresh = 5, threshold = NULL, point_size = 5, text_size = 20, plot_title = NULL, xints = NULL, hide_x_labels = FALSE, domain_colours = NULL )
mc_manhattan_plot( ext_sol_df, dl = NULL, target_dl = NULL, variable_order = NULL, neg_log_pval_thresh = 5, threshold = NULL, point_size = 5, text_size = 20, plot_title = NULL, xints = NULL, hide_x_labels = FALSE, domain_colours = NULL )
ext_sol_df |
A sol_df that contains "_pval"
columns containing the values to be plotted. This object is the output of
|
dl |
List of data frames containing data information. |
target_dl |
List of data frames containing target information. |
variable_order |
Order of features to be displayed in the plot. |
neg_log_pval_thresh |
Threshold for negative log p-values. |
threshold |
p-value threshold to plot horizontal dashed line at. |
point_size |
Size of points in the plot. |
text_size |
Size of text in the plot. |
plot_title |
Title of the plot. |
xints |
Either "outcomes" or a vector of numeric values to plot vertical lines at. |
hide_x_labels |
If TRUE, hides x-axis labels. |
domain_colours |
Named vector of colours for domains. |
A Manhattan plot (class "gg", "ggplot") showing the association p-values of features against each solution in the provided solutions data frame, stratified by meta cluster label.
# dl <- data_list( # list(subc_v, "subcortical_volume", "neuroimaging", "continuous"), # list(income, "household_income", "demographics", "continuous"), # list(pubertal, "pubertal_status", "demographics", "continuous"), # list(anxiety, "anxiety", "behaviour", "ordinal"), # list(depress, "depressed", "behaviour", "ordinal"), # uid = "unique_id" # ) # # sc <- snf_config( # dl = dl, # n_solutions = 20, # min_k = 20, # max_k = 50 # ) # # sol_df <- batch_snf(dl, sc) # # ext_sol_df <- extend_solutions( # sol_df, # dl = dl, # min_pval = 1e-10 # p-values below 1e-10 will be thresholded to 1e-10 # ) # # # Calculate pairwise similarities between cluster solutions # sol_aris <- calc_aris(sol_df) # # # Extract hierarchical clustering order of the cluster solutions # meta_cluster_order <- get_matrix_order(sol_aris) # # # Identify meta cluster boundaries with shiny app or trial and error # # ari_hm <- meta_cluster_heatmap(sol_aris, order = meta_cluster_order) # # shiny_annotator(ari_hm) # # # Result of meta cluster examination # split_vec <- c(2, 5, 12, 17) # # ext_sol_df <- label_meta_clusters(ext_sol_df, split_vec, meta_cluster_order) # # # Extracting representative solutions from each defined meta cluster # rep_solutions <- get_representative_solutions(sol_aris, ext_sol_df) # # mc_manhattan <- mc_manhattan_plot( # rep_solutions, # dl = dl, # point_size = 3, # text_size = 12, # plot_title = "Feature-Meta Cluster Associations", # threshold = 0.05, # neg_log_pval_thresh = 5 # ) # # mc_manhattan
# dl <- data_list( # list(subc_v, "subcortical_volume", "neuroimaging", "continuous"), # list(income, "household_income", "demographics", "continuous"), # list(pubertal, "pubertal_status", "demographics", "continuous"), # list(anxiety, "anxiety", "behaviour", "ordinal"), # list(depress, "depressed", "behaviour", "ordinal"), # uid = "unique_id" # ) # # sc <- snf_config( # dl = dl, # n_solutions = 20, # min_k = 20, # max_k = 50 # ) # # sol_df <- batch_snf(dl, sc) # # ext_sol_df <- extend_solutions( # sol_df, # dl = dl, # min_pval = 1e-10 # p-values below 1e-10 will be thresholded to 1e-10 # ) # # # Calculate pairwise similarities between cluster solutions # sol_aris <- calc_aris(sol_df) # # # Extract hierarchical clustering order of the cluster solutions # meta_cluster_order <- get_matrix_order(sol_aris) # # # Identify meta cluster boundaries with shiny app or trial and error # # ari_hm <- meta_cluster_heatmap(sol_aris, order = meta_cluster_order) # # shiny_annotator(ari_hm) # # # Result of meta cluster examination # split_vec <- c(2, 5, 12, 17) # # ext_sol_df <- label_meta_clusters(ext_sol_df, split_vec, meta_cluster_order) # # # Extracting representative solutions from each defined meta cluster # rep_solutions <- get_representative_solutions(sol_aris, ext_sol_df) # # mc_manhattan <- mc_manhattan_plot( # rep_solutions, # dl = dl, # point_size = 3, # text_size = 12, # plot_title = "Feature-Meta Cluster Associations", # threshold = 0.05, # neg_log_pval_thresh = 5 # ) # # mc_manhattan
This helper function combines all data frames in a single-level list into a single data frame.
merge_df_list(df_list, join = "inner", uid = "uid", no_na = FALSE)
merge_df_list(df_list, join = "inner", uid = "uid", no_na = FALSE)
df_list |
list of data frames. |
join |
String indicating if join should be "inner" or "full". |
uid |
Column name to join on. Default is "uid". |
no_na |
Whether to remove NA values from the merged data frame. |
Inner join of all data frames in list.
merge_df_list(list(income, pubertal), uid = "unique_id")
merge_df_list(list(income, pubertal), uid = "unique_id")
Join two data lists with the same components (data frames) but separate
observations. To instead merge two data lists that have the same
observations but different components, simply use c()
.
## S3 method for class 'data_list' merge(x, y, ...)
## S3 method for class 'data_list' merge(x, y, ...)
x |
The first data list to merge. |
y |
The second data list to merge. |
... |
Additional arguments passed into merge function. |
A data list ("list"-class object) containing the observations of both provided data lists.
Merge method for SNF config objects
## S3 method for class 'snf_config' merge(x, y, reset_indices = TRUE, ...)
## S3 method for class 'snf_config' merge(x, y, reset_indices = TRUE, ...)
x |
SNF config to merge. |
y |
SNF config to merge. |
reset_indices |
If TRUE (default), re-labels the "solutions" indices in the config from 1 to the number of defined settings. |
... |
Additional arguments passed into merge function. |
An SNF config combining the rows of both prior configurations.
Heatmap of pairwise adjusted rand indices between solutions
meta_cluster_heatmap( aris, order = NULL, cluster_rows = FALSE, cluster_columns = FALSE, log_graph = FALSE, scale_diag = "none", min_colour = "#282828", max_colour = "firebrick2", col = circlize::colorRamp2(c(min(aris), max(aris)), c(min_colour, max_colour)), ... )
meta_cluster_heatmap( aris, order = NULL, cluster_rows = FALSE, cluster_columns = FALSE, log_graph = FALSE, scale_diag = "none", min_colour = "#282828", max_colour = "firebrick2", col = circlize::colorRamp2(c(min(aris), max(aris)), c(min_colour, max_colour)), ... )
aris |
Matrix of adjusted rand indices from |
order |
Numeric vector containing row order of the heatmap. |
cluster_rows |
Whether rows should be clustered. |
cluster_columns |
Whether columns should be clustered. |
log_graph |
If TRUE, log transforms the graph. |
scale_diag |
Method of rescaling matrix diagonals. Can be "none" (don't change diagonals), "mean" (replace diagonals with average value of off-diagonals), or "zero" (replace diagonals with 0). |
min_colour |
Colour used for the lowest value in the heatmap. |
max_colour |
Colour used for the highest value in the heatmap. |
col |
Colour ramp to use for the heatmap. |
... |
Additional parameters passed to |
Returns a heatmap (class "Heatmap" from package ComplexHeatmap) that displays the pairwise adjusted Rand indices (similarities) between the cluster solutions of the provided solutions data frame.
#dl <- data_list( # list(cort_sa, "cortical_surface_area", "neuroimaging", "continuous"), # list(subc_v, "subcortical_volume", "neuroimaging", "continuous"), # list(income, "household_income", "demographics", "continuous"), # list(pubertal, "pubertal_status", "demographics", "continuous"), # uid = "unique_id" #) # #set.seed(42) #my_sc <- snf_config( # dl = dl, # n_solutions = 20, # min_k = 20, # max_k = 50 #) # #sol_df <- batch_snf(dl, my_sc) # #sol_df # #sol_aris <- calc_aris(sol_df) # #meta_cluster_order <- get_matrix_order(sol_aris) # ## `split_vec` found by iteratively plotting ari_hm or by ?shiny_annotator() #split_vec <- c(6, 10, 16) #ari_hm <- meta_cluster_heatmap( # sol_aris, # order = meta_cluster_order, # split_vector = split_vec #)
#dl <- data_list( # list(cort_sa, "cortical_surface_area", "neuroimaging", "continuous"), # list(subc_v, "subcortical_volume", "neuroimaging", "continuous"), # list(income, "household_income", "demographics", "continuous"), # list(pubertal, "pubertal_status", "demographics", "continuous"), # uid = "unique_id" #) # #set.seed(42) #my_sc <- snf_config( # dl = dl, # n_solutions = 20, # min_k = 20, # max_k = 50 #) # #sol_df <- batch_snf(dl, my_sc) # #sol_df # #sol_aris <- calc_aris(sol_df) # #meta_cluster_order <- get_matrix_order(sol_aris) # ## `split_vec` found by iteratively plotting ari_hm or by ?shiny_annotator() #split_vec <- c(6, 10, 16) #ari_hm <- meta_cluster_heatmap( # sol_aris, # order = meta_cluster_order, # split_vector = split_vec #)
Modification of SNFtool mock data frame "Data2"
methylation_df
methylation_df
methylation_df
A data frame with 200 rows and 3 columns:
Mock gene methylation feature
Mock gene methylation feature
Random three-digit number uniquely identifying the patient
This data came from the SNFtool package, with slight modifications.
Extract number of features stored in an object
n_features(x)
n_features(x)
x |
The object to extract number of features from. |
The number of features in x.
Extract number of observations stored in an object
n_observations(x)
n_observations(x)
x |
The object to extract number of observations from. |
The number of observations in x.
solutions_df
class objectConstructor for solutions_df
class object
new_solutions_df(sol_dfl)
new_solutions_df(sol_dfl)
sol_dfl |
A solutions data frame-like object to be validated and converted into a solutions data frame. |
A solutions_df
class object.
ari_matrix
Custom formatted print for weights matrices that outputs information about feature weights functions to the console.
## S3 method for class 'ari_matrix' print(x, ...)
## S3 method for class 'ari_matrix' print(x, ...)
x |
A |
... |
Other arguments passed to |
Function prints to console but does not return any value.
clust_fns_list
Custom formatted print for clustering functions list objects that outputs information about the contained clustering functions to the console.
## S3 method for class 'clust_fns_list' print(x, ...)
## S3 method for class 'clust_fns_list' print(x, ...)
x |
A |
... |
Other arguments passed to |
Function prints to console but does not return any value.
data_list
Custom formatted print for data list objects that outputs information about the contained observations and components to the console.
## S3 method for class 'data_list' print(x, ...)
## S3 method for class 'data_list' print(x, ...)
x |
A |
... |
Other arguments passed to |
Function prints to console but does not return any value.
dist_fns_list
Custom formatted print for distance metrics list objects that outputs information about the contained distance metrics to the console.
## S3 method for class 'dist_fns_list' print(x, ...)
## S3 method for class 'dist_fns_list' print(x, ...)
x |
A |
... |
Other arguments passed to |
Function prints to console but does not return any value.
ext_solutions_df
Custom formatted print for extended solutions data frame class objects.
## S3 method for class 'ext_solutions_df' print(x, n = NULL, ...)
## S3 method for class 'ext_solutions_df' print(x, n = NULL, ...)
x |
A |
n |
Number of rows to print, passed into |
... |
Other arguments passed to |
Function prints to console but does not return any value.
settings_df
Custom formatted print for settings data frame that outputs information about SNF hyperparameters to the console.
## S3 method for class 'settings_df' print(x, ...)
## S3 method for class 'settings_df' print(x, ...)
x |
A |
... |
Other arguments passed to |
Function prints to console but does not return any value.
snf_config
Custom formatted print for SNF config
## S3 method for class 'snf_config' print(x, ...)
## S3 method for class 'snf_config' print(x, ...)
x |
A |
... |
Other arguments passed to |
Function prints to console but does not return any value.
solutions_df
Custom formatted print for weights matrices that outputs information about feature weights functions to the console.
## S3 method for class 'solutions_df' print(x, n = NULL, tips = TRUE, ...)
## S3 method for class 'solutions_df' print(x, n = NULL, tips = TRUE, ...)
x |
A |
n |
Number of rows to print, passed into |
tips |
If TRUE, include lines on how to print more rows / transposed. |
... |
Other arguments passed to |
Function prints to console but does not return any value.
t_ext_solutions_df
Custom formatted print for transposed solutions data frame class objects.
## S3 method for class 't_ext_solutions_df' print(x, ...)
## S3 method for class 't_ext_solutions_df' print(x, ...)
x |
A |
... |
Other arguments passed to |
Function prints to console but does not return any value.
t_solutions_df
Custom formatted print for transposed solutions data frame class objects.
## S3 method for class 't_solutions_df' print(x, ...)
## S3 method for class 't_solutions_df' print(x, ...)
x |
A |
... |
Other arguments passed to |
Function prints to console but does not return any value.
weights_matrix
Custom formatted print for weights matrices that outputs information about feature weights functions to the console.
## S3 method for class 'weights_matrix' print(x, ...)
## S3 method for class 'weights_matrix' print(x, ...)
x |
A |
... |
Other arguments passed to |
Function prints to console but does not return any value.
Like the mock data frame "abcd_pubertal", but with "unique_id" as the "uid".
pubertal
pubertal
pubertal
A data frame with 275 rows and 2 columns:
The unique identifier of the ABCD dataset
Average reported pubertal status between child and parent (1-5 categorical scale)
Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive DevelopmentSM (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.
Heatmap of p-values
pval_heatmap( ext_sol_df, order = NULL, cluster_columns = TRUE, cluster_rows = FALSE, show_row_names = FALSE, show_column_names = TRUE, min_colour = "red2", max_colour = "white", legend_breaks = c(0, 1), col = circlize::colorRamp2(legend_breaks, c(min_colour, max_colour)), heatmap_legend_param = list(color_bar = "continuous", title = "p-value", at = c(0, 1)), rect_gp = grid::gpar(col = "black"), column_split_vector = NULL, row_split_vector = NULL, column_split = NULL, row_split = NULL, ... )
pval_heatmap( ext_sol_df, order = NULL, cluster_columns = TRUE, cluster_rows = FALSE, show_row_names = FALSE, show_column_names = TRUE, min_colour = "red2", max_colour = "white", legend_breaks = c(0, 1), col = circlize::colorRamp2(legend_breaks, c(min_colour, max_colour)), heatmap_legend_param = list(color_bar = "continuous", title = "p-value", at = c(0, 1)), rect_gp = grid::gpar(col = "black"), column_split_vector = NULL, row_split_vector = NULL, column_split = NULL, row_split = NULL, ... )
ext_sol_df |
An ext_solutions_df class object (produced from
the function |
order |
Numeric vector containing row order of the heatmap. |
cluster_columns |
Whether columns should be sorted by hierarchical clustering. |
cluster_rows |
Whether rows should be sorted by hierarchical clustering. |
show_row_names |
Whether row names should be shown. |
show_column_names |
Whether column names should be shown. |
min_colour |
Colour used for the lowest value in the heatmap. |
max_colour |
Colour used for the highest value in the heatmap. |
legend_breaks |
Numeric vector of breaks for the legend. |
col |
Colour function for |
heatmap_legend_param |
Legend function for |
rect_gp |
Cell border function for |
column_split_vector |
Vector of indices to split columns by. |
row_split_vector |
Vector of indices to split rows by. |
column_split |
Standard parameter of |
row_split |
Standard parameter of |
... |
Additional parameters passed to |
Returns a heatmap (class "Heatmap" from package ComplexHeatmap) that displays the provided p-values.
#dl <- data_list( # list(income, "household_income", "demographics", "ordinal"), # list(pubertal, "pubertal_status", "demographics", "continuous"), # list(fav_colour, "favourite_colour", "demographics", "categorical"), # list(anxiety, "anxiety", "behaviour", "ordinal"), # list(depress, "depressed", "behaviour", "ordinal"), # uid = "unique_id" #) # #sc <- snf_config( # dl, # n_solutions = 4, # dropout_dist = "uniform", # max_k = 50 #) # #sol_df <- batch_snf(dl, sc) # #ext_sol_df <- extend_solutions(sol_df, dl) # #pval_heatmap(ext_sol_df)
#dl <- data_list( # list(income, "household_income", "demographics", "ordinal"), # list(pubertal, "pubertal_status", "demographics", "continuous"), # list(fav_colour, "favourite_colour", "demographics", "categorical"), # list(anxiety, "anxiety", "behaviour", "ordinal"), # list(depress, "depressed", "behaviour", "ordinal"), # uid = "unique_id" #) # #sc <- snf_config( # dl, # n_solutions = 4, # dropout_dist = "uniform", # max_k = 50 #) # #sol_df <- batch_snf(dl, sc) # #ext_sol_df <- extend_solutions(sol_df, dl) # #pval_heatmap(ext_sol_df)
These functions calculate conventional metrics of cluster solution quality.
calculate_silhouettes(sol_df) calculate_dunn_indices(sol_df) calculate_db_indices(sol_df)
calculate_silhouettes(sol_df) calculate_dunn_indices(sol_df) calculate_db_indices(sol_df)
sol_df |
A |
calculate_silhouettes: A wrapper for cluster::silhouette
that calculates
silhouette scores for all cluster solutions in a provided solutions data
frame. Silhouette values range from -1 to +1 and indicate an overall ratio
of how close together observations within a cluster are to how far apart
observations across clusters are. You can learn more about interpreting
the results of this function by calling ?cluster::silhouette
.
calculate_dunn_indices: A wrapper for clv::clv.Dunn
that calculates
Dunn indices for all cluster solutions in a provided solutions data
frame. Dunn indices, like silhouette scores, similarly reflect similarity
within clusters and separation across clusters. You can learn more about
interpreting the results of this function by calling ?clv::clv.Dunn
.
calculate_db_indices: A wrapper for clv::clv.Davies.Bouldin
that
calculates Davies-Bouldin indices for all cluster solutions in a provided
solutions data frame. These values can be interpreted similarly as those
above. You can learn more about interpreting the results of this function by
calling ?clv::clv.Davies.Bouldin
.
A list of silhouette
class objects, a vector of Dunn indices, or a
vector of Davies-Bouldin indices depending on which function was used.
input_dl <- data_list( list(gender_df, "gender", "demographics", "categorical"), list(diagnosis_df, "diagnosis", "clinical", "categorical"), uid = "patient_id" ) sc <- snf_config(input_dl, n_solutions = 5) sol_df <- batch_snf(input_dl, sc, return_sim_mats = TRUE) # calculate Davies-Bouldin indices davies_bouldin_indices <- calculate_db_indices(sol_df) # calculate Dunn indices dunn_indices <- calculate_dunn_indices(sol_df) # calculate silhouette scores silhouette_scores <- calculate_silhouettes(sol_df)
input_dl <- data_list( list(gender_df, "gender", "demographics", "categorical"), list(diagnosis_df, "diagnosis", "clinical", "categorical"), uid = "patient_id" ) sc <- snf_config(input_dl, n_solutions = 5) sol_df <- batch_snf(input_dl, sc, return_sim_mats = TRUE) # calculate Davies-Bouldin indices davies_bouldin_indices <- calculate_db_indices(sol_df) # calculate Dunn indices dunn_indices <- calculate_dunn_indices(sol_df) # calculate silhouette scores silhouette_scores <- calculate_silhouettes(sol_df)
Helper function to contribute to rows within the settings data frame. Number of columns removed follows a uniform or exponential probability distribution.
random_removal( columns, min_removed_inputs, max_removed_inputs, dropout_dist = "exponential" )
random_removal( columns, min_removed_inputs, max_removed_inputs, dropout_dist = "exponential" )
columns |
Columns of the settings_df that are passed in |
min_removed_inputs |
The smallest number of input data frames that may be randomly removed. |
max_removed_inputs |
The largest number of input data frames that may be randomly removed. |
dropout_dist |
Indication of how input data frames should be dropped. can be "none" (no dropout), "uniform" (uniformly draw number between min and max removed inputs), or "exponential" (like uniform, but using an exponential distribution; default). |
inclusions_df data frame that can be rbind'ed to the settings_df
Row-binding of solutions data frame class objects.
## S3 method for class 'ext_solutions_df' rbind(..., reset_indices = FALSE)
## S3 method for class 'ext_solutions_df' rbind(..., reset_indices = FALSE)
... |
An arbitrary number of |
reset_indices |
If TRUE, re-labels the "solutions" indices in the solutions data frame from 1 to the number of defined settings. |
An ext_solutions_df
class object.
Row-binding of solutions data frame class objects.
## S3 method for class 'solutions_df' rbind(..., reset_indices = FALSE)
## S3 method for class 'solutions_df' rbind(..., reset_indices = FALSE)
... |
An arbitrary number of |
reset_indices |
If TRUE, re-labels the "solutions" indices in the solutions data frame from 1 to the number of defined settings. |
A solutions_df
class object.
Rename features in a data list
rename_dl(dl, name_mapping)
rename_dl(dl, name_mapping)
dl |
A nested list of input data from |
name_mapping |
A named vector where the values are the features to be renamed and the names are the new names for those features. |
A data list ("list"-class object) with adjusted feature names.
library(metasnf) dl <- data_list( list(pubertal, "pubertal_status", "demographics", "continuous"), list(anxiety, "anxiety", "behaviour", "ordinal"), list(depress, "depressed", "behaviour", "ordinal"), uid = "unique_id" ) summary(dl, "feature") name_changes <- c( "anxiety_score" = "cbcl_anxiety_r", "depression_score" = "cbcl_depress_r" ) dl <- rename_dl(dl, name_changes) summary(dl, "feature")
library(metasnf) dl <- data_list( list(pubertal, "pubertal_status", "demographics", "continuous"), list(anxiety, "anxiety", "behaviour", "ordinal"), list(depress, "depressed", "behaviour", "ordinal"), uid = "unique_id" ) summary(dl, "feature") name_changes <- c( "anxiety_score" = "cbcl_anxiety_r", "depression_score" = "cbcl_depress_r" ) dl <- rename_dl(dl, name_changes) summary(dl, "feature")
Like sample, but when given a single value x, returns back that single value instead of a random value from 1 to x.
resample(x, ...)
resample(x, ...)
x |
Vector or single value to sample from |
... |
Remaining arguments for base::sample function |
Numeric vector result of running base::sample.
Save a heatmap object to a file
save_heatmap(heatmap, path, width = 480, height = 480, res = 100)
save_heatmap(heatmap, path, width = 480, height = 480, res = 100)
heatmap |
The heatmap object to save. |
path |
The path to save the heatmap to. |
width |
The width of the heatmap. |
height |
The height of the heatmap. |
res |
The resolution of the heatmap. |
Does not return any value. Saves heatmap to file.
The settings_df is a data frame whose rows completely specify the hyperparameters and decisions required to transform individual input data frames (found in a data list, see ?data_list) into a single similarity matrix through SNF. The format of the settings data frame is as follows:
A column named "solution": This column is used to keep track of the rows and should have integer values only.
A column named "alpha": This column contains the value of the alpha hyperparameter that will be used on that run of the SNF pipeline.
A column named "k": Like above, but for the K (nearest neighbours) hyperparameter.
A column named "t": Like above, but for the t (number of iterations) hyperparameter.
A column named "snf_scheme": Which of 3 pre-defined schemes will be used to integrate the data frames of the data list into a final fused network. The purpose of varying these schemes is primarily to increase the diversity of the generated cluster solutions.
A value of 1 corresponds to the "individual" scheme, in which all data frames are directly merged by SNF into the final fused network. This scheme corresponds to the approach shown in the original SNF paper.
A value of 2 corresponds to the "two-step" scheme, in which all data frames witin a domain are first merged into a domain-specific fused network. Next, domain-specific networks are fused once more by SNF into the final fused network. This scheme is useful for fairly re-weighting SNF pipelines with unequal numbers of data frames across domains.
A value of 3 corresponds to the "domain" scheme, in which all data frames within a domain are first concatenated into a single domain- specific data frame before being merged by SNF into the final fused network. This approach serves as an alternative way to re-weight SNF pipelines with unequal numbers of data frames across domains. You can learn more about this parameter here: https://branchlab.github.io/metasnf/articles/snf_schemes.html.
A column named "clust_alg": Specification of which clustering algorithm will be applied to the final similarity matrix. By default, this column can take on the integer values 1 or 2, which correspond to spectral clustering where the number of clusters is determined by the eigen-gap or rotation cost heuristic respectively. You can learn more about this parameter here: https://branchlab.github.io/metasnf/articles/clustering_algorithms.html.
A column named "cnt_dist": Specification of which distance metric will be used for data frames of purely continuous data. You can learn about this metric and its defaults here: https://branchlab.github.io/metasnf/articles/distance_metrics.html
A column named "dsc_dist": Like above, but for discrete data frames.
A column named "ord_dist": Like above, but for ordinal data frames.
A column named "cat_dist": Like above, but for categorical data frames.
A column named "mix_dist": Like above, but for mixed-type (e.g., both categorical and discrete) data frames.
One column for every input data frame in the corresponding data list which can either have the value of 0 or 1. The name of the column should be formatted as "inc_[]" where the square brackets are replaced with the name (as found in dl_summary(dl)$"name") of each data frame. When 0, that data frame will be excluded from that run of the SNF pipeline. When 1, that data frame will be included.
settings_df( dl, n_solutions = 0, min_removed_inputs = 0, max_removed_inputs = length(dl) - 1, dropout_dist = "exponential", min_alpha = NULL, max_alpha = NULL, min_k = NULL, max_k = NULL, min_t = NULL, max_t = NULL, alpha_values = NULL, k_values = NULL, t_values = NULL, possible_snf_schemes = c(1, 2, 3), clustering_algorithms = NULL, continuous_distances = NULL, discrete_distances = NULL, ordinal_distances = NULL, categorical_distances = NULL, mixed_distances = NULL, dfl = NULL, snf_input_weights = NULL, snf_domain_weights = NULL, retry_limit = 10, allow_duplicates = FALSE )
settings_df( dl, n_solutions = 0, min_removed_inputs = 0, max_removed_inputs = length(dl) - 1, dropout_dist = "exponential", min_alpha = NULL, max_alpha = NULL, min_k = NULL, max_k = NULL, min_t = NULL, max_t = NULL, alpha_values = NULL, k_values = NULL, t_values = NULL, possible_snf_schemes = c(1, 2, 3), clustering_algorithms = NULL, continuous_distances = NULL, discrete_distances = NULL, ordinal_distances = NULL, categorical_distances = NULL, mixed_distances = NULL, dfl = NULL, snf_input_weights = NULL, snf_domain_weights = NULL, retry_limit = 10, allow_duplicates = FALSE )
dl |
A nested list of input data from |
n_solutions |
Number of rows to generate for the settings data frame. |
min_removed_inputs |
The smallest number of input data frames that may be randomly removed. By default, 0. |
max_removed_inputs |
The largest number of input data frames that may be randomly removed. By default, this is 1 less than all the provided input data frames in the data list. |
dropout_dist |
Parameter controlling how the random removal of input data frames should occur. Can be "none" (no input data frames are randomly removed), "uniform" (uniformly sample between min_removed_inputs and max_removed_inputs to determine number of input data frames to remove), or "exponential" (pick number of input data frames to remove by sampling from min_removed_inputs to max_removed_inputs with an exponential distribution; the default). |
min_alpha |
The minimum value that the alpha hyperparameter can have.
Random assigned value of alpha for each row will be obtained by uniformly
sampling numbers between |
max_alpha |
The maximum value that the alpha hyperparameter can have.
See |
min_k |
The minimum value that the k hyperparameter can have.
Random assigned value of k for each row will be obtained by uniformly
sampling numbers between |
max_k |
The maximum value that the k hyperparameter can have.
See |
min_t |
The minimum value that the t hyperparameter can have.
Random assigned value of t for each row will be obtained by uniformly
sampling numbers between |
max_t |
The maximum value that the t hyperparameter can have.
See |
alpha_values |
A number or numeric vector of a set of possible values
that alpha can take on. Value will be obtained by uniformly sampling the
vector. Cannot be used in conjunction with the |
k_values |
A number or numeric vector of a set of possible values
that k can take on. Value will be obtained by uniformly sampling the
vector. Cannot be used in conjunction with the |
t_values |
A number or numeric vector of a set of possible values
that t can take on. Value will be obtained by uniformly sampling the
vector. Cannot be used in conjunction with the |
possible_snf_schemes |
A vector containing the possible snf_schemes to uniformly randomly select from. By default, the vector contains all 3 possible schemes: c(1, 2, 3). 1 corresponds to the "individual" scheme, 2 corresponds to the "domain" scheme, and 3 corresponds to the "twostep" scheme. |
clustering_algorithms |
A list of clustering algorithms to uniformly randomly pick from when clustering. When not specified, randomly select between spectral clustering using the eigen-gap heuristic and spectral clustering using the rotation cost heuristic. See ?clust_fns_list for more details on running custom clustering algorithms. |
continuous_distances |
A vector of continuous distance metrics to use when a custom dist_fns_list is provided. |
discrete_distances |
A vector of categorical distance metrics to use when a custom dist_fns_list is provided. |
ordinal_distances |
A vector of categorical distance metrics to use when a custom dist_fns_list is provided. |
categorical_distances |
A vector of categorical distance metrics to use when a custom dist_fns_list is provided. |
mixed_distances |
A vector of mixed distance metrics to use when a custom dist_fns_list is provided. |
dfl |
List containing distance metrics to vary over. See ?generate_dist_fns_list. |
snf_input_weights |
Nested list containing weights for when SNF is used to merge individual input measures (see ?generate_snf_weights) |
snf_domain_weights |
Nested list containing weights for when SNF is used to merge domains (see ?generate_snf_weights) |
retry_limit |
The maximum number of attempts to generate a novel row.
This function does not return matrices with identical rows. As the range of
requested possible settings tightens and the number of requested rows
increases, the risk of randomly generating a row that already exists
increases. If a new random row has matched an existing row |
allow_duplicates |
If TRUE, enables creation of a settings data frame with duplicate non-feature weighting related hyperparameters. This function should only be used when paired with a custom weights matrix that has non-duplicate rows. |
A settings data frame
This function calls the htShiny()
function from the package
InteractiveComplexHeatmap to assist users in identifying the indices of the
boundaries between meta clusters in a meta cluster heatmap. By providing a
heatmap of inter-solution similarities (obtained through
meta_cluster_heatmap()), users can click on positions within the heatmap
that appear to meaningfully separate major sets of similar cluster
solutions by visual inspection. The corresponding indices of the clicked
positions are printed to the console and also shown within the app. This
function can only run from an interactive session of R.
shiny_annotator(ari_heatmap)
shiny_annotator(ari_heatmap)
ari_heatmap |
Heatmap of ARIs to divide into meta clusters. |
Does not return any value. Launches interactive shiny applet.
#dl <- data_list( # list(cort_sa, "cortical_surface_area", "neuroimaging", "continuous"), # list(subc_v, "subcortical_volume", "neuroimaging", "continuous"), # list(income, "household_income", "demographics", "continuous"), # list(pubertal, "pubertal_status", "demographics", "continuous"), # uid = "unique_id" #) # #set.seed(42) #my_sc <- snf_config( # dl = dl, # n_solutions = 20, # min_k = 20, # max_k = 50 #) # #sol_df <- batch_snf(dl, my_sc) # #sol_aris <- calc_aris(sol_df) # #meta_cluster_order <- get_matrix_order(sol_aris) # #ari_hm <- meta_cluster_heatmap(sol_aris, order = meta_cluster_order) # ## Click on meta cluster boundaries to obtain `split_vec` values #shiny_annotator(ari_hm) # #split_vec <- c(6, 10, 16) # #ari_hm <- meta_cluster_heatmap( # sol_aris, # order = meta_cluster_order, # split_vector = split_vec #)
#dl <- data_list( # list(cort_sa, "cortical_surface_area", "neuroimaging", "continuous"), # list(subc_v, "subcortical_volume", "neuroimaging", "continuous"), # list(income, "household_income", "demographics", "continuous"), # list(pubertal, "pubertal_status", "demographics", "continuous"), # uid = "unique_id" #) # #set.seed(42) #my_sc <- snf_config( # dl = dl, # n_solutions = 20, # min_k = 20, # max_k = 50 #) # #sol_df <- batch_snf(dl, my_sc) # #sol_aris <- calc_aris(sol_df) # #meta_cluster_order <- get_matrix_order(sol_aris) # #ari_hm <- meta_cluster_heatmap(sol_aris, order = meta_cluster_order) # ## Click on meta cluster boundaries to obtain `split_vec` values #shiny_annotator(ari_hm) # #split_vec <- c(6, 10, 16) # #ari_hm <- meta_cluster_heatmap( # sol_aris, # order = meta_cluster_order, # split_vector = split_vec #)
sim_mats_list
class objectCreate or extract a sim_mats_list
class object
sim_mats_list(x)
sim_mats_list(x)
x |
The object to create or extract a |
A sim_mats_list
class object.
Plot heatmap of similarity matrix
similarity_matrix_heatmap( similarity_matrix, order = NULL, cluster_solution = NULL, scale_diag = "mean", log_graph = TRUE, cluster_rows = FALSE, cluster_columns = FALSE, show_row_names = FALSE, show_column_names = FALSE, data = NULL, left_bar = NULL, right_bar = NULL, top_bar = NULL, bottom_bar = NULL, left_hm = NULL, right_hm = NULL, top_hm = NULL, bottom_hm = NULL, annotation_colours = NULL, min_colour = NULL, max_colour = NULL, split_vector = NULL, row_split = NULL, column_split = NULL, ... )
similarity_matrix_heatmap( similarity_matrix, order = NULL, cluster_solution = NULL, scale_diag = "mean", log_graph = TRUE, cluster_rows = FALSE, cluster_columns = FALSE, show_row_names = FALSE, show_column_names = FALSE, data = NULL, left_bar = NULL, right_bar = NULL, top_bar = NULL, bottom_bar = NULL, left_hm = NULL, right_hm = NULL, top_hm = NULL, bottom_hm = NULL, annotation_colours = NULL, min_colour = NULL, max_colour = NULL, split_vector = NULL, row_split = NULL, column_split = NULL, ... )
similarity_matrix |
A similarity matrix |
order |
Vector of numbers to reorder the similarity matrix (and data if provided). Overwrites ordering specified by cluster_solution param. |
cluster_solution |
Row of a solutions data frame or column of a transposed solutions data frame. |
scale_diag |
Method of rescaling matrix diagonals. Can be "none" (don't change diagonals), "mean" (replace diagonals with average value of off-diagonals), or "zero" (replace diagonals with 0). |
log_graph |
If TRUE, log transforms the graph. |
cluster_rows |
Parameter for ComplexHeatmap::Heatmap. |
cluster_columns |
Parameter for ComplexHeatmap::Heatmap. |
show_row_names |
Parameter for ComplexHeatmap::Heatmap. |
show_column_names |
Parameter for ComplexHeatmap::Heatmap. |
data |
A data frame containing elements requested for annotation. |
left_bar |
Named list of strings, where the strings are features in df that should be used for a barplot annotation on the left of the plot and the names are the names that will be used to caption the plots and their legends. |
right_bar |
See left_bar. |
top_bar |
See left_bar. |
bottom_bar |
See left_bar. |
left_hm |
Like left_bar, but with a heatmap annotation instead of a barplot annotation. |
right_hm |
See left_hm. |
top_hm |
See left_hm. |
bottom_hm |
See left_hm. |
annotation_colours |
Named list of heatmap annotations and their colours. |
min_colour |
Colour used for the lowest value in the heatmap. |
max_colour |
Colour used for the highest value in the heatmap. |
split_vector |
A vector of partition indices. |
row_split |
Standard parameter of |
column_split |
Standard parameter of |
... |
Additional parameters passed into ComplexHeatmap::Heatmap. |
Returns a heatmap (class "Heatmap" from package ComplexHeatmap) that displays the similarities between observations in the provided matrix.
#my_dl <- data_list( # list( # data = expression_df, # name = "expression_data", # domain = "gene_expression", # type = "continuous" # ), # list( # data = methylation_df, # name = "methylation_data", # domain = "gene_methylation", # type = "continuous" # ), # uid = "patient_id" #) # #sc <- snf_config(my_dl, n_solutions = 10) # #sol_df <- batch_snf(my_dl, sc, return_sim_mats = TRUE) # #sim_mats <- sim_mats_list(sol_df) # #similarity_matrix_heatmap( # sim_mats[[1]], # cluster_solution = sol_df[1, ] #)
#my_dl <- data_list( # list( # data = expression_df, # name = "expression_data", # domain = "gene_expression", # type = "continuous" # ), # list( # data = methylation_df, # name = "methylation_data", # domain = "gene_methylation", # type = "continuous" # ), # uid = "patient_id" #) # #sc <- snf_config(my_dl, n_solutions = 10) # #sol_df <- batch_snf(my_dl, sc, return_sim_mats = TRUE) # #sim_mats <- sim_mats_list(sol_df) # #similarity_matrix_heatmap( # sim_mats[[1]], # cluster_solution = sol_df[1, ] #)
Squared (including weights) Euclidean distance
siw_euclidean_distance(df, weights_row)
siw_euclidean_distance(df, weights_row)
df |
data frame containing at least 1 data column. |
weights_row |
Single-row data frame where the column names contain the column names in df and the row contains the corresponding weights. |
distance_matrix A distance matrix.
snf_config()
constructs an SNF config object which inherits from classes
snf_config
and list
. This object is used to store all settings
required to transform data stored in a data_list
class object into a
space of cluster solutions by SNF. The SNF config object contains the
following components:
1. A settings data frame (inherits from settings_df
and data.frame
).
Data frame that stores SNF-specific hyperparameters and information
about feature selection and weighting, SNF schemes, clustering
algorithms, and distance metrics. Each row of the settings data frame
corresponds to a distinct cluster solution.
2. A clustering algorithms list (inherits from clust_fns_list
and
list
), which stores all clustering algorithms that the settings
data frame can point to.
3. A distance metrics list (inherits from dist_metrics_list
and
list
), which stores all distance metrics that the settings data
frame can point to.
4. A weights matrix (inherits from weights_matrix
, matrix
, and
array
'), which stores the feature weights to use prior to distance
calculations. Each column of the weights matrix corresponds to a
different feature in the data list and each row corresponds to a
different row in the settings data frame.
snf_config( dl = NULL, sdf = NULL, dfl = NULL, cfl = NULL, wm = NULL, n_solutions = 0, min_removed_inputs = 0, max_removed_inputs = length(dl) - 1, dropout_dist = "exponential", min_alpha = NULL, max_alpha = NULL, min_k = NULL, max_k = NULL, min_t = NULL, max_t = NULL, alpha_values = NULL, k_values = NULL, t_values = NULL, possible_snf_schemes = c(1, 2, 3), clustering_algorithms = NULL, continuous_distances = NULL, discrete_distances = NULL, ordinal_distances = NULL, categorical_distances = NULL, mixed_distances = NULL, snf_input_weights = NULL, snf_domain_weights = NULL, retry_limit = 10, cnt_dist_fns = NULL, dsc_dist_fns = NULL, ord_dist_fns = NULL, cat_dist_fns = NULL, mix_dist_fns = NULL, automatic_standard_normalize = FALSE, use_default_dist_fns = FALSE, clust_fns = NULL, use_default_clust_fns = FALSE, weights_fill = "ones" )
snf_config( dl = NULL, sdf = NULL, dfl = NULL, cfl = NULL, wm = NULL, n_solutions = 0, min_removed_inputs = 0, max_removed_inputs = length(dl) - 1, dropout_dist = "exponential", min_alpha = NULL, max_alpha = NULL, min_k = NULL, max_k = NULL, min_t = NULL, max_t = NULL, alpha_values = NULL, k_values = NULL, t_values = NULL, possible_snf_schemes = c(1, 2, 3), clustering_algorithms = NULL, continuous_distances = NULL, discrete_distances = NULL, ordinal_distances = NULL, categorical_distances = NULL, mixed_distances = NULL, snf_input_weights = NULL, snf_domain_weights = NULL, retry_limit = 10, cnt_dist_fns = NULL, dsc_dist_fns = NULL, ord_dist_fns = NULL, cat_dist_fns = NULL, mix_dist_fns = NULL, automatic_standard_normalize = FALSE, use_default_dist_fns = FALSE, clust_fns = NULL, use_default_clust_fns = FALSE, weights_fill = "ones" )
dl |
A nested list of input data from |
sdf |
A |
dfl |
A |
cfl |
A |
wm |
A |
n_solutions |
Number of rows to generate for the settings data frame. |
min_removed_inputs |
The smallest number of input data frames that may be randomly removed. By default, 0. |
max_removed_inputs |
The largest number of input data frames that may be randomly removed. By default, this is 1 less than all the provided input data frames in the data list. |
dropout_dist |
Parameter controlling how the random removal of input data frames should occur. Can be "none" (no input data frames are randomly removed), "uniform" (uniformly sample between min_removed_inputs and max_removed_inputs to determine number of input data frames to remove), or "exponential" (pick number of input data frames to remove by sampling from min_removed_inputs to max_removed_inputs with an exponential distribution; the default). |
min_alpha |
The minimum value that the alpha hyperparameter can have.
Random assigned value of alpha for each row will be obtained by uniformly
sampling numbers between |
max_alpha |
The maximum value that the alpha hyperparameter can have.
See |
min_k |
The minimum value that the k hyperparameter can have.
Random assigned value of k for each row will be obtained by uniformly
sampling numbers between |
max_k |
The maximum value that the k hyperparameter can have.
See |
min_t |
The minimum value that the t hyperparameter can have.
Random assigned value of t for each row will be obtained by uniformly
sampling numbers between |
max_t |
The maximum value that the t hyperparameter can have.
See |
alpha_values |
A number or numeric vector of a set of possible values
that alpha can take on. Value will be obtained by uniformly sampling the
vector. Cannot be used in conjunction with the |
k_values |
A number or numeric vector of a set of possible values
that k can take on. Value will be obtained by uniformly sampling the
vector. Cannot be used in conjunction with the |
t_values |
A number or numeric vector of a set of possible values
that t can take on. Value will be obtained by uniformly sampling the
vector. Cannot be used in conjunction with the |
possible_snf_schemes |
A vector containing the possible snf_schemes to uniformly randomly select from. By default, the vector contains all 3 possible schemes: c(1, 2, 3). 1 corresponds to the "individual" scheme, 2 corresponds to the "domain" scheme, and 3 corresponds to the "twostep" scheme. |
clustering_algorithms |
A list of clustering algorithms to uniformly randomly pick from when clustering. When not specified, randomly select between spectral clustering using the eigen-gap heuristic and spectral clustering using the rotation cost heuristic. See ?clust_fns_list for more details on running custom clustering algorithms. |
continuous_distances |
A vector of continuous distance metrics to use when a custom dist_fns_list is provided. |
discrete_distances |
A vector of categorical distance metrics to use when a custom dist_fns_list is provided. |
ordinal_distances |
A vector of categorical distance metrics to use when a custom dist_fns_list is provided. |
categorical_distances |
A vector of categorical distance metrics to use when a custom dist_fns_list is provided. |
mixed_distances |
A vector of mixed distance metrics to use when a custom dist_fns_list is provided. |
snf_input_weights |
Nested list containing weights for when SNF is used to merge individual input measures (see ?generate_snf_weights) |
snf_domain_weights |
Nested list containing weights for when SNF is used to merge domains (see ?generate_snf_weights) |
retry_limit |
The maximum number of attempts to generate a novel row.
This function does not return matrices with identical rows. As the range of
requested possible settings tightens and the number of requested rows
increases, the risk of randomly generating a row that already exists
increases. If a new random row has matched an existing row |
cnt_dist_fns |
A named list of continuous distance metric functions. |
dsc_dist_fns |
A named list of discrete distance metric functions. |
ord_dist_fns |
A named list of ordinal distance metric functions. |
cat_dist_fns |
A named list of categorical distance metric functions. |
mix_dist_fns |
A named list of mixed distance metric functions. |
automatic_standard_normalize |
If TRUE, will automatically use standard normalization prior to calculation of any numeric distances. This parameter overrides all other distance functions list-related parameters. |
use_default_dist_fns |
If TRUE, prepend the base distance metrics (euclidean distance for continuous, discrete, and ordinal data and gower distance for categorical and mixed data) to the resulting distance metrics list. |
clust_fns |
A list of named clustering functions |
use_default_clust_fns |
If TRUE, prepend the base clustering algorithms (spectral_eigen and spectral_rot, which apply spectral clustering and use the eigen-gap and rotation cost heuristics respectively for determining the number of clusters in the graph) to clust_fns. |
weights_fill |
String indicating what to populate generate rows with. Can be "ones" (default; fill matrix with 1), "uniform" (fill matrix with uniformly distributed random values), or "exponential" (fill matrix with exponentially distributed random values). |
An snf_config
class object.
# Simple random config for 5 cluster solutions input_dl <- data_list( list(anxiety, "anxiety", "behaviour", "ordinal"), list(depress, "depressed", "behaviour", "ordinal"), uid = "unique_id" ) my_sc <- snf_config( dl = input_dl, n_solutions = 5 ) # specifying possible K range my_sc <- snf_config( dl = input_dl, n_solutions = 5, min_k = 20, max_k = 40 ) # Random feature weights across from uniform distribution my_sc <- snf_config( dl = input_dl, n_solutions = 5, min_k = 20, max_k = 40, weights_fill = "uniform" ) # Specifying custom pre-built clustering and distance functions # - Random alternation between 2-cluster and 5-cluster solutions # - When continuous or discrete data frames are being processed, # randomly alternate between standardized/normalized Euclidean # distance vs. regular Euclidean distance my_sc <- snf_config( dl = input_dl, n_solutions = 5, min_k = 20, max_k = 40, weights_fill = "uniform", clust_fns = list( "two_cluster_spectral" = spectral_two, "five_cluster_spectral" = spectral_five ), cnt_dist_fns = list( "euclidean" = euclidean_distance, "std_nrm_euc" = sn_euclidean_distance ), dsc_dist_fns = list( "euclidean" = euclidean_distance, "std_nrm_euc" = sn_euclidean_distance ) )
# Simple random config for 5 cluster solutions input_dl <- data_list( list(anxiety, "anxiety", "behaviour", "ordinal"), list(depress, "depressed", "behaviour", "ordinal"), uid = "unique_id" ) my_sc <- snf_config( dl = input_dl, n_solutions = 5 ) # specifying possible K range my_sc <- snf_config( dl = input_dl, n_solutions = 5, min_k = 20, max_k = 40 ) # Random feature weights across from uniform distribution my_sc <- snf_config( dl = input_dl, n_solutions = 5, min_k = 20, max_k = 40, weights_fill = "uniform" ) # Specifying custom pre-built clustering and distance functions # - Random alternation between 2-cluster and 5-cluster solutions # - When continuous or discrete data frames are being processed, # randomly alternate between standardized/normalized Euclidean # distance vs. regular Euclidean distance my_sc <- snf_config( dl = input_dl, n_solutions = 5, min_k = 20, max_k = 40, weights_fill = "uniform", clust_fns = list( "two_cluster_spectral" = spectral_two, "five_cluster_spectral" = spectral_five ), cnt_dist_fns = list( "euclidean" = euclidean_distance, "std_nrm_euc" = sn_euclidean_distance ), dsc_dist_fns = list( "euclidean" = euclidean_distance, "std_nrm_euc" = sn_euclidean_distance ) )
Helper function to determine which row and columns to split on
split_parser( row_split_vector = NULL, column_split_vector = NULL, row_split = NULL, column_split = NULL, n_rows, n_columns )
split_parser( row_split_vector = NULL, column_split_vector = NULL, row_split = NULL, column_split = NULL, n_rows, n_columns )
row_split_vector |
A vector of row indices to split on. |
column_split_vector |
A vector of column indices to split on. |
row_split |
Standard parameter of |
column_split |
Standard parameter of |
n_rows |
The number of rows in the data. |
n_columns |
The number of columns in the data. |
"list"-class object containing row_split and column_split character vectors to pass into ComplexHeatmap::Heatmap.
Like the mock data frame "abcd_subc_v", but with "unique_id" as the "uid".
subc_v
subc_v
subc_v
A data frame with 174 rows and 31 columns:
The unique identifier of the ABCD dataset
Subcortical volumes of various ROIs (mm^3, I think)
Though this data is no longer "real" ABCD data, the reference for using ABCD as a data source is below:
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive DevelopmentSM (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.
Given a data list, return a list of smaller data lists that are generated
through random sampling (without replacement). The results of this function
can be passed into batch_snf_subsamples()
to obtain a list of resampled
solutions data frames.
subsample_dl( dl, n_subsamples, subsample_fraction = NULL, n_observations = NULL )
subsample_dl( dl, n_subsamples, subsample_fraction = NULL, n_observations = NULL )
dl |
A nested list of input data from |
n_subsamples |
Number of subsamples to create. |
subsample_fraction |
Percentage of patients to include per subsample. |
n_observations |
Number of patients to include per subsample. |
A "list" class object containing n_subsamples
number of
data lists. Each of those data lists contains a random subsample_fraction
fraction of the observations of the provided data list.
# my_dl <- data_list( # list(subc_v, "subcortical_volume", "neuroimaging", "continuous"), # list(income, "household_income", "demographics", "continuous"), # list(pubertal, "pubertal_status", "demographics", "continuous"), # uid = "unique_id" # ) # # my_dl_subsamples <- subsample_dl( # my_dl, # n_subsamples = 20, # subsample_fraction = 0.85 # )
# my_dl <- data_list( # list(subc_v, "subcortical_volume", "neuroimaging", "continuous"), # list(income, "household_income", "demographics", "continuous"), # list(pubertal, "pubertal_status", "demographics", "continuous"), # uid = "unique_id" # ) # # my_dl_subsamples <- subsample_dl( # my_dl, # n_subsamples = 20, # subsample_fraction = 0.85 # )
Given a list of subsampled solutions data frames from
'batch_snf_subsamples()
, this function calculates the adjusted Rand
indices across all the subsamples of each solution. ARI calculation between
two subsamples only factors in observations that were present in both
subsamples.
subsample_pairwise_aris(subsample_solutions, verbose = FALSE)
subsample_pairwise_aris(subsample_solutions, verbose = FALSE)
subsample_solutions |
A list of solutions data frames from
subsamples of the data. This object is generated by the function
|
verbose |
If TRUE, output progress to console. |
A two-item list: "raw_aris", a list of inter-subsample pairwise ARI matrices (one for each full cluster solution) and "ari_summary", a data frame containing the mean and SD of the inter-subsample ARIs for each original cluster solution.
# my_dl <- data_list( # list(subc_v, "subcortical_volume", "neuroimaging", "continuous"), # list(income, "household_income", "demographics", "continuous"), # list(pubertal, "pubertal_status", "demographics", "continuous"), # uid = "unique_id" # ) # # sc <- snf_config(my_dl, n_solutions = 5, max_k = 40) # # my_dl_subsamples <- subsample_dl( # my_dl, # n_subsamples = 20, # subsample_fraction = 0.85 # ) # # batch_subsample_results <- batch_snf_subsamples( # my_dl_subsamples, # sc, # verbose = TRUE # ) # # pairwise_aris <- subsample_pairwise_aris( # batch_subsample_results # verbose = TRUE # ) # # # Visualize ARIs # ComplexHeatmap::Heatmap( # pairwise_aris[[1]], # heatmap_legend_param = list( # color_bar = "continuous", # title = "Inter-Subsample\nARI", # at = c(0, 0.5, 1) # ), # show_column_names = FALSE, # show_row_names = FALSE # )
# my_dl <- data_list( # list(subc_v, "subcortical_volume", "neuroimaging", "continuous"), # list(income, "household_income", "demographics", "continuous"), # list(pubertal, "pubertal_status", "demographics", "continuous"), # uid = "unique_id" # ) # # sc <- snf_config(my_dl, n_solutions = 5, max_k = 40) # # my_dl_subsamples <- subsample_dl( # my_dl, # n_subsamples = 20, # subsample_fraction = 0.85 # ) # # batch_subsample_results <- batch_snf_subsamples( # my_dl_subsamples, # sc, # verbose = TRUE # ) # # pairwise_aris <- subsample_pairwise_aris( # batch_subsample_results # verbose = TRUE # ) # # # Visualize ARIs # ComplexHeatmap::Heatmap( # pairwise_aris[[1]], # heatmap_legend_param = list( # color_bar = "continuous", # title = "Inter-Subsample\nARI", # at = c(0, 0.5, 1) # ), # show_column_names = FALSE, # show_row_names = FALSE # )
Summarize a clust_fns_list object
summarize_clust_fns_list(cfl)
summarize_clust_fns_list(cfl)
cfl |
A |
summary_df "data.frame" class object containing the name and index
of each clustering algorithm in te provided clust_fns_list
.
Summarize metrics contained in a dist_fns_list
summarize_dfl(dist_fns_list)
summarize_dfl(dist_fns_list)
dist_fns_list |
A dist_fns_list. |
"data.frame"-class object summarizing items in a distance metrics list.
Defunct function for summarizing a data list. Please
use
summary()
instead.
summarize_dl(data_list, scope = "component")
summarize_dl(data_list, scope = "component")
data_list |
A nested list of input data from |
scope |
The level of detail for the summary. Options are:
|
data.frame class object summarizing all components (or features if scope == "component").
Pull features used to calculate summary p-values from an object
summary_features(x)
summary_features(x)
x |
The object to extract summary features from. |
A character vector of summary features.
data_list
Returns a data list summary (data.frame
class object) containing
information on components, features, variable types, domains, and component
dimensions.
## S3 method for class 'data_list' summary(object, scope = "component", ...)
## S3 method for class 'data_list' summary(object, scope = "component", ...)
object |
A |
scope |
The level of detail for the summary. By default, this is set to "component", which returns a summary of the data list at the component level. Can also be set to "feature", resulting in a summary at the feature level. |
... |
Other arguments passed to |
A data.frame
class object. If scope
is "component", each row
shows the name, variable type, domain, and dimensions of each component. If
scope
is "feature", each row shows the name, variable type, and domain of
each feature.
Given a vector of uid_id and a threshold, returns a list of which members should be in the training set and which should be in the testing set. The function relies on whether or not the absolute value of the Jenkins's one_at_a_time hash function exceeds the maximum possible value (2147483647) multiplied by the threshold.
train_test_assign(train_frac, uids, seed = 42)
train_test_assign(train_frac, uids, seed = 42)
train_frac |
The fraction (0 to 1) of observations for training |
uids |
A character vector of UIDs to be distributed into training and test sets. |
seed |
Seed used for Jenkins's one_at_a_time hash function. |
A named list containing the training and testing uid_ids.
Pull UIDs from an object
uids(x)
uids(x)
x |
The object to extract UIDs from. |
A character vector of UIDs.
solutions_df
class objectValidator for solutions_df
class object
validate_solutions_df(sol_dfl)
validate_solutions_df(sol_dfl)
sol_dfl |
A solutions data frame-like object to be validated and converted into a solutions data frame. |
If sol_dfl has a valid structure for a solutions_df
class object,
returns the input unchanged. Otherwise, raises an error.
Manhattan plot of feature-feature association p-values
var_manhattan_plot( dl, key_var, neg_log_pval_thresh = 5, threshold = NULL, point_size = 5, text_size = 20, plot_title = NULL, hide_x_labels = FALSE, bonferroni_line = FALSE )
var_manhattan_plot( dl, key_var, neg_log_pval_thresh = 5, threshold = NULL, point_size = 5, text_size = 20, plot_title = NULL, hide_x_labels = FALSE, bonferroni_line = FALSE )
dl |
List of data frames containing data information. |
key_var |
Feature for which the association p-values of all other features are plotted. |
neg_log_pval_thresh |
Threshold for negative log p-values. |
threshold |
p-value threshold to plot dashed line at. |
point_size |
Size of points in the plot. |
text_size |
Size of text in the plot. |
plot_title |
Title of the plot. |
hide_x_labels |
If TRUE, hides x-axis labels. |
bonferroni_line |
If TRUE, plots a dashed black line at the Bonferroni-corrected equivalent of the p-value threshold. |
A Manhattan plot (class "gg", "ggplot") showing the association p-values of features against one key feature in a data list.
dl <- data_list( list(subc_v, "subcortical_volume", "neuroimaging", "continuous"), list(income, "household_income", "demographics", "continuous"), list(pubertal, "pubertal_status", "demographics", "continuous"), list(anxiety, "anxiety", "behaviour", "ordinal"), list(depress, "depressed", "behaviour", "ordinal"), uid = "unique_id" ) var_manhattan <- var_manhattan_plot( dl, key_var = "household_income", plot_title = "Correlation of Features with Household Income", text_size = 16, neg_log_pval_thresh = 3, threshold = 0.05 )
dl <- data_list( list(subc_v, "subcortical_volume", "neuroimaging", "continuous"), list(income, "household_income", "demographics", "continuous"), list(pubertal, "pubertal_status", "demographics", "continuous"), list(anxiety, "anxiety", "behaviour", "ordinal"), list(depress, "depressed", "behaviour", "ordinal"), uid = "unique_id" ) var_manhattan <- var_manhattan_plot( dl, key_var = "household_income", plot_title = "Correlation of Features with Household Income", text_size = 16, neg_log_pval_thresh = 3, threshold = 0.05 )
Function for building a weights matrix independently of an SNF config. The
weights matrix contains one row corresponding to each row of the settings
data frame in an SNF config (one row for each resulting cluster solution)
and one column for each feature in the data list used for clustering. Values
of the weights matrix are passed to distance metrics functions during the
conversion of input data frames to distance matrices. Typically, there is no
need to use this function directly. Instead, users should provide weights
matrix-building parameters to the snf_config()
function.
weights_matrix(dl = NULL, n_solutions = 1, weights_fill = "ones")
weights_matrix(dl = NULL, n_solutions = 1, weights_fill = "ones")
dl |
A nested list of input data from |
n_solutions |
Number of rows to generate the template weights matrix for. |
weights_fill |
String indicating what to populate generate rows with. Can be "ones" (default; fill matrix with 1), "uniform" (fill matrix with uniformly distributed random values), or "exponential" (fill matrix with exponentially distributed random values). |
wm A properly formatted matrix containing columns for all the features that require weights and rows.
input_dl <- data_list( list(subc_v, "subcortical_volume", "neuroimaging", "continuous"), list(income, "income", "demographics", "continuous"), list(pubertal, "pubertal_status", "demographics", "continuous"), uid = "unique_id" ) sc <- snf_config(input_dl, n_solutions = 5) wm <- weights_matrix(input_dl, n_solutions = 5, weights_fill = "uniform") # updating an SNF config in parts sc$"weights_matrix" <- wm
input_dl <- data_list( list(subc_v, "subcortical_volume", "neuroimaging", "continuous"), list(income, "income", "demographics", "continuous"), list(pubertal, "pubertal_status", "demographics", "continuous"), uid = "unique_id" ) sc <- snf_config(input_dl, n_solutions = 5) wm <- weights_matrix(input_dl, n_solutions = 5, weights_fill = "uniform") # updating an SNF config in parts sc$"weights_matrix" <- wm