API Reference#
Datasets#
This class encapsulates an AFQ dataset and has static methods to read data from csv files conforming to the AFQ data standard.
- class afqinsight.AFQDataset(X, y=None, groups=None, feature_names=None, group_names=None, target_cols=None, subjects=None, sessions=None, classes=None)[source]#
Represent AFQ features and targets.
The AFQDataset class represents tractometry features and, optionally, phenotypic targets.
The simplest way to create a new AFQDataset is to pass in the tractometric features and phenotypic targets explicitly.
>>> import numpy as np >>> AFQDataset(X=np.random.rand(50, 1000), y=np.random.rand(50)) AFQDataset(n_samples=50, n_features=1000, n_targets=1)
You can keep track of the names of the target variables with the target_cols parameter.
>>> AFQDataset(X=np.random.rand(50, 1000), ... y=np.random.rand(50), target_cols=["age"]) AFQDataset(n_samples=50, n_features=1000, n_targets=1, targets=['age'])
Source Datasets:
The most common way to create an AFQDataset is to load data from a set of csv files that conform to the AFQ data format. For example,
>>> import os.path as op >>> sarica_dir = download_sarica(verbose=False) >>> dataset = AFQDataset.from_files( ... fn_nodes=op.join(sarica_dir, "nodes.csv"), ... fn_subjects=op.join(sarica_dir, "subjects.csv"), ... dwi_metrics=["md", "fa"], ... target_cols=["class"], ... label_encode_cols=["class"], ... ) >>> dataset AFQDataset(n_samples=48, n_features=4000, n_targets=1, targets=['class'])
AFQDatasets are indexable and can be sliced.
>>> dataset[0:10] AFQDataset(n_samples=10, n_features=4000, n_targets=1, targets=['class'])
You can query the length of the dataset as well as the feature and target shapes.
>>> len(dataset) 48 >>> dataset.shape ((48, 4000), (48,))
Datasets can be used as expected in scikit-learn’s model selection functions. For example
>>> from sklearn.model_selection import train_test_split >>> train_data, test_data = train_test_split(dataset, test_size=0.3, ... stratify=dataset.y) >>> train_data AFQDataset(n_samples=33, n_features=4000, n_targets=1, targets=['class']) >>> test_data AFQDataset(n_samples=15, n_features=4000, n_targets=1, targets=['class'])
You can drop samples from the dataset that have null target values using the drop_target_na method. >>> dataset.y = dataset.y.astype(float) >>> dataset.y[:5] = np.nan >>> dataset.drop_target_na() >>> dataset AFQDataset(n_samples=43, n_features=4000, n_targets=1, targets=[‘class’])
- Parameters:
- Xarray-like of shape (n_samples, n_features)
The feature samples.
- yarray-like of shape (n_samples,) or (n_samples, n_targets), optional
Target values. This will be None if
unsupervised
is True- groupslist of numpy.ndarray, optional
The feature indices for each feature group. These are typically used to keep group collections of “nodes” into white matter bundles.
- feature_nameslist of tuples, optional
The multi-indexed columns of X. i.e. the names of the features.
- group_nameslist of tuples, optional
The multi-indexed groups of X. i.e. the names of the feature groups.
- target_colslist of strings, optional
List of column names for the target variables in y.
- subjectslist, optional
Subject IDs
- sessionslist, optional
Session IDs.
- classesdict, optional
Class labels for each label encoded column specified in
y
.
- Attributes:
shape
Return the shape of the features and targets.
Methods
as_tensorflow_dataset
([bundles_as_channels, ...])Return features and labels packaged as a tensorflow dataset.
as_torch_dataset
([bundles_as_channels, ...])Return features and labels packaged as a pytorch dataset.
bundle_means
()Return diffusion metrics averaged along the length of each bundle.
copy
()Return a deep copy of this dataset.
drop_target_na
()Drop subjects who have nan values as targets.
from_files
([fn_nodes, fn_subjects, ...])Create an AFQDataset from csv files.
from_study
(study[, verbose])Fetch an AFQ dataset from a predefined study.
model_fit
(model, **fit_params)Fit the dataset with a provided model object.
model_fit_transform
(model, **fit_params)Fit and transform the dataset with a provided model object.
model_predict
(model, **predict_params)Predict the targets with a provided model object.
model_score
(model, **score_params)Score a model on this dataset.
model_transform
(model, **transform_params)Transform the dataset with a provided model object.
Pipelines#
These are AFQ-Insights recommended estimator pipelines.
- afqinsight.make_afq_regressor_pipeline(imputer='simple', scaler='standard', feature_transformer=False, ensemble_meta_estimator=None, imputer_kwargs=None, scaler_kwargs=None, feature_transformer_kwargs=None, ensemble_meta_estimator_kwargs=None, use_cv_estimator=True, memory=None, pipeline_verbosity=False, target_transformer=None, target_transform_func=None, target_transform_inverse_func=None, target_transform_check_inverse=True, **estimator_kwargs)[source]#
Return the recommended AFQ-specific regression pipeline.
This function returns a Pipeline instance with the following steps:
[imputer, scaler, feature_transformer, estimator]
where
imputer
imputes missing data due to individual subjects missing metrics along an entire bundle;scaler
is optional and scales the features of the feature matrix;feature_transformer
is optional and applies a transform featurewise to make data more Gaussian-like; andestimator
is an instance ofgroupyr.SGLCV
ifuse_cv_estimator=True
orgroupyr.SGL
ifuse_cv_estimator=False
. The estimator may optionally be wrapped in an ensemble meta-estimator specified byensemble_meta_estimator
and given the keyword arguments inensemble_meta_estimator_kwargs
. Additionally, the estimator may optionally be wrapped insklearn:sklearn.compose.TransformedTargetRegressor
, such that the computation duringfit
is:estimator.fit(X, target_transform_func(y))
or:
estimator.fit(X, target_transformer.transform(y))
The computation during
predict
is:target_transform_inverse_func(estimator.predict(X))
or:
target_transformer.inverse_transform(estimator.predict(X))
- Parameters:
- imputer“simple”, “knn”, or sklearn-compatible transformer, default=”simple”
The imputer for missing data. String arguments result in the use of specific imputers/transformers: “simple” yields
sklearn.impute.SimpleImputer
; “knn” yieldssklearn.impute.KNNImputer
. Custom transformers are allowed as long as they inherit fromsklearn.base.TransformerMixin
.- scaler“standard”, “minmax”, “maxabs”, “robust”, or
sklearn-compatible transformer, default=”standard”
The scaler to use for the feature matrix. String arguments result in the use of specific transformers: “standard” yields the
sklearn:sklearn.preprocessing.StandardScalar
; “minmax” yields thesklearn.preprocessing.MinMaxScaler
; “maxabs” yields thesklearn.preprocessing.MaxAbsScaler
; “robust” yields thesklearn.preprocessing.RobustScaler
. Custom transformers are allowed as long as they inherit fromsklearn.base.TransformerMixin
.- feature_transformerbool or sklearn-compatible transformer, default=False
An optional transformer for use on the feature matrix. If True, use
sklearn.preprocessing.PowerTransformer
. If False, skip this step. Custom transformers are allowed as long as they inherit fromsklearn.base.TransformerMixin
.- ensemble_meta_estimator“bagging”, “adaboost”, or None
An optional ensemble meta-estimator to combine the predictions of several base estimators. “Adaboost” will result in the use of
sklearn.ensemble.AdaBoostClassifier
for classifier base estimators orsklearn.sklearn.ensemble.AdaBoostRegressor
for regressor base estimators. “Bagging” will result in the use ofsklearn.sklearn.ensemble.BaggingClassifier
for classifier base estimators orsklearn.sklearn.ensemble.BaggingRegressor
for regressor base estimators.- imputer_kwargsdict, default=None,
Key-word arguments for the imputer.
- scaler_kwargsdict, default=None,
Key-word arguments for the scaler.
- feature_transformer_kwargsdict, default=None,
Key-word arguments for the feature_transformer.
- use_cv_estimatorbool, default=True,
If True, use
groupyr.SGLCV
as the final estimator. Otherwise, usegroupyr.SGL
.- memorystr or object with the joblib.Memory interface, default=None
Used to cache the fitted transformers of the pipeline. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute
named_steps
orsteps
to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming.- pipeline_verbositybool, default=False
If True, the time elapsed while fitting each step will be printed as it is completed.
- target_transformerobject, default=None
Estimator object such as derived from
sklearn.base.TransformerMixin
. Cannot be set at the same time asfunc
andinverse_func
. Iftransformer
isNone
as well asfunc
andinverse_func
, the transformer will be an identity transformer. Note that the transformer will be cloned during fitting. Also, the transformer is restrictingy
to be a numpy array.- target_transform_funcfunction, default=None
Function to apply to
y
before passing tofit
. Cannot be set at the same time astransformer
. The function needs to return a 2-dimensional array. Iffunc
isNone
, the function used will be the identity function.- target_transform_inverse_funcfunction, default=None
Function to apply to the prediction of the regressor. Cannot be set at the same time as
transformer
as well. The function needs to return a 2-dimensional array. The inverse function is used to return predictions to the same space of the original training labels.- target_transform_check_inversebool, default=True
Whether to check that
transform
followed byinverse_transform
orfunc
followed byinverse_func
leads to the original targets.- **estimator_kwargskwargs
Keyword arguments passed to
groupyr.SGLCV
ifuse_cv_estimator=True
orgroupyr.SGL
ifuse_cv_estimator=False
.
- Returns:
- pipelinePipeline instance
- afqinsight.make_afq_classifier_pipeline(imputer='simple', scaler='standard', feature_transformer=False, ensemble_meta_estimator=None, imputer_kwargs=None, scaler_kwargs=None, feature_transformer_kwargs=None, ensemble_meta_estimator_kwargs=None, use_cv_estimator=True, memory=None, pipeline_verbosity=False, target_transformer=None, target_transform_func=None, target_transform_inverse_func=None, target_transform_check_inverse=True, **estimator_kwargs)[source]#
Return the recommended AFQ-specific classification pipeline.
This function returns a Pipeline instance with the following steps:
[imputer, scaler, feature_transformer, estimator]
where
imputer
imputes missing data due to individual subjects missing metrics along an entire bundle;scaler
is optional and scales the features of the feature matrix;feature_transformer
is optional and applies a transform featurewise to make data more Gaussian-like; andestimator
is an instance ofgroupyr.LogisticSGLCV
ifuse_cv_estimator=True
orgroupyr.LogisticSGL
ifuse_cv_estimator=False
. The estimator may optionally be wrapped in an ensemble meta-estimator specified byensemble_meta_estimator
and given the keyword arguments inensemble_meta_estimator_kwargs
. Additionally, the estimator may optionally be wrapped insklearn:sklearn.compose.TransformedTargetRegressor
, such that the computation duringfit
is:estimator.fit(X, target_transform_func(y))
or:
estimator.fit(X, target_transformer.transform(y))
The computation during
predict
is:target_transform_inverse_func(estimator.predict(X))
or:
target_transformer.inverse_transform(estimator.predict(X))
- Parameters:
- imputer“simple”, “knn”, or sklearn-compatible transformer, default=”simple”
The imputer for missing data. String arguments result in the use of specific imputers/transformers: “simple” yields
sklearn.impute.SimpleImputer
; “knn” yieldssklearn.impute.KNNImputer
. Custom transformers are allowed as long as they inherit fromsklearn.base.TransformerMixin
.- scaler“standard”, “minmax”, “maxabs”, “robust”, or
sklearn-compatible transformer, default=”standard”
The scaler to use for the feature matrix. String arguments result in the use of specific transformers: “standard” yields the
sklearn:sklearn.preprocessing.StandardScalar
; “minmax” yields thesklearn.preprocessing.MinMaxScaler
; “maxabs” yields thesklearn.preprocessing.MaxAbsScaler
; “robust” yields thesklearn.preprocessing.RobustScaler
. Custom transformers are allowed as long as they inherit fromsklearn.base.TransformerMixin
.- feature_transformerbool or sklearn-compatible transformer, default=False
An optional transformer for use on the feature matrix. If True, use
sklearn.preprocessing.PowerTransformer
. If False, skip this step. Custom transformers are allowed as long as they inherit fromsklearn.base.TransformerMixin
.- ensemble_meta_estimator“bagging”, “adaboost”, or None
An optional ensemble meta-estimator to combine the predictions of several base estimators. “Adaboost” will result in the use of
sklearn.ensemble.AdaBoostClassifier
and “bagging” will result in the use ofsklearn.sklearn.ensemble.BaggingClassifier
.- imputer_kwargsdict, default=None,
Key-word arguments for the imputer.
- scaler_kwargsdict, default=None,
Key-word arguments for the scaler.
- feature_transformer_kwargsdict, default=None,
Key-word arguments for the feature_transformer.
- ensemble_meta_estimator_kwargsdict, default=None,
Key-word arguments for the ensemble meta-estimator.
- use_cv_estimatorbool, default=True,
If True, use
groupyr.LogisticSGLCV
as the final estimator. Otherwise, usegroupyr.LogisticSGL
.- memorystr or object with the joblib.Memory interface, default=None
Used to cache the fitted transformers of the pipeline. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute
named_steps
orsteps
to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming.- pipeline_verbositybool, default=False
If True, the time elapsed while fitting each step will be printed as it is completed.
- target_transformerobject, default=None
Estimator object such as derived from
sklearn.base.TransformerMixin
. Cannot be set at the same time asfunc
andinverse_func
. Iftransformer
isNone
as well asfunc
andinverse_func
, the transformer will be an identity transformer. Note that the transformer will be cloned during fitting. Also, the transformer is restrictingy
to be a numpy array.- target_transform_funcfunction, default=None
Function to apply to
y
before passing tofit
. Cannot be set at the same time astransformer
. The function needs to return a 2-dimensional array. Iffunc
isNone
, the function used will be the identity function.- target_transform_inverse_funcfunction, default=None
Function to apply to the prediction of the regressor. Cannot be set at the same time as
transformer
as well. The function needs to return a 2-dimensional array. The inverse function is used to return predictions to the same space of the original training labels.- target_transform_check_inversebool, default=True
Whether to check that
transform
followed byinverse_transform
orfunc
followed byinverse_func
leads to the original targets.- **estimator_kwargskwargs
Keyword arguments passed to
groupyr.LogisticSGLCV
ifuse_cv_estimator=True
orgroupyr.LogisticSGL
ifuse_cv_estimator=False
.
- Returns:
- pipelinePipeline instance
Transformers#
These transformers transform tractometry information from the AFQ standard data format to feature matrices that are ready for ingestion into sklearn-compatible pipelines.
- class afqinsight.AFQDataFrameMapper(pd_interpolate_kwargs=None, bundle_agg_func=None, concat_subject_session=False, **dataframe_mapper_kwargs)[source]#
Map pandas dataframe to sklearn feature matrix.
This object first converts an AFQ nodes.csv dataframe into a feature matrix with rows corresponding to subjects and columns corresponding to tract profile values. It interpolates along tracts to fill missing values and then maps the dataframe onto a 2D feature matrix for ingestion into sklearn-compatible estimators. It also maintains attributes for the subject index, feature names, and groups of features.
- Parameters:
- df_mapper_paramskwargs, default=dict(features=[], default=None)
Keyword arguments passed to sklearn_pandas.DataFrameMapper. You will probably not need to change these defaults.
- pd_interpolate_paramskwargs,
default=dict(method=”linear”, limit_direction=”both”, limit_area=”inside”) Keyword arguments passed to pandas.DataFrame.interpolate. Missing values are interpolated within the tract profile so that no data is used from other subjects, tracts, or metrics, minimizing the chance of train/test leakage. You will probably not need to change these defaults.
- bundle_agg_funcfunction, str, list or dict, optional
If provided, a function to use for aggregating the nodes in each tract. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply.
Accepted combinations are:
function
string function name
list of functions and/or function names, e.g. [np.sum, ‘mean’]
By default, this mapper will not aggregate but will return values at each node.
- Attributes:
- subjects_list
List of subject IDs retrieved from pandas dataframe index.
- groups_list of numpy.ndarray
List of arrays of non-overlapping indices for each group. For example, if nine features are grouped into equal contiguous groups of three, then groups would be
[array([0, 1, 2]), array([3, 4, 5]), array([6, 7, 8])]
.feature_names_
list of tuplesReturn the feature names.
Methods
fit
(X[, y])Fit a transform from the given dataframe.
fit_transform
(X[, y])Fit a transform from the given dataframe and apply directly to given data.
get_names
(columns, transformer, x[, alias, ...])Return verbose names for the transformed columns.
get_params
([deep])Get parameters for this estimator.
set_output
(*[, transform])Set output container.
set_params
(**params)Set the parameters of this estimator.
transform
(X)Transform the input data.
get_dtype
get_dtypes
Cross Validation#
This function validates model performance using cross-validation, while checkpointing the estimators and scores.
- afqinsight.cross_validate_checkpoint(estimator, X, y=None, *, groups=None, scoring=None, cv=None, n_jobs=None, verbose=0, fit_params=None, pre_dispatch='2*n_jobs', return_train_score=False, return_estimator=False, error_score=nan, workdir=None, checkpoint=True, force_refresh=False, serialize_cv=False)[source]#
Evaluate metric(s) by cross-validation and also record fit/score times.
This is a copy of
sklearn.model_selection.cross_validate()
that uses_fit_and_score_ckpt()
to checkpoint scores and estimators for each CV split. Read more in the sklearn user guide.- Parameters:
- estimatorestimator object implementing ‘fit’
The object to use to fit the data.
- Xarray-like of shape (n_samples, n_features)
The data to fit. Can be for example a list, or an array.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
The target variable to try to predict in the case of supervised learning.
- groupsarray-like of shape (n_samples,), default=None
Group labels for the samples used while splitting the dataset into train/test set. Only used in conjunction with a “Group” cv instance (e.g.,
sklearn:GroupKFold
).- scoringstr, callable, list/tuple, or dict, default=None
A single str (see The scoring parameter: defining model evaluation rules) or a callable (see Defining your scoring strategy from metric functions) to evaluate the predictions on the test set.
For evaluating multiple metrics, either give a list of (unique) strings or a dict with names as keys and callables as values.
NOTE that when using custom scorers, each scorer should return a single value. Metric functions returning a list/array of values can be wrapped into multiple scorers that return one value each.
See Specifying multiple metrics for evaluation for an example.
If None, the estimator’s score method is used.
- cvint, cross-validation generator or an iterable, default=None
Determines the cross-validation splitting strategy. Possible inputs for cv are:
None, to use the default 5-fold cross validation,
int, to specify the number of folds in a (Stratified)KFold,
an sklearn CV splitter,
An iterable yielding (train, test) splits as arrays of indices.
For int/None inputs, if the estimator is a classifier and
y
is either binary or multiclass,sklearn.model_selection.StratifiedKFold
is used. In all other cases,sklearn.model_selection.KFold
is used. Refer sklearn user guide for the various cross-validation strategies that can be used here.- n_jobsint, default=None
The number of CPUs to use to do the computation.
None
means 1 unless in ajoblib.parallel_backend
context.-1
means using all processors. See sklearn Glossary for more details.- verboseint, default=0
The verbosity level.
- fit_paramsdict, default=None
Parameters to pass to the fit method of the estimator.
- pre_dispatchint or str, default=’2*n_jobs’
Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:
None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs
An int, giving the exact number of total jobs that are spawned
A str, giving an expression as a function of n_jobs, as in ‘2*n_jobs’
- return_train_scorebool, default=False
Whether to include train scores. Computing training scores is used to get insights on how different parameter settings impact the overfitting/underfitting trade-off. However computing the scores on the training set can be computationally expensive and is not strictly required to select the parameters that yield the best generalization performance.
- return_estimatorbool, default=False
Whether to return the estimators fitted on each split.
- error_score‘raise’ or numeric
Value to assign to the score if an error occurs in estimator fitting. If set to ‘raise’, the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.
- workdirpath-like object, default=None
A string or path-like object indicating the directory in which to store checkpoint files.
- checkpointbool, default=True
If True, checkpoint the parameters, estimators, and scores.
- force_refreshbool, default=False
If True, recompute scores even if the checkpoint file already exists. Otherwise, load scores from checkpoint files and return.
- serialize_cvbool, default=False
If True, do not use joblib.Parallel to evaluate each CV split.
- Returns:
- scoresdict of float arrays of shape (n_splits,)
Array of scores of the estimator for each run of the cross validation.
A dict of arrays containing the score/time arrays for each scorer is returned. The possible keys for this
dict
are:test_score
The score array for test scores on each cv split. Suffix
_score
intest_score
changes to a specific metric liketest_r2
ortest_auc
if there are multiple scoring metrics in the scoring parameter.train_score
The score array for train scores on each cv split. Suffix
_score
intrain_score
changes to a specific metric liketrain_r2
ortrain_auc
if there are multiple scoring metrics in the scoring parameter. This is available only ifreturn_train_score
parameter isTrue
.fit_time
The time for fitting the estimator on the train set for each cv split.
score_time
The time for scoring the estimator on the test set for each cv split. (Note time for scoring on the train set is not included even if
return_train_score
is set toTrue
estimator
The estimator objects for each cv split. This is available only if
return_estimator
parameter is set toTrue
.
See also
sklearn.model_selection.cross_val_score
Run cross-validation for single metric evaluation.
sklearn.model_selection.cross_val_predict
Get predictions from each split of cross-validation for diagnostic purposes.
sklearn.metrics.make_scorer
Make a scorer from a performance metric or loss function.
Examples
>>> import numpy as np >>> import shutil >>> import tempfile >>> from sklearn import datasets, linear_model >>> from afqinsight import cross_validate_checkpoint >>> from sklearn.pipeline import make_pipeline >>> from sklearn.preprocessing import StandardScaler >>> diabetes = datasets.load_diabetes() >>> X = diabetes.data[:150] >>> y = diabetes.target[:150] >>> lasso = linear_model.Lasso()
Single metric evaluation using
cross_validate
>>> cv_results = cross_validate_checkpoint(lasso, X, y, cv=3, checkpoint=False) >>> sorted(cv_results.keys()) ['fit_time', 'score_time', 'test_score'] >>> cv_results['test_score'] array([0.33150..., 0.08022..., 0.03531...])
Multiple metric evaluation using
cross_validate
, an estimator pipeline, and checkpointing (please refer thescoring
parameter doc for more information)>>> tempdir = tempfile.mkdtemp() >>> scaler = StandardScaler() >>> pipeline = make_pipeline(scaler, lasso) >>> scores = cross_validate_checkpoint(pipeline, X, y, cv=3, ... scoring=('r2', 'neg_mean_squared_error'), ... return_train_score=True, checkpoint=True, ... workdir=tempdir, return_estimator=True) >>> shutil.rmtree(tempdir) >>> print(scores['test_neg_mean_squared_error']) [-2479.2... -3281.2... -3466.7...] >>> print(scores['train_r2']) [0.507... 0.602... 0.478...]