API Reference#

Datasets#

This class encapsulates an AFQ dataset and has static methods to read data from csv files conforming to the AFQ data standard.

class afqinsight.AFQDataset(X, y=None, groups=None, feature_names=None, group_names=None, target_cols=None, subjects=None, sessions=None, classes=None)[source]#

Represent AFQ features and targets.

The AFQDataset class represents tractometry features and, optionally, phenotypic targets.

The simplest way to create a new AFQDataset is to pass in the tractometric features and phenotypic targets explicitly.

>>> import numpy as np
>>> AFQDataset(X=np.random.rand(50, 1000), y=np.random.rand(50))
AFQDataset(n_samples=50, n_features=1000, n_targets=1)

You can keep track of the names of the target variables with the target_cols parameter.

>>> AFQDataset(X=np.random.rand(50, 1000),
...            y=np.random.rand(50), target_cols=["age"])
AFQDataset(n_samples=50, n_features=1000, n_targets=1, targets=['age'])

Source Datasets:

The most common way to create an AFQDataset is to load data from a set of csv files that conform to the AFQ data format. For example,

>>> import os.path as op
>>> sarica_dir = download_sarica(verbose=False)
>>> dataset = AFQDataset.from_files(
...     fn_nodes=op.join(sarica_dir, "nodes.csv"),
...     fn_subjects=op.join(sarica_dir, "subjects.csv"),
...     dwi_metrics=["md", "fa"],
...     target_cols=["class"],
...     label_encode_cols=["class"],
... )
>>> dataset
AFQDataset(n_samples=48, n_features=4000, n_targets=1, targets=['class'])

AFQDatasets are indexable and can be sliced.

>>> dataset[0:10]
AFQDataset(n_samples=10, n_features=4000, n_targets=1, targets=['class'])

You can query the length of the dataset as well as the feature and target shapes.

>>> len(dataset)
48
>>> dataset.shape
((48, 4000), (48,))

Datasets can be used as expected in scikit-learn’s model selection functions. For example

>>> from sklearn.model_selection import train_test_split
>>> train_data, test_data = train_test_split(dataset, test_size=0.3,
...                                          stratify=dataset.y)
>>> train_data
AFQDataset(n_samples=33, n_features=4000, n_targets=1, targets=['class'])
>>> test_data
AFQDataset(n_samples=15, n_features=4000, n_targets=1, targets=['class'])

You can drop samples from the dataset that have null target values using the drop_target_na method. >>> dataset.y = dataset.y.astype(float) >>> dataset.y[:5] = np.nan >>> dataset.drop_target_na() >>> dataset AFQDataset(n_samples=43, n_features=4000, n_targets=1, targets=[‘class’])

Parameters:
Xarray-like of shape (n_samples, n_features)

The feature samples.

yarray-like of shape (n_samples,) or (n_samples, n_targets), optional

Target values. This will be None if unsupervised is True

groupslist of numpy.ndarray, optional

The feature indices for each feature group. These are typically used to keep group collections of “nodes” into white matter bundles.

feature_nameslist of tuples, optional

The multi-indexed columns of X. i.e. the names of the features.

group_nameslist of tuples, optional

The multi-indexed groups of X. i.e. the names of the feature groups.

target_colslist of strings, optional

List of column names for the target variables in y.

subjectslist, optional

Subject IDs

sessionslist, optional

Session IDs.

classesdict, optional

Class labels for each label encoded column specified in y.

Attributes:
shape

Return the shape of the features and targets.

Methods

as_tensorflow_dataset([bundles_as_channels, ...])

Return features and labels packaged as a tensorflow dataset.

as_torch_dataset([bundles_as_channels, ...])

Return features and labels packaged as a pytorch dataset.

bundle_means()

Return diffusion metrics averaged along the length of each bundle.

copy()

Return a deep copy of this dataset.

drop_target_na()

Drop subjects who have nan values as targets.

from_files([fn_nodes, fn_subjects, ...])

Create an AFQDataset from csv files.

from_study(study[, verbose])

Fetch an AFQ dataset from a predefined study.

model_fit(model, **fit_params)

Fit the dataset with a provided model object.

model_fit_transform(model, **fit_params)

Fit and transform the dataset with a provided model object.

model_predict(model, **predict_params)

Predict the targets with a provided model object.

model_score(model, **score_params)

Score a model on this dataset.

model_transform(model, **transform_params)

Transform the dataset with a provided model object.

Pipelines#

These are AFQ-Insights recommended estimator pipelines.

afqinsight.make_afq_regressor_pipeline(imputer='simple', scaler='standard', feature_transformer=False, ensemble_meta_estimator=None, imputer_kwargs=None, scaler_kwargs=None, feature_transformer_kwargs=None, ensemble_meta_estimator_kwargs=None, use_cv_estimator=True, memory=None, pipeline_verbosity=False, target_transformer=None, target_transform_func=None, target_transform_inverse_func=None, target_transform_check_inverse=True, **estimator_kwargs)[source]#

Return the recommended AFQ-specific regression pipeline.

This function returns a Pipeline instance with the following steps:

[imputer, scaler, feature_transformer, estimator]

where imputer imputes missing data due to individual subjects missing metrics along an entire bundle; scaler is optional and scales the features of the feature matrix; feature_transformer is optional and applies a transform featurewise to make data more Gaussian-like; and estimator is an instance of groupyr.SGLCV if use_cv_estimator=True or groupyr.SGL if use_cv_estimator=False. The estimator may optionally be wrapped in an ensemble meta-estimator specified by ensemble_meta_estimator and given the keyword arguments in ensemble_meta_estimator_kwargs. Additionally, the estimator may optionally be wrapped in sklearn:sklearn.compose.TransformedTargetRegressor, such that the computation during fit is:

estimator.fit(X, target_transform_func(y))

or:

estimator.fit(X, target_transformer.transform(y))

The computation during predict is:

target_transform_inverse_func(estimator.predict(X))

or:

target_transformer.inverse_transform(estimator.predict(X))
Parameters:
imputer“simple”, “knn”, or sklearn-compatible transformer, default=”simple”

The imputer for missing data. String arguments result in the use of specific imputers/transformers: “simple” yields sklearn.impute.SimpleImputer; “knn” yields sklearn.impute.KNNImputer. Custom transformers are allowed as long as they inherit from sklearn.base.TransformerMixin.

scaler“standard”, “minmax”, “maxabs”, “robust”, or

sklearn-compatible transformer, default=”standard”

The scaler to use for the feature matrix. String arguments result in the use of specific transformers: “standard” yields the sklearn:sklearn.preprocessing.StandardScalar; “minmax” yields the sklearn.preprocessing.MinMaxScaler; “maxabs” yields the sklearn.preprocessing.MaxAbsScaler; “robust” yields the sklearn.preprocessing.RobustScaler. Custom transformers are allowed as long as they inherit from sklearn.base.TransformerMixin.

feature_transformerbool or sklearn-compatible transformer, default=False

An optional transformer for use on the feature matrix. If True, use sklearn.preprocessing.PowerTransformer. If False, skip this step. Custom transformers are allowed as long as they inherit from sklearn.base.TransformerMixin.

ensemble_meta_estimator“bagging”, “adaboost”, or None

An optional ensemble meta-estimator to combine the predictions of several base estimators. “Adaboost” will result in the use of sklearn.ensemble.AdaBoostClassifier for classifier base estimators or sklearn.sklearn.ensemble.AdaBoostRegressor for regressor base estimators. “Bagging” will result in the use of sklearn.sklearn.ensemble.BaggingClassifier for classifier base estimators or sklearn.sklearn.ensemble.BaggingRegressor for regressor base estimators.

imputer_kwargsdict, default=None,

Key-word arguments for the imputer.

scaler_kwargsdict, default=None,

Key-word arguments for the scaler.

feature_transformer_kwargsdict, default=None,

Key-word arguments for the feature_transformer.

use_cv_estimatorbool, default=True,

If True, use groupyr.SGLCV as the final estimator. Otherwise, use groupyr.SGL.

memorystr or object with the joblib.Memory interface, default=None

Used to cache the fitted transformers of the pipeline. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute named_steps or steps to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming.

pipeline_verbositybool, default=False

If True, the time elapsed while fitting each step will be printed as it is completed.

target_transformerobject, default=None

Estimator object such as derived from sklearn.base.TransformerMixin. Cannot be set at the same time as func and inverse_func. If transformer is None as well as func and inverse_func, the transformer will be an identity transformer. Note that the transformer will be cloned during fitting. Also, the transformer is restricting y to be a numpy array.

target_transform_funcfunction, default=None

Function to apply to y before passing to fit. Cannot be set at the same time as transformer. The function needs to return a 2-dimensional array. If func is None, the function used will be the identity function.

target_transform_inverse_funcfunction, default=None

Function to apply to the prediction of the regressor. Cannot be set at the same time as transformer as well. The function needs to return a 2-dimensional array. The inverse function is used to return predictions to the same space of the original training labels.

target_transform_check_inversebool, default=True

Whether to check that transform followed by inverse_transform or func followed by inverse_func leads to the original targets.

**estimator_kwargskwargs

Keyword arguments passed to groupyr.SGLCV if use_cv_estimator=True or groupyr.SGL if use_cv_estimator=False.

Returns:
pipelinePipeline instance
afqinsight.make_afq_classifier_pipeline(imputer='simple', scaler='standard', feature_transformer=False, ensemble_meta_estimator=None, imputer_kwargs=None, scaler_kwargs=None, feature_transformer_kwargs=None, ensemble_meta_estimator_kwargs=None, use_cv_estimator=True, memory=None, pipeline_verbosity=False, target_transformer=None, target_transform_func=None, target_transform_inverse_func=None, target_transform_check_inverse=True, **estimator_kwargs)[source]#

Return the recommended AFQ-specific classification pipeline.

This function returns a Pipeline instance with the following steps:

[imputer, scaler, feature_transformer, estimator]

where imputer imputes missing data due to individual subjects missing metrics along an entire bundle; scaler is optional and scales the features of the feature matrix; feature_transformer is optional and applies a transform featurewise to make data more Gaussian-like; and estimator is an instance of groupyr.LogisticSGLCV if use_cv_estimator=True or groupyr.LogisticSGL if use_cv_estimator=False. The estimator may optionally be wrapped in an ensemble meta-estimator specified by ensemble_meta_estimator and given the keyword arguments in ensemble_meta_estimator_kwargs. Additionally, the estimator may optionally be wrapped in sklearn:sklearn.compose.TransformedTargetRegressor, such that the computation during fit is:

estimator.fit(X, target_transform_func(y))

or:

estimator.fit(X, target_transformer.transform(y))

The computation during predict is:

target_transform_inverse_func(estimator.predict(X))

or:

target_transformer.inverse_transform(estimator.predict(X))
Parameters:
imputer“simple”, “knn”, or sklearn-compatible transformer, default=”simple”

The imputer for missing data. String arguments result in the use of specific imputers/transformers: “simple” yields sklearn.impute.SimpleImputer; “knn” yields sklearn.impute.KNNImputer. Custom transformers are allowed as long as they inherit from sklearn.base.TransformerMixin.

scaler“standard”, “minmax”, “maxabs”, “robust”, or

sklearn-compatible transformer, default=”standard”

The scaler to use for the feature matrix. String arguments result in the use of specific transformers: “standard” yields the sklearn:sklearn.preprocessing.StandardScalar; “minmax” yields the sklearn.preprocessing.MinMaxScaler; “maxabs” yields the sklearn.preprocessing.MaxAbsScaler; “robust” yields the sklearn.preprocessing.RobustScaler. Custom transformers are allowed as long as they inherit from sklearn.base.TransformerMixin.

feature_transformerbool or sklearn-compatible transformer, default=False

An optional transformer for use on the feature matrix. If True, use sklearn.preprocessing.PowerTransformer. If False, skip this step. Custom transformers are allowed as long as they inherit from sklearn.base.TransformerMixin.

ensemble_meta_estimator“bagging”, “adaboost”, or None

An optional ensemble meta-estimator to combine the predictions of several base estimators. “Adaboost” will result in the use of sklearn.ensemble.AdaBoostClassifier and “bagging” will result in the use of sklearn.sklearn.ensemble.BaggingClassifier.

imputer_kwargsdict, default=None,

Key-word arguments for the imputer.

scaler_kwargsdict, default=None,

Key-word arguments for the scaler.

feature_transformer_kwargsdict, default=None,

Key-word arguments for the feature_transformer.

ensemble_meta_estimator_kwargsdict, default=None,

Key-word arguments for the ensemble meta-estimator.

use_cv_estimatorbool, default=True,

If True, use groupyr.LogisticSGLCV as the final estimator. Otherwise, use groupyr.LogisticSGL.

memorystr or object with the joblib.Memory interface, default=None

Used to cache the fitted transformers of the pipeline. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute named_steps or steps to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming.

pipeline_verbositybool, default=False

If True, the time elapsed while fitting each step will be printed as it is completed.

target_transformerobject, default=None

Estimator object such as derived from sklearn.base.TransformerMixin. Cannot be set at the same time as func and inverse_func. If transformer is None as well as func and inverse_func, the transformer will be an identity transformer. Note that the transformer will be cloned during fitting. Also, the transformer is restricting y to be a numpy array.

target_transform_funcfunction, default=None

Function to apply to y before passing to fit. Cannot be set at the same time as transformer. The function needs to return a 2-dimensional array. If func is None, the function used will be the identity function.

target_transform_inverse_funcfunction, default=None

Function to apply to the prediction of the regressor. Cannot be set at the same time as transformer as well. The function needs to return a 2-dimensional array. The inverse function is used to return predictions to the same space of the original training labels.

target_transform_check_inversebool, default=True

Whether to check that transform followed by inverse_transform or func followed by inverse_func leads to the original targets.

**estimator_kwargskwargs

Keyword arguments passed to groupyr.LogisticSGLCV if use_cv_estimator=True or groupyr.LogisticSGL if use_cv_estimator=False.

Returns:
pipelinePipeline instance

Transformers#

These transformers transform tractometry information from the AFQ standard data format to feature matrices that are ready for ingestion into sklearn-compatible pipelines.

class afqinsight.AFQDataFrameMapper(pd_interpolate_kwargs=None, bundle_agg_func=None, concat_subject_session=False, **dataframe_mapper_kwargs)[source]#

Map pandas dataframe to sklearn feature matrix.

This object first converts an AFQ nodes.csv dataframe into a feature matrix with rows corresponding to subjects and columns corresponding to tract profile values. It interpolates along tracts to fill missing values and then maps the dataframe onto a 2D feature matrix for ingestion into sklearn-compatible estimators. It also maintains attributes for the subject index, feature names, and groups of features.

Parameters:
df_mapper_paramskwargs, default=dict(features=[], default=None)

Keyword arguments passed to sklearn_pandas.DataFrameMapper. You will probably not need to change these defaults.

pd_interpolate_paramskwargs,

default=dict(method=”linear”, limit_direction=”both”, limit_area=”inside”) Keyword arguments passed to pandas.DataFrame.interpolate. Missing values are interpolated within the tract profile so that no data is used from other subjects, tracts, or metrics, minimizing the chance of train/test leakage. You will probably not need to change these defaults.

bundle_agg_funcfunction, str, list or dict, optional

If provided, a function to use for aggregating the nodes in each tract. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply.

Accepted combinations are:

  • function

  • string function name

  • list of functions and/or function names, e.g. [np.sum, ‘mean’]

By default, this mapper will not aggregate but will return values at each node.

Attributes:
subjects_list

List of subject IDs retrieved from pandas dataframe index.

groups_list of numpy.ndarray

List of arrays of non-overlapping indices for each group. For example, if nine features are grouped into equal contiguous groups of three, then groups would be [array([0, 1, 2]), array([3, 4, 5]), array([6, 7, 8])].

feature_names_list of tuples

Return the feature names.

Methods

fit(X[, y])

Fit a transform from the given dataframe.

fit_transform(X[, y])

Fit a transform from the given dataframe and apply directly to given data.

get_names(columns, transformer, x[, alias, ...])

Return verbose names for the transformed columns.

get_params([deep])

Get parameters for this estimator.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Transform the input data.

get_dtype

get_dtypes

Cross Validation#

This function validates model performance using cross-validation, while checkpointing the estimators and scores.

afqinsight.cross_validate_checkpoint(estimator, X, y=None, *, groups=None, scoring=None, cv=None, n_jobs=None, verbose=0, fit_params=None, pre_dispatch='2*n_jobs', return_train_score=False, return_estimator=False, error_score=nan, workdir=None, checkpoint=True, force_refresh=False, serialize_cv=False)[source]#

Evaluate metric(s) by cross-validation and also record fit/score times.

This is a copy of sklearn.model_selection.cross_validate() that uses _fit_and_score_ckpt() to checkpoint scores and estimators for each CV split. Read more in the sklearn user guide.

Parameters:
estimatorestimator object implementing ‘fit’

The object to use to fit the data.

Xarray-like of shape (n_samples, n_features)

The data to fit. Can be for example a list, or an array.

yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None

The target variable to try to predict in the case of supervised learning.

groupsarray-like of shape (n_samples,), default=None

Group labels for the samples used while splitting the dataset into train/test set. Only used in conjunction with a “Group” cv instance (e.g., sklearn:GroupKFold).

scoringstr, callable, list/tuple, or dict, default=None

A single str (see The scoring parameter: defining model evaluation rules) or a callable (see Defining your scoring strategy from metric functions) to evaluate the predictions on the test set.

For evaluating multiple metrics, either give a list of (unique) strings or a dict with names as keys and callables as values.

NOTE that when using custom scorers, each scorer should return a single value. Metric functions returning a list/array of values can be wrapped into multiple scorers that return one value each.

See Specifying multiple metrics for evaluation for an example.

If None, the estimator’s score method is used.

cvint, cross-validation generator or an iterable, default=None

Determines the cross-validation splitting strategy. Possible inputs for cv are:

  • None, to use the default 5-fold cross validation,

  • int, to specify the number of folds in a (Stratified)KFold,

  • an sklearn CV splitter,

  • An iterable yielding (train, test) splits as arrays of indices.

For int/None inputs, if the estimator is a classifier and y is either binary or multiclass, sklearn.model_selection.StratifiedKFold is used. In all other cases, sklearn.model_selection.KFold is used. Refer sklearn user guide for the various cross-validation strategies that can be used here.

n_jobsint, default=None

The number of CPUs to use to do the computation. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See sklearn Glossary for more details.

verboseint, default=0

The verbosity level.

fit_paramsdict, default=None

Parameters to pass to the fit method of the estimator.

pre_dispatchint or str, default=’2*n_jobs’

Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:

  • None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs

  • An int, giving the exact number of total jobs that are spawned

  • A str, giving an expression as a function of n_jobs, as in ‘2*n_jobs’

return_train_scorebool, default=False

Whether to include train scores. Computing training scores is used to get insights on how different parameter settings impact the overfitting/underfitting trade-off. However computing the scores on the training set can be computationally expensive and is not strictly required to select the parameters that yield the best generalization performance.

return_estimatorbool, default=False

Whether to return the estimators fitted on each split.

error_score‘raise’ or numeric

Value to assign to the score if an error occurs in estimator fitting. If set to ‘raise’, the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.

workdirpath-like object, default=None

A string or path-like object indicating the directory in which to store checkpoint files.

checkpointbool, default=True

If True, checkpoint the parameters, estimators, and scores.

force_refreshbool, default=False

If True, recompute scores even if the checkpoint file already exists. Otherwise, load scores from checkpoint files and return.

serialize_cvbool, default=False

If True, do not use joblib.Parallel to evaluate each CV split.

Returns:
scoresdict of float arrays of shape (n_splits,)

Array of scores of the estimator for each run of the cross validation.

A dict of arrays containing the score/time arrays for each scorer is returned. The possible keys for this dict are:

test_score

The score array for test scores on each cv split. Suffix _score in test_score changes to a specific metric like test_r2 or test_auc if there are multiple scoring metrics in the scoring parameter.

train_score

The score array for train scores on each cv split. Suffix _score in train_score changes to a specific metric like train_r2 or train_auc if there are multiple scoring metrics in the scoring parameter. This is available only if return_train_score parameter is True.

fit_time

The time for fitting the estimator on the train set for each cv split.

score_time

The time for scoring the estimator on the test set for each cv split. (Note time for scoring on the train set is not included even if return_train_score is set to True

estimator

The estimator objects for each cv split. This is available only if return_estimator parameter is set to True.

See also

sklearn.model_selection.cross_val_score

Run cross-validation for single metric evaluation.

sklearn.model_selection.cross_val_predict

Get predictions from each split of cross-validation for diagnostic purposes.

sklearn.metrics.make_scorer

Make a scorer from a performance metric or loss function.

Examples

>>> import numpy as np
>>> import shutil
>>> import tempfile
>>> from sklearn import datasets, linear_model
>>> from afqinsight import cross_validate_checkpoint
>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.preprocessing import StandardScaler
>>> diabetes = datasets.load_diabetes()
>>> X = diabetes.data[:150]
>>> y = diabetes.target[:150]
>>> lasso = linear_model.Lasso()

Single metric evaluation using cross_validate

>>> cv_results = cross_validate_checkpoint(lasso, X, y, cv=3, checkpoint=False)
>>> sorted(cv_results.keys())
['fit_time', 'score_time', 'test_score']
>>> cv_results['test_score']  
array([0.33150..., 0.08022..., 0.03531...])

Multiple metric evaluation using cross_validate, an estimator pipeline, and checkpointing (please refer the scoring parameter doc for more information)

>>> tempdir = tempfile.mkdtemp()
>>> scaler = StandardScaler()
>>> pipeline = make_pipeline(scaler, lasso)
>>> scores = cross_validate_checkpoint(pipeline, X, y, cv=3,
...                         scoring=('r2', 'neg_mean_squared_error'),
...                         return_train_score=True, checkpoint=True,
...                         workdir=tempdir, return_estimator=True)
>>> shutil.rmtree(tempdir)
>>> print(scores['test_neg_mean_squared_error'])
[-2479.2... -3281.2... -3466.7...]
>>> print(scores['train_r2'])
[0.507... 0.602... 0.478...]