Preprocessing module

Overview: the “chain” concept

The preprocessing module includes preprocessing steps to apply before running a sorter. Example processing steps include filtering, removing noise, removing bad channels, etc. Preprocessors are lazy, meaning that no computation is performed until it is required (usually at the spike sorting step). This enables one to build preprocessing chains to be applied in sequence to a ~spikeinterface.core.BaseRecording object. This is possible because each preprocessing step returns a new ~spikeinterface.core.BaseRecording that can be input to the next step in the chain.

In this code example, we build a preprocessing chain with two steps:

bandpass filter
common median reference (CMR)

from spikeinterface.preprocessing import bandpass_filter, common_reference

# recording is a RecordingExtractor object
recording_f = bandpass_filter(recording=recording, freq_min=300, freq_max=6000)
recording_cmr = common_reference(recording=recording_f, operator="median")

These two preprocessors will not compute anything at instantiation, but the computation will be “on-demand” (“on-the-fly”) when getting traces.

traces = recording_cmr.get_traces(start_frame=100_000, end_frame=200_000)

Most sorters (see Internal sorters) can work directly on these preprocessed objects so there is no need to save the object:

# here the spykingcircus2 sorter directly uses the lazy "recording_cmr" object
sorting = run_sorter(sorter='spykingcircus2', recording=recording_cmr, sorter_name='spykingcircus2')

Some sorters, however, will need a binary file as input, so we can optionally save the processed recording with the efficient SpikeInterface save() function:

recording_saved = recording_cmr.save(folder="/path/to/preprocessed", n_jobs=8, chunk_duration='1s')

In this case, the save() function will process in parallel our original recording with the bandpass filter and CMR, and save it to a binary file in the “/path/to/preprocessed” folder. The recording_saved is yet another ~spikeinterface.core.BaseRecording which maps directly to the newly created binary file, for very quick access.

NOTE: some sorters will automatically perform the saving operation internally.

The Preprocessing Pipeline

The module also contains the PreprocessingPipeline object which aims to allow users to easily share pipelines across labs. The input to create the pipeline is a dictionary of preprocessing steps whose keys are the names of the steps and values are dictionaries of parameters. For example, to construct a pipeline consisting of highpass filtering with a minimum frequency of 250 Hz followed by whitening with default parameters, and finally a detect and remove bad channels step. We first make the appropriate dictionary

from spikeinterface.preprocessing import apply_preprocessing_pipeline, PreprocessingPipeline

preprocessing_dict = {
    'highpass_filter': {'freq_min': 250},
    'whiten': {},
    'detect_and_remove_bad_channels': {},
}

We can then pass this dictionary to the apply_preprocessing_pipeline function to make a preprocessed recording

preprocessed_recording = apply_preprocessing_pipeline(recording, preprocessing_dict)

Alternatively, we can construct a PreprocessingPipeline, allowing us to investigate the pipeline before using it.

preprocessing_pipeline = PreprocessingPipeline(preprocessing_dict)
# to view the pipeline:
preprocessing_pipeline

PreprocessingPipeline

Initial Recording

↓

highpass_filter

freq_min: 250
margin_ms: 5.0
dtype: None
**filter_kwargs: None

whiten

dtype: None
apply_mean: False
regularize: False
regularize_kwargs: None
mode: 'global'
radius_um: 100.0
int_scale: None
eps: None
W: None
M: None
**random_chunk_kwargs: None

detect_and_remove_bad_channels

parent_recording: None
bad_channel_ids: None
channel_labels: None
**detect_bad_channels_kwargs: None

↓

Preprocessed Recording

Once we have the pipeline, we can apply it to a recording in the same way as applying the dictionary

preprocessed_recording_again = apply_preprocessing_pipeline(recording, preprocessing_pipeline)

To share the pipeline you have made with another lab, you can simply share the dictionary. The dictionary can also be obtained from the pipeline object directly:

dict_used_to_make_pipeline = preprocessing_pipeline.preprocessor_dict

Some preprocessing steps, such as detect_and_remove_artifacts, allow you to specify an input recording and optionally another recording to perform some computation (e.g., detect artifacts on the output of a previous preprocessor, but remove them on the the parent preprocessor). In this case, the string “pipeline[preprocessor_name]” can be used in the dictionary to specify that the recording argument for this step should be the output of a previous preprocessor in the same pipeline. For example, if we want to use the output of the “bandpass_filter” step as the recording to detect artifacts, we can specify it as follows:

preprocessing_dict = {
    'bandpass_filter': {'freq_min': 250},
    'common_reference': {'operator': 'median', 'reference': 'global'},
    'detect_and_remove_artifacts': {'recording_to_detect': 'pipeline[bandpass_filter]'},
}

This will detect artifacts on the output of the “bandpass_filter” step, but the artifacts will be removed on the output of the “common_reference” step (since the parent recording for “detect_and_remove_artifacts” is by default the output of the previous step in the pipeline, which is “common_reference” in this case). To specify the “raw” recording, i.e., the input to the pipeline, we can use “pipeline[raw]”. For example, if we want to detect artifacts on the raw recording, we can specify it as follows:

preprocessing_dict = {
    'bandpass_filter': {'freq_min': 250},
    'common_reference': {'operator': 'median', 'reference': 'global'},
    'detect_and_remove_artifacts': {'recording_to_detect': 'pipeline[raw]'},
}

Impact on recording dtype

By default the dtype of a preprocessed recording does not change the recording’s dtype, even if, internally, the computation is performed using a different dtype. For instance if we have a int16` recording, the application of a bandpass filter will preserve the original dtype (unless specified otherwise):

import spikeinterface.extractors as se
# spikeGLX is int16
rec_int16 = se.read_spikeglx(folder_path"my_folder")
# by default the int16 is kept
rec_f = bandpass_filter(recording=rec_int16, freq_min=300, freq_max=6000)
# we can force a float32 casting
rec_f2 = bandpass_filter(recording=rec_int16, freq_min=300, freq_max=6000, dtype='float32')

Some scaling pre-processors, such as whiten() or zscore(), will force the output to float32.

When converting from a float to an int, the value will first be rounded to the nearest integer.

Available preprocessing

We have many preprocessing functions that can be flexibly added to a pipeline.

The full list of preprocessing functions can be found here: spikeinterface.preprocessing

Here is a full list of possible preprocessing steps, grouped by type of processing:

For all examples rec is a RecordingExtractor.

filter() / bandpass_filter() / notch_filter() / highpass_filter()

There are several variants of filtering (e.g., bandpass, highpass, notch).

Filtering steps are implemented using scipy.signal.

Important aspects of filtering functions:

they use a margin internally to deal with border effects
they perform forward-backward filtering (filtfilt)
they can use ‘ba’ or ‘sos’ mode

rec_f = bandpass_filter(recording=rec, freq_min=300, freq_max=6000)

common_reference()

A very common operation to remove noise is to re-reference traces. This is implemented with the common_reference() function.

There are various options when combining operator and reference arguments:

using “median” or “average” (average is faster, but median is less sensitive to outliers)
using “global” / “local” / “single” references

rec_cmr = common_reference(recording=rec, operator="median", reference="global")

common_reference()

phase_shift()

Recording system often do not sample all channels simultaneously. In fact, there is a small delay (less that a sampling period) in between channels. For instance this is the case for Neuropixels devices.

Applying common_reference() on this data does not correctly remove artifacts, since we first need to compensate for these small delays. This is exactly what phase_shift() does.

This function relies on an internal property of the recording called inter_sample_shift. For Neuropixels recordings (read with the read_spikeglx() or the read_openephys() functions), the inter_sample_shift is automatically loaded from the metadata and set.

Calling phase_shift() alone has almost no effect, but combined with common_reference() it can make a real difference on artifact removal.

rec_shift = phase_shift(recording=rec)
rec_cmr = common_reference(recording=rec_shift, operator="median", reference="global")

CatGT and IBL destriping are both based on this idea (see How to implement “IBL destriping” or “SpikeGLX CatGT” in SpikeInterface).

phase_shift()

normalize_by_quantile() /scale() / center() / zscore()

We have several “scalers” to apply some gains and offsets on traces.

scale() is the base function to apply user-defined gains and offsets to every channels.

zscore() estimates median/mad (or mean/std) of each channel and then applies the scale function to obtain centered with unitary variance on each channel.

rec_normed = zscore(recording=rec)

whiten()

Many sorters use this pre-processing step internally, but if you want to combine this operation with other preprocessing steps, you can use the whiten() implemented in SpikeInterface. The whitenning matrix W is constructed by estimating the covariance across channels and then inverting it.

The whitened traces are then the dot product between the traces and the W matrix.

rec_w = whiten(recording=rec)

whiten()

clip() / blank_saturation()

We can limit traces between a user-defined minimum and maximum using clip() function. The blank_saturation() function is similar, but it automatically estimates the limits by using quantiles.

rec_w = clip(recording=rec, a_min=-250., a_max=260)

highpass_spatial_filter()

highpass_spatial_filter() is a preprocessing step introduced by the International Brain Laboratory [IBL_spikesorting]. It applies a filter in the spatial axis of the traces after ordering the channels by depth. It is similar to common reference, but it can deal with “stripes” that are uneven across depth. This preprocessing step can be super useful for long probes like Neuropixels.

This is part of the “destriping” from IBL (see How to implement “IBL destriping” or “SpikeGLX CatGT” in SpikeInterface).

highpass_spatial_filter()

detect_bad_channels() / interpolate_bad_channels()

The detect_bad_channels() can be used to detect bad channels with several methods, including an std- or mad-based approach to detect bad channels with abnormally high power and the coherence+psd method (introduced by [IBL_spikesorting]), which detects bad channels looking at both coherence with other channels and PSD power in the high-frequency range.

Note: The coherence+psd method must be run on individual probes/shanks separately since it uses the coherence of the signal across the depth of the probe. See Processing a Recording by Channel Group for more information.

The function returns both the bad_channel_ids and channel_labels, which can be good, noise, dead, or out (outside of the brain). Note that the dead and out are only available with the coherence+psd method.

Bad channels can then either be removed from the recording using recording.remove_channels(bad_channel_ids) or be interpolated with the interpolate_bad_channels() function (channels labeled as out should always be removed):

# detect
bad_channel_ids, channel_labels = detect_bad_channels(recording=rec)
# Case 1 : remove then
rec_clean = recording.remove_channels(remove_channel_ids=bad_channel_ids)
# Case 2 : interpolate then
rec_clean = interpolate_bad_channels(recording=rec, bad_channel_ids=bad_channel_ids)

Once you have tested these functions and decided on your workflow, you can use the detect_and_* functions to do everything at once. These return a Preprocessor class, so are consistent with the “chain” concept for this module. For example:

# detect and remove bad channels
rec_only_good_channels = detect_and_remove_bad_channels(recording=rec)

# detect and interpolate the bad channels
rec_interpolated_channels = detect_and_interpolate_bad_channels(recording=rec)

rectify()

This step returns traces in absolute values. It could be used to compute a proxy signal of multi-unit activity (MUA).

rectify()

remove_artifacts()

Given an external list of trigger times, remove_artifacts() function can remove artifacts with several strategies:

replace with zeros (blank) 'zeros'
make a linear ('linear') or cubic ('cubic') interpolation
remove the median ('median') or average ('avereage') template (with optional time jitter and amplitude scaling correction)

rec_clean = remove_artifacts(recording=rec, list_triggers=[100, 200, 300], mode='zeros')

remove_artifacts()

astype() / unsigned_to_signed()

Similarly to numpy.astype(), the astype() casts the traces to the desired dtype:

rec_int16 = astype(recording=rec_float, dtype="int16")

For recordings whose traces are unsigned (e.g. Maxwell Biosystems), the unsigned_to_signed() function makes them signed by removing the unsigned “offset”. For example, uint16 traces will be first upcast to uint32, 2**15 is subtracted, and the traces are finally cast to int16:

rec_int16 = unsigned_to_signed(recording=rec_uint16)

zero_channel_pad()

Pads a recording with extra channels that contain only zeros. This step can be useful when a certain shape is required.

rec_with_more_channels = zero_channel_pad(parent_recording=rec, num_channels=128)

zero_channel_pad()

gaussian_filter()

Implementation of a gaussian filter for high/low/bandpass filters. Note that the the gaussian filter response is not very steep.

# highpass
rec_hp = gaussian_filter(recording=rec, freq_min=300, freq_max=None)
# lowpass
rec_lp = gaussian_filter(recording=rec, freq_min=None, freq_max=500)
# bandpass
rec_bp = gaussian_filter(recording=rec, freq_min=300, freq_max=2000)

gaussian_filter()

Motion/drift correction

Motion/drift correction is one of the most sophisticated preprocessing. See the Motion/drift correction page for a full explanation.

deepinterpolation() (experimental)

This step (experimental) applies the inference step of a DeepInterpolation denoiser model [DeepInterpolation].

deepinterpolation()

How to implement “IBL destriping” or “SpikeGLX CatGT” in SpikeInterface

SpikeGLX has a built-in function called CatGT to apply some preprocessing on the traces to remove noise and artifacts.

IBL also has a standardized pipeline for preprocessed traces a bit similar to CatGT which is called “destriping” [IBL_spikesorting]. In both these cases, the entire traces are read, processed and written back to a file.

SpikeInterface can reproduce similar results without the need to write back to a file by building a lazy preprocessing chain. Optionally, the result can still be written to a binary (or a zarr) file.

Here is a recipe to mimic the IBL destriping:

rec = read_spikeglx(folder_path='my_spikeglx_folder')
rec = highpass_filter(recording=rec, n_channel_pad=60)
rec = phase_shift(recording=rec)
bad_channel_ids = detect_bad_channels(recording=rec)
rec = interpolate_bad_channels(recording=rec, bad_channel_ids=bad_channel_ids)
rec = highpass_spatial_filter(recording=rec)
# optional
rec.save(folder='clean_traces', n_jobs=10, chunk_duration='1s', progres_bar=True)

Here is a recipe to mimic the SpikeGLX CatGT:

rec = read_spikeglx(folder_path='my_spikeglx_folder')
rec = phase_shift(recording=rec)
rec = common_reference(recording=rec, operator="median", reference="global")
# optional
rec.save(folder='clean_traces', n_jobs=10, chunk_duration='1s', progres_bar=True)

Of course, these pipelines can be enhanced and customized using other available steps in the spikeinterface.preprocessing module!

Preprocessing on Snippets

Some preprocessing steps are available also for BaseSnippets objects:

align_snippets()

This function aligns waveform snippets.

align_snippets()

References

[IBL_spikesorting] (1,2,3)

International Brain Laboratory. “Spike sorting pipeline for the International Brain Laboratory”. 4 May 2022. 9 Jun 2022.

[DeepInterpolation]

Lecoq, Jérôme, et al. “Removing independent noise in systems neuroscience data using DeepInterpolation.” Nature methods 18.11 (2021): 1401-1408.