Datasets

There are two main types of datasets: torchsig.datasets.datasets.TorchSigIterableDataset and torchsig.datasets.datasets.StaticDataset.

TorchSigIterableDataset is for generating synthetic data in memory (infinitely).

To then save a dataset to disk, use a torchsig.utils.writer.DatasetCreator which accepts a TorchSigIterableDataset object.

StaticTorchSigDataset (torchsig.datasets.StaticTorchSigDataset) is for loading a saved dataset from disk. Samples can be accessed in any order and previously generated samples are accesible.

Note: If a TorchSigIterableDataset is written to disk with no transforms and target transforms, it is considered raw. Otherwise, it is considered to processed. raw means when the dataset is loaded back in using a StaticTorchSigDataset object, users can define transforms and target transforms to be applied. When a processed dataset is loaded back in, users cannot define any transforms and target transform to be applied.

Base Classes

TorchSig Datasets

Dataset Base Classes for creation and static loading.

class torchsig.datasets.datasets.TorchSigDatasetConfig(dataset_id: str, dataset_length: int, seed: int, impairment_level: int, output_representation: Literal['iq', 'spectrogram'], output_spectrogram_fft: int | None, signal_sampling_mode: Literal['per_signal', 'per_family'], dataset_metadata: dict[str, Any])[source]

Bases: object

Configuration dataclass for TorchSig datasets.

dataset_id

A unique identifier for the dataset.

Type:

str

dataset_length

The total number of samples in the dataset.

Type:

int

seed

A random seed for reproducibility.

Type:

int

impairment_level

The level of impairment to apply to the signals.

Type:

int

output_representation

The representation of the output data (e.g., “iq” or “spectrogram”).

Type:

Literal[‘iq’, ‘spectrogram’]

output_spectrogram_fft

The FFT size to use when generating spectrograms (if output_representation is “spectrogram”).

Type:

int | None

signal_sampling_mode

The mode for sampling signals, either “per_signal” or “per_family”.

Type:

Literal[‘per_signal’, ‘per_family’]

dataset_metadata

A dictionary containing additional metadata about the dataset.

Type:

dict[str, Any]

dataset_id: str
dataset_length: int
seed: int
impairment_level: int
output_representation: Literal['iq', 'spectrogram']
output_spectrogram_fft: int | None
signal_sampling_mode: Literal['per_signal', 'per_family']
dataset_metadata: dict[str, Any]
torchsig.datasets.datasets.apply_label_to_signal(sample: Signal, target_label: str) list[source]

Recursively applies the specified label to a signal sample and its components.

Parameters:
  • sample – The signal sample to apply the label to.

  • target_label – The label that should be used to identify relevant values in the signal sample.

Returns:

A list of values corresponding to the label specified in the sample and its component signals.

torchsig.datasets.datasets.apply_transforms_and_labels_to_signal(sample: Signal, transforms: list[Transform | callable], target_labels: list) Signal | np.ndarray | tuple[source]

Applies a series of transformations to a signal sample and retrieves specified label values.

Parameters:
  • sample – The signal sample to process.

  • transforms – A list of function objects, each taking a Signal object and returning a transformed Signal object.

  • target_labels – Labels to be retrieved from the signal sample after transformations. If None, the transformed signal is returned. If an empty list, the signal data is returned.

Returns:

  • If target_labels is None, a Signal object with all applied transforms.

  • If target_labels is an empty list, the numpy.ndarray data of the sample.

  • If target_labels contains one label, a tuple of (sample_data, target_value).

  • If target_labels contains multiple labels, a tuple of (sample_data, [target_values]).

class torchsig.datasets.datasets.TorchSigIterableDataset(signal_generators: str | ConcatSignalGenerator | list = 'all', transforms: list[Transform | callable] = [], component_transforms: list[Transform | callable] = [], target_labels: list | None = None, validate_init: bool = True, **kwargs)[source]

Bases: HierarchicalMetadataObject, IterableDataset

Base class for generating signals.

The dataset will continue to generate samples infinitely.

signal_generators

The signal generators to use. Can be a string, ConcatSignalGenerator, or list.

transforms

List of transforms to apply to the entire signal.

component_transforms

List of transforms to apply to individual signal components.

target_labels

Labels to extract from the signal.

validate_init

Whether to validate metadata during initialization.

init_signal_generator(signal_generator: str | callable) None[source]

Initializes the signal generator.

Parameters:

signal_generator – The signal generator to be initialized. If a string, it is first looked up to retrieve the corresponding signal generator function.

Raises:

TypeError – If the signal_generator is neither a string nor a callable.

add_signal_generator(signal_generator: callable, class_name: str | None = None, class_index: int | None = None, likelihood: int = 1) None[source]

Adds a signal generator to this dataset.

Parameters:
  • signal_generator – A callable object which takes no arguments and returns a Signal.

  • class_name – (optional) A name for this signal class in the dataset. If None, the signal will be generated and added to the data, but no labels will be made for the signal.

  • likelihood – (optional) The relative likelihood of this signal type in the dataset. Doubling the likelihood will make this signal twice as likely to be placed in the data.

validate_metadata_fields() bool[source]

Validates signal metadata for each signal generators.

Returns:

Whether Signal metadata is valid.

class torchsig.datasets.datasets.StaticTorchSigDataset(root: str, file_handler_class=<class 'torchsig.utils.file_handlers.hdf5.HDF5Reader'>, transforms: list = [], target_labels: list | None = None, **kwargs)[source]

Bases: Dataset, Seedable

Static Dataset class, which loads pre-generated data from a directory.

Parameters:
  • root – The root directory where the dataset is stored.

  • transforms – Transforms to apply to the data (default: []).

  • file_handler_class – Class used for reading the dataset (default: HDF5FileHandler).

Datamodules

PyTorch Lightning DataModules Learn More: https://lightning.ai/docs/pytorch/stable/data/datamodule.html If dataset does not exist at root, creates new dataset and writes to disk If dataset does exist, simply loaded it back in

class torchsig.datasets.datamodules.TorchSigDataModule(root: str, metadata, dataset_size: int, dataset_splits: list[float] | list[int] = [0.7, 0.2, 0.1], batch_size: int = 1, num_workers: int = 1, collate_fn: callable | None = None, create_batch_size: int = 8, create_num_workers: int = 4, file_writer: BaseFileHandler = <class 'torchsig.utils.file_handlers.hdf5.HDF5Writer'>, file_reader: BaseFileHandler = <class 'torchsig.utils.file_handlers.hdf5.HDF5Reader'>, overwrite: bool = False, impairment_level: int = 0, transforms=[], target_labels: list[str] | None = None, seed: int | None = None)[source]

Bases: LightningDataModule

PyTorch Lightning DataModule for creating and loading TorchSig datasets.

This DataModule handles:
  • Dataset creation or loading from disk via a file handler.

  • Splitting into train/val/test subsets.

  • Batching, collation, and worker seeding for training.

root

Directory where datasets are stored or created.

dataset_size

Total number of samples in the dataset.

dataset_splits

Fractions or counts for train/val/test splits.

dataset_metadata

Metadata describing the dataset.

impairment_level

Optional interference level for synthetic impairments.

transforms

Transforms applied to the input data.

target_labels

Names of target metadata fields to include.

batch_size

Batch size for the training/validation/testing DataLoaders.

num_workers

Number of worker processes for data loading.

collate_fn

Custom collate function for batching.

create_batch_size

Batch size used during on-disk dataset creation.

create_num_workers

Number of workers used during dataset creation.

file_writer

FileHandler class for disk I/O.

file_reader

FileReader class for disk I/O.

overwrite

If True, existing on-disk data will be overwritten.

seed

Optional random seed for reproducibility.

train

Initialized training dataset (set in setup()).

Type:

StaticTorchSigDataset

val

Initialized validation dataset (set in setup()).

Type:

StaticTorchSigDataset

test

Initialized test dataset (set in setup()).

Type:

StaticTorchSigDataset

train: StaticTorchSigDataset
val: StaticTorchSigDataset
test: StaticTorchSigDataset
prepare_data() None[source]

Prepares the dataset by creating new datasets if they do not exist on disk.

The datasets are created using the DatasetCreator class. If the dataset already exists on disk, it is loaded back into memory.

Raises:
setup(stage: str = 'train') None[source]

Sets up the train and validation datasets for the given stage.

Parameters:

stage – The stage of the DataModule, typically ‘train’ or ‘test’. Defaults to ‘train’.

Raises:
train_dataloader() DataLoader[source]

Returns the DataLoader for the training dataset.

Returns:

A PyTorch DataLoader for the training dataset.

Raises:

RuntimeError – If the training dataset is not initialized.

val_dataloader() DataLoader[source]

Returns the DataLoader for the validation dataset.

Returns:

A PyTorch DataLoader for the validation dataset.

Raises:

RuntimeError – If the validation dataset is not initialized.

test_dataloader() DataLoader[source]

Returns the DataLoader for the test dataset.

Returns:

A PyTorch DataLoader for the test dataset.

Raises:

RuntimeError – If the test dataset is not initialized.

trainer: pl.Trainer | None
prepare_data_per_node: bool
allow_zero_length_dataloader_with_multiple_devices: bool