Datasets¶
There are two main types of datasets: torchsig.datasets.datasets.TorchSigIterableDataset and torchsig.datasets.datasets.StaticDataset.
TorchSigIterableDataset is for generating synthetic data in memory (infinitely).
To then save a dataset to disk, use a torchsig.utils.writer.DatasetCreator which accepts a TorchSigIterableDataset object.
StaticTorchSigDataset (torchsig.datasets.StaticTorchSigDataset) is for loading a saved dataset from disk.
Samples can be accessed in any order and previously generated samples are accesible.
Note: If a TorchSigIterableDataset is written to disk with no transforms and target transforms, it is considered raw. Otherwise, it is considered to processed. raw means when the dataset is loaded back in using a StaticTorchSigDataset object, users can define transforms and target transforms to be applied. When a processed dataset is loaded back in, users cannot define any transforms and target transform to be applied.
Base Classes¶
TorchSig Datasets¶
Dataset Base Classes for creation and static loading.
- torchsig.datasets.datasets.apply_label_to_signal(sample: Signal, target_label: str) list[source]¶
Recursively applies the specified label to a signal sample and its components.
- Parameters:
sample – The signal sample to apply the label to.
target_label – The label that should be used to identify relevant values in the signal sample.
- Returns:
A list of values corresponding to the label specified in the sample and its component signals.
- torchsig.datasets.datasets.apply_transforms_and_labels_to_signal(sample: Signal, transforms: list[Transform | callable], target_labels: list) Signal | np.ndarray | tuple[source]¶
Applies a series of transformations to a signal sample and retrieves specified label values.
- Parameters:
sample – The signal sample to process.
transforms – A list of function objects, each taking a Signal object and returning a transformed Signal object.
target_labels – Labels to be retrieved from the signal sample after transformations. If None, the transformed signal is returned. If an empty list, the signal data is returned.
- Returns:
If target_labels is None, a Signal object with all applied transforms.
If target_labels is an empty list, the numpy.ndarray data of the sample.
If target_labels contains one label, a tuple of (sample_data, target_value).
If target_labels contains multiple labels, a tuple of (sample_data, [target_values]).
- class torchsig.datasets.datasets.TorchSigIterableDataset(signal_generators: str | ConcatSignalGenerator | list = 'all', transforms: list[Transform | callable] = [], component_transforms: list[Transform | callable] = [], target_labels: list | None = None, validate_init: bool = True, **kwargs)[source]¶
Bases:
HierarchicalMetadataObject,IterableDatasetBase class for generating signals.
The dataset will continue to generate samples infinitely.
- signal_generators¶
The signal generators to use. Can be a string, ConcatSignalGenerator, or list.
- transforms¶
List of transforms to apply to the entire signal.
- component_transforms¶
List of transforms to apply to individual signal components.
- target_labels¶
Labels to extract from the signal.
- validate_init¶
Whether to validate metadata during initialization.
- init_signal_generator(signal_generator: str | callable) None[source]¶
Initializes the signal generator.
- Parameters:
signal_generator – The signal generator to be initialized. If a string, it is first looked up to retrieve the corresponding signal generator function.
- Raises:
TypeError – If the signal_generator is neither a string nor a callable.
- add_signal_generator(signal_generator: callable, class_name: str | None = None, class_index: int | None = None, likelihood: int = 1) None[source]¶
Adds a signal generator to this dataset.
- Parameters:
signal_generator – A callable object which takes no arguments and returns a Signal.
class_name – (optional) A name for this signal class in the dataset. If None, the signal will be generated and added to the data, but no labels will be made for the signal.
likelihood – (optional) The relative likelihood of this signal type in the dataset. Doubling the likelihood will make this signal twice as likely to be placed in the data.
- class torchsig.datasets.datasets.StaticTorchSigDataset(root: str, file_handler_class=<class 'torchsig.utils.file_handlers.hdf5.HDF5Reader'>, transforms: list = [], target_labels: list | None = None, **kwargs)[source]¶
Bases:
Dataset,SeedableStatic Dataset class, which loads pre-generated data from a directory.
- Parameters:
root – The root directory where the dataset is stored.
transforms – Transforms to apply to the data (default: []).
file_handler_class – Class used for reading the dataset (default: HDF5FileHandler).
Datamodules¶
PyTorch Lightning DataModules Learn More: https://lightning.ai/docs/pytorch/stable/data/datamodule.html If dataset does not exist at root, creates new dataset and writes to disk If dataset does exist, simply loaded it back in
- class torchsig.datasets.datamodules.TorchSigDataModule(root: str, metadata, dataset_size: int, dataset_splits: list[float] | list[int] = [0.7, 0.2, 0.1], batch_size: int = 1, num_workers: int = 1, collate_fn: callable | None = None, create_batch_size: int = 8, create_num_workers: int = 4, file_writer: BaseFileHandler = <class 'torchsig.utils.file_handlers.hdf5.HDF5Writer'>, file_reader: BaseFileHandler = <class 'torchsig.utils.file_handlers.hdf5.HDF5Reader'>, overwrite: bool = False, impairment_level: int | None = None, transforms=[], target_labels: list[str] | None = None, seed: int | None = None)[source]¶
Bases:
LightningDataModulePyTorch Lightning DataModule for creating and loading TorchSig datasets.
- This DataModule handles:
Dataset creation or loading from disk via a file handler.
Splitting into train/val/test subsets.
Batching, collation, and worker seeding for training.
- root¶
Directory where datasets are stored or created.
- dataset_size¶
Total number of samples in the dataset.
- dataset_splits¶
Fractions or counts for train/val/test splits.
- dataset_metadata¶
Metadata describing the dataset.
- impairment_level¶
Optional interference level for synthetic impairments.
- transforms¶
Transforms applied to the input data.
- target_labels¶
Names of target metadata fields to include.
- batch_size¶
Batch size for the training/validation/testing DataLoaders.
- num_workers¶
Number of worker processes for data loading.
- collate_fn¶
Custom collate function for batching.
- create_batch_size¶
Batch size used during on-disk dataset creation.
- create_num_workers¶
Number of workers used during dataset creation.
- file_writer¶
FileHandler class for disk I/O.
- file_reader¶
FileReader class for disk I/O.
- overwrite¶
If True, existing on-disk data will be overwritten.
- seed¶
Optional random seed for reproducibility.
- train¶
Initialized training dataset (set in setup()).
- Type:
- val¶
Initialized validation dataset (set in setup()).
- Type:
- test¶
Initialized test dataset (set in setup()).
- Type:
- train: StaticTorchSigDataset¶
- test: StaticTorchSigDataset¶
- prepare_data() None[source]¶
Prepares the dataset by creating new datasets if they do not exist on disk.
The datasets are created using the DatasetCreator class. If the dataset already exists on disk, it is loaded back into memory.
- Raises:
FileNotFoundError – If the root directory cannot be created.
RuntimeError – If dataset creation fails.
- setup(stage: str = 'train') None[source]¶
Sets up the train and validation datasets for the given stage.
- Parameters:
stage – The stage of the DataModule, typically ‘train’ or ‘test’. Defaults to ‘train’.
- Raises:
FileNotFoundError – If the dataset files are not found at the specified root.
ValueError – If dataset splits are invalid.
- train_dataloader() DataLoader[source]¶
Returns the DataLoader for the training dataset.
- Returns:
A PyTorch DataLoader for the training dataset.
- Raises:
RuntimeError – If the training dataset is not initialized.
- val_dataloader() DataLoader[source]¶
Returns the DataLoader for the validation dataset.
- Returns:
A PyTorch DataLoader for the validation dataset.
- Raises:
RuntimeError – If the validation dataset is not initialized.
- test_dataloader() DataLoader[source]¶
Returns the DataLoader for the test dataset.
- Returns:
A PyTorch DataLoader for the test dataset.
- Raises:
RuntimeError – If the test dataset is not initialized.