torchsig.utils.writer.DatasetCreator

class torchsig.utils.writer.DatasetCreator(dataset: ~torchsig.datasets.datasets.NewTorchSigDataset, root: str, overwrite: bool = False, batch_size: int = 1, num_workers: int = 1, collate_fn: ~typing.Callable = <function collate_fn>, tqdm_desc: str | None = None, file_handler: ~torchsig.utils.file_handlers.base_handler.TorchSigFileHandler = <class 'torchsig.utils.file_handlers.zarr.ZarrFileHandler'>, train: bool | None = None, multithreading: bool = True, **kwargs)[source]

Bases: object

Class for creating a dataset and saving it to disk in batches.

This class generates a dataset if it doesn’t already exist on disk. It processes the data in batches and saves it using a specified file handler. The class allows setting options like whether to overwrite existing datasets, batch size, and number of worker threads.

root

The root directory where the dataset will be saved.

Type:

Path

overwrite

Flag indicating whether to overwrite an existing dataset.

Type:

bool

batch_size

The number of samples in each batch.

Type:

int

num_workers

The number of worker threads to use for data loading.

Type:

int

save_type

The type of dataset being saved (“raw” or “processed”).

Type:

str

tqdm_desc

A description for the progress bar.

Type:

str

writer

The file handler used for saving the dataset.

Type:

TorchSigFileHandler

dataloader

The DataLoader used to load data in batches.

Type:

DataLoader

Methods

check_yamls

Checks for differences between the dataset metadata on disk and the dataset metadata in memory.

create

Creates the dataset on disk by writing batches to the file handler.

get_writing_info_dict

Returns a dictionary with information about the dataset being written.

__init__(dataset: ~torchsig.datasets.datasets.NewTorchSigDataset, root: str, overwrite: bool = False, batch_size: int = 1, num_workers: int = 1, collate_fn: ~typing.Callable = <function collate_fn>, tqdm_desc: str | None = None, file_handler: ~torchsig.utils.file_handlers.base_handler.TorchSigFileHandler = <class 'torchsig.utils.file_handlers.zarr.ZarrFileHandler'>, train: bool | None = None, multithreading: bool = True, **kwargs)[source]

Initializes the DatasetCreator.

Parameters:
  • dataset (NewTorchSigDataset) – The dataset to be written to disk.

  • root (str) – The root directory where the dataset will be saved.

  • overwrite (bool) – Whether to overwrite an existing dataset (default: False).

  • batch_size (int) – The number of samples per batch (default: 1).

  • num_workers (int) – The number of workers for loading data (default: 1).

  • collate_fn (Callable) – Function to merge a list of samples into a batch (default: default_collate_fn).

  • tqdm_desc (str) – Description for the tqdm progress bar (optional).

  • file_handler (TorchSigFileHandler) – File handler for saving the dataset (default: ZarrFileHandler).

  • train (bool) – Whether the dataset is for training (optional).

Raises:

ValueError – If the dataset does not specify num_samples.

get_writing_info_dict() Dict[str, Any][source]

Returns a dictionary with information about the dataset being written.

This method gathers information regarding the root, overwrite status, batch size, number of workers, file handler class, and the save type of the dataset.

Returns:

Dictionary containing the dataset writing configuration.

Return type:

Dict[str, Any]

check_yamls() List[Tuple[str, Any, Any]][source]

Checks for differences between the dataset metadata on disk and the dataset metadata in memory.

Compares the dataset metadata that would be written to disk against the existing metadata on disk. Returns a list of differences.

Returns:

List of differences between metadata on disk and in memory.

Return type:

List[Tuple[str, Any, Any]]

create() None[source]

Creates the dataset on disk by writing batches to the file handler.

This method generates the dataset in batches and saves it to disk. If the dataset already exists and overwrite is set to False, it will skip regeneration.

The method also writes the dataset metadata and writing information to YAML files.

Raises:

ValueError – If the dataset is already generated and overwrite is set to False.