torchsig.utils.writer.DatasetCreator

class torchsig.utils.writer.DatasetCreator(dataloader: DataLoader = None, dataset_length: int | None = None, root: str = '.', overwrite: bool = True, tqdm_desc: str | None = None, file_handler: FileWriter = <class 'torchsig.utils.file_handlers.hdf5.HDF5Writer'>, multithreading: bool = True, **kwargs)[source]

Bases: object

Class for creating a dataset and saving it to disk in batches.

This class generates a dataset if it doesn’t already exist on disk. It processes the data in batches and saves it using a specified file handler. The class allows setting options like whether to overwrite existing datasets, batch size, and number of worker threads.

dataloader

The DataLoader used to load data in batches.

Type:

DataLoader

root

The root directory where the dataset will be saved.

Type:

Path

overwrite

Flag indicating whether to overwrite an existing dataset.

Type:

bool

tqdm_desc

A description for the progress bar.

Type:

str

file_handler

The file handler used for saving the dataset.

Type:

FileWriter

Methods

create

Creates the dataset on disk by writing batches to the file handler.

get_writing_info_dict

Returns a dictionary with information about the dataset being written.

__init__(dataloader: DataLoader = None, dataset_length: int | None = None, root: str = '.', overwrite: bool = True, tqdm_desc: str | None = None, file_handler: FileWriter = <class 'torchsig.utils.file_handlers.hdf5.HDF5Writer'>, multithreading: bool = True, **kwargs)[source]

Initializes the DatasetCreator.

Parameters:
  • dataloader (DataLoader) – The DataLoader used to load data in batches.

  • dataset_length (int) – The number of samples to draw from a dataset.

  • root (Path) – The root directory where the dataset will be saved.

  • overwrite (bool) – Flag indicating whether to overwrite an existing dataset.

  • tqdm_desc (str) – A description for the progress bar.

  • file_handler (FileWriter) – The file handler used for saving the dataset.

get_writing_info_dict() dict[str, Any][source]

Returns a dictionary with information about the dataset being written.

This method gathers information regarding the root, overwrite status, batch size, number of workers, file handler class, and the save type of the dataset.

Returns:

Dictionary containing the dataset writing configuration.

Return type:

Dict[str, Any]

create() None[source]

Creates the dataset on disk by writing batches to the file handler.

This method generates the dataset in batches and saves it to disk. If the dataset already exists and overwrite is set to False, it will skip regeneration.

The method also writes the dataset metadata and writing information to YAML files.

Raises:

ValueError – If the dataset is already generated and overwrite is set to False.