torchsig.utils.writer.DatasetCreator¶

class torchsig.utils.writer.DatasetCreator(dataloader: DataLoader, dataset_length: int | None = None, root: str = '.', overwrite: bool = True, tqdm_desc: str | None = None, file_handler: FileWriter = <class 'torchsig.utils.file_handlers.hdf5.HDF5Writer'>, multithreading: bool = True, max_inflight_futures: int = 32, **kwargs)[source]¶

Bases: object

Class for creating a dataset and saving it to disk in batches.

This class generates a dataset if it does not already exist on disk. It processes the data in batches and saves it using a specified file handler. The class allows setting options like whether to overwrite existing datasets, batch size, and number of worker threads.

dataloader¶

The DataLoader used to load data in batches.

Type:: DataLoader

root¶

The root directory where the dataset will be saved.

Type:: Path

overwrite¶

Flag indicating whether to overwrite an existing dataset.

Type:: bool

tqdm_desc¶

A description for the progress bar.

Type:: str

file_handler¶

The file handler used for saving the dataset.

Type:: FileWriter

Methods

`check_yamls`	Returns (complete, differences) without mutating dataset or entering writer context.
`create`	Creates the dataset on disk by writing batches to the file handler.
`get_dataset_info_dict`	Get metadata content for the dataset_info.yaml file.
`get_writer_info_dict`	Returns a dictionary with information about the dataset writing configuration.

__init__(dataloader: DataLoader, dataset_length: int | None = None, root: str = '.', overwrite: bool = True, tqdm_desc: str | None = None, file_handler: FileWriter = <class 'torchsig.utils.file_handlers.hdf5.HDF5Writer'>, multithreading: bool = True, max_inflight_futures: int = 32, **kwargs)[source]¶

Initializes the DatasetCreator.

Parameters:

dataloader (DataLoader) – DataLoader used to load data in batches. Required.
dataset_length (int) – Number of dataset items to be created. Length inferrence attempted if not provided.
root (Path) – Root directory where the dataset files will be saved. Defaults to current directory.
overwrite (bool) – Flag indicating whether to overwrite an existing dataset. Defaults to True.
tqdm_desc (str) – Description for the progress bar.
file_handler (FileWriter) – File handler used to write dataset. Defaults to HDF5Writer.
multithreading (bool) – Whether to use multithreading for writing batches. Defaults to True.
max_inflight_futures (int) – Maximum number of concurrent futures when using multithreading. Defaults to 32.
**kwargs – Additional arguments for the file handler.

get_dataset_info_dict(*, dataset_length: int, original_target_labels: Any) → dict[str, Any][source]¶

Get metadata content for the dataset_info.yaml file.

Returns:: Dictionary containing the dataset metadata information.
Return type:: Dict[str, Any]

get_writer_info_dict(*, complete: bool) → dict[str, Any][source]¶

Returns a dictionary with information about the dataset writing configuration. Used primarily for creating content for the writer_info.yaml summary file.

Returns:: Dictionary containing the dataset writing configuration.
Return type:: Dict[str, Any]

check_yamls(*, expected_dataset_info: dict[str, Any]) → tuple[bool, list[tuple[str, Any, Any]]][source]¶: Returns (complete, differences) without mutating dataset or entering writer context.

create() → None[source]¶

Creates the dataset on disk by writing batches to the file handler.

This method generates the dataset in batches and saves it to disk. If the dataset already exists and overwrite is set to False, it will skip regeneration.

The method also writes the dataset metadata and writing information to YAML files.

Raises:: ValueError – If the dataset is already generated and overwrite is set to False.