torchsig.utils.writer.DatasetCreator¶
- class torchsig.utils.writer.DatasetCreator(dataset: ~torchsig.datasets.datasets.NewTorchSigDataset, root: str, overwrite: bool = False, batch_size: int = 1, num_workers: int = 1, collate_fn: ~typing.Callable = <function collate_fn>, tqdm_desc: str | None = None, file_handler: ~torchsig.utils.file_handlers.base_handler.TorchSigFileHandler = <class 'torchsig.utils.file_handlers.zarr.ZarrFileHandler'>, train: bool | None = None, multithreading: bool = True, **kwargs)[source]¶
Bases:
objectClass for creating a dataset and saving it to disk in batches.
This class generates a dataset if it doesn’t already exist on disk. It processes the data in batches and saves it using a specified file handler. The class allows setting options like whether to overwrite existing datasets, batch size, and number of worker threads.
- root¶
The root directory where the dataset will be saved.
- Type:
Path
- writer¶
The file handler used for saving the dataset.
- Type:
- dataloader¶
The DataLoader used to load data in batches.
- Type:
DataLoader
Methods
Checks for differences between the dataset metadata on disk and the dataset metadata in memory.
Creates the dataset on disk by writing batches to the file handler.
Returns a dictionary with information about the dataset being written.
- __init__(dataset: ~torchsig.datasets.datasets.NewTorchSigDataset, root: str, overwrite: bool = False, batch_size: int = 1, num_workers: int = 1, collate_fn: ~typing.Callable = <function collate_fn>, tqdm_desc: str | None = None, file_handler: ~torchsig.utils.file_handlers.base_handler.TorchSigFileHandler = <class 'torchsig.utils.file_handlers.zarr.ZarrFileHandler'>, train: bool | None = None, multithreading: bool = True, **kwargs)[source]¶
Initializes the DatasetCreator.
- Parameters:
dataset (NewTorchSigDataset) – The dataset to be written to disk.
root (str) – The root directory where the dataset will be saved.
overwrite (bool) – Whether to overwrite an existing dataset (default: False).
batch_size (int) – The number of samples per batch (default: 1).
num_workers (int) – The number of workers for loading data (default: 1).
collate_fn (Callable) – Function to merge a list of samples into a batch (default: default_collate_fn).
tqdm_desc (str) – Description for the tqdm progress bar (optional).
file_handler (TorchSigFileHandler) – File handler for saving the dataset (default: ZarrFileHandler).
train (bool) – Whether the dataset is for training (optional).
- Raises:
ValueError – If the dataset does not specify num_samples.
- get_writing_info_dict() Dict[str, Any][source]¶
Returns a dictionary with information about the dataset being written.
This method gathers information regarding the root, overwrite status, batch size, number of workers, file handler class, and the save type of the dataset.
- Returns:
Dictionary containing the dataset writing configuration.
- Return type:
Dict[str, Any]
- check_yamls() List[Tuple[str, Any, Any]][source]¶
Checks for differences between the dataset metadata on disk and the dataset metadata in memory.
Compares the dataset metadata that would be written to disk against the existing metadata on disk. Returns a list of differences.
- Returns:
List of differences between metadata on disk and in memory.
- Return type:
List[Tuple[str, Any, Any]]
- create() None[source]¶
Creates the dataset on disk by writing batches to the file handler.
This method generates the dataset in batches and saves it to disk. If the dataset already exists and overwrite is set to False, it will skip regeneration.
The method also writes the dataset metadata and writing information to YAML files.
- Raises:
ValueError – If the dataset is already generated and overwrite is set to False.