torchsig.utils.writer.DatasetCreator¶
- class torchsig.utils.writer.DatasetCreator(dataloader: DataLoader = None, dataset_length: int | None = None, root: str = '.', overwrite: bool = True, tqdm_desc: str | None = None, file_handler: FileWriter = <class 'torchsig.utils.file_handlers.hdf5.HDF5Writer'>, multithreading: bool = True, **kwargs)[source]¶
Bases:
objectClass for creating a dataset and saving it to disk in batches.
This class generates a dataset if it doesn’t already exist on disk. It processes the data in batches and saves it using a specified file handler. The class allows setting options like whether to overwrite existing datasets, batch size, and number of worker threads.
- dataloader¶
The DataLoader used to load data in batches.
- Type:
DataLoader
- root¶
The root directory where the dataset will be saved.
- Type:
Path
- file_handler¶
The file handler used for saving the dataset.
- Type:
Methods
Creates the dataset on disk by writing batches to the file handler.
Returns a dictionary with information about the dataset being written.
- __init__(dataloader: DataLoader = None, dataset_length: int | None = None, root: str = '.', overwrite: bool = True, tqdm_desc: str | None = None, file_handler: FileWriter = <class 'torchsig.utils.file_handlers.hdf5.HDF5Writer'>, multithreading: bool = True, **kwargs)[source]¶
Initializes the DatasetCreator.
- Parameters:
dataloader (DataLoader) – The DataLoader used to load data in batches.
dataset_length (int) – The number of samples to draw from a dataset.
root (Path) – The root directory where the dataset will be saved.
overwrite (bool) – Flag indicating whether to overwrite an existing dataset.
tqdm_desc (str) – A description for the progress bar.
file_handler (FileWriter) – The file handler used for saving the dataset.
- get_writing_info_dict() dict[str, Any][source]¶
Returns a dictionary with information about the dataset being written.
This method gathers information regarding the root, overwrite status, batch size, number of workers, file handler class, and the save type of the dataset.
- Returns:
Dictionary containing the dataset writing configuration.
- Return type:
Dict[str, Any]
- create() None[source]¶
Creates the dataset on disk by writing batches to the file handler.
This method generates the dataset in batches and saves it to disk. If the dataset already exists and overwrite is set to False, it will skip regeneration.
The method also writes the dataset metadata and writing information to YAML files.
- Raises:
ValueError – If the dataset is already generated and overwrite is set to False.