forked from pytorch/pytorch
-
Notifications
You must be signed in to change notification settings - Fork 0
Data Basics
Erjia Guan edited this page Oct 29, 2021
·
2 revisions
Read through link
- Sampler: Randomly choosing index per iteration. It would yield indices when
batch_size
is notNone
.- For
IterableDataset
, it would keep yielding None(s) per iteration using_InfiniteConstantSampler
- For
- Fetcher: Taking single index or a batch of indices, and returning corresponding data from Dataset. It would invoke
collate_fn
over each batch of data and drop the remaining unfilled batch ifdrop_last
is set.- For
IterableDataset
, it would simply get next batch-size elements as a batch.
- For
- Single Process:
Sampler
|
index/indices
|
V
Fetcher
|
index/indices
|
V
dataset
|
V
collate_fn
|
V
output
- Multiple processes:
Sampler (Main process)
|
index/indices
|
V
Index Multiprocessing Queue (one healthy worker)
|
index/indices
|
V
Fetcher (Worker process)
|
index/indices
|
V
dataset
|
Batch of data
|
V
collate_fn
|
V
Result Multiprocessing Queue
|
Data
|
V
pin_memory_thread (Main process)
|
V
output
This is just a general data and control flow in DataLoader. There are multiple further detailed functionalities like prefetching, worker_status, and etc.
Most of common questions for DataLoader come from multiple workers as multiprocessing is enabled.
- Default multiprocessing methods are different across platforms based on Python (https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods)
- Control randomness per worker using
worker_init_fn
. Otherwise, DataLoader either becomes non-deterministic when using spawn or shares same random state for each worker when using fork. - COW in fork (Copy-on-access in Python fork). The simplest solution for the implementation of Dataset is to use Tensor or NumPy array to replace Python arbitrary objects like list and dict.
- Control randomness per worker using
- Difference between Map-style Datset and Iterable-style Dataset
- Map-style Dataset can utilize the indices sampled from main process to get automatic sharding.
- Iterable-style Dataset requires users to manually implement sharding inside
__iter__
method usingtorch.utils.data.get_worker_info()
. Please check the example.
- Shuffle is not enabled for Iterable-style Dataset. If needed, users need to implement the shuffle utilities inside
IterableDataset
class. (This is solved by TorchData project)
Read through link and link Expected features:
- Automatic/Dydamic sharding
- Determinism Control
- Snapshotting
- DataFrame integration
- etc.
Goto N1222094 for Data Lab
PyTorch presented to you with love by the PyTorch Team of contributors
- Install Prerequisites and Dependencies
- Fork, clone, and checkout the PyTorch source
- Build PyTorch from source
- Tips for developing PyTorch
- PyTorch Workflow Git cheatsheet
- Overview of the Pull Request Lifecycle
- Finding Or Creating Issues
- Pre Commit Checks
- Create a Pull Request
- Typical Pull Request Workflow
- Pull Request FAQs
- Getting Help
- Codebase structure
- Tensors, Operators, and Testing
- Autograd
- Dispatcher, Structured Kernels, and Codegen
- torch.nn
- CUDA basics
- Data (Optional)
- function transforms (Optional)