An iterator that can stream over stdin #1193

erip · 2023-07-04T14:43:19Z

🚀 The feature

An IterDataPipe which can consume from stdin and automatically re-cyle each epoch.

Motivation, pitch

I'd like to push data augmentation and preprocessing upstream so model training/inference can operate directly on tokens streamed over stdin. This allows for tremendous flexibility without a user needing to hard-code a preprocessing pipeline in userland code. For an NLP use-case, I imagine something like...

paste <(cut -f1 train.tsv | spm_encode --model ...) \
      <(cut -f2 train.tsv) | \
      python train_with_stdin_iter.py --epochs 5

with some code similar to

def create_tensor(line):
    X, y = line.strip().split("\t")
    return vocab_lookup(X), int(y)


iter_dp = IterableWrapper(sys.stdin).map(create_tensor)
loader = DataLoader(iter_dp)

Alternatives

The preprocessed text could be written to a file which native torchdata constructs could operate on directly. This is fine, but requires a copy of the data to be written to disk.

Additional context

The current code doesn't work because sys.stdin closes when it reaches EOF, so the dataloader only sees a single epoch worth of data.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An iterator that can stream over stdin #1193

An iterator that can stream over stdin #1193

erip commented Jul 4, 2023 •

edited

Loading

An iterator that can stream over stdin #1193

An iterator that can stream over stdin #1193

Comments

erip commented Jul 4, 2023 • edited Loading

🚀 The feature

Motivation, pitch

Alternatives

Additional context

erip commented Jul 4, 2023 •

edited

Loading