Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An iterator that can stream over stdin #1193

Open
erip opened this issue Jul 4, 2023 · 0 comments
Open

An iterator that can stream over stdin #1193

erip opened this issue Jul 4, 2023 · 0 comments

Comments

@erip
Copy link
Contributor

erip commented Jul 4, 2023

🚀 The feature

An IterDataPipe which can consume from stdin and automatically re-cyle each epoch.

Motivation, pitch

I'd like to push data augmentation and preprocessing upstream so model training/inference can operate directly on tokens streamed over stdin. This allows for tremendous flexibility without a user needing to hard-code a preprocessing pipeline in userland code. For an NLP use-case, I imagine something like...

paste <(cut -f1 train.tsv | spm_encode --model ...) \
      <(cut -f2 train.tsv) | \
      python train_with_stdin_iter.py --epochs 5

with some code similar to

def create_tensor(line):
    X, y = line.strip().split("\t")
    return vocab_lookup(X), int(y)


iter_dp = IterableWrapper(sys.stdin).map(create_tensor)
loader = DataLoader(iter_dp)

Alternatives

The preprocessed text could be written to a file which native torchdata constructs could operate on directly. This is fine, but requires a copy of the data to be written to disk.

Additional context

The current code doesn't work because sys.stdin closes when it reaches EOF, so the dataloader only sees a single epoch worth of data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant