-
Notifications
You must be signed in to change notification settings - Fork 11
Cell Sampling
Sampling cells allow you to generate partial samples of a datset. Vizier currently supports three forms of sampling:
- Basic
- Manually Stratified
- Automatically Stratified
This cell generates a new dataset consisting of a randomly selected subset of cells in the input. Samples are chosen based on a uniform sampling rate across all cells.
- Input Dataset: The dataset to sample
- Sampling Rate: The fraction of records to include in the sample (1 = all records, 0 = no records)
-
Output Dataset (optional): The name to assign the sampled dataset (defaults to the name of the input dataset with
_sample
appended)
This cell generates a new dataset consisting of a subset of the cells in the input. Samples are chosen based on a manually provided rate that varies based on a categorical attribute.
- Input Dataset: The dataset to sample
- Column: The categorical attribute used to select a sampling rate
-
Strata: Sampling rates for each value of Column
- Column Value: Of the records where Column has this value...
- Sampling Rate: ...include this fraction.
-
Output Dataset (optional): The name to assign the sampled dataset (defaults to the name of the input dataset with
_sample
appended)
This cell generates a new dataset consisting of a subset of the cells in the input. Samples are chosen to ensure equal representation from each value of a specified categorical attribute. Note that if too few records exist for one or more values of the categorical attribute, the cell will generate an error.
- Input Dataset: The dataset to sample.
- Column: The categorical attribute used to select a sampling rate. Every distinct value of this column will have (roughly) even representation in the final sample.
- Sampling Rate: The fraction of records to include in the sample (1 = all records, 0 = no records). If this value is too high (i.e., some categories would be under-represented in the result), an error message will indicate the maximum value of this field.
-
Output Dataset (optional): The name to assign the sampled dataset (defaults to the name of the input dataset with
_sample
appended)