Skip to content

Cell Sampling

Oliver Kennedy edited this page Dec 31, 2023 · 2 revisions

Sampling cells allow you to generate partial samples of a datset. Vizier currently supports three forms of sampling:

  1. Basic
  2. Manually Stratified
  3. Automatically Stratified

Basic Sample

This cell generates a new dataset consisting of a randomly selected subset of cells in the input. Samples are chosen based on a uniform sampling rate across all cells.

  • Input Dataset: The dataset to sample
  • Sampling Rate: The fraction of records to include in the sample (1 = all records, 0 = no records)
  • Output Dataset (optional): The name to assign the sampled dataset (defaults to the name of the input dataset with _sample appended)

Manually Stratified Sample

This cell generates a new dataset consisting of a subset of the cells in the input. Samples are chosen based on a manually provided rate that varies based on a categorical attribute.

  • Input Dataset: The dataset to sample
  • Column: The categorical attribute used to select a sampling rate
  • Strata: Sampling rates for each value of Column
    • Column Value: Of the records where Column has this value...
    • Sampling Rate: ...include this fraction.
  • Output Dataset (optional): The name to assign the sampled dataset (defaults to the name of the input dataset with _sample appended)

Automatically Stratified Sample

This cell generates a new dataset consisting of a subset of the cells in the input. Samples are chosen to ensure equal representation from each value of a specified categorical attribute. Note that if too few records exist for one or more values of the categorical attribute, the cell will generate an error.

  • Input Dataset: The dataset to sample.
  • Column: The categorical attribute used to select a sampling rate. Every distinct value of this column will have (roughly) even representation in the final sample.
  • Sampling Rate: The fraction of records to include in the sample (1 = all records, 0 = no records). If this value is too high (i.e., some categories would be under-represented in the result), an error message will indicate the maximum value of this field.
  • Output Dataset (optional): The name to assign the sampled dataset (defaults to the name of the input dataset with _sample appended)