Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider computing bin edge calculation #468

Open
jonmmease opened this issue Mar 5, 2024 · 0 comments
Open

Consider computing bin edge calculation #468

jonmmease opened this issue Mar 5, 2024 · 0 comments

Comments

@jonmmease
Copy link
Collaborator

I started playing with implementing the Vega/D3 automatic bin selection algorithm in SQL. I got pretty close with this (in duckdb)

-- [duckdb]
WITH
  -- Input table
  "tbl_a" as (
    SELECT * FROM (VALUES (-2.9), (0), (23.5)) as tbl(col_value)
  ),
  -- Compute raw min and max
  "tbl_b" as (
    SELECT MIN(col_value) as "min_", MAX(col_value) as "max_"
    FROM "tbl_a"
  ),
  -- Config and scalar calculations
  "tbl_0" AS (
    SELECT 
      -- config
      10 as base
      , 20 as maxbins
      , 0 as minstep
      -- pass through
      , min_
      , max_
      -- calculation
      , ln(base) as logb
      , max_ - min_ as span
      , ceil(ln(maxbins) / logb) as level
      , greatest(minstep, pow(base, round(ln(span) / logb) - level)) as step
      FROM "tbl_b"
  ),
  -- Increase step size if too many bins
  "tbl_1" as (
    SELECT logb, maxbins, minstep, base, min_, max_, span
      , ceil(span / (step * pow(base, base_factor))) as num_bins
      , step * pow(base, base_factor) as step
    FROM "tbl_0"
    CROSS JOIN (SELECT * FROM (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8)) as tbl(base_factor))
    WHERE num_bins <= maxbins
    ORDER BY num_bins DESC
    LIMIT 1
  ),
  -- Decrease step size (based on divide of [5, 2]) if allowed
  "tbl_2" as (
    SELECT base, num_bins, min_, max_, logb, step / div as step
    FROM "tbl_1"
    CROSS JOIN (SELECT * FROM (VALUES (5), (2), (1)) as tbl(div))
    WHERE step >= minstep AND span / step <= maxbins
    ORDER BY step
    LIMIT 1
  ),
  -- Update precision of min_ and max_
  "tbl_3" as (
    SELECT base, step, num_bins, min_, max_
      , ln(step) as v, CASE WHEN v >= 0.0 THEN 0.0 ELSE floor(-v / logb) + 1.0 END as precision
      , pow(base, -precision - 1.0) as eps
      , floor(min_ / step + eps) * step as v1
      , CASE WHEN min_ < v1 THEN v1 - step ELSE v1 END as final_min
      , ceil(max_ / step) * step as final_max2
      , final_min as start
      , CASE WHEN final_max2 <> final_min THEN final_max2 ELSE final_min + step END as stop

    FROM "tbl_2"
  ),
  -- Add new binned columns to input table
  "tbl_4" as (
    SELECT start, stop, step, col_value
      , start + floor((col_value - start) / (step)) * step as bin_start
      , bin_start + step as bin_end
    from "tbl_a"
    CROSS JOIN "tbl_3"
  )
SELECT * FROM "tbl_4"
|    |   start |   stop |   step |   col_value |   bin_start |   bin_end |
|---:|--------:|-------:|-------:|------------:|------------:|----------:|
|  0 |      -4 |     24 |      2 |        -2.9 |          -4 |        -2 |
|  1 |      -4 |     24 |      2 |         0   |           0 |         2 |
|  2 |      -4 |     24 |      2 |        23.5 |          22 |        24 |

In place of the while loop in the "Increase step size if too many bins" section of the algorithm, I added a fixed number of iterations and selected the first one that satisfies the criteria. I'm not sure how best to select this threshold, but I don't think in needs to be very high (and it shouldn't be very expensive).

In the case where the bin transform is required to output a signal, we would need to perform this calculation in Rust anyway, but for the case where it doesn't need to output a signal, it might be benefitial to avoid performing a separate query to compute the bin edges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant