additional higher-level `rle` api? #897

ericphanson · 2023-10-02T10:51:19Z

I have been using the following abstraction over rle in some private code:

"""
    Run{T}

An object which represents a "run" of repeated values in a vector, produced by [`get_runs`](@ref).

See also: [`run_length`](@ref), [`run_value`](@ref), and [`run_indices`](@ref).
"""
struct Run{T}
    val::T
    indices::UnitRange{Int}
end
"""
    run_length(r::Run)

Returns the length of a given run.
"""
run_length(r::Run) = length(r.indices)
"""
    run_value(r::Run)

Returns the value of a given run.
"""
run_value(r::Run) = r.val

"""
    run_indices(r::Run)

Returns the indices (into the original vector) associated to the given run.
"""
run_indices(r::Run) = r.indices

"""
    get_runs(v) -> Vector{Run}

Performs a run-length encoding, returning a sorted vector of [`Run`](@ref) objects, each which support [`run_length`](@ref), [`run_value`](@ref), and [`run_indices`](@ref).
"""
function get_runs(v)
    vals, lens = rle(v)
    cs = cumsum(lens)
    run_starts = cs .- lens .+ 1
    run_stops = cs
    return [Run(vals[i], run_starts[i]:run_stops[i])
            for i in eachindex(vals, run_starts, run_stops)]
end

Would this be appropriate to upstream to StatsBase.jl (w/ tests)?

I have found this useful for manipulating boolean masks in-place, some examples below. I found the code written in terms of this api much simpler/easier to follow, since the lengths <-> index conversion is hidden away in get_runs, where it can be written/tested once and then used a bunch of times in downstream code.

Example usages

"""
    ranges_where_true(v) -> Vector{UnitRange{Int}}

Returns the sorted collection of indices where `v` has elements `true`, assuming `v` is a boolean mask.
"""
function ranges_where_true(v)
    runs = get_runs(v)
    filter!(run_value, runs) # only runs with value `true`
    return run_indices.(runs)
end

"""
    merge_mask_gaps!(v, N::Integer) -> v

Given a boolean mask `v`, merges "gaps" (i.e. runs of falses surrounded by trues) of length up to `N`, modifying `v` in-place.

## Example
```jldoctest
julia> show(merge_mask_gaps!([true, false, true, false], 1))
Bool[1, 1, 1, 0]
julia> show(merge_mask_gaps!([true, false, false, true, false], 1))
Bool[1, 0, 0, 1, 0]
julia> show(merge_mask_gaps!([true, false, false, true, false], 2))
Bool[1, 1, 1, 1, 0]

```
"""
function merge_mask_gaps!(v, N::Integer)
    @argcheck N >= 0
    N == 0 && return v
    runs = get_runs(v)

    # We want to find short 0 runs to flip. We only want to flip ones after the first non-zero run, and before the last non-zero run.
    f = findfirst(run_value, runs)
    l = findlast(run_value, runs)
    if !isnothing(f) && !isnothing(l)
        for i in f:l
            run = runs[i]
            if !run_value(run) && run_length(run) <= N
                v[run_indices(run)] .= true
            end
        end
    end
    return v
end

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

additional higher-level `rle` api? #897

additional higher-level `rle` api? #897

ericphanson commented Oct 2, 2023 •

edited

Loading

additional higher-level rle api? #897

additional higher-level rle api? #897

Comments

ericphanson commented Oct 2, 2023 • edited Loading

additional higher-level `rle` api? #897

additional higher-level `rle` api? #897

ericphanson commented Oct 2, 2023 •

edited

Loading