-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arithmetic coding demo #631
base: main
Are you sure you want to change the base?
Changes from 15 commits
e6e3585
edcf280
84b5fa8
72b1202
7f65307
8003af5
3c934e7
d616f7f
77341c9
31d0b37
2b48f4c
4d876f1
d2dd11f
4c43d9b
7ebc207
6a9595b
6b48407
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,102 @@ | ||
'## [Arithmetic coding](https://en.wikipedia.org/wiki/Arithmetic_coding) | ||
This demonstrates a lossless method for compression on a string of letters. | ||
Rather than assigning a code to each letter, the entire string is encoded | ||
into a single floating-point number. | ||
|
||
Alphabet = Fin 26 | ||
Interval = (Float&Float) | ||
top:Interval = (0.,1.) | ||
|
||
def charToIdx (c: Word8) : Int = W8ToI c - W8ToI 'a' | ||
def idxToChar (i: Int) : Word8 = IToW8 (i + (W8ToI 'a')) | ||
|
||
'### Statistical modelling | ||
First, model the probability of each letter given by the string to be encoded. | ||
|
||
def cumProb (ps: n=>Float) : n=>Float = | ||
withState 0.0 \total. | ||
for i. if ps.i > 0. | ||
then | ||
currTotal = get total | ||
newTotal = currTotal + ps.i | ||
total := newTotal | ||
currTotal | ||
else 0. | ||
|
||
def getFrequency (str: (Fin l)=>Word8) : Alphabet=>Int = | ||
a: Alphabet => Int = zero | ||
yieldState a \ref. for i. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This one can also be computed with a parallel |
||
i' = (charToIdx str.i)@_ | ||
ref!i' := (get ref).i' + 1 | ||
|
||
def getProbability (l: Int) (freq: Alphabet=>Int) : Alphabet=>(Float&Float) = | ||
probs = for i. IToF freq.i / IToF l | ||
cums = cumProb probs | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this going to repeat work every time it's called? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The probabilities are cached on line 92: |
||
for i. (probs.i, cums.i) | ||
|
||
'### Scaling functions | ||
|
||
def getUpdateRule (p: Alphabet=>(Float&Float)) : Alphabet=>(Interval->Interval) = | ||
for i. | ||
case p.i == (0.,0.) of | ||
True -> id | ||
False -> | ||
\(x, w). | ||
x' = x + w*(snd p.i) | ||
w' = w*(fst p.i) | ||
(x', w') | ||
|
||
def subdivide (str: (Fin l)=>Word8) | ||
(rule: Alphabet=>(Interval->Interval)) | ||
(i: (Fin l)) (in: Interval) : Interval = | ||
updateInterval = rule.((charToIdx str.i)@_) | ||
updateInterval in | ||
|
||
def findInterval (l: Int) | ||
(code: Float) | ||
(rule: Alphabet=>(Interval->Interval)) | ||
(i: (Fin l)) | ||
((str,in): (List Word8 & Interval)) : (List Word8 & Interval) = | ||
(letter, in') = boundedIter (size Alphabet) (' ', top) \j. | ||
case rule.(j@_) in == in of | ||
True -> Continue | ||
False -> | ||
(x, w) = rule.(j@_) in | ||
case code >= x && code < (x+w) of | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this a binary search? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's actually just linear search on intervals, following after most code implementations I've seen. But maybe it'd scale better to larger alphabets if binary search were implemented instead. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You might be able to use |
||
True -> Done (idxToChar j, (x,w)) | ||
False -> Continue | ||
(str <> AsList 1 [letter], in') | ||
|
||
'### Coding interface | ||
Start from an initial interval, [0, 1). | ||
For each letter encoded from the string, the current interval is divided based on the | ||
cumulative probability of all letters, then updated to the partition that matches | ||
the encoded letter. | ||
The decoding process retraces the steps of the encoding process to recover the correct letters. | ||
|
||
def encode (str: (Fin l)=>Word8) (rule: Alphabet=>(Interval->Interval)) : Float = | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It gets us this error:
But yeah, it does look more succinct with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, I see. Well it's not a big deal either way. |
||
update = subdivide str rule | ||
finalInterval = fold top update | ||
fst finalInterval + (snd finalInterval)/2. | ||
|
||
def decode (l: Int) (code: Float) (rule: Alphabet=>(Interval->Interval)) : List Word8 = | ||
update = findInterval l code rule | ||
initStr: List Word8 = AsList _ [] | ||
fst $ fold (initStr, top) update | ||
|
||
'### Demo: Lossless compression on a test string | ||
|
||
str' = "abbadcabccdd" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you make a longer test? I'm not convinced that there aren't lurking floating-point issues in this implementation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A longer test here would fail. I could change all instances of Float to Float64 for more precision and pass the longer test, but the code would look bloated. i.e. this line, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmmm. Would an even longer test then fail even if you used F64? I get that this is just a demo, but if it only works for short sequences we should put some warnings in. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree. I'll probably make a more serious attempt at integer arithmetic before I commit to this implementation and add warnings. |
||
(AsList l str) = str' | ||
|
||
p = getProbability l $ getFrequency str | ||
r = getUpdateRule p | ||
|
||
code = encode str r | ||
code | ||
> 0.081569 | ||
|
||
decoded = decode l code r | ||
decoded == str' | ||
> True | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be computed with
Accum
instead of state.