Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tweaking internal representation #69

Open
andrewthad opened this issue Aug 26, 2019 · 2 comments
Open

Tweaking internal representation #69

andrewthad opened this issue Aug 26, 2019 · 2 comments

Comments

@andrewthad
Copy link

The internal representation of Scientific is currently:

data Scientific = Scientific
  { coefficient :: !Integer
  , base10Exponent :: {-# UNPACK #-} !Int
  }

It's been pointed out on another issue that this is incorrect, but I suspect that, in the process of fixing it, there is an additional optimization that may be worth making. The type currently makes chase a pointer to get to the coefficent, even when the coefficient is something small and fits inside of S#. But what if this was done instead?

-- Invariant: The S# data constructor is never used.
data UnpackableInteger = UnpackableInteger (# Int# | Integer #)

data Scientific = Scientific
  { coefficient :: {-# UNPACK #-} !UnpackableInteger
  , base10Exponent :: {-# UNPACK #-} !UnpackableInteger
  }

This would only work on GHC 8.6 and higher because using UnboxedSums in this way would corrupt codegen before then. However, in the extremely common case in which the coefficient and the exponent are both small (less than 2^32 or 2^64 depending on platform), this can keep all the data on a single cache line. It could be argued that it makes the data constructor take up more space. This representation has a 6-machine-word payload (2 for Int#s, 2 for Integers, and 2 for the sums toggle words), while the more simple data Scientific = Scientific !Integer !Integer takes only 2 machine words. However, the S# data constructors that Integer requires for small integers means that we actually end up needing 4 additional machine words. The difference is that things are just more spread across memory. In the case where either the coefficient or the exponent were large, then the representation I suggest here would consume more memory than the naive one and perform slightly worse.

Ignoring the abysmal backwards-compatibility issue for the moment, the important question is whether it's better to optimize for small integers or large ones. My preference is small integers since I see Scientific as a safe intermediary for lexers and parsers that are, in practice, applied to small numbers most of the time (the prevalent example being aeson).

If the project author finds this idea agreeable, I'm happy to implement this and verify that it improves the performance of the library. I'd do it in a way that shims out a non-unboxed-sum implementation on older GHCs.

@treeowl
Copy link

treeowl commented Jan 14, 2022

Other possible flavors:

data Scientific = Scientific
  (# (# Int#, Int# #) | (# Int#, Integer #) | (# Integer, Int# #) | (# Integer, Integer #)
data Scientific
  = SmallSmall !Int !Int
  | BigSmall !Integer !Int
  | SmallBig !Int !Integer
  | BigBig !Integer !Integer

It would also be possible to select just a couple of these options.

@andrewthad
Copy link
Author

In the scientific-notation libary, I use this:

data Scientific = Scientific
  {-# UNPACK #-} !Int -- coefficient
  {-# UNPACK #-} !Int -- base-10 exponent, minBound means use unlimited-precision field
  LargeScientific

data LargeScientific = LargeScientific
  !Integer -- coefficent
  !Integer -- exponent

Which is like having only SmallSmall and BigBig.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants