-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
explicit types for everything #31
Comments
(not urgent) |
@woodsaj did you see dgryski/go-tsz#7 ? it looks like if we can avoid decimals, storage efficiency goes up quite a bit. though could use some further experimentation. |
It is completely impractical to not be able to handle floats. |
right, we would still be using and supporting floats.
our storage already handles floats without decimals much more efficiently, and later we can add optimisations for ints and bools. but i think it's important to start sending data properly sooner rather then later. |
I think the sooner we can anticipate changes in data format that will be required to optimize storage and do a 1-time data format change (breaking to current alpha users), the better. we're only boarding more and more people and supposedly our platform is gearing up to become production ready, so if we're gonna do changes like store latency on ms or 0.1 ms granularity instead of 1s that is going to break people's graphs, i rather get it over with soon. @woodsaj and @nopzor1200 I would like to discuss and decide what's next, either through a hangout soon, or possibly irl in NYC. basically it's all about figuring out what's the best way to store the litmus (and more than just litmus but litmus is where to start) metrics, here's the current list:
I think I want MT to become a metrics2.0/unit-aware store and it can do its own optimizations.
it MT becomes unit aware, it means
|
@Dieterbe I absolutely agree that metrics2.0 support in MT is the correct way forward on this, presumably combined with an ingestion service that can handle old-school unitless readings and decide type for them based on a configured set of rules. As far as data storage, I don't really have a frame of reference yet for how much optimization is too much. The split range approach for ping seems like an interesting approach, if we can assume that the maximum supported value will be 60s. If we need more range we could take that approach further and define 4 sub-ranges which would give us a gradual falloff in precision and support for values up to just over 30 minutes, which will also support a broader range of use-cases that might include values over 60s as we expand the things we monitor.
Here's some quick and dirty python to pack and unpack a float representing time in ms according to this scheme: def pack_uint16(ms):
if ms < 0:
return None
if ms < 1000:
return int(ms * 10)
if ms < 30000:
return int(ms) + 9000
if ms < 120000:
return int(ms / 10) + 36000
if ms <= 1873500:
return int(ms / 100) + 46800
return None
def unpack_uint16(packed):
if packed < 10000:
return packed / 10.0
if packed < 39000:
return packed - 9000
if packed < 48000:
return (packed - 36000) * 10
return (packed - 46800) * 100 As far as ping loss, yes 0-100 is probably fine from a technical perspective. I'm not sure if users expect to see a decimal place or 2 there when you have a scenario that doesn't come out to a round percentage like 3 of 7 pings failed. |
I think it is a very bad idea to store different dataTypes inside MetricTank. It will significantly increase the complexity of the code base, i dont think the gain is worth it. I would only consider allowing int64 in addition to float64. Influxdb had an issue about dropping support for int64's and after long discussion decided to keep them. Prometheus on the other hand only supports float64. The need for int64's arises when the values exceed 2^53, as you loose precision when storing as float64. This typically only happens when you are using a 64bit counter, such as on a network switch that does a lot of traffic. At full line rate, a byte counter on a 10GB switchPort could reach 2^53 bytes (9petabytes) in ~90days. Though if the counter is incrementing quickly enough to fill 64bits in a year or less the precision loss would be much less then 1%. Because we use deltaEncoding of the data and all of the the bool, and small integer values like statusCode, error_state, packetLoss, etc... change infrequently, they already achieve excellent compression. I am very doubtful that the effort required to support the range of datatypes would be worth the gain received.
This is just not possible. The ping check is a check that reports an up/down state. It can only do that based on percentage of ping failure so each check run needs to send many pings. I also fundamentally disagree with the premise that sending 1 ping every 2seconds gives better insight into network issues then sending 5pings once every 10seconds. If you send 1 ping every 2seconds and it is lost you only know that the link is experiencing some packet loss. If the packet is not lost you only know that there is less then 100% packet loss. However if you send 5pings at once and 1 out of the five is lost, you know the link is experiencing <20% packet loss. |
Very interesting comment, (I think) I misread @Dieterbe 's comment to imply we were using uint16 to store these values today. Based on what @woodsaj just said it does sound like there is very little reason to change the internal storage format (and go through the kind of acrobatics I described above). That said, I think the storage format is a separate discussion from the metrics2.0 concept of tagging series with their type to support intelligence in dealing with the stored data, which is definitely valuable. Having a unified internal storage format would also sidestep the issues ES has seen with the 2.0 upgrade in getting more strict about the data type for index fields. |
100% agree that metrics2.0 is the way to go. And i think this whole issue needs to be moved to https://github.com/raintank/raintank-metric There are two separate decisions to be made. Our transport protocol (metrics2.0 format) and our storage protocol, which currently uses float64 for everything then delta encodes 10 or 30minutes of values before storing in Cassandra). For the transport protocol, i feel that we need to support JSON for ease of use, and additionally support msgPack or protoBuf for performance. Our own collectors/agents would use msgPack/protoBuf, while tools written by the community would likely prefer JSON for simplicity. For the transport protocol, we are not going to see very any noticeable performance gain from supporting many different dataTypes and will likely see reduced performance as we need to treat all values as just []byte until we can read the dataType from the metricDefinition then correctly decode the data. For protoBuf this means that we loose the benefit of some of the encoding techniques used for various datatypes. |
this is why we have alerting rules on top of the datastream to make informed decisions about whether a ping check is in a problematic state or not. The alerting rules is where we have access to all data (including from other probes), and where is the best place to provide a reliable indicator to the user. so we got derailed a bit into talking about specific optimisations for datatypes, but keep in mind the original reason i opened this ticket is that we should think about, and document what are acceptable data precisions for our data, because even if we stick to just float64 for everything, it makes a lot of sense to avoid decimals (see dgryski/go-tsz#9) so this brings up questions such as do we want timings on ms level precision or 0.1 ms, which may force a change in data and hence a breaking change in graphs, unless we add awareness and conversion to the store. |
we should choose the best fitting types for every metric and make them explicit.
some are best served as a float64, others int32 or whatever, some are merely a bool.
benefits:
The text was updated successfully, but these errors were encountered: