Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

Add possibility to convert String to Option[Int], Option[Long], Option[Boolean], etc #431

Closed
wants to merge 3 commits into from

Conversation

Krever
Copy link

@Krever Krever commented Feb 7, 2018

I thought it may be worth considering to add several methods that verify if the given string is a valid int/long/etc. On multiple occasions, I wrote something like this: Try(s.toInt).isSuccess which is more than inconvenient and suboptimal.

  • I don't know if its really worth to add it
  • I can add docs and tests once general idea is accepted
  • I didn't know if this should go to collections or collections-contrib

In general, I will be happy to apply any changes needed, just let me know.

@sjrd
Copy link
Member

sjrd commented Feb 7, 2018

TBH I don't think this deserves a place in the standard library. There are lots of things that "would be nice" to have, but we should place the bar much higher than "nice to have", or we're going to end up again with zillions of stuff nobody uses like scala.text.

@Krever
Copy link
Author

Krever commented Feb 8, 2018

On the one hand, I agree, its always nice to have the minimal amount of code, but on the other hand...
I can see 3 main pain points of adding something to std lib:

  • maintenance cost - someone has to do bugfixes
  • complexity cost - "product is perfect if there is nothing more to remove"
  • evolution slowdown - std lib release cycle is not so fast as ecosystem libraries

With this PR there is no maintenance cost because the amount of code is extremely small and I cannot imagine how it could change. Even if the parsing logic changes these helpers will stay the same. The same applies to evolution slowdown - this code will not require any evolution.

Regarding complexity cost, I think its compensated by the frequency of usage. These SO questions: link1, link2, link3, link4 have over 70 thousands views together. And that's only from people who didn't know the answer by heart - I googled it once but used dozens of times after that.

So I have to argue that the chance of these being classified as "stuff nobody uses" is rather small.

@NthPortal
Copy link
Contributor

Given the fact that the implementation of these is to actually convert them, and see if it worked, it's not clear to me what the major use case is. Presumably you actually want the value out of the string in most cases, in which case using these methods just ends up converting the string twice.

@Krever
Copy link
Author

Krever commented Feb 13, 2018

Its hard for me to come with particular use case as its been a while since I needed it, but something like this seems probable:

str match {
  case a if a.isInt => 
  case a if a.isLong => 
  case a if a.isDouble => 
  case a if a.isBoolean => 
  case a =>
}

And even if you parse it n times readability increases.

@hrhino
Copy link
Contributor

hrhino commented Feb 13, 2018

You can do something like:

object IntishString {
  def unapply(str: String): Option[Int] = Try(str.toInt).toOption
}
/* ... */
str match {
  case IntishString(intval) => /* ... */
  /* ... */ 
}

for that use case, which I think is somewhat more idiomatic and avoids double-parsing. I'd agree with Sébastien that these are a bit niche for the stdlib; having .toInt return Option[Int] would be a different (and more justifiable IMHO) change, but I recall there being a discussion some time back with good arguments against that, as well.

@Ichoran
Copy link
Contributor

Ichoran commented Feb 13, 2018

You really don't want to use try/catch to detect numbers in strings. It's incredibly slow (like 500x slower than inspecting the characters).

@NthPortal
Copy link
Contributor

having .toInt return Option[Int] would be a different (and more justifiable IMHO) change, but I recall there being a discussion some time back with good arguments against that, as well.

I would love to see that discussion, if anyone knows where it is. I also think being able to get Options would be very helpful.

@Ichoran
Copy link
Contributor

Ichoran commented Feb 13, 2018

I don't remember where the discussion was, but having toInt return Option would break all code everywhere that uses toInt, and, unless an extra unsafeToInt method was created, impose an unavoidable major performance decrease.

Adding methods that return an option, like toIntOption, are much more feasible.

@NthPortal
Copy link
Contributor

Oh, I absolutely understand why changing its behaviour would break everything. Having other methods (or using an implicit to say whether it returns an Option or not?) was what I was thinking of.

@julienrf
Copy link
Contributor

I support adding toIntOption and others!

@Krever
Copy link
Author

Krever commented Feb 14, 2018

I see we've got discussion here:)

Reg, custom extractors, sure user can do that as well as many other things, its mostly matter of convenience.

I think modifying the behaviour of toInt is not an option, not only because it breaks everything but also it will increase learning curve for people coming from exception-based languages...

I was considering toIntOption at some point but abandoned the idea for some reason. Anyway, it looks like a really good alternative to isInt, a little more boilerplate (toIntOption.isDefined) but much more general. Moreover, there are precedences for similar implementations like headOption so it will be quite uniform.

If there will be no cons of toIntOption I can implement that.

@NthPortal
Copy link
Contributor

The concern is, as @Ichoran mentioned, that using Try to convert toInt to an Option is very expensive for failure cases.

@Krever
Copy link
Author

Krever commented Feb 14, 2018

I was thinking about less trivial implementation. Try based was ok for this PR but for something as generic as Option I can try to make it more performant.

@Ichoran
Copy link
Contributor

Ichoran commented Feb 14, 2018

Check out mdedetrich/scalajson#37 and related code.

@SethTisue
Copy link
Member

SethTisue commented Feb 17, 2018

this is scala/bug#16 (yup, a double-digit bug number, the only one still open)

@NthPortal
Copy link
Contributor

NthPortal commented Feb 17, 2018

Semantically, Try(_).toOption is simple, straightforward, and accomplishes the goal well, and can also be easily wrapped up in a simple method like opt, as mentioned in scala/bug#16. And that actually works fine for cases where it's used infrequently. However, if you're parsing thousands of JSON requests per second, and a lot of them have invalid numbers, the exception construction will become a major cost.

scala/bug#16 doesn't really address the cost of exceptions, and I think it is legitimate to want an implementation which returns Options with reasonable cost.

There was a thread on contributors which brought up the problem of exceptions for toInt (as well as some other performance costs); I don't recall it reaching any sort of conclusion though.

@martijnhoekstra
Copy link

IMO a method checking whether a string is an int/byte/whatever is a useless feature.

If you want to be sure it's an Int, you're doing the same work as parsing, and could just as well return the success in an Option. In addition, the edge cases of numeric characters that are not 0-9 are a bit tricky, and having toInt and isInt disagree is really bad. Risking those getting out of sync for no real gain isn't a good idea.

To only option for a safe and fast isInt with accepts the same strings as toInt that makes sense is IMO to hand-roll a toIntOption and use that instead.

@Krever
Copy link
Author

Krever commented Feb 17, 2018

Let me know if following is correct:

Currently almost all the conversions are based on Java implementations like java.lang.Float.parseFloat. This leaves us with 3 options:

  1. Try(f).toOption - least efficient because of exception catching and intermediate object creation.
  2. `try{ Some(f) } catch { case e => None } - still not perfect because of exceptions
  3. Reimplement parsing(probably copy from java) so we can parametrize the way in which result is returned

Point 3 made me ask myself a question "how is this handled in scala-js and scala-native" and as far as I can see both of these projects implement java.lang from scratch. In the best case, we can see how it's done but it doesn't help much.

Anyway, I implemented very basic benchmark here. Its probably highly inaccurate, but may give some intuition anyway. Here are the results:

Benchmark                                  Mode  Cnt          Score          Error  Units
OptionBenchmark.checkWithReturn            thrpt  200  287195403.054 ± 12354118.434  ops/s
OptionBenchmark.checkWithTry               thrpt  200  118274801.113 ± 27783534.402  ops/s
OptionBenchmark.checkWithTryCatch          thrpt  200  118771374.313 ± 27883958.369  ops/s

So according to this benchmark reimplementation gives us 2.5x speedup. Should we try to do that?

@martijnhoekstra
Copy link

It's difficult to benchmark exceptions, because creation and unwinding depends on the stack depth. Benchmarking the "random" case isn't ideal either IMO. Separately benching the success and failure cases is IMO much better.

The reimplementation and benchmarks I did can be found at https://github.com/martijnhoekstra/numparsers/blob/master/src/main/scala/StringConversions.scala and an abbreviated benchmark result overview at https://docs.google.com/spreadsheets/d/1TohVfP_KDkEvVV2fvRMXYvSYTYNkntpkGdP-NSVcikg/edit#gid=0

The short of it is that parsing to Option from reimplementation takes roughly the same amount of time as .toInt and friends. What the benchmarks don't include is the garbage creation overhead.

I still believe these reimplementations show reasonable performance, while handling the edge cases the same as toInt. I haven't taken a close look at the reimplementations scalajs or scalanative provide, so they could be better replacements. I also haven't reimplemented floating point, because I don't think I can get those right.

@NthPortal
Copy link
Contributor

For reference, this is a good blog post about the costs of exceptions: https://shipilev.net/blog/2014/exceptional-performance/

@NthPortal
Copy link
Contributor

Borrowing scala-native or scala-js's implementations (and possibly tweaking them to return Options) sounds like a decent, low-cost solution.

@sjrd
Copy link
Member

sjrd commented Mar 5, 2018

Borrowing from Scala.js (resp. Scala Native) is not possible, because the implementations of those functions rely on primitive JS (resp. C) functions.

@szeiger szeiger added the not scala/scala Not relevant for moving to scala/scala label Mar 12, 2018
@SethTisue
Copy link
Member

production system brought to its knees by the expensiveness of Try(x.toInt).toOption: https://medium.com/@muuki88/follow-the-stacktraces-jvm-performance-profiling-3c371d323e5f

@SethTisue
Copy link
Member

I'd forgotten about this until that blog post jogged my memory, but in 2011 I sped up a major component of a production system 10x simply by replacing exception-catching number-parsing code with hand-rolled code that handled the commonest cases directly. (To my own astonishment at the time at how big the speedup was.) It really is a performance trap.

@Krever
Copy link
Author

Krever commented Mar 17, 2018

So, as far I understand there is a consensus on having toXOption. I have pushed an update to this branch. Trusting @sjrd comment I decided to go the simplest path and "copy" the code from java implementation of those methods. This is almost 1:1 copy as I don't have much time to work on it. We can use it for further discussion on how to proceed with it.

  • Should these methods be implemented in StringOps or in Int/Byte/etc.
  • I couldn't find any comprehensive test suite in the current codebase
  • Does current deduplication implementation have some performance implications? Should we go with full duplication of methods?
  • I have not copied Float/Double as they seem to be much more complicated. It should be a rather straightforward copy as well but with much more code.
  • Can we just copy the code from legal PoV?
  • Can we improve something in java impl?
  • Should we do some benchmarks to spot any potential performance regression?
  • I had some import troubles that I didn't have time to investigate.

If someone wants to pick this up and we want to keep the discussion record I can give full access to my fork to anyone. I'm aware this is probably not the highest standard of contribution but probably better than nothing :)

@sjrd
Copy link
Member

sjrd commented Mar 17, 2018

No, you cannot legally copy the implem from Java. The Java implem is GPL and Scala's codebase is BSD.

I said there was nothing to be copied in Scala.js' or Scala Native's implem. I most definitely did not say you should copy it from Java's implementation instead.

@SethTisue
Copy link
Member

SethTisue commented Mar 17, 2018

(@sjrd is replying to a comment I posted and then immediately deleted when I realized I had just repeated Krever's "Can we just copy the code from legal PoV" question)

@Krever
Copy link
Author

Krever commented Mar 17, 2018

@sjrd I never claimed you said that :) I just assumed it is the easiest way to go. I'm not a lawyer but even now this code is not literally copied. It was translated from java to scala, I introduced parameterized result handling and other changes so its far from a direct copy. Also if it helps I can admit under oath that I manually retyped every character and only took inspiration from java impl. Also, I don't think its in the state in which it could be considered ready for merge, so I think we can postpone this aspect of discussion. I assume we would like to make it more "scalaish".

@sjrd
Copy link
Member

sjrd commented Mar 17, 2018

Unfortunately, that's not how the GPL works. Even "taking inspiration" and "retyping every character" is considered a derivative work, and must therefore be released under the GPL as well.

In general, the only safe assumption is not to even look at any GPL code when contributing to a BSD- or MIT-licensed codebase, such as Scala. Any reimplementation must be completely clean-room.

@Ichoran
Copy link
Contributor

Ichoran commented Mar 17, 2018

@Krever - That isn't good enough either. You have to do a clean reimplementation. Typing the code over again, in a different language, with a few changes to the API is practically the definition of a "derivative work", and the GPL covers all derivative works (that is practically the point of the GPL).

You can find BSD-or-compatible-licensed implementations of integer parsing in ScalaJSON https://github.com/mdedetrich/scalajson, and in Jsonal in my "KSE" library, among other places.

@Krever
Copy link
Author

Krever commented Mar 17, 2018

My mistake, I did some research and this, in fact, seems to be derivative work. I still think there is no way to prove it once the refactoring is completed, but probably the way in which it was created still make it derivative work. Maybe I will find some time to rewrite this or use BSD-compatible implementation, but cant say for sure.

Still, if anyone has any comments regarding how this should be implemented its probably a good moment for it.

@Krever
Copy link
Author

Krever commented Mar 17, 2018

I removed the copyrighted code to take away the risk of being inspired by it. Signatures and implementation of derived parsers are my work completely.

@Ichoran
Copy link
Contributor

Ichoran commented Mar 17, 2018

@Krever - Three other comments. (1) Unless you have benchmarked your code to be faster than the built-in exception-throwing flavor, you should not change from the original versions of toInt etc.; (2) You are going to pay a significant runtime penalty for the generic => T forms. For code that can be incredibly performance-critical like this, code duplication is fine if it speeds things up. (3) Parsing of floating-point numbers cannot fail so long as the number has the right form. So you can just check that it's well-formed and defer to the exception-throwing version if it is.

@SethTisue
Copy link
Member

@Krever I bet you didn't expect this might turn out to be so complicated :-)

@NthPortal
Copy link
Contributor

I've looked at the implementations in scala-native, and the implementations of Long.parseLong, Integer.parseInt, Short.parseShort and Byte.parseByte seem pretty portable (though those for Double.parseDouble and Float.parseFloat do not).

@martijnhoekstra
Copy link

martijnhoekstra commented Mar 19, 2018 via email

@SethTisue SethTisue changed the title Add possibility to check if string is a valid int/long/boolean/etc Add possibility to convert String to Option[Int], Option[Long], Option[Boolean], etc Apr 17, 2018
@SethTisue
Copy link
Member

I've changed the issue title to reflect the move away from returning Boolean as originally suggested.

@martijnhoekstra
Copy link

scala/scala#6538 follows up on this

@SethTisue
Copy link
Member

@Krever thank you very much for getting the ball rolling and pushing it as far as you did 👏

@Krever
Copy link
Author

Krever commented Apr 19, 2018

I wish I had more time for this, but I'm extremely happy that someone picked this up. Thanks @martijnhoekstra !

@SethTisue I think we can close it now.

@SethTisue SethTisue closed this Apr 19, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
not scala/scala Not relevant for moving to scala/scala
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants