Add possibility to convert String to Option[Int], Option[Long], Option[Boolean], etc #431

Krever · 2018-02-07T19:04:21Z

I thought it may be worth considering to add several methods that verify if the given string is a valid int/long/etc. On multiple occasions, I wrote something like this: Try(s.toInt).isSuccess which is more than inconvenient and suboptimal.

I don't know if its really worth to add it
I can add docs and tests once general idea is accepted
I didn't know if this should go to collections or collections-contrib

In general, I will be happy to apply any changes needed, just let me know.

sjrd · 2018-02-07T22:52:16Z

TBH I don't think this deserves a place in the standard library. There are lots of things that "would be nice" to have, but we should place the bar much higher than "nice to have", or we're going to end up again with zillions of stuff nobody uses like scala.text.

Krever · 2018-02-08T07:05:49Z

On the one hand, I agree, its always nice to have the minimal amount of code, but on the other hand...
I can see 3 main pain points of adding something to std lib:

maintenance cost - someone has to do bugfixes
complexity cost - "product is perfect if there is nothing more to remove"
evolution slowdown - std lib release cycle is not so fast as ecosystem libraries

With this PR there is no maintenance cost because the amount of code is extremely small and I cannot imagine how it could change. Even if the parsing logic changes these helpers will stay the same. The same applies to evolution slowdown - this code will not require any evolution.

Regarding complexity cost, I think its compensated by the frequency of usage. These SO questions: link1, link2, link3, link4 have over 70 thousands views together. And that's only from people who didn't know the answer by heart - I googled it once but used dozens of times after that.

So I have to argue that the chance of these being classified as "stuff nobody uses" is rather small.

NthPortal · 2018-02-12T12:25:00Z

Given the fact that the implementation of these is to actually convert them, and see if it worked, it's not clear to me what the major use case is. Presumably you actually want the value out of the string in most cases, in which case using these methods just ends up converting the string twice.

Krever · 2018-02-13T13:39:47Z

Its hard for me to come with particular use case as its been a while since I needed it, but something like this seems probable:

str match {
  case a if a.isInt => 
  case a if a.isLong => 
  case a if a.isDouble => 
  case a if a.isBoolean => 
  case a =>
}

And even if you parse it n times readability increases.

hrhino · 2018-02-13T16:15:11Z

You can do something like:

object IntishString {
  def unapply(str: String): Option[Int] = Try(str.toInt).toOption
}
/* ... */
str match {
  case IntishString(intval) => /* ... */
  /* ... */ 
}

for that use case, which I think is somewhat more idiomatic and avoids double-parsing. I'd agree with Sébastien that these are a bit niche for the stdlib; having .toInt return Option[Int] would be a different (and more justifiable IMHO) change, but I recall there being a discussion some time back with good arguments against that, as well.

Ichoran · 2018-02-13T17:57:14Z

You really don't want to use try/catch to detect numbers in strings. It's incredibly slow (like 500x slower than inspecting the characters).

NthPortal · 2018-02-13T20:00:45Z

having .toInt return Option[Int] would be a different (and more justifiable IMHO) change, but I recall there being a discussion some time back with good arguments against that, as well.

I would love to see that discussion, if anyone knows where it is. I also think being able to get Options would be very helpful.

Ichoran · 2018-02-13T20:13:15Z

I don't remember where the discussion was, but having toInt return Option would break all code everywhere that uses toInt, and, unless an extra unsafeToInt method was created, impose an unavoidable major performance decrease.

Adding methods that return an option, like toIntOption, are much more feasible.

NthPortal · 2018-02-13T20:50:21Z

Oh, I absolutely understand why changing its behaviour would break everything. Having other methods (or using an implicit to say whether it returns an Option or not?) was what I was thinking of.

julienrf · 2018-02-14T08:56:09Z

I support adding toIntOption and others!

Krever · 2018-02-14T09:37:09Z

I see we've got discussion here:)

Reg, custom extractors, sure user can do that as well as many other things, its mostly matter of convenience.

I think modifying the behaviour of toInt is not an option, not only because it breaks everything but also it will increase learning curve for people coming from exception-based languages...

I was considering toIntOption at some point but abandoned the idea for some reason. Anyway, it looks like a really good alternative to isInt, a little more boilerplate (toIntOption.isDefined) but much more general. Moreover, there are precedences for similar implementations like headOption so it will be quite uniform.

If there will be no cons of toIntOption I can implement that.

NthPortal · 2018-02-14T20:35:30Z

The concern is, as @Ichoran mentioned, that using Try to convert toInt to an Option is very expensive for failure cases.

Krever · 2018-02-14T20:42:58Z

I was thinking about less trivial implementation. Try based was ok for this PR but for something as generic as Option I can try to make it more performant.

Ichoran · 2018-02-14T20:46:57Z

Check out mdedetrich/scalajson#37 and related code.

SethTisue · 2018-02-17T02:41:26Z

this is scala/bug#16 (yup, a double-digit bug number, the only one still open)

NthPortal · 2018-02-17T03:47:33Z

Semantically, Try(_).toOption is simple, straightforward, and accomplishes the goal well, and can also be easily wrapped up in a simple method like opt, as mentioned in scala/bug#16. And that actually works fine for cases where it's used infrequently. However, if you're parsing thousands of JSON requests per second, and a lot of them have invalid numbers, the exception construction will become a major cost.

scala/bug#16 doesn't really address the cost of exceptions, and I think it is legitimate to want an implementation which returns Options with reasonable cost.

There was a thread on contributors which brought up the problem of exceptions for toInt (as well as some other performance costs); I don't recall it reaching any sort of conclusion though.

martijnhoekstra · 2018-02-17T09:40:35Z

IMO a method checking whether a string is an int/byte/whatever is a useless feature.

If you want to be sure it's an Int, you're doing the same work as parsing, and could just as well return the success in an Option. In addition, the edge cases of numeric characters that are not 0-9 are a bit tricky, and having toInt and isInt disagree is really bad. Risking those getting out of sync for no real gain isn't a good idea.

To only option for a safe and fast isInt with accepts the same strings as toInt that makes sense is IMO to hand-roll a toIntOption and use that instead.

Krever · 2018-02-17T17:41:55Z

Let me know if following is correct:

Currently almost all the conversions are based on Java implementations like java.lang.Float.parseFloat. This leaves us with 3 options:

Try(f).toOption - least efficient because of exception catching and intermediate object creation.
`try{ Some(f) } catch { case e => None } - still not perfect because of exceptions
Reimplement parsing(probably copy from java) so we can parametrize the way in which result is returned

Point 3 made me ask myself a question "how is this handled in scala-js and scala-native" and as far as I can see both of these projects implement java.lang from scratch. In the best case, we can see how it's done but it doesn't help much.

Anyway, I implemented very basic benchmark here. Its probably highly inaccurate, but may give some intuition anyway. Here are the results:

Benchmark                                  Mode  Cnt          Score          Error  Units
OptionBenchmark.checkWithReturn            thrpt  200  287195403.054 ± 12354118.434  ops/s
OptionBenchmark.checkWithTry               thrpt  200  118274801.113 ± 27783534.402  ops/s
OptionBenchmark.checkWithTryCatch          thrpt  200  118771374.313 ± 27883958.369  ops/s

So according to this benchmark reimplementation gives us 2.5x speedup. Should we try to do that?

martijnhoekstra · 2018-02-17T18:40:41Z

It's difficult to benchmark exceptions, because creation and unwinding depends on the stack depth. Benchmarking the "random" case isn't ideal either IMO. Separately benching the success and failure cases is IMO much better.

The reimplementation and benchmarks I did can be found at https://github.com/martijnhoekstra/numparsers/blob/master/src/main/scala/StringConversions.scala and an abbreviated benchmark result overview at https://docs.google.com/spreadsheets/d/1TohVfP_KDkEvVV2fvRMXYvSYTYNkntpkGdP-NSVcikg/edit#gid=0

The short of it is that parsing to Option from reimplementation takes roughly the same amount of time as .toInt and friends. What the benchmarks don't include is the garbage creation overhead.

I still believe these reimplementations show reasonable performance, while handling the edge cases the same as toInt. I haven't taken a close look at the reimplementations scalajs or scalanative provide, so they could be better replacements. I also haven't reimplemented floating point, because I don't think I can get those right.

NthPortal · 2018-02-17T19:16:28Z

For reference, this is a good blog post about the costs of exceptions: https://shipilev.net/blog/2014/exceptional-performance/

NthPortal · 2018-02-17T19:20:07Z

Borrowing scala-native or scala-js's implementations (and possibly tweaking them to return Options) sounds like a decent, low-cost solution.

sjrd · 2018-03-05T14:38:41Z

Borrowing from Scala.js (resp. Scala Native) is not possible, because the implementations of those functions rely on primitive JS (resp. C) functions.

SethTisue · 2018-03-17T14:52:02Z

production system brought to its knees by the expensiveness of Try(x.toInt).toOption: https://medium.com/@muuki88/follow-the-stacktraces-jvm-performance-profiling-3c371d323e5f

SethTisue · 2018-03-17T15:15:03Z

I'd forgotten about this until that blog post jogged my memory, but in 2011 I sped up a major component of a production system 10x simply by replacing exception-catching number-parsing code with hand-rolled code that handled the commonest cases directly. (To my own astonishment at the time at how big the speedup was.) It really is a performance trap.

…into string-ops-conversion-checks

Krever · 2018-03-17T17:51:03Z

So, as far I understand there is a consensus on having toXOption. I have pushed an update to this branch. Trusting @sjrd comment I decided to go the simplest path and "copy" the code from java implementation of those methods. This is almost 1:1 copy as I don't have much time to work on it. We can use it for further discussion on how to proceed with it.

Should these methods be implemented in StringOps or in Int/Byte/etc.
I couldn't find any comprehensive test suite in the current codebase
Does current deduplication implementation have some performance implications? Should we go with full duplication of methods?
I have not copied Float/Double as they seem to be much more complicated. It should be a rather straightforward copy as well but with much more code.
Can we just copy the code from legal PoV?
Can we improve something in java impl?
Should we do some benchmarks to spot any potential performance regression?
I had some import troubles that I didn't have time to investigate.

If someone wants to pick this up and we want to keep the discussion record I can give full access to my fork to anyone. I'm aware this is probably not the highest standard of contribution but probably better than nothing :)

sjrd · 2018-03-17T17:54:29Z

No, you cannot legally copy the implem from Java. The Java implem is GPL and Scala's codebase is BSD.

I said there was nothing to be copied in Scala.js' or Scala Native's implem. I most definitely did not say you should copy it from Java's implementation instead.

SethTisue · 2018-03-17T17:55:23Z

(@sjrd is replying to a comment I posted and then immediately deleted when I realized I had just repeated Krever's "Can we just copy the code from legal PoV" question)

Krever · 2018-03-17T19:44:31Z

@sjrd I never claimed you said that :) I just assumed it is the easiest way to go. I'm not a lawyer but even now this code is not literally copied. It was translated from java to scala, I introduced parameterized result handling and other changes so its far from a direct copy. Also if it helps I can admit under oath that I manually retyped every character and only took inspiration from java impl. Also, I don't think its in the state in which it could be considered ready for merge, so I think we can postpone this aspect of discussion. I assume we would like to make it more "scalaish".

sjrd · 2018-03-17T19:51:18Z

Unfortunately, that's not how the GPL works. Even "taking inspiration" and "retyping every character" is considered a derivative work, and must therefore be released under the GPL as well.

In general, the only safe assumption is not to even look at any GPL code when contributing to a BSD- or MIT-licensed codebase, such as Scala. Any reimplementation must be completely clean-room.

Ichoran · 2018-03-17T19:51:28Z

@Krever - That isn't good enough either. You have to do a clean reimplementation. Typing the code over again, in a different language, with a few changes to the API is practically the definition of a "derivative work", and the GPL covers all derivative works (that is practically the point of the GPL).

You can find BSD-or-compatible-licensed implementations of integer parsing in ScalaJSON https://github.com/mdedetrich/scalajson, and in Jsonal in my "KSE" library, among other places.

Krever · 2018-03-17T19:57:43Z

My mistake, I did some research and this, in fact, seems to be derivative work. I still think there is no way to prove it once the refactoring is completed, but probably the way in which it was created still make it derivative work. Maybe I will find some time to rewrite this or use BSD-compatible implementation, but cant say for sure.

Still, if anyone has any comments regarding how this should be implemented its probably a good moment for it.

Krever · 2018-03-17T20:13:51Z

I removed the copyrighted code to take away the risk of being inspired by it. Signatures and implementation of derived parsers are my work completely.

Ichoran · 2018-03-17T20:36:47Z

@Krever - Three other comments. (1) Unless you have benchmarked your code to be faster than the built-in exception-throwing flavor, you should not change from the original versions of toInt etc.; (2) You are going to pay a significant runtime penalty for the generic => T forms. For code that can be incredibly performance-critical like this, code duplication is fine if it speeds things up. (3) Parsing of floating-point numbers cannot fail so long as the number has the right form. So you can just check that it's well-formed and defer to the exception-throwing version if it is.

SethTisue · 2018-03-17T20:45:35Z

@Krever I bet you didn't expect this might turn out to be so complicated :-)

NthPortal · 2018-03-18T06:13:11Z

I've looked at the implementations in scala-native, and the implementations of Long.parseLong, Integer.parseInt, Short.parseShort and Byte.parseByte seem pretty portable (though those for Double.parseDouble and Float.parseFloat do not).

martijnhoekstra · 2018-03-19T23:57:32Z

Note that those implementations currently suffer from a bug in Character.digit, but that wouldn't bother the JVM implementation that has the correct behavior

…

On Sun, Mar 18, 2018, 07:13 Nth ***@***.***> wrote: I've looked at the implementations in scala-native, and the implementations of Long.parseLong <https://github.com/scala-native/scala-native/blob/v0.3.6/javalib/src/main/scala/java/lang/Long.scala#L303-L348>, Integer.parseInt <https://github.com/scala-native/scala-native/blob/v0.3.6/javalib/src/main/scala/java/lang/Integer.scala#L289-L336>, Short.parseShort <https://github.com/scala-native/scala-native/blob/v0.3.6/javalib/src/main/scala/java/lang/Short.scala#L211-L217> and Byte.parseByte <https://github.com/scala-native/scala-native/blob/v0.3.6/javalib/src/main/scala/java/lang/Byte.scala#L211-L217> seem pretty portable (though those for Double.parseDouble and Float.parseFloat do not). — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#431 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA13EWid8D2c6vzGyxWE9o2a6sDNQsE3ks5tffr7gaJpZM4R9Oig> .

SethTisue · 2018-04-17T18:06:50Z

I've changed the issue title to reflect the move away from returning Boolean as originally suggested.

martijnhoekstra · 2018-04-19T08:40:24Z

scala/scala#6538 follows up on this

SethTisue · 2018-04-19T08:42:24Z

@Krever thank you very much for getting the ball rolling and pushing it as far as you did 👏

Krever · 2018-04-19T08:50:28Z

I wish I had more time for this, but I'm extremely happy that someone picked this up. Thanks @martijnhoekstra !

@SethTisue I think we can close it now.

Add possibility to check if string is a valid int/long/boolean/etc

89d4502

SethTisue mentioned this pull request Feb 17, 2018

String.toInt, .toDouble should allow a default parameter scala/bug#16

Closed

szeiger added the not scala/scala Not relevant for moving to scala/scala label Mar 12, 2018

Krever added 2 commits March 17, 2018 18:44

string toXOption

f210079

Merge branch 'master' of https://github.com/scala/collection-strawman …

cc4915b

…into string-ops-conversion-checks

SethTisue changed the title ~~Add possibility to check if string is a valid int/long/boolean/etc~~ Add possibility to convert String to Option[Int], Option[Long], Option[Boolean], etc Apr 17, 2018

martijnhoekstra mentioned this pull request Apr 19, 2018

Add Option-returning methods for parsing literals scala/scala#6538

Merged

5 tasks

SethTisue closed this Apr 19, 2018

Add possibility to convert String to Option[Int], Option[Long], Option[Boolean], etc #431

Add possibility to convert String to Option[Int], Option[Long], Option[Boolean], etc #431

Conversation

Krever commented Feb 7, 2018

sjrd commented Feb 7, 2018

Krever commented Feb 8, 2018 • edited Loading

NthPortal commented Feb 12, 2018

Krever commented Feb 13, 2018

hrhino commented Feb 13, 2018

Ichoran commented Feb 13, 2018

NthPortal commented Feb 13, 2018

Ichoran commented Feb 13, 2018

NthPortal commented Feb 13, 2018

julienrf commented Feb 14, 2018

Krever commented Feb 14, 2018

NthPortal commented Feb 14, 2018

Krever commented Feb 14, 2018

Ichoran commented Feb 14, 2018

SethTisue commented Feb 17, 2018 • edited Loading

NthPortal commented Feb 17, 2018 • edited Loading

martijnhoekstra commented Feb 17, 2018

Krever commented Feb 17, 2018

martijnhoekstra commented Feb 17, 2018

NthPortal commented Feb 17, 2018

NthPortal commented Feb 17, 2018

sjrd commented Mar 5, 2018

SethTisue commented Mar 17, 2018

SethTisue commented Mar 17, 2018

Krever commented Mar 17, 2018 • edited Loading

sjrd commented Mar 17, 2018 • edited Loading

SethTisue commented Mar 17, 2018 • edited Loading

Krever commented Mar 17, 2018 • edited Loading

sjrd commented Mar 17, 2018

Ichoran commented Mar 17, 2018

Krever commented Mar 17, 2018

Krever commented Mar 17, 2018 • edited Loading

Ichoran commented Mar 17, 2018

SethTisue commented Mar 17, 2018

NthPortal commented Mar 18, 2018

martijnhoekstra commented Mar 19, 2018 via email

SethTisue commented Apr 17, 2018

martijnhoekstra commented Apr 19, 2018

SethTisue commented Apr 19, 2018

Krever commented Apr 19, 2018 • edited Loading

Krever commented Feb 8, 2018 •

edited

Loading

SethTisue commented Feb 17, 2018 •

edited

Loading

NthPortal commented Feb 17, 2018 •

edited

Loading

Krever commented Mar 17, 2018 •

edited

Loading

sjrd commented Mar 17, 2018 •

edited

Loading

SethTisue commented Mar 17, 2018 •

edited

Loading

Krever commented Mar 17, 2018 •

edited

Loading

Krever commented Mar 17, 2018 •

edited

Loading

Krever commented Apr 19, 2018 •

edited

Loading