Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

split() with "branch reset (?|...)" works like "not capturing (?:...)" #22912

Open
ranvis opened this issue Jan 14, 2025 · 5 comments
Open

split() with "branch reset (?|...)" works like "not capturing (?:...)" #22912

ranvis opened this issue Jan 14, 2025 · 5 comments

Comments

@ranvis
Copy link

ranvis commented Jan 14, 2025

Description
Regex "branch reset" pattern (?|...) is working like "not capturing" pattern (?:...) if used in split() function:

use v5.41;

sub dd { say "(" . join(", ", map {defined ? qq{"$_"} : "undef"} @_) . ")" }

my $input = "aa=AA,123 bb=123;dd";
my $pattern = qr/\b(\w+)=(?|(\w+),(\d+)|(\d+);(\w+))\b/;
dd(@{^CAPTURE}) while ($input =~ /$pattern/g);
dd(split($pattern, $input));

Output:

("aa", "AA", "123")
("bb", "123", "dd")
("", "aa", "AA", "123", undef, undef, " ", "bb", undef, undef, "123", "dd")

Steps to Reproduce
perl -E 'sub dd { say "(" . join(", ", map {defined ? qq{"$_"} : "undef"} @_) . ")" } dd(split(/\b(\w+)=(?|(\w+),(\d+)|(\d+);(\w+))\b/, "aa=AA,123 bb=123;dd"));'

Expected behavior
Output not containing undef:

("", "aa", "AA", 123, " ", "bb", 123, "dd")

Perl configuration

5.41.8 @61978476912ee303cc78e7bf09602a4b38f3d75e
Summary of my perl5 (revision 5 version 41 subversion 8) configuration:
  Snapshot of: 61978476912ee303cc78e7bf09602a4b38f3d75e
  Platform:
    osname=linux
    osvers=5.14.0-503.16.1.el9_5.x86_64
    archname=x86_64-linux
    uname='linux localhost 5.14.0-503.16.1.el9_5.x86_64 #1 smp preempt_dynamic fri dec 13 01:47:05 est 2024 x86_64 x86_64 x86_64 gnulinux '
    config_args='-de -Dprefix=/opt/perlbrew/perls/perl-blead -Dccflags=-fPIC -Dusedevel -Aeval:scriptdir=/opt/perlbrew/perls/perl-blead/bin'
    hint=recommended
    useposix=true
    d_sigaction=define
    useithreads=undef
    usemultiplicity=undef
    use64bitint=define
    use64bitall=define
    uselongdouble=undef
    usemymalloc=n
    default_inc_excludes_dot=define
  Compiler:
    cc='cc'
    ccflags ='-fPIC -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D_FORTIFY_SOURCE=2'
    optimize='-O2'
    cppflags='-fPIC -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include'
    ccversion=''
    gccversion='11.5.0 20240719 (Red Hat 11.5.0-2)'
    gccosandvers=''
    intsize=4
    longsize=8
    ptrsize=8
    doublesize=8
    byteorder=12345678
    doublekind=3
    d_longlong=define
    longlongsize=8
    d_longdbl=define
    longdblsize=16
    longdblkind=3
    ivtype='long'
    ivsize=8
    nvtype='double'
    nvsize=8
    Off_t='off_t'
    lseeksize=8
    alignbytes=8
    prototype=define
  Linker and Libraries:
    ld='cc'
    ldflags =' -fstack-protector-strong -L/usr/local/lib'
    libpth=/usr/local/lib /usr/lib /usr/lib64 /usr/local/lib64
    libs=-lpthread -ldb -ldl -lm -lcrypt -lutil -lc
    perllibs=-lpthread -ldl -lm -lcrypt -lutil -lc
    libc=/lib/../lib64/libc.so.6
    so=so
    useshrplib=false
    libperl=libperl.a
    gnulibc_version='2.34'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs
    dlext=so
    d_dlsymun=undef
    ccdlflags='-Wl,-E'
    cccdlflags='-fPIC'
    lddlflags='-shared -O2 -L/usr/local/lib -fstack-protector-strong'


Characteristics of this binary (from libperl):
  Compile-time options:
    HAS_LONG_DOUBLE
    HAS_STRTOLD
    HAS_TIMES
    PERLIO_LAYERS
    PERL_COPY_ON_WRITE
    PERL_DONT_CREATE_GVSV
    PERL_HASH_FUNC_SIPHASH13
    PERL_HASH_USE_SBOX32
    PERL_MALLOC_WRAP
    PERL_OP_PARENT
    PERL_PRESERVE_IVUV
    PERL_USE_DEVEL
    PERL_USE_SAFE_PUTENV
    USE_64_BIT_ALL
    USE_64_BIT_INT
    USE_LARGE_FILES
    USE_LOCALE
    USE_LOCALE_COLLATE
    USE_LOCALE_CTYPE
    USE_LOCALE_NUMERIC
    USE_LOCALE_TIME
    USE_PERLIO
    USE_PERL_ATOF
  Built under linux
  Compiled at Jan 14 2025 20:33:25
  %ENV:
    PERLBREW_HOME="/home/test/.perlbrew"
    PERLBREW_MANPATH="/opt/perlbrew/perls/perl-blead/man"
    PERLBREW_PATH="/opt/perlbrew/bin:/opt/perlbrew/perls/perl-blead/bin"
    PERLBREW_PERL="perl-blead"
    PERLBREW_ROOT="/opt/perlbrew"
    PERLBREW_SHELLRC_VERSION="0.98"
    PERLBREW_VERSION="0.98"
  @INC:
    /opt/perlbrew/perls/perl-blead/lib/site_perl/5.41.8/x86_64-linux
    /opt/perlbrew/perls/perl-blead/lib/site_perl/5.41.8
    /opt/perlbrew/perls/perl-blead/lib/5.41.8/x86_64-linux
    /opt/perlbrew/perls/perl-blead/lib/5.41.8
@richardleach
Copy link
Contributor

Also occurs in 5.36.0, have not had time to look further.

Output from use re 'debug'; on bleadperl attached.
22912_re_debug.txt

@richardleach
Copy link
Contributor

This might be behaving as documented but perhaps with a couple of behaviours in play at the same time?

From split:

Thus, when assigning to a list, if LIMIT is omitted (or zero), then LIMIT is treated as though it were one larger than the number of variables in the list

and

If the PATTERN contains capturing groups, then for each separator, an additional field is produced for each substring captured by a group (in the order in which the groups are specified, as per backreferences); if any group does not match, then it captures the undef value instead of a substring.

@richardleach
Copy link
Contributor

Or not, I see now that you wouldn't expect to see trailing undefs in either case.

@richardleach
Copy link
Contributor

This might be RX_NPARENS behaviour in pp_split. https://github.com/Perl/perl5/blob/blead/pp.c#L6988 or thereabouts.

I don't know if branch reset should be taken into account when counting capture groups for RX_NPARENS (i.e. there's an actual bug) or if the behaviour is as per the documentation (i.e. perhaps the documentation could better describe the behaviour here).

@jkeenan
Copy link
Contributor

jkeenan commented Jan 15, 2025

Since I myself have never used @{^CAPTURE} or the "branch reset" pattern, I couldn't say off the top of my head what the correct behavior should be in the example provided by the OP. I therefore began by locating relevant parts of the documentation.

From perldoc perlvar:

    @{^CAPTURE}
            An array which exposes the contents of the capture buffers, if
            any, of the last successful pattern match, not counting patterns
            matched in nested blocks that have been exited already.
...
            This variable was added in 5.25.7
...

From perldoc perlre:

    "(?|*pattern*)"
        This is the "branch reset" pattern, which has the special property
        that the capture groups are numbered from the same starting point in
        each alternation branch. It is available starting from perl 5.10.0.

        Capture groups are numbered from left to right, but inside this
        construct the numbering is restarted for each branch.

        The numbering within each branch will be as normal, and any groups
        following this construct will be numbered as though the construct
        contained only one branch, that being the one with the most capture
        groups in it.

        This construct is useful when you want to capture one of a number of
        alternative matches.

        Consider the following pattern. The numbers underneath show in which
        group the captured content will be stored.

            # before  ---------------branch-reset----------- after
            / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
            # 1            2         2  3        2     3     4

From perldoc -f split:

            If the PATTERN contains capturing groups, then for each
            separator, an additional field is produced for each substring
            captured by a group (in the order in which the groups are
            specified, as per backreferences); if any group does not match,
            then it captures the "undef" value instead of a substring. Also,
            note that any such additional field is produced whenever there
            is a separator (that is, whenever a split occurs), and such an
            additional field does not count towards the LIMIT. Consider the
            following expressions evaluated in list context (each returned
            list is provided in the associated comment):

I tested the OP's program with various perls built by perlbrew and observed a change in behavior between perl-5.36 and perl-5.38. I therefore bisected with this program (on linux unthreaded):

$ /tmp/gh-22912-test.t

use strict;use warnings;
use Test::More tests => 1;

my $input = "aa=AA,123 bb=123;dd";
my $pattern = qr/\b(\w+)=(?|(\w+),(\d+)|(\d+);(\w+))\b/;

my $expected = [ "", "aa", "AA", "123", " ", "bb", "123", "dd" ];
my $got      = [ split($pattern, $input) ];
is_deeply($got, $expected, "GH-22912: no 'undef's in output of 'split'");
perl Porting/bisect.pl \
--start=v5.36.0 \
--end=v5.38.0 \
-- ./perl -Ilib /tmp/gh-22912-test.t

The result pointed to fe5492d (v5.37.7-120-gfe5492d916) as the point where behavior changed:

fe5492d916201ce31a107839a36bcb1435fe7bf0 is the first bad commit
commit fe5492d916201ce31a107839a36bcb1435fe7bf0
Author: Yves Orton <[email protected]>
Date:   Thu Dec 29 12:07:22 2022 +0100
Commit:     Yves Orton <[email protected]>
CommitDate: Thu Jan 12 03:11:51 2023

    regcomp.c etc - rework branch reset so it works properly

@demerphq, can you take a look? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants