Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for 32-bit platforms broken? #221

Closed
mmuetzel opened this issue Dec 14, 2022 · 72 comments
Closed

Support for 32-bit platforms broken? #221

mmuetzel opened this issue Dec 14, 2022 · 72 comments
Assignees
Labels
bug Something isn't working

Comments

@mmuetzel
Copy link
Contributor

In versions prior to 6.0, there was a data type SuiteSparse_long. That data type was long on all platforms but WIN64 where it was __int64.
That means that the data type had a size of 32 bit on 32-bit platforms and a size of 64 bit on 64-bit platforms (Linux and Windows alike).

For version 6.0, that was replaced with int64_t unconditionally. That means the type has now a size of 64 bit on 32-bit and 64-bit platforms.

IIUC, that type is used when indexing into arrays. But those can't be larger than 32 bit on 32-bit platforms.
Would it make sense to replace that type with intptr_t instead of int64_t?

I'm seeing crashes when using SPQR/CHOLMOD with 32-bit Octave on Windows.
Could that be related to that change? Are 32-bit platforms no longer supported by SuiteSparse?

The top of a backtrace from a segmentation fault caught with gdb when using 32-bit Octave with the libraries from SuiteSparse 6.0.2 (on WoW64):

Thread 7 received signal SIGSEGV, Segmentation fault.
[Switching to Thread 7636.0x4b00]
0x5baf2a86 in cholmod_l_postorder (Parent=Parent@entry=0x1f875938, n=<optimized out>, n@entry=0, Weight=Weight@entry=0x0, Post=Post@entry=0x1f875938, Common=Common@entry=0x1f1e5510) at D:/mingw-w64-suitesparse/src/build-MINGW32/CHOLMOD/Cholesky/cholmod_postorder.c:202
warning: Source file is more recent than executable.
202                 p = Parent [j] ;
(gdb) bt
#0  0x5baf2a86 in cholmod_l_postorder (Parent=Parent@entry=0x1f875938, n=<optimized out>, n@entry=0,
    Weight=Weight@entry=0x0, Post=Post@entry=0x1f875938, Common=Common@entry=0x1f1e5510)
    at D:/mingw-w64-suitesparse/src/build-MINGW32/CHOLMOD/Cholesky/cholmod_postorder.c:202
#1  0x5baeff7f in cholmod_l_analyze_ordering (A=A@entry=0x1f8759a8, ordering=ordering@entry=0,
    Perm=Perm@entry=0x1a62ddd8, fset=fset@entry=0x0, fsize=fsize@entry=0, Parent=Parent@entry=0x1f875938,
    Post=Post@entry=0x1f875938, ColCount=ColCount@entry=0x1a62df38, First=First@entry=0x1f875938,
    Level=Level@entry=0x1f875938, Common=Common@entry=0x1f1e5510)
    at D:/mingw-w64-suitesparse/src/build-MINGW32/CHOLMOD/Cholesky/cholmod_analyze.c:344
#2  0x5baf09d4 in cholmod_l_analyze_p2 (for_whom=0, A=0x1f8759a8, UserPerm=0x0, fset=0x0, fsize=0, Common=0x1f1e5510)
    at D:/mingw-w64-suitesparse/src/build-MINGW32/CHOLMOD/Cholesky/cholmod_analyze.c:729
#3  0x5cef238e in spqr_analyze (A=0x1f875868, ordering=ordering@entry=0, Quser=Quser@entry=0x0, do_rank_detection=1,
    keepH=keepH@entry=1, cc=cc@entry=0x1f1e5510)
    at D:/mingw-w64-suitesparse/src/build-MINGW32/SPQR/Source/spqr_analyze.cpp:286
#4  0x5cf073d8 in spqr_1factor<double> (ordering=ordering@entry=0, tol=6.8389738316909643e-14, tol@entry=-2,
    bncols=0, keepH=keepH@entry=1, A=<optimized out>, A@entry=0x1e0aa558, ldb=0, Bp=Bp@entry=0x0, Bi=Bi@entry=0x0,
    Bx=<optimized out>, Bx@entry=0x0, cc=cc@entry=0x1f1e5510)
    at D:/mingw-w64-suitesparse/src/build-MINGW32/SPQR/Source/spqr_1factor.cpp:642
#5  0x5cf0cfca in SuiteSparseQR<double> (ordering=ordering@entry=0, tol=tol@entry=-2, econ=5, getCTX=getCTX@entry=-1,
    A=A@entry=0x1e0aa558, Bsparse=Bsparse@entry=0x0, Bdense=<optimized out>, Bdense@entry=0x0,
    p_Zsparse=p_Zsparse@entry=0x0, p_Zdense=p_Zdense@entry=0x0, p_R=p_R@entry=0x1f1e5d68, p_E=p_E@entry=0x1f1e5d6c,
    p_H=p_H@entry=0x1f1e5d70, p_HPinv=p_HPinv@entry=0x1f1e5d78, p_HTau=p_HTau@entry=0x1f1e5d74,
    cc=cc@entry=0x1f1e5510)
    at D:/mingw-w64-suitesparse/src/build-MINGW32/SPQR/Source/SuiteSparseQR.cpp:265
#6  0x5cf0c74f in SuiteSparseQR<double> (ordering=0, tol=-2, econ=5, A=0x1e0aa558, R=0x1f1e5d68, E=0x1f1e5d6c,
    H=0x1f1e5d70, HPinv=0x1f1e5d78, HTau=0x1f1e5d74, cc=0x1f1e5510)
    at D:/mingw-w64-suitesparse/src/build-MINGW32/SPQR/Source/SuiteSparseQR.cpp:1404--Type <RET> for more, q to quit, c to continue without paging--

The remainder of the backtrace is in functions in Octave.

@DrTimothyAldenDavis
Copy link
Owner

CHOLMOD can use either int32 (as in cholmod_whatever) or int64 (as in cholmod_l_whatever). SPQR only has the int64_t version for its integers.

I didn't realize that changing SuiteSparse_long from long to int64_t would break the 32-bit builds. Let me think over what I can do.

@DrTimothyAldenDavis
Copy link
Owner

DrTimothyAldenDavis commented Dec 15, 2022

I found this related issues for the 32-bit Octave: https://savannah.gnu.org/bugs/?59833 and I see you know about that :-).
It looks like it took a lot of work to connect Octave to SPQR, mainly because SPQR only provides a single integer type.

CHOLMOD, UMFPACK, AMD, CXSparse, COLAMD, KLU, LDL, etc. all provide both int32 and int64 versions.
SPQR, Mongoose, SPEX, and GraphBLAS only have 64-bit integer variants (for SPQR this was formerly SuiteSparse_long).
Perhaps the best solution is to provide a 32-bit integer variant of SPQR, like CHOLMOD has.

The problem with SuiteSparse_long is that it's a confusing type ... it's size was uncertain (just like "long") and that often causes difficulties. It's much simpler to have integers of known sizes: int32_t and int64_t. I don't want to go back to using SuiteSparse_long.

The simplest way to do this would be to allow either 64-bit indices, or 32-bit, but not both, like I had when using SuiteSparse_long in SPQR. I would need to change all calls to cholmod_l_(whatever) to cholmod_(whatever), and I would need to change all "int64_t" with some internal typedef that would depend on something like -DSPQR32. This would be very easy to do.

A better option would be to allow both 32-bit and 64-bit methods to exist side-by-side, via templates, but then the cholmod_sparse object would be difficult to handle, since it's not templatized and is a plain C object. That matrix type has an internal flag with the size of the integer, and I'd need to have the SPQR user call both cholmod_l_start (cc64) and cholmod_start (cc32), say, and then pass in the corresponding Common (cc64 or cc32) to SPQR. Matrices with different integer sizes can't be mixed inside CHOLMOD, but they can both exist at the same time in the user space. This option would take more work since it's a more significant change to make. It might be overkill since I don't think user applications would typically need both 32-bit and 64-bit matrices to exist at the same time in the same application.

I'd also like to keep backward compatibility with the existing 64-bit interface to SPQR, since that is the most commonly used form.

In the meantime, the 32-bit Octave would need to stick with SuiteSparse v5.13.

What do you think? What would make most sense for Octave?

@mmuetzel
Copy link
Contributor Author

I'm not certain which bug report you are referring to. Something seems to have gone wrong with the link.

I'm looking at this with two slightly different hats on:
From the point of view of a consumer of the libraries (Octave developer), it would be ideal if the API didn't change. It is possible to add configure checks to see how we'd need to interface with the library. And then implement code for the different interfaces dependent on the detected version. But it would be much easier if the APIs remained compatible between different versions.
Also in this particular case, I'm not certain if it is possible at all to use a working interface for 32-bit Octave. I haven't looked into it yet though. Could you point out the changes we'd need to make to get this working again on that platform?

The other hat is the one of a contributor to MSYS2 which distributes binary packages for Windows in different "build environments". Usually, the versions that are distributed for the different build environments are the same. So either, SuiteSparse would need to be downgraded to version 5.13 for all environments or they could stop shipping SuiteSparse for the 32-bit environments. (That is if it is really that broken on those platforms.)
While it is possible to build Octave (and possibly other dependers) without linking to SuiteSparse, that is not ideal (and some functionality would be missing).

I don't have a complete overview yet. And it might be that I'm misunderstanding the issue. But from a first glance, I'd guess that this is one of the cases why there is intptr_t. That type exists in C (since C99) and C++ (since C++11). It would be 64-bit on most (or all?) current Linux distributions (which are 64-bit by now) and would have the same size that SuiteSparse_long had before on all platforms. So, it wouldn't break the API.
At the same time, you could still deprecate SuiteSparse_long.

I also don't know if the width of Fortran integers comes into play here. Afaict, that is usually 32-bit. But it can be 64-bit on 64-bit platforms. Afaict, most distributions don't ship binaries with 64-bit Fortran integers though (at least not by default).

@DrTimothyAldenDavis
Copy link
Owner

Here's the corrected link (I also editted it in my post above): https://savannah.gnu.org/bugs/?59833

intptr_t would be an option, but I'd like to stick with int32_t and int64_t for the packages that support both integer sizes (AMD, COLAMD, KLU, UMFPACK, CHOLMOD, etc). That way, even on a 32-bit platform, the 64-bit integer index would be available.

I can keep the SPQR interface nearly unchanged. I would essentially revert to SuiteSparse_long as the single supported integer size. I would call that integer something else, since the integer would be for SPQR only. The integer would be int64_t by default, but would become int32_t if -DSPQR32 is used. SPQR relies on CHOLMOD, and it would call cholmod_l_* methods by default, such as cholmod_l_analyze (the int64_t version). If -DSPQR32 is enabled, it would call cholmod_analyze instead (the int32_t version).

This change would be very easy for me to make, and it would not extend or break the current API.

This change would require a minor change to the Octave interface.
I could add this SPQR32 flag as a #defined flag in the SPQR include file, as well, via the cmake config process. The 32-bit Octave would then need to call cholmod_start instead of cholmod_l_start, to initialize the Common struct for SPQR.

How does this sound?

The fortran integer is a separate issue. I detect that already, and I do a safe typecast into the SUITESPARSE_BLAS_INT in the macros defined in SuiteSparse_config.h. I detect the blas integer size in cmake, and configure SuiteSparse_config.h accordingly. That's totally transparent to the user of UMFPACK, CHOLMOD, and SPQR, and has no effect on the choice of integers for the sparse matrix data structures.

@DrTimothyAldenDavis
Copy link
Owner

Does the 32-bit Octave use 32-bit integers for the octave_idx_type, and 64 bit integers for the 64-bit build?

@mmuetzel
Copy link
Contributor Author

The 32-bit Octave would then need to call cholmod_start instead of cholmod_l_start, to initialize the Common struct for SPQR.

I believe it is doing that already anyway:
https://hg.savannah.gnu.org/hgweb/octave/file/a1e7e0b62765/liboctave/util/oct-sparse.h#l125

Does the 32-bit Octave use 32-bit integers for the octave_idx_type, and 64 bit integers for the 64-bit build?

By default, yes. However, it is possible to limit the size to 32 bit manually when configuring with --disable-64.

@DrTimothyAldenDavis
Copy link
Owner

Perfect. The changes to Octave would then hopefully be minor. This section:

#  if defined (OCTAVE_ENABLE_64)
typedef SuiteSparse_long suitesparse_integer;
#  else
typedef int suitesparse_integer;
#  endif

can become

#  if defined (OCTAVE_ENABLE_64)
typedef int64_t suitesparse_integer;
#  else
typedef int32_t suitesparse_integer;
#  endif

or maybe just the single line

typedef octave_idx_type suitesparse_integer ;

Then if SuiteSparseQR is compiled by default, its integers would be int64_t. That would be the case when octave_idx_type is 64 bit. If compiled with (say) -DSPQR32, then the integer used by SPQR would be int32_t, which would also match the octave_idx_type if Octave is compiled on a 32-bit platform or if --disable-64 is used.

@DrTimothyAldenDavis
Copy link
Owner

I can also create a typedef that is visible to the user application in the SPQR include files, say SPQR_int, which would be the integer index type I would be using in all of SPQR. That definition would be configured by the cmake config process to be int64_t (by default) or int32_t (if -DSPQR32 is used in the cmake build).

@mmuetzel
Copy link
Contributor Author

We'd probably still need to check during configure which version of the SuiteSparse libraries we are dealing with. The newer typedefs would probably only work in versions > 6.0.2. The existing typedefs would still be needed for versions < 6. And there would be a couple of versions that just can't be used for 32-bit platforms.

@DrTimothyAldenDavis
Copy link
Owner

Good point.

I would make this SuiteSparse 6.1.0, most likely, and I would bump the major version of SPQR by one as well (from 3.0 to 4.0). Then SuiteSparse 5.13.0 and earlier (with SPQR 2.x and earlier), or 6.1.x or later (with SPQR 4.0 and later), would work with Octave, but not SuiteSparse 6.0.x (with SPQR 3.0.x, which breaks the 32-bit octave).

@DrTimothyAldenDavis
Copy link
Owner

The alternative would be to add new user-visible functions with a new name space (say SuiteSparseQR32_*) that used int32_t exclusively, and keep the current ones using int64_t all the time. Then the compiled library libspqr.so, or libspqr.dll or whatever, would always have both versions available.

I could use templating instead of the namespace, but that would change the user-visible API for the current methods, which only templatize on the Entry type (real or complex).

This would take some more work but might be a more robust solution in the long run. It would make the SPQR library consistent with AMD, COLAMD, UMFPACK, CHOLMOD, KLU, etc.

@mmuetzel
Copy link
Contributor Author

I might be missing something. But didn't we find out that the version with 64-bit integers doesn't work correctly on 32-bit platforms? Would that version start working for some reason with the changes you are proposing?
If it wouldn't, won't it make sense to drop building that version?
If no version with 64 bit integers would be built for 32 bit platforms, wouldn't it make sense to switch from the int64_t interface to a intptr_t interface?

@DrTimothyAldenDavis
Copy link
Owner

The 64-bit versions is working fine for other packages. The problem here with SPQR is that the 32-bit Octave is expecting it to take int32_t inputs (the old SuiteSparse_long), but SPQR now assumes int64_t.

I don't want to use intptr_t because it's hard to reason about (like the old SuiteSparse_long). I want functions that use a fixed integer size, like amd_order using int32_t and amd_l_order using int64_t. It's hard to work with integers that vary in size based on which platform happens to be compiling the code. I want to have control over the integer size, so I'd rather pick int32_t or int64_t explicitly.

@mmuetzel
Copy link
Contributor Author

Sorry, I'm still not sure I understand.
Didn't the type of SuiteSparse_long change to int64_t in version 6? Wouldn't that mean that Octave (still using SuiteSparse_long) would use that type now?
And why are the demos giving incorrect results in the CI? Are they just using the wrong type, too?

@DrTimothyAldenDavis
Copy link
Owner

Yes, SuiteSparse_long is now int64_t. It used to be different. I haven't gotten any feedback on anyone building on a 32-bit platform for many years, and I don't have access to one myself. So I hadn't tried it, and that's why my recent change inadvertently broke the 32-bit Octave.

Using intptr_t would make SPQR very different from all my other codes. It would be hard to know which version of CHOLMOD that SPQR should call (cholmod_analyze or cholmod_l_analyze?). I would rather that the current SPQR functions keep their int64_t integers, and then add new ones via templates or a different name. Then the current SPQR functions would always connect with cholmod_l_whatever, and the new 32-bit SPQR functions would always connect to cholmod_whatever.

I'm not sure why some demos are giving incorrect output. Some of the incorrect results are because of how the values are printed with printf. The printfs are using the wrong format. I can fix that.

@DrTimothyAldenDavis
Copy link
Owner

So I think my plan is to make SPQR just like CHOLMOD: with both int32_t and int64_t options. Then Octave would pick the methods for SPQR the same way it does for CHOLMOD. This will take more work than making SPQR use a single integer (say intptr_t) but it's more robust in the long term.

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Dec 17, 2022

I'm not sure if the issue that we found with int64_t on 32-bit platforms is limited to SPQR. It might be that it is just the first library for which we'd noticed because it doesn't have a dedicated 32-bit interface.
IIUC, there are two different projects that might both be worth to pursue:

  1. Add a separate interface for 32-bit integers to the SPQR library (like already existing for the other libraries in SuiteSparse).
  2. Use a type in the _l_ interface functions that works for all target platforms. That type could be the widest integer type that is supported on the respective target platform. That could be intptr_t on all platforms.

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Dec 23, 2022

MSYS2 now packages the last SuiteSparse 5 version for 32-bit:
msys2/MINGW-packages#14663

Other distributions could probably do the same.
That means this can probably have a lower priority.

It might still be nice if SuiteSparse could fix the _l_ interface to work on 32-bit platforms again though.

@DrTimothyAldenDavis
Copy link
Owner

Yes, that's my plan. I plan on adding a 32-bit variant of SPQR (option 1 in your comment above), where all the indices are 32-bit in size.

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Dec 27, 2022

I agree that this would be a good solution to make the APIs of the different libraries more similar.
However, it would require some more work for downstream projects. And would leave the wide indices API broken on 32-bit platforms. The second task in the comment above would still help with that even if the first task is completed.

I'd still argue that indices must be smaller than sizeof(void*) on any platform. That is exactly why the standard introduced intptr_t together with int64_t and the other fixed width integer identifiers.

@DrTimothyAldenDavis
Copy link
Owner

I've added this issue to a Project for the future SuiteSparse v7.1.0: https://github.com/users/DrTimothyAldenDavis/projects/1

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Jan 3, 2023

Is that project set to private? I'm just getting a 404 error page for that link.

@DrTimothyAldenDavis
Copy link
Owner

I just set it to public.

@ViralBShah
Copy link
Contributor

ViralBShah commented Apr 14, 2023

@DrTimothyAldenDavis - @Wimmerer and I are running into some of these challenges getting SuiteSparse 7 into Julia. Would you recommend waiting until 7.1 is out and the dust on 32-bit issues is settled - especially if APIs will change?

Or would you recommend moving forward with 7.0 for now, and that 7.1 will only introduce new interfaces for SPQR 32-bit?

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Aug 4, 2023

I also tried cherry-picking this change to SuiteSparse 7.1.0. With it, Octave no longer segfaults during the test suite. But I still see a failing test that used to pass with SuiteSparse 5:

>>>>> processing C:\msys64\mingw32\share\octave\8.2.0\etc\tests\libinterp\corefcn\qr.cc-tst
***** testif ; (__have_feature__ ("SPQR") && __have_feature__ ("CHOLMOD")) || __have_feature__ ("CXSPARSE") <*63069>
 assert (qr (sparse (0, 0)), sparse (0, 0))
 assert (qr (sparse (1, 0)), sparse (1, 0))
 assert (qr (sparse (0, 1)), sparse (0, 1))
!!!!! regression: https://octave.org/testfailure/?63069
integer dimension or index out of range for Octave's indexing type

The last message could indicate that some indices or sizes might still be returned with bogus values.

But the main blocker -- i.e., the segfault -- is indeed fixed afaict. 🎉

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Aug 5, 2023

Fwiw, the MINGW32 build artifacts that I used for that test are here:
https://github.com/mmuetzel/MINGW-packages/actions/runs/5759876546

@mmuetzel
Copy link
Contributor Author

@svillemot: You wrote that Octave's test suite no longer segfaults for you after you cherry-picked that commit. Do you still see the regression though? Or is that Windows-only?

@DrTimothyAldenDavis
Copy link
Owner

Except for that particular Octave test failure, I think this issue is resolved. That would be nice to fix, of course, but I'm not sure if it's now an issue with SPQR, and hopefully now an issue of a minor update to Octave to use the specific 32-bit or 64-bit version it needs.

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Sep 5, 2023

now an issue of a minor update to Octave to use the specific 32-bit or 64-bit version it needs.

Again: Does that mean your answer to this issue is that the int64_t interface is broken and will remain broken on 32-bit platforms? And you are pushing the responsibility to select a working part of the API to downstream projects?
That is an odd decision to say the least...

@DrTimothyAldenDavis
Copy link
Owner

No, the spqr interface isn't broken; it now supports both 32-bit and 64-bit integer indices, on both 32-bit and 64-bit platforms. It's just different from what it used to be, with this change of API. So that could be causing the Octave failure, but hopefully a minor update to Octave would be easy to do, on the 32-bit platform.

@DrTimothyAldenDavis
Copy link
Owner

Ideally, if I could replicate the Octave issue myself, I might be able to submit a PR to update Octave's interface to SPQR on 32-bit platforms.

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Sep 6, 2023

Ideally, if I could replicate the Octave issue myself, I might be able to submit a PR to update Octave's interface to SPQR on 32-bit platforms.

I would appreciate that.

@DrTimothyAldenDavis
Copy link
Owner

Would I start with building octave from here: https://github.com/gnu-octave/octave ?

I would need to link it against SuiteSparse v7.2.0.beta1, and presumably test it on a 32- bit platform. I have a 32-bit debian 11 installed in virtualbox. I would then need to find the test that is currently failing.

Looking through the octave source code, I think the best solution would be for SuiteSparseQR to track the same integer as octave_idx_type. That type is either int32_t or int64_t. The SuiteSparse_long type would then be removed and replaced with octave_idx_type. The two uses of CHOLMOD_LONG in sparse-qr.cc would become a ternary:

A.itype = sizeof (octave_idx_type) == sizeof (int64_t) ? CHOLMOD_LONG : CHOLMOD_INT ;

and the copy of Ap and Ai on lines 250-260 and 294-304 would go away.

All calls to SuiteSparseQR would use <double,octave_idx_type> or <Complex,octave_idx_type> as their template parameters.

This would assume that SuiteSparseQR 4.2.0 or later is available, of course, so perhaps an #ifdef would be required to keep the current sparse-qr.cc the same if the version of SuiteSparseQR is < 4.2.0.

The function from_suitesparse_long would no longer be needed.

Before I try al that, let me try to replicate the test failure so I understand what is currently failing.

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Sep 6, 2023

Would I start with building octave from here: https://github.com/gnu-octave/octave ?

Yes, that is the main repository. If you are more familiar with git than with hg, you could also use the mirror on GitHub: https://github.com/gnu-octave/octave
Edit: Oops. Misread. I was thinking of the main repository here: https://hg.savannah.gnu.org/hgweb/octave
The mirror on GitHub is usually only a couple of minutes behind.

I think the best solution would be for SuiteSparseQR to track the same integer as octave_idx_type.

I don't know if that is possible without a lot of preprocessor conditionals. We need to keep compatibility with the versions of SuiteSparse that are packaged by the various distributions.
IIUC, the first version for which that would work without issues would be SuiteSparse 7.2.0. That version is not even released yet. And it will probably take a couple of years before we can assume that most distros updated to that or a newer version.

For the time being, it might be best to keep using the version that uses SuiteSparse_long (i.e., int64_t now and long previously).

All calls to SuiteSparseQR would use <double,octave_idx_type> or <Complex,octave_idx_type> as their template parameters.

For the reasons mentioned before, that would probably increase the maintenance burden because different code paths for older and newer versions of SuiteSparse would need to be maintained.

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Sep 6, 2023

It's a Heisenbug. I was able to reproduce when configuring with --disable-64. But only if I run the test suite with make check.
I doesn't occur if I run those failing tests directly or when I run only the tests from libinterp/corefcn/qr.cc-tst.

Something's not getting initialized correctly? So, the values depend on the memory layout? Or a cholmod_common thing?

@DrTimothyAldenDavis
Copy link
Owner

Unsure. I would need to replicate the bug myself. I suppose I need to build octave myself and link in SuiteSparse v7.2.0.beta. What OS should I use to do that? Does this bug only occur for 32-bit OS's? Or does --disable-64 trigger the bug in a 64-bit OS?

I'm trying the latest version on the github repo, with "configure --disable-64 ; make " and then I suppose "make install ; make check" ? I'm trying this on a 64-bit OS (Ubuntu 18.04 with gcc/g++ 12.2.0 from spack). That might not be the best place to try it but at least it will show me how to compile my own copy of octave. Then how to I tell it to use my own copy of SuiteSparse v7.2.0beta?

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Sep 7, 2023

The "Heisenbug behavior" was with a 64-bit build that was configured with --disable-64.

To have the configure script pick up the libraries from a non-default location, you could configure with --with-cholmod-includedir="/path/to/cholmod/headers" --with-cholmod-libdir="/path/to/cholmod/library" --with-spqr-includedir="/path/to/spqr/headers" --with-spqr-libdir="/path/to/spqr/library" and similarly for the other libraries (i.e., amd, camd, colamd, ccolamd, cxsparse, umfpack, and klu). (That might become easier once the configure script is adapted to pick up the new .pc files.)
Or probably easier: Set CPPFLAGS="-I/path/to/suitesparse/headers" LDFLAGS="-L/path/to/suitesparse/libraries".
It might be necessary to build without sundials (because that is likely still linked to an older, potentially incompatible version of SuiteSparse). For that, you could add --without-sundials_nvecserial to the configure flags.

After a fresh checkout, you'd need to run the bootstrap script once. I.e., the following commands should get you started:

# in the source tree of Octave
./bootstrap
mkdir build
cd ./build
CPPFLAGS="-I/path/to/suitesparse/headers" LDFLAGS="-L/path/to/suitesparse/libraries" \
  ../configure # add your configure flags here
make all -j20  # or however many parallel jobs you'd like to use
make check  # the error occurs during this for me

@DrTimothyAldenDavis
Copy link
Owner

Thanks! I'll give it a try and let you know how it goes.

@DrTimothyAldenDavis
Copy link
Owner

OK ... making progress. I got octave to build with my beta SuiteSparse 7.2.0, and I see the test failure. I can also replicate it in ./run-octave:

hypersparse $ ./run-octave 
octave: no graphical display found
octave: disabling GUI features
GNU Octave, version 9.0.0
Copyright (C) 1993-2023 The Octave Project Developers.
This is free software; see the source code for copying conditions.
There is ABSOLUTELY NO WARRANTY; not even for MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE.  For details, type 'warranty'.

Octave was configured for "x86_64-pc-linux-gnu".

Home page:            https://octave.org
Support resources:    https://octave.org/support
Improve Octave:       https://octave.org/get-involved

For changes from previous versions, type 'news'.

octave:1> 
octave:1> 
octave:1> 
octave:1> sparse(0,1)
ans = Compressed Column Sparse (rows = 0, cols = 1, nnz = 0)

octave:2> a=ans
a = Compressed Column Sparse (rows = 0, cols = 1, nnz = 0)

octave:3> qr(a)
error: integer dimension or index out of range for Octave's indexing type
octave:4> qr(a')
error: integer dimension or index out of range for Octave's indexing type
octave:5> qr(sparse(rand(4)))
ans =

Compressed Column Sparse (rows = 4, cols = 4, nnz = 10 [62%])

  (1, 1) -> -1.5889
  (1, 2) -> -0.7768
  (2, 2) -> -0.9528
  (1, 3) -> -1.2863
  (2, 3) -> -0.6779
  (3, 3) -> 0.7488
  (1, 4) -> -1.1343
  (2, 4) -> -0.7571
  (3, 4) -> 0.3542
  (4, 4) -> -0.077042

octave:6> qr(sparse(0,0))
error: integer dimension or index out of range for Octave's indexing type
octave:7> qr(sparse(1,0))
error: integer dimension or index out of range for Octave's indexing type
octave:8> qr(sparse(1,1))
error: integer dimension or index out of range for Octave's indexing type
octave:9> A = sparse (3,3)
A =

Compressed Column Sparse (rows = 3, cols = 3, nnz = 0 [0%])


octave:10> A(1,1) = 1
A =

Compressed Column Sparse (rows = 3, cols = 3, nnz = 1 [11%])

  (1, 1) -> 1

octave:11> qr(A)
ans =

Compressed Column Sparse (rows = 3, cols = 3, nnz = 1 [11%])

  (1, 1) -> 1

octave:12> A =sparse(0,4)
A = Compressed Column Sparse (rows = 0, cols = 4, nnz = 0)

octave:13> qr(A)
error: integer dimension or index out of range for Octave's indexing type
octave:14> qr(A')
error: integer dimension or index out of range for Octave's indexing type
octave:15> 
octave:17> qr(sparse(4,4))
error: integer dimension or index out of range for Octave's indexing type

qr(A) seems to handle rank deficient matrices OK, just not matrices that are entirely empty. I have some of those matrices in my standalone tests, outside of octave, and they work fine. Either octave is triggering a different path inside my code and hitting a bug in SPQR, or perhaps there's a glitch in the octave interface to the revised SPQR, or something. Whichever it is, it looks like I should be able to track this down.

@DrTimothyAldenDavis
Copy link
Owner

Oh ... I think I see a bug in sparse-qr.cc.

Octave is using A->nzmax to determine the # of entries in the matrix. That is not the right value. That's the space allocated for entries in the matrix. For an empty matrix, I think A->nzmax is 1 but A [ncol] is zero.

Checking this now but that looks suspicious.

@DrTimothyAldenDavis
Copy link
Owner

The correct method to ask a matrix for the # of entries is to use cholmod_nnz or cholmod_l_nnz.

@DrTimothyAldenDavis
Copy link
Owner

That's the bug, but I'm having a hard time fixing it because I can't find the CHOLMOD Common object to pass in.

qr(sparse(0,1)) returns a matrix with no entries, but R->nzmax is 1. The octave interface tries to copy the indices from R->i, and it thinks there is exactly 1. But there is none. The space is uninitialized so you get a garbage int64 value and it is outside the range of an int32.

The usage nz = R->nzmax is incorrect. It should be nz = cholmod_l_nnz (R, cc). But I can't seem to add the CHOLMOD Common object to the functions where R->nzmax is used.

A similar bug appears when using CXSparse instead. A->nzmax is the size of the A->i and A->x arrays, which is an upper bound on nnz(A). You don't want to use A->nzmax for CXSParse, but nz=A->p [ncol] instead.

@DrTimothyAldenDavis
Copy link
Owner

Here's a corrected sparse-qr.cc file. It generates some warnings about typecasting but it fixes the bug.

fixed.zip

@DrTimothyAldenDavis
Copy link
Owner

I haven't tested the changes I made to the CXSparse interface, since I'm running with SPQR installed.

@DrTimothyAldenDavis
Copy link
Owner

Summary:

PASS 18292
FAIL 1
XFAIL (reported bug) 36
SKIP (missing feature) 440
SKIP (run-time condition) 30

fntests.log

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Sep 7, 2023

Thank you very much!
I wonder how that worked correctly in so many cases. Maybe, we've just have been lucky so far.

I pushed your changes with some modifications here:
https://hg.savannah.gnu.org/hgweb/octave/rev/dde186dc7b4c

That also includes changes of the same pattern in other files.

So, it's been an issue in Octave after all that only showed up with newer versions of SuiteSparse (for whatever reason).
I'm glad there's one bug pattern less. 👍

siko1056 pushed a commit to gnu-octave/octave that referenced this issue Sep 7, 2023
Based on changes by @DrTimothyAldenDavis on GitHub:
DrTimothyAldenDavis/SuiteSparse#221 (comment)

* liboctave/numeric/sparse-qr.cc (rcs2ros, ccs2cos): Add cholmod_common as
input argument. Use "cholmod_l_nnz" to get number of non-zero matrix elements.
(sparse_qr<T>::sparse_qr_rep::V): Pass cholmod_common to conversion function.
Get correct number of non-zero matrix elements.
(sparse_qr<T>::sparse_qr_rep::R): Use "cholmod_l_nnz" to get number of non-zero
matrix elements. Get correct number of non-zero matrix elements.
(sparse_qr<T>::sparse_qr_rep::tall_solve): Use "cholmod_l_nnz" to get number of
non-zero matrix elements.
(sparse_qr<T>::min2norm_solve): Pass cholmod_common to conversion function.

* liboctave/numeric/sparse-qr.cc (parse_chol<chol_type>::L): Use
"cholmod_[l_]nnz" to get number of non-zero matrix elements.

* liboctave/array/CSparse.cc, dSparse.cc (fsolve): Use "cholmod_[l_]nnz" to get
number of non-zero matrix elements.
@DrTimothyAldenDavis
Copy link
Owner

Glad to help! Yes, my modifications did need some polishing; my typecast was generating a warning, and I only fixed the one file.

The bug has been there for a long time, I'm sure. Typically, my nzmax is equal to max(1,nz) for a matrix with nz entries, and sometimes larger. For example:

A->nzmax = nzmax = CS_MAX (nzmax, 1) ;

nzmax = MAX (1, nzmax) ;

I do this because p=malloc(0) can behave strangely. I've seen it return NULL because no space is needed. So I never malloc a space of zero size. I could leave the array A->i as NULL, if there are no entries in the matrix, and skip the malloc entirely. But that looks wierd and my tests for a valid matrix (checking if A->i is non-NULL) would then be more complicated.

As a result, if nnz(A) is zero, A->i still exists and A->nzmax == 1. And that one integer was uninitialized space.

Looks like I can close this issue now, right?

mtmiller pushed a commit to mtmiller/octave that referenced this issue Oct 16, 2023
Based on changes by @DrTimothyAldenDavis on GitHub:
DrTimothyAldenDavis/SuiteSparse#221 (comment)

* liboctave/numeric/sparse-qr.cc (rcs2ros, ccs2cos): Add cholmod_common as
input argument. Use "cholmod_l_nnz" to get number of non-zero matrix elements.
(sparse_qr<T>::sparse_qr_rep::V): Pass cholmod_common to conversion function.
Get correct number of non-zero matrix elements.
(sparse_qr<T>::sparse_qr_rep::R): Use "cholmod_l_nnz" to get number of non-zero
matrix elements. Get correct number of non-zero matrix elements.
(sparse_qr<T>::sparse_qr_rep::tall_solve): Use "cholmod_l_nnz" to get number of
non-zero matrix elements.
(sparse_qr<T>::min2norm_solve): Pass cholmod_common to conversion function.

* liboctave/numeric/sparse-qr.cc (parse_chol<chol_type>::L): Use
"cholmod_[l_]nnz" to get number of non-zero matrix elements.

* liboctave/array/CSparse.cc, dSparse.cc (fsolve): Use "cholmod_[l_]nnz" to get
number of non-zero matrix elements.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
Status: Done
Development

No branches or pull requests

5 participants