Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pll_id conflicts when submitting many jobs simultaneously (LSF arrays) #113

Open
4 tasks done
rpguiteras opened this issue Oct 25, 2024 · 2 comments
Open
4 tasks done

Comments

@rpguiteras
Copy link

Preliminaries

Before submitting an issue, please check (with x in brackets) that you:

  • Are using the newest release (see here for latest release version number).
  • Have checked that the examples in the help work.
  • Have read the help (HTML version) and the gallery of examples.
  • Have checked that there is not already an existing issues for what you are reporting.

Expected behavior and actual behavior

Context

I am using LSF as a job scheduler to submit an array of jobs to our cluster. Each job is assigned 4 cores, and I am using parallel sim to divide the simulation among the 4 cores.

Desired behavior

I would like parallel to assign a unique pll_id to each job.

Actual behavior

LSF may send many jobs simultaneously, i.e., within the same second. This means that parallel assigns the same pll_id to each job, causing a conflict and errors for all but one of the jobs that arrive within the same second.

Failed solution attempted

Because each job has a unique seed, I tried the randtype("current") option but this was not effective.

Solution (workaround)

What solved the problem was to create a while loop such that, if cap parallel sim returns an error, we wait a random number of seconds (1-16, although this is arbitrary) and try again. This was successful although somewhat to my surprise some jobs needed to go through the loop 10 or more times. I have appended a sketch of the code.

Steps to reproduce the problem

This would be pretty tough because I think it depends on the specifics of our cluster, our job scheduler, etc.

System information

Some relevant information

  • Stata version and flavor: Stata 18.5 MP4
  • OS type and version (e.g. Windows 10): RHEL9
  • Parallel version: version 1.20.1 07jun2021

Output from creturn list:

System values
-------------

    ------------------------------------------------------
        c(current_date) = "25 Oct 2024"
        c(current_time) = "09:09:16"
           c(rmsg_time) = 0                          (seconds, from set rmsg)
    ------------------------------------------------------
       c(stata_version) = 18.5
             c(version) = 17                         (version)
         c(userversion) = 17                         (version)
      c(dyndoc_version) = 2                          (dyndoc)
    ------------------------------------------------------
           c(born_date) = "16 Jul 2024"
             c(edition) = "BE"
        c(edition_real) = "MP"
                 c(bit) = 64
                  c(SE) = 1
                  c(MP) = 1
          c(processors) = 4                          (Stata/MP, set processors)
      c(processors_lic) = 4
     c(processors_mach) = 32
      c(processors_max) = 4
       c(kmp_blocktime) = 200                        (set kmp_blocktime)
                c(mode) = "batch"
             c(console) = "console"
    ------------------------------------------------------
                  c(os) = "Unix"
               c(osdtl) = ""
            c(hostname) = "c009n01"
        c(machine_type) = "PC (64-bit x86-64)"
           c(byteorder) = "lohi"
            c(username) = "rpguiter"
    ------------------------------------------------------

Directories and paths
---------------------

    ------------------------------------------------------
        c(sysdir_stata) = "/usr/local/apps/s.."      (sysdir)
         c(sysdir_base) = "/usr/local/apps/s.."      (sysdir)
         c(sysdir_site) = "/usr/local/apps/s.."      (sysdir)
         c(sysdir_plus) = "code/ado/plus/"           (sysdir)
     c(sysdir_personal) = "code/ado/personal/"       (sysdir)
     c(sysdir_oldplace) = "code/ado/"                (sysdir)
              c(tmpdir) = "/share/rpguiter/r.."
    ------------------------------------------------------
             c(adopath) = "BASE;SITE;.;PERSO.."      (adopath)
                 c(pwd) = "/rs1/researchers/.."      (cd)
              c(dirsep) = "/"
    ------------------------------------------------------

System limits
-------------

    ------------------------------------------------------
        c(max_N_theory) = 1099511627775
        c(max_k_theory) = 5000                       (set maxvar)
    c(max_width_theory) = 1048576                    (set maxvar)
    ------------------------------------------------------
          c(max_matdim) = 65534
    ------------------------------------------------------
        c(max_it_cvars) = 64
        c(max_it_fvars) = 8
    ------------------------------------------------------
        c(max_macrolen) = 15480200
            c(macrolen) = 645200                     (set maxvar)
             c(charlen) = 67783
          c(max_cmdlen) = 15480216
              c(cmdlen) = 645216                     (set maxvar)
         c(namelenbyte) = 128
         c(namelenchar) = 32
               c(eqlen) = 1337
    ------------------------------------------------------

Numerical and string limits
---------------------------

    ------------------------------------------------------
           c(mindouble) = -8.9884656743e+307
           c(maxdouble) = 8.9884656743e+307
           c(epsdouble) = 2.22044604925e-16
      c(smallestdouble) = 2.2250738585e-308
    ------------------------------------------------------
            c(minfloat) = -1.70141173319e+38
            c(maxfloat) = 1.70141173319e+38
            c(epsfloat) = 1.19209289551e-07
    ------------------------------------------------------
             c(minlong) = -2147483647
             c(maxlong) = 2147483620
    ------------------------------------------------------
              c(minint) = -32767
              c(maxint) = 32740
    ------------------------------------------------------
             c(minbyte) = -127
             c(maxbyte) = 100
    ------------------------------------------------------
        c(maxstrvarlen) = 2045
       c(maxstrlvarlen) = 2000000000
        c(maxvlabellen) = 32000
    ------------------------------------------------------

Current dataset
---------------

    ------------------------------------------------------
               c(frame) = "default"
                   c(N) = 0
                   c(k) = 0
               c(width) = 0
             c(changed) = 0
            c(filename) = ""
            c(filedate) = ""
    ------------------------------------------------------

Memory settings
---------------

    ------------------------------------------------------
              c(memory) = 33554432
              c(maxvar) = 5000                       (set maxvar)
            c(niceness) = 5                          (set niceness)
          c(min_memory) = 0                          (set min_memory)
          c(max_memory) = .                          (set max_memory)
         c(segmentsize) = 33554432                   (set segmentsize)
             c(adosize) = 1000                       (set adosize)
     c(max_preservemem) = 1073741824                 (set max_preservemem)
    ------------------------------------------------------

Output settings
---------------

    ------------------------------------------------------
                c(more) = "off"                      (set more)
                c(rmsg) = "off"                      (set rmsg)
                  c(dp) = "period"                   (set dp)
            c(linesize) = 110                        (set linesize)
            c(pagesize) = 23                         (set pagesize)
             c(logtype) = "smcl"                     (set logtype)
              c(logmsg) = "on"                       (set logmsg)
             c(noisily) = 1
    ------------------------------------------------------
             c(iterlog) = "on"                       (set iterlog)
    ------------------------------------------------------
               c(level) = 95                         (set level)
              c(clevel) = 95                         (set clevel)
    ------------------------------------------------------
      c(showbaselevels) = ""                         (set showbaselevels)
      c(showemptycells) = ""                         (set showemptycells)
         c(showomitted) = ""                         (set showomitted)
             c(fvlabel) = "on"                       (set fvlabel)
              c(fvwrap) = 1                          (set fvwrap)
            c(fvwrapon) = "word"                     (set fvwrapon)
            c(lstretch) = ""                         (set lstretch)
    ------------------------------------------------------
             c(cformat) = ""                         (set cformat)
             c(sformat) = ""                         (set sformat)
             c(pformat) = ""                         (set pformat)
    ------------------------------------------------------
      c(coeftabresults) = "on"                       (set coeftabresults)
                c(dots) = "on"                       (set dots)
    ------------------------------------------------------
       c(collect_label) = "default"                  (set collect_label)
       c(collect_style) = "default"                  (set collect_style)
         c(table_style) = "table"                    (set table_style)
        c(etable_style) = "etable"                   (set etable_style)
        c(dtable_style) = "dtable"                   (set dtable_style)
        c(collect_warn) = "on"                       (set collect_warn)
    ------------------------------------------------------

Interface settings
------------------

    ------------------------------------------------------
             c(linegap) = .                          (set linegap)
       c(scrollbufsize) = .                          (set scrollbufsize)
               c(maxdb) = 50                         (set maxdb)
    ------------------------------------------------------

Graphics settings
-----------------

    ------------------------------------------------------
            c(graphics) = "off"                      (set graphics)
              c(scheme) = "s1color"                  (set scheme)
          c(printcolor) = "asis"                     (set printcolor)
       c(min_graphsize) = 1                          (region_options)
       c(max_graphsize) = 100                        (region_options)
    ------------------------------------------------------

Network settings
----------------

    ------------------------------------------------------
           c(httpproxy) = "off"                      (set httpproxy)
       c(httpproxyhost) = ""                         (set httpproxyhost)
       c(httpproxyport) = 80                         (set httpproxyport)
    ------------------------------------------------------
       c(httpproxyauth) = "off"                      (set httpproxyauth)
       c(httpproxyuser) = ""                         (set httpproxyuser)
         c(httpproxypw) = ""                         (set httpproxypw)
    ------------------------------------------------------

Trace (program debugging) settings
----------------------------------

    ------------------------------------------------------
               c(trace) = "off"                      (set trace)
          c(tracedepth) = 1                          (set tracedepth)
            c(tracesep) = "on"                       (set tracesep)
         c(traceindent) = "on"                       (set traceindent)
         c(traceexpand) = "on"                       (set traceexpand)
         c(tracenumber) = "off"                      (set tracenumber)
         c(tracehilite) = ""                         (set tracehilite)
    ------------------------------------------------------

Mata settings
-------------

    ------------------------------------------------------
          c(matastrict) = "off"                      (set matastrict)
            c(matalnum) = "off"                      (set matalnum)
        c(mataoptimize) = "on"                       (set mataoptimize)
           c(matafavor) = "space"                    (set matafavor)
           c(matacache) = 2000                       (set matacache)
            c(matalibs) = ""                         (set matalibs)
         c(matamofirst) = "off"                      (set matamofirst)
        c(matasolvetol) = .                          (set matasolvetol)
    ------------------------------------------------------

Java settings
-------------

    ------------------------------------------------------
        c(java_heapmax) = "4096m"                    (set java_heapmax)
           c(java_home) = "/usr/local/apps/s.."      (set java_home)
    ------------------------------------------------------

LAPACK settings
---------------

    ------------------------------------------------------
          c(lapack_mkl) = "on"                       (set lapack_mkl)
      c(lapack_mkl_cnr) = "default"                  (set lapack_mkl_cnr)
    ------------------------------------------------------

putdocx settings
----------------

    ------------------------------------------------------
      c(docx_hardbreak) = "off"                      (set docx_hardbreak)
       c(docx_paramode) = "off"                      (set docx_paramode)
       c(docx_maxtable) = 500                        (set docx_maxtable)
    ------------------------------------------------------

putpdf settings
---------------

    ------------------------------------------------------
        c(pdf_maxtable) = 500                        (set pdf_maxtable)
    ------------------------------------------------------

Python settings
---------------

    ------------------------------------------------------
         c(python_exec) = ""                         (set python_exec)
     c(python_userpath) = ""                         (set python_userpath)
    ------------------------------------------------------

RNG settings
------------

    ------------------------------------------------------
                 c(rng) = "default"                  (set rng)
         c(rng_current) = "mt64"
            c(rngstate) = "XAA00000000000000.."      (set rngstate)
       c(rngseed_mt64s) = 123456789
           c(rngstream) = 1                          (set rngstream)
    ------------------------------------------------------

sort settings
-------------

    ------------------------------------------------------
          c(sortmethod) = "default"                  (set sortmethod)
        c(sort_current) = "fsort"
        c(sortrngstate) = "1001XZA112210f4b1.."      (set sortrngstate)
    ------------------------------------------------------

Unicode settings
----------------

    ------------------------------------------------------
           c(locale_ui) = ""                         (set locale_ui)
    c(locale_functions) = "en_US"                    (set locale_functions)
      c(locale_icudflt) = "en_US"                    (unicode locale)
    ------------------------------------------------------

Other settings
--------------

    ------------------------------------------------------
                c(type) = "float"                    (set type)
             c(maxiter) = 300                        (set maxiter)
       c(searchdefault) = "all"                      (set searchdefault)
           c(varabbrev) = "off"                      (set varabbrev)
          c(emptycells) = "keep"                     (set emptycells)
             c(fvtrack) = "term"                     (set fvtrack)
              c(fvbase) = "on"                       (set fvbase)
             c(odbcmgr) = "iodbc"                    (set odbcmgr)
          c(odbcdriver) = "unicode"                  (set odbcdriver)
             c(fredkey) = ""                         (set fredkey)
      c(collect_double) = "on"                       (set collect_double)
       c(dtascomplevel) = 1                          (set dtascomplevel)
       c(reshape_favor) = "memory"                   (set reshape_favor)
    ------------------------------------------------------
@rpguiteras
Copy link
Author

Code that works:


parallel initialize 4

loc success = 0 
loc attempt = 1

while `success'==0 {

set seed `myseed'
local rngstate_initial = c(rngstate)

cap parallel sim, ///
  expr( ///
    beta1_hat          = r(beta1_hat)  ///
    beta1_se_hat       = r(beta1_se_hat) ///
    pi1_hat            = r(pi1_hat) ///
    pi1_hat_se         = r(pi1_hat_se) ///
    fs_Fstat           = r(fs_Fstat) ///
    gamma1_hat         = r(gamma1_hat) ///
    gamma1_hat_se_unc  = r(gamma1_hat_se_unc) ///
    gamma1_hat_se_jnt2 = r(gamma1_hat_se_jnt2) ///
    gamma1_hat_se_jnt  = r(gamma1_hat_se_jnt) ///
    gamma1_hat_se_jnt1 = r(gamma1_hat_se_jnt1) ///
    N_G_npp            = r(N_G_npp) ///
  ) /// end expr()
  reps(`num_reps') noisily randtype("current")  ///
  saving("temp/simulation_postfile-`fileSuffix'.dta", replace) ///
  : ///
    sim_menzel_onerep, ///
      num_groups(`num_groups') n_g(`n_g') ///
      beta1(`beta1') gamma1(`gamma1') ///
      rho(`rho') pi1(`pi1') sigma(`sigma') ///
      `testing'

if _rc==0 {
  noi di `"** succeeded on attempt `attempt' **"'
  noi ret li 
  loc my_pll_id `r(pll_id)'
  use "temp/simulation_postfile-`fileSuffix'.dta", clear 

  loc LocsToChars num_reps N_G n_g rho  ///
    logit_scale beta1 gamma1 pi1 sigma

  foreach LTC of local LocsToChars {
    char _dta[`LTC'] ``LTC''
  }
  char _dta[rngstate_initial] `rngstate_initial'

  save "output/data/sim_menzel/simulation_postfile-`fileSuffix'.dta", replace 
  rm "temp/simulation_postfile-`fileSuffix'.dta"

  desc, f 
  char li _dta[]

  summarize

  // remove auxiliary files
  parallel clean, event(`my_pll_id')
  loc success = 1
}
else {
  di `"** attempt `attempt' failed"'
  loc success = 0
  loc ++attempt
  
  loc tempseed = `myseed'+`attempt'
  set seed `tempseed'

  loc tosleep = runiformint(1,16)
  di `"** wait `tosleep' seconds to try again"'
  sleep `tosleep'
}

} // end while


@bquistorff
Copy link
Collaborator

It seems like 2 separate processes see that an ID is available, and then next both try to get it, but only one gets it. We could have a combined "find available pllid AND hold it" command in parallel_sandbox (these are separate operations currently) that retries with a new id if acquisition fails.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants