Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time SDM Random Forests runs out of memory #6

Open
scottsfarley93 opened this issue Sep 8, 2016 · 4 comments
Open

Time SDM Random Forests runs out of memory #6

scottsfarley93 opened this issue Sep 8, 2016 · 4 comments

Comments

@scottsfarley93
Copy link
Member

Log Dump:

19:03:53.000
Sep 8 00:03:53 instance-5 startup-script: INFO startup-script: [1] "This rep is # 1 for cores= 1 and numTrees= 6000"
19:04:55.000
Sep 8 00:04:55 instance-5 collectd[968]: match_throttle_metadata_keys: 23 history entries, 221 distinct keys, 24952 bytes server memory.
19:09:55.000
Sep 8 00:09:55 instance-5 collectd[968]: match_throttle_metadata_keys: 24 history entries, 221 distinct keys, 24952 bytes server memory.
19:12:51.000
Sep 8 00:12:51 instance-5 startup-script: INFO startup-script: [1] "Parallel finished"
19:15:54.000
Sep 8 00:15:54 instance-5 collectd[968]: match_throttle_metadata_keys: 25 history entries, 221 distinct keys, 24952 bytes server memory.
19:17:01.000
Sep 8 00:17:01 instance-5 CRON[5289]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.133976] R invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.141176] R cpuset=/ mems_allowed=0
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.145173] CPU: 7 PID: 1315 Comm: R Not tainted 3.16.0-4-amd64 #1 Debian 3.16.7-ckt25-2+deb8u3
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.153972] Hardware name: Google Google/Google, BIOS Google 01/01/2011
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.160684] 0000000000000000 ffffffff8150e08f ffff8800b3c43470 0000000000000000
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.168637] ffffffff8150bc8b 0000000000000000 ffffffff810d51ff 0000000000000000
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.176619] ffffffff815129ae 0000000000000200 ffffffff81067933 ffffffff810c2914
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.184590] Call Trace:
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.187142] [] ? dump_stack+0x5d/0x78
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.192554] [] ? dump_header+0x76/0x1e8
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.198152] [] ? smp_call_function_single+0x5f/0xa0
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.204788] [] ? mutex_lock+0xe/0x2a
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.210126] [] ? put_online_cpus+0x23/0x80
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.215974] [] ? rcu_oom_notify+0xc4/0xe0
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.221735] [] ? do_try_to_free_pages+0x4ac/0x520
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.228198] [] ? oom_kill_process+0x21d/0x370
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.234307] [] ? find_lock_task_mm+0x3d/0x90
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.240336] [] ? out_of_memory+0x473/0x4b0
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.246189] [] ? __alloc_pages_nodemask+0x9cd/0xb30
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.252821] [] ? alloc_pages_vma+0x98/0x160
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.258756] [] ? handle_mm_fault+0xe3d/0x11c0
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.264863] [] ? __do_page_fault+0x177/0x4f0
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.270885] [] ? do_mmap_pgoff+0x2e9/0x3b0
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.276743] [] ? put_prev_entity+0x57/0x350
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.282686] [] ? vm_mmap_pgoff+0x9b/0xc0
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.288360] [] ? page_fault+0x28/0x30
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.293795] Mem-Info:
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.296170] Node 0 DMA per-cpu:
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.299535] CPU 0: hi: 0, btch: 1 usd: 0
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.304438] CPU 1: hi: 0, btch: 1 usd: 0
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.309335] CPU 2: hi: 0, btch: 1 usd: 0
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.314233] CPU 3: hi: 0, btch: 1 usd: 0
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.319131] CPU 4: hi: 0, btch: 1 usd: 0
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.324028] CPU 5: hi: 0, btch: 1 usd: 0
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.328925] CPU 6: hi: 0, btch: 1 usd: 0
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.333827] CPU 7: hi: 0, btch: 1 usd: 0
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.338720] Node 0 DMA32 per-cpu:
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.342260] CPU 0: hi: 186, btch: 31 usd: 9
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.347155] CPU 1: hi: 186, btch: 31 usd: 19
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.352051] CPU 2: hi: 186, btch: 31 usd: 12
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.356957] CPU 3: hi: 186, btch: 31 usd: 0
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.361856] CPU 4: hi: 186, btch: 31 usd: 0
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.366754] CPU 5: hi: 186, btch: 31 usd: 0
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.371653] CPU 6: hi: 186, btch: 31 usd: 0
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.376551] CPU 7: hi: 186, btch: 31 usd: 0
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.381448] Node 0 Normal per-cpu:
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.385071] CPU 0: hi: 186, btch: 31 usd: 0
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.389978] CPU 1: hi: 186, btch: 31 usd: 30
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.394880] CPU 2: hi: 186, btch: 31 usd: 0
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.399779] CPU 3: hi: 186, btch: 31 usd: 0
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.404674] CPU 4: hi: 186, btch: 31 usd: 0
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.409574] CPU 5: hi: 186, btch: 31 usd: 0
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.414471] CPU 6: hi: 186, btch: 31 usd: 0
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.419367] CPU 7: hi: 186, btch: 31 usd: 0
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.424267] active_anon:1787463 inactive_anon:2075 isolated_anon:0
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.424267] active_file:0 inactive_file:0 isolated_file:0
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.424267] unevictable:0 dirty:0 writeback:0 unstable:0
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.424267] free:24829 slab_reclaimable:2226 slab_unreclaimable:3608
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.424267] mapped:719 shmem:2108 pagetables:4376 bounce:0
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.424267] free_cma:0
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.455819] Node 0 DMA free:15904kB min:144kB low:180kB high:216kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:4kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.495314] lowmem_reserve[]: 0 2981 7178 7178
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.500443] Node 0 DMA32 free:44096kB min:28004kB low:35004kB high:42004kB active_anon:2975676kB inactive_anon:512kB active_file:0kB inactive_file:28kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3129332kB managed:3054872kB mlocked:0kB dirty:0kB writeback:0kB mapped:200kB shmem:588kB slab_reclaimable:1660kB slab_unreclaimable:2928kB kernel_stack:1344kB pagetables:7400kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:1093 all_unreclaimable? yes
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.543405] lowmem_reserve[]: 0 0 4197 4197
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.548277] Node 0 Normal free:39316kB min:39432kB low:49288kB high:59148kB active_anon:4174176kB inactive_anon:7788kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:4404224kB managed:4298120kB mlocked:0kB dirty:0kB writeback:0kB mapped:2676kB shmem:7844kB slab_reclaimable:7240kB slab_unreclaimable:11504kB kernel_stack:2080kB pagetables:10104kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.591413] lowmem_reserve[]: 0 0 0 0
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.595771] Node 0 DMA: 2*4kB (UE) 1*8kB (E) 1*16kB (E) 2*32kB (UE) 3*64kB (UE) 2*128kB (UE) 2*256kB (UE) 1*512kB (E) 2*1024kB (UE) 2*2048kB (ER) 2*4096kB (M) = 15904kB
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.613687] Node 0 DMA32: 323*4kB (UEM) 195*8kB (UEM) 114*16kB (UEM) 65*32kB (UEM) 22*64kB (UEM) 15*128kB (UE) 22*256kB (UE) 11*512kB (UEM) 5*1024kB (UE) 3*2048kB (EM) 3*4096kB (MR) = 44900kB
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.633615] Node 0 Normal: 322*4kB (UEM) 240*8kB (UEM) 143*16kB (UEM) 83*32kB (UEM) 31*64kB (UEM) 19*128kB (UE) 14*256kB (UEM) 8*512kB (UEM) 3*1024kB (EM) 4*2048kB (UEM) 2*4096kB (ER) = 39704kB
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.653797] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.662345] 2869 total pagecache pages
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.666207] 0 pages in swap cache
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.669621] Swap cache stats: add 0, delete 0, find 0/0
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.674952] Free swap = 0kB
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.677942] Total swap = 0kB
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.680925] 1887387 pages RAM
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.684008] 0 pages HighMem/MovableOnly
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.687950] 26526 pages reserved
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.691285] 0 pages hwpoisoned
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.694453] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.702402] [ 217] 0 217 10383 879 22 0 0 systemd-journal
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.711311] [ 220] 0 220 10268 190 21 0 -1000 systemd-udevd
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.720036] [ 384] 0 384 6350 1739 14 0 0 dhclient
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.728323] [ 431] 0 431 6869 60 17 0 0 cron
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.736278] [ 435] 0 435 4964 67 15 0 0 systemd-logind
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.745090] [ 448] 108 448 10531 98 25 0 -900 dbus-daemon
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.753642] [ 496] 104 496 7292 145 19 0 0 ntpd
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.761586] [ 505] 0 505 64668 251 28 0 0 rsyslogd
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.769872] [ 510] 0 510 1064 38 8 0 0 acpid
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.777983] [ 514] 0 514 3604 41 10 0 0 agetty
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.786098] [ 515] 0 515 3559 41 12 0 0 agetty
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.794213] [ 517] 0 517 48215 205 29 0 0 rserver
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.802411] [ 606] 0 606 14103 2113 32 0 0 google_metadata
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.811307] [ 607] 0 607 14200 2217 32 0 0 google_ip_forwa
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.820205] [ 608] 0 608 14216 2247 31 0 0 google_clock_sk
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.829101] [ 610] 0 610 14256 2278 31 0 0 google_accounts
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.838004] [ 611] 0 611 13796 169 30 0 -1000 sshd
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.845949] [ 615] 0 615 14331 2294 31 0 0 supervisord
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.854500] [ 622] 0 622 5414 69 15 0 0 tmpD3seIG
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.862885] [ 651] 0 651 33693 1719 35 0 0 python
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.871003] [ 968] 0 968 251175 805 85 0 0 stackdriver-col
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.879905] [ 1210] 0 1210 29967 5004 53 0 0 ruby
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.887849] [ 1213] 0 1213 131295 20474 242 0 0 ruby
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.895787] [ 1315] 0 1315 1801208 1744925 3520 0 0 R
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.903472] Out of memory: Kill process 1315 (R) score 920 or sacrifice child
19:20:50.000
Sep 8 00:20:50 instance-5 kernel: [ 8359.910717] Killed process 1315 (R) total-vm:7204832kB, anon-rss:6979680kB, file-rss:20kB
19:20:50.000
Sep 8 00:20:50 instance-5 startup-script: INFO startup-script: /tmp/startup-uVO465/tmpD3seIG: line 36: 1315 Killed Rscript /home/rstudio/thesis-scripts/R/random_forest_timeSDm.R
19:20:51.000
Sep 8 00:20:51 instance-5 logger: Startup done. Ready to proceed.
19:20:51.000
Sep 8 00:20:51 instance-5 startup-script: INFO startup-script: Return code 0.
19:20:51.000
Sep 8 00:20:51 instance-5 startup-script: INFO Finished running startup scripts.
@scottsfarley93
Copy link
Member Author

The problem is that memory is shared between worker cores. Each core gets a copy of the full forest and all training data, which gets real big real fast. Going to try setting ulimit to = 90% of total.

@scottsfarley93
Copy link
Member Author

ulimit:

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 28716
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65536
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 28716
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

RInfo:

R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)

System Memory Usage:

memoryUsage

@scottsfarley93
Copy link
Member Author

scottsfarley93 commented Sep 9, 2016

Changes:

  1. Change nodesize to 15, instead of the default 5. This will grow smaller trees, so need to check if the AUC is affected.
  2. I was growing 4x the required number of trees than I wanted.
    Old code:
   model <- foreach(ntree=rep(rfTrees, ncores), .combine=combine, .multicombine=TRUE,
                    .packages='randomForest') %do% {
                      randomForest(x, y, ntree=ntree)}
   

New code:

    treesPerCore <- round(rfTrees / ncores)
    model <- foreach(ntree=rep(treesPerCore, ncores), .combine=combine, .multicombine=TRUE,
                     .packages='randomForest') %do% {
                       randomForest(x, y, ntree=ntree, nodesize=15)}

Before I was growing rfTrees on each processor. Now, I am growing rfTrees in total, with 1/nCores on each processor.

@scottsfarley93
Copy link
Member Author

It looks like ^ fixed this. Need to test with larger trees. AUC looks OK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant