Skip to content
This repository has been archived by the owner on Jan 20, 2022. It is now read-only.

BUG() triggered during vfork and clone #2672

Open
llly opened this issue Sep 1, 2021 · 15 comments
Open

BUG() triggered during vfork and clone #2672

llly opened this issue Sep 1, 2021 · 15 comments

Comments

@llly
Copy link
Contributor

llly commented Sep 1, 2021

I countered an BUG() whem migrating Java program from old Graphene version to latest 1.2 rc. It can reproduce every time.
https://github.com/oscarlab/graphene/blob/f7aa86ed3ae4bbb99480d16bab98c87625b778c9/LibOS/shim/src/ipc/shim_ipc_pid.c#L150-L153
There are several fork() called in the Java program to run another system command. Other forks succeed. I cannot root cause it.

Here is the log:

[P4346:T2:java] trace: ---- shim_pipe([261, 262]) = 0x0
[P4346:T2:java] debug: Creating pipe: pipe.srv:cdeeee9b1b1809d141b4eb2a2f44a479e3205028ae9c22b74ebc17f5a913e5e6
debug: sock_getopt (fd = 290, sockopt addr = 0x7ff040a55f70) is not implemented and always returns 0
debug: sock_getopt (fd = 297, sockopt addr = 0x7ff040a55f70) is not implemented and always returns 0
debug: sock_getopt (fd = 298, sockopt addr = 0x7ff040a55f70) is not implemented and always returns 0
[P4346:T2:java] trace: ---- shim_pipe([263, 264]) = 0x0
[P4346:T2:java] debug: Creating pipe: pipe.srv:dad74dd71d9053c6de1d92150d48349c3ce1e0b55d36ddb0e56c5e50654b5b9b
debug: sock_getopt (fd = 290, sockopt addr = 0x7ff040a55f70) is not implemented and always returns 0
debug: sock_getopt (fd = 299, sockopt addr = 0x7ff040a55f70) is not implemented and always returns 0
debug: sock_getopt (fd = 358, sockopt addr = 0x7ff040a55f70) is not implemented and always returns 0
[P4346:T2:java] trace: ---- shim_pipe([265, 266]) = 0x0
[P4346:T2:java] trace: ---- shim_vfork() ...
[P4346:T2:java] warning: vfork was called by the application, implemented as an alias to fork in Graphene
......
[P4346:T2:java] debug: complete checkpointing data
[P4346:T2:java] debug: checkpoint of 1531864 bytes created
......
[P5015::] debug: restored memory from checkpoint
[P5015::] debug: receiving 62 PAL handles
[P5015::] debug: restoring checkpoint at 0xba2000000 rebased from 0xba2000000
[P5015:T524:java] debug: successfully restored checkpoint at 0xba2000000 - 0xba2175fd8
[P5015:T524:java] debug: Creating pipe: pipe.srv:5015
debug: sock_getopt (fd = 368, sockopt addr = 0x7ffdbc7d6ca0) is not implemented and always returns 0
debug: sock_getopt (fd = 369, sockopt addr = 0x7ffdbc7d6ca0) is not implemented and always returns 0
[P5015:shim] debug: IPC worker started
debug: sock_getopt (fd = 288, sockopt addr = 0x7ff042096f70) is not implemented and always returns 0
......
[P5015:T524:java] debug: ipc_change_id_owner: sending a request (524..5015)
[P5015:T524:java] debug: Sending ipc message to 4346
[P5015:T524:java] debug: Waiting for a response to 1
[P4346:shim] debug: IPC worker: received IPC message from 5015: code=4 size=42 seq=1
[P4346:shim] debug: ID 524 unknown!
[P4346:shim] error: BUG() ../LibOS/shim/src/ipc/shim_ipc_pid.c:153
error: Unknown or illegal instruction executed
[P4346:shim] error: Illegal instruction during Graphene internal execution at 0xfe4f70b3d (IP = +0x29b3d, VMID = 4346, TID = 0)
debug: DkProcessExit: Returning exit code 1
@dimakuv
Copy link

dimakuv commented Sep 1, 2021

@llly Could you try it on the latest Graphene? After the v1.2-rc release, we had a bunch of bug fixes and small changes to the IPC (Inter-Process Communication) subsystem in Graphene.

In particular, when you move to the latest Graphene, you will notice more meaningful Process IDs (instead of your current P4346, P5015 there will be more meaningful P1 for main process, P2 for the first child process, etc).

Hopefully this bug will disappear after you update. If not, please post a similar debug log for the updated run.

@boryspoplawski Please keep an eye on this too. Also, looks like this message:

debug: ipc_change_id_owner: sending a request (524..5015)

should actually be:

debug: ipc_change_id_owner: sending a request (524 -> 5015)

(Meaning that TID 524 should get a new owner with PID 5015.)

@llly
Copy link
Contributor Author

llly commented Sep 17, 2021

@dimakuv I can reproduce this failure on latest Graphene trunk.
This time it's caused by clone a process at the very beginning of Python program.

[P2:T21:python] trace: ---- shim_clone(CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|[SIGCHLD], 0x0, 0x0, 0xcf740ca10, 0x0) ...
[P2:T21:python] debug: ipc_get_new_vmid: sending a request
[P2:T21:python] debug: Sending ipc message to 1
[P2:shim] debug: ipc_release_id_range: sending a request: [33..36]
[P2:T21:python] debug: Waiting for a response to 3
[P2:shim] debug: Sending ipc message to 1
[P2:shim] debug: ipc_release_id_range: ipc_send_message: 0
[P1:shim] debug: IPC worker: received IPC message from 2: code=1 size=34 seq=3
[P1:shim] debug: ipc_get_new_vmid_callback: 3
[P1:shim] debug: Sending ipc message to 2
[P1:shim] debug: IPC worker: received IPC message from 2: code=4 size=42 seq=0
[P2:shim] debug: IPC worker: received IPC message from 1: code=0 size=38 seq=3
[P1:shim] debug: ipc_release_id_range_callback: release_id_range(33..36)
[P2:T21:python] debug: Waiting finished: 0
[P2:shim] debug: Got an IPC response from 1, seq: 3
[P2:T21:python] debug: ipc_get_new_vmid: got a response: 3
......
[P2:T21:python] debug: allocating checkpoint store (size = 67108864, reserve = 33554432)
[P2:T21:python] debug: complete checkpointing data
[P2:T21:python] debug: checkpoint of 178440 bytes created
......
[P3:T37:python] debug: successfully restored checkpoint at 0xfb89fe000 - 0xfb8a29908
[P3:T37:python] debug: Creating pipe: pipe.srv:3
debug: sock_getopt (fd = 86, sockopt addr = 0x7ffcbf790280) is not implemented and always returns 0
debug: sock_getopt (fd = 87, sockopt addr = 0x7ffcbf790280) is not implemented and always returns 0
[P3:shim] debug: IPC worker started
debug: sock_getopt (fd = 330, sockopt addr = 0x7f5a84f25f70) is not implemented and always returns 0
[P3:T37:python] debug: ipc_change_id_owner: sending a request (37..3)
debug: sock_getopt (fd = 88, sockopt addr = 0x7ffcbf790280) is not implemented and always returns 0
debug: sock_getopt (fd = 263, sockopt addr = 0x7f91a51eaf70) is not implemented and always returns 0
[P3:T37:python] debug: Sending ipc message to 1
[P3:T37:python] debug: Waiting for a response to 1
[P1:shim] debug: IPC worker: received IPC message from 3: code=5 size=42 seq=1
[P1:shim] debug: ID 37 unknown!
[P1:shim] error: BUG() ../LibOS/shim/src/ipc/shim_ipc_pid.c:153
error: Unknown or illegal instruction executed
[P1:shim] error: Illegal instruction during Graphene internal execution at 0xfe4f7e91d (IP = +0x2891d, VMID = 1, TID = 0)
debug: DkProcessExit: Returning exit code 1
[P2:shim] debug: IPC leader disconnected
[P2:shim] debug: Unknown process (vmid: 0x1) disconnected

@llly llly changed the title BUG() triggered during vfork BUG() triggered during vfork and clone Sep 17, 2021
@dimakuv
Copy link

dimakuv commented Sep 17, 2021

@llly How can we quickly reproduce it? What is the Python script you tried and is there anything special you did in python.manifest.template?

@boryspoplawski
Copy link
Contributor

Also what is the exact commit hash you've run this on?

@glorysdj
Copy link

glorysdj commented Sep 23, 2021

Also what is the exact commit hash you've run this on?

the commit hash is fb71e43
and also on 1.2-rc1, same issue happens
and on previous commit 1b8848b which is before 1.2-rc1, it works well

@glorysdj
Copy link

@llly How can we quickly reproduce it? What is the Python script you tried and is there anything special you did in python.manifest.template?

actually we use bash to start java/python programs
please refer to this manifest
https://github.com/intel-analytics/analytics-zoo/blob/master/ppml/trusted-big-data-ml/python/docker-graphene/bash.manifest.template

any comments and help will be appreciated!

@boryspoplawski
Copy link
Contributor

That commit is 5 months old...
Please try the newest master from the new repository (at the time of writing: gramineproject/gramine@ff5a2da)

@glorysdj
Copy link

That commit is 5 months old...
Please try the newest master from the new repository (at the time of writing: gramineproject/gramine@ff5a2da)

sorry, I pasted the wrong commit, actually we are now running on fb71e43, and encounter this issue.

@dimakuv
Copy link

dimakuv commented Sep 23, 2021

@glorysdj Unfortunately, the manifest you provided requires a very specific environment on the machine (Python3.6, Java, Spark, something called zinc, etc). So we won't be able to reproduce this.

Anyway, @boryspoplawski, do you have an idea what the debug log above could indicate? In particular, why this could trigger:

[P1:shim] debug: ID 37 unknown!
[P1:shim] error: BUG() ../LibOS/shim/src/ipc/shim_ipc_pid.c:153

In other words, why the main process P1 wouldn't know all ancestor IDs?

@boryspoplawski
Copy link
Contributor

@dimakuv no idea, sounds like a serious bug, but we've never seen that before...

@glorysdj @llly Could you please post a trace log of a full execution? Maybe then we could tell more (the provided snipped skips too much details).

@glorysdj
Copy link

glorysdj commented Sep 24, 2021

test-orca-tf-text-sgx.log

trace log is attached.

also you can run it with below docker image:

export ENCLAVE_KEY=xxx/enclave-key.pem
export LOCAL_IP=x.x.x.x

docker run -itd \
--privileged \
--net=host \
--cpuset-cpus="26-30" \
--oom-kill-disable \
--device=/dev/gsgx \
--device=/dev/sgx/enclave \
--device=/dev/sgx/provsion \
-v /var/run/aesmd/aesm.socket:/var/run/aesmd/aesm.socket \
-v $ENCLAVE_KEY:/graphene/Pal/src/host/Linux-SGX/signer/enclave-key.pem \
--name=spark-local \
-e LOCAL_IP=$LOCAL_IP \
-e SGX_MEM_SIZE=32G \
intelanalytics/analytics-zoo-ppml-trusted-big-data-ml-python-graphene:0.12-k8s bash

attach the container

docker exec -it spark-local bash
cd /ppml/trusted-big-data-ml/

init the sgx manifest

./init.sh

run the workload

graphene-sgx ./bash -c "/opt/jdk8/bin/java -cp \
  '/ppml/trusted-big-data-ml/work/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_3.1.2-0.12.0-SNAPSHOT-jar-with-dependencies.jar:${SPARK_HOME}/conf/:${SPARK_HOME}/jars/*' \
  -Xmx3g \
  org.apache.spark.deploy.SparkSubmit \
  --master 'local[4]' \
  --conf spark.driver.memory=3g \
  --conf spark.executor.extraClassPath=/ppml/trusted-big-data-ml/work/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_3.1.2-0.12.0-SNAPSHOT-jar-with-dependencies.jar \
  --conf spark.driver.extraClassPath=/ppml/trusted-big-data-ml/work/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_3.1.2-0.12.0-SNAPSHOT-jar-with-dependencies.jar \
  --properties-file /ppml/trusted-big-data-ml/work/analytics-zoo-0.12.0-SNAPSHOT/conf/spark-analytics-zoo.conf \
  --jars /ppml/trusted-big-data-ml/work/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_3.1.2-0.12.0-SNAPSHOT-jar-with-dependencies.jar \
  --py-files /ppml/trusted-big-data-ml/work/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_3.1.2-0.12.0-SNAPSHOT-python-api.zip \
  --executor-memory 3g \
  --executor-cores 2 \
  --driver-cores 2 \
  /ppml/trusted-big-data-ml/work/examples/pyzoo/orca/learn/tf/basic_text_classification/basic_text_classification.py \
  --cluster_mode local" | tee test-orca-tf-text-sgx.log

@glorysdj
Copy link

@dimakuv @boryspoplawski any insights?

@boryspoplawski
Copy link
Contributor

@glorysdj Could you check if gramineproject/gramine#109 fixes this issue?

@glorysdj
Copy link

glorysdj commented Oct 8, 2021

@glorysdj Could you check if gramineproject/gramine#109 fixes this issue?

yes, we will try this

@glorysdj
Copy link

glorysdj commented Oct 9, 2021

@glorysdj Could you check if gramineproject/gramine#109 fixes this issue?

have tried with latest gramine, but encountered another issues, when run a very simple java program. will try to summarize the issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants