Skip to content
This repository has been archived by the owner on Nov 16, 2019. It is now read-only.

Feature extraction mode running slow #293

Open
Marcteen opened this issue Dec 18, 2017 · 7 comments
Open

Feature extraction mode running slow #293

Marcteen opened this issue Dec 18, 2017 · 7 comments

Comments

@Marcteen
Copy link

Hi, there. I'm using CaffeOnSpark to extract deep feature(dimention is 4096) from pictures. The model I use is vgg_face,the content of solver.prototxt is

net: "VGG_FACE_deploy.prototxt" type: "Adam" test_iter: 30 test_interval: 5000 base_lr: 0.000001 momentum: 0.9 momentum2: 0.999 lr_policy: "fixed" gamma:0.8 stepsize:100000 display: 2500 max_iter: 1500000 snapshot: 5000 snapshot_prefix: "faceId-snap" solver_mode: CPU

and the spark submit command is

spark-submit --master yarn --deploy-mode cluster \ --driver-memory 3g \ --driver-cores 2 \ --num-executors 100 \ --executor-cores 1 \ --executor-memory 2g \ --files /.../adam_solver.prototxt,/.../VGG_FACE_deploy.prototxt \ --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \ --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \ --class com.yahoo.ml.caffe.CaffeOnSpark \ ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \ -features fc7 \ -clusterSize 100 \ -label label \ -conf adam_solver.prototxt \ -connection ethernet \ -model hdfs:///.../VGG_FACE.caffemodel \ -output hdfs:///.../vggFaces

When I use a sequence file consists of 3k image with 20 executors(clusterSize set to 20 of course), feature extraction terminates in 4 min. But when I process the sequence file with 450k using 100 executors, it just keep running exceeding 13 hours(no idea how long it would really takes). Since the deep conv net requires heavy cpu load, I think the time cost may be not so resonable here. Maybe I've mistaken some thing. Any help will be appreciated!!!

@Marcteen Marcteen changed the title Feature extraction model running slow Feature extraction mode running slow Dec 18, 2017
@junshi15
Copy link
Collaborator

If you have access to the executors, go there and check the CPU usage etc. I suspect the job is stuck.

For feature extraction, make sure you set batch size to 1 in your prototxt file.

@Marcteen
Copy link
Author

@junshi15 Thanks for respond. Yes, I do set the batch size to 1 for data in test phase. I check a host with a hung executor task, the "%CPU" column of "top" command output shows there is an java process use 1034% CPU(each worker have 12 cores). I have 13 spark slaves, And I use 120 executors each with 1 cores. Can I set more cores for executor in feature extraction mode?

Then I restart the application with only 20 executors.I check the stage detail of the spark job. I found out that the "Input Size / Records" raises fast at the beginning of the whole job, and when each task's records reach about 1k(similar with the case using 120 executors), its speed descents immediately. I notice that there is a presist(DISK_ONLY) and count() of the feature values RDD in the CaffeOnSpark.scala. Maybe there is something wrong with the persisting?

Here is a sorted info about the tasks:

image

I check a host(like hybrid14) with a task running faster(with more records number), its %CPU of "top" command is 1057 while another gives out 669(like hybrid10). I found out if there is two executors on one same host, both of them is slower compared with single one on a host. Any idea?

image

Then I reload the stage detail page continuously, I can see the number of records raises about 20+ each time, And the input size also raises slowly with about 0.3M. Is that normal? The whole input file should be 4.1GB.
image

@junshi15
Copy link
Collaborator

I don't know where the problem is. I only use Yarn mode, which sets spark.executor.cores to 1. So one core per executor. I am not sure what will happen if you have more than one core per executor.

@Marcteen
Copy link
Author

@junshi15, I also use Yarn mode. I notice when I use 20 executors, the images sequenceFile is split into 30 task. So some executors has more than one task to run. And I am sure that each time a executor takes a new task(rdd partition), the first 1k+ records always finish in a few minutes, then it turn into a slow pattern. Have you ever try the feature mode? I think if I split the data into pieces and submit several spark application, it would be faster than processing them in one.

@junshi15
Copy link
Collaborator

It has been a while since I used CaffeOnSpark. I did not remember any problem with "features" mode. The synchronicity between executors are not required in this mode, so it is OK for some executors to have more partitions.

@umeshnet88
Copy link

umeshnet88 commented Dec 19, 2017 via email

@umeshnet88
Copy link

umeshnet88 commented Dec 19, 2017 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants