This guide introduces how to run TensorFlow job on OpenPAI. The following contents show some basic TensorFlow examples, other customized TensorFlow code can be run similarly.
- TensorFlow CIFAR-10 image classification
- TensorFlow ImageNet image classification
- Distributed TensorFlow CIFAR-10 image classification
- TensorFlow Tensorboard
To run TensorFlow examples in OpenPAI, you need to prepare a job configuration file and submit it through webportal.
OpenPAI packaged the docker env required by the job for user to use. User could refer to DOCKER.md to customize this example docker env. If user have built a customized image and pushed it to Docker Hub, replace our pre-built image openpai/pai.example.tensorflow
with your own.
Here're some configuration file examples:
{
"jobName": "tensorflow-cifar10",
"image": "openpai/pai.example.tensorflow",
"dataDir": "/tmp/data",
"outputDir": "/tmp/output",
"taskRoles": [
{
"name": "cifar_train",
"taskNumber": 1,
"cpuNumber": 8,
"memoryMB": 32768,
"gpuNumber": 1,
"command": "git clone https://github.com/tensorflow/models && cd models/research/slim && python download_and_convert_data.py --dataset_name=cifar10 --dataset_dir=$PAI_DATA_DIR && python train_image_classifier.py --batch_size=64 --model_name=inception_v3 --dataset_name=cifar10 --dataset_split_name=train --dataset_dir=$PAI_DATA_DIR --train_dir=$PAI_OUTPUT_DIR"
}
]
}
{
"jobName": "tensorflow-imagenet",
"image": "openpai/pai.example.tensorflow",
// prepare imagenet dataset in TFRecord format following https://git.io/vFxjh and upload to hdfs
"dataDir": "$PAI_DEFAULT_FS_URI/path/data",
// make a new dir for output on hdfs
"outputDir": "$PAI_DEFAULT_FS_URI/path/output",
// download code from tensorflow slim https://git.io/vFpef and upload to hdfs
"codeDir": "$PAI_DEFAULT_FS_URI/path/code",
"taskRoles": [
{
"name": "imagenet_train",
"taskNumber": 1,
"cpuNumber": 8,
"memoryMB": 32768,
"gpuNumber": 1,
"command": "python code/train_image_classifier.py --batch_size=64 --model_name=inception_v3 --dataset_name=imagenet --dataset_split_name=train --dataset_dir=$PAI_DATA_DIR --train_dir=$PAI_OUTPUT_DIR"
}
]
}
{
"jobName": "tensorflow-distributed-cifar10",
"image": "openpai/pai.example.tensorflow",
// download cifar10 dataset from http://www.cs.toronto.edu/~kriz/cifar.html and upload to hdfs
"dataDir": "$PAI_DEFAULT_FS_URI/path/data",
// make a new dir for output on hdfs
"outputDir": "$PAI_DEFAULT_FS_URI/path/output",
// download code from tensorflow benchmark https://git.io/vF4wT and upload to hdfs
"codeDir": "$PAI_DEFAULT_FS_URI/path/code",
"taskRoles": [
{
"name": "ps_server",
"taskNumber": 2,
"cpuNumber": 2,
"memoryMB": 8192,
"gpuNumber": 0,
"command": "pip --quiet install scipy && python code/tf_cnn_benchmarks.py --local_parameter_device=cpu --batch_size=32 --model=resnet20 --variable_update=parameter_server --data_dir=$PAI_DATA_DIR --data_name=cifar10 --train_dir=$PAI_OUTPUT_DIR --ps_hosts=$PAI_TASK_ROLE_ps_server_HOST_LIST --worker_hosts=$PAI_TASK_ROLE_worker_HOST_LIST --job_name=ps --task_index=$PAI_CURRENT_TASK_ROLE_CURRENT_TASK_INDEX"
},
{
"name": "worker",
"taskNumber": 2,
"cpuNumber": 2,
"memoryMB": 16384,
"gpuNumber": 4,
"command": "pip --quiet install scipy && python code/tf_cnn_benchmarks.py --local_parameter_device=cpu --batch_size=32 --model=resnet20 --variable_update=parameter_server --data_dir=$PAI_DATA_DIR --data_name=cifar10 --train_dir=$PAI_OUTPUT_DIR --ps_hosts=$PAI_TASK_ROLE_ps_server_HOST_LIST --worker_hosts=$PAI_TASK_ROLE_worker_HOST_LIST --job_name=worker --task_index=$PAI_CURRENT_TASK_ROLE_CURRENT_TASK_INDEX",
"minSucceededTaskCount": 2
}
],
"retryCount": 0
}
{
"jobName": "tensorflow-tensorboard",
"image": "openpai/pai.example.tensorflow",
// prepare checkpoint and log to be visualized and upload to hdfs
"dataDir": "$PAI_DEFAULT_FS_URI/path/data",
// prepare visualization script tensorboard-example.sh and upload to hdfs
"codeDir": "$PAI_DEFAULT_FS_URI/path/code",
"taskRoles": [
{
"name": "tensorboard",
"taskNumber": 1,
"cpuNumber": 2,
"memoryMB": 4096,
"gpuNumber": 0,
"command": "/bin/bash code/tensorflow-tensorboard.sh"
}
]
}
For more details on how to write a job configuration file, please refer to job tutorial.