Skip to content

Commit

Permalink
Auto device: API changes, bug fixes, README.md
Browse files Browse the repository at this point in the history
- Change :cuda(device) overload to :cudaOn(device)
- Add :cloneOn(device)
- Fix bug in +,-,*,/ metamethods: checkGPU wasn't being called on these
  metamethods.
- Add description of auto-device mode to README.md
  • Loading branch information
adamlerer committed Apr 30, 2015
1 parent 911f1fa commit d88ac24
Show file tree
Hide file tree
Showing 6 changed files with 139 additions and 49 deletions.
37 changes: 22 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,15 @@ Cutorch provides the following:
- a new tensor type: `torch.CudaTensor` that acts like `torch.FloatTensor`, but all it's operations are on the GPU. Most of the tensor operations are supported by cutorch. There are a few missing ones, which are being implemented. The missing list can be found here: https://github.com/torch/cutorch/issues/70
- `cutorch.*` - Functions to set/get GPU, get device properties, memory usage, set/get low-level streams, set/get random number generator's seed, synchronization etc. They are described in more detail below.

<a name="cutorch.cudatensor"/>
### torch.CudaTensor
This new tensor type behaves exactly like a `torch.FloatTensor`, but has a couple of extra functions of note:
- `t:getDevice()` - Given a CudaTensor `t`, you can call :getDevice on it to find out the GPU ID on which the tensor memory is allocated.

<a name="cutorch.api"/>
###`cutorch.*` API
- `cutorch.synchronize()` : All of the CUDA API is asynchronous (barring a few functions), which means that you can queue up operations. To wait for the operations to finish, you can issue `cutorch.synchronize()` in your code, when the code waits for all GPU operations on the current GPU to finish.
- `cutorch.setDevice(i)` : If one has multiple-GPUs, you can switch the default GPU (to allocate CUDA tensors and do operations). The GPU IDs are 1-indexed, so having 4 GPUs means, you can setDevice(1), setDevice(2), setDevice(3), setDevice(4).
- `cutorch.setDevice(i)` : If one has multiple-GPUs, you can switch the default GPU (to allocate CUDA tensors and do operations). The GPU IDs are 1-indexed, so having 4 GPUs means, you can setDevice(1), setDevice(2), setDevice(3), setDevice(4). Alternatively, you can use [auto-device mode](#cutorch.api.autodevice).
- `idx = cutorch.getDevice()` : Returns the currently set GPU device index.
- `count = cutorch.getDeviceCount()` : Gets the number of available GPUs.
- `totalMemory, freeMemory = cutorch.getMemoryUsage(devID)` : Gets the total and free memory in bytes for the given device ID.
Expand All @@ -26,9 +28,23 @@ This new tensor type behaves exactly like a `torch.FloatTensor`, but has a coupl
- `cutorch.getRNGState([device])` - returns the current RNG state in the form of a byte tensor, for the current or specified device.
- `cutorch.setRNGState(state [, device])` - Sets the RNG state from a previously saved state, on the current or specified device.
- `cutorch.getState()` - Returns the global state of the cutorch package. This state is not for users, it stores the raw RNG states, cublas handles and other thread and device-specific stuff.
- `cutorch.withDevice(devID, f)` - This is a convenience for multi-GPU code, that takes in a device ID as well as a function f. It switches cutorch to the new device, executes the function f, and switches back cutorch to the original device. Alternatively, you can use [auto-device mode](#cutorch.api.autodevice).

- `cutorch.withDevice(devID, f)` - This is a convenience for multi-GPU code, that takes in a device ID as well as a function f. It switches cutorch to the new device, executes the function f, and switches back cutorch to the original device.
<a name="cutorch.api.autodevice"/>
#### Auto-device mode

Computations on CUDA tensors must be run on the CUDA device where the tensor resides. Running a computation on a tensor from the wrong device will lead to a cutorch error.

If device is set to 0, cutorch will automatically determine where to run computation. In this mode, tensors must be created with the `torch.CudaTensorOn(device,...)`, `:cudaOn(device,...)`, and `:cloneOn(device)` convenience methods.

```lua
cutorch.setDevice(0)
local t1 = torch.CudaTensorOn(2, 1000) -- on device 2
local t2 = torch.Tensor(1000):cudaOn(3) -- on device 3
local t3 = t1 + 1 -- on device 2
```

<a name="cutorch.api.streams"/>
#### Low-level streams functions (dont use this as a user, easy to shoot yourself in the foot):
- `cutorch.reserveStreams(n)`: creates n user streams for use on every device.
- `n = cutorch.getNumStreams()`: returns the number of user streams available on every device. By `default`, this is `0`, meaning only the default stream (stream 0) is available.
Expand All @@ -41,7 +57,7 @@ This new tensor type behaves exactly like a `torch.FloatTensor`, but has a coupl
- `cutorch.streamBarrierMultiDevice({[device]={streamsToWaitOn...}...})`: As with streamBarrier but allows barriers between streams on arbitrary devices. Creates a cross-device N-to-N-way barrier between all (device, stream) values listed.
- `cutorch.streamSynchronize(stream)`: equivalent to `cudaStreamSynchronize(stream)` for the current device. Blocks the CPU until stream completes its queued kernels/events.

##### Common Examples
#### Common Examples
Transfering a FloatTensor `src` to the GPU:
```lua
dest = src:cuda() -- dest is on the current GPU
Expand All @@ -50,21 +66,12 @@ dest = src:cuda() -- dest is on the current GPU
Allocating a tensor on a given GPU:
Allocate `src` on GPU 3
```lua
cutorch.setDevice(3)
src = torch.CudaTensor(100)
src = torch.CudaTensorOn(3, 100)
```

Copying a CUDA tensor from one GPU to another:
Given a tensor called `src` on GPU 1, if you want to create it's clone on GPU 2, then:
Given a tensor called `src` on GPU 1, if you want to create its clone on GPU 2, then:

```lua
cutorch.setDevice(2)
local dest = src:clone()
```

OR

```
local dest
cutorch.withDevice(2, function() dest = src:clone() end)
local dest = src:cloneOn(2)
```
62 changes: 36 additions & 26 deletions Tensor.lua
Original file line number Diff line number Diff line change
Expand Up @@ -30,20 +30,21 @@ end
local function Tensor__typeAs(self,tensor)
return self:type(tensor:type())
end
local function Tensor__cuda(self,device)
if device ~= nil then
local curDev = cutorch.getDevice()
cutorch.setDevice(device)
local res = self:type('torch.CudaTensor')
if res:nElement() == 0 then
res:setDevice(device)
end
cutorch.setDevice(curDev)
return res
else
return self:type('torch.CudaTensor')
local function Tensor__cuda(self)
return self:type('torch.CudaTensor')
end

local function Tensor__cudaOn(self, device)
local curDev = cutorch.getDevice()
cutorch.setDevice(device)
local res = self:type('torch.CudaTensor')
if res:nElement() == 0 then
res:setDevice(device)
end
cutorch.setDevice(curDev)
return res
end

local function Tensor__double(self)
return self:type('torch.DoubleTensor')
end
Expand Down Expand Up @@ -72,23 +73,32 @@ local function Tensor__long(self)
end

rawset(torch.getmetatable('torch.DoubleTensor'), 'cuda', Tensor__cuda)
rawset(torch.getmetatable('torch.FloatTensor'), 'cuda', Tensor__cuda)
rawset(torch.getmetatable('torch.ByteTensor'), 'cuda', Tensor__cuda)
rawset(torch.getmetatable('torch.CharTensor'), 'cuda', Tensor__cuda)
rawset(torch.getmetatable('torch.IntTensor'), 'cuda', Tensor__cuda)
rawset(torch.getmetatable('torch.ShortTensor'), 'cuda', Tensor__cuda)
rawset(torch.getmetatable('torch.LongTensor'), 'cuda', Tensor__cuda)
rawset(torch.getmetatable('torch.CudaTensor'), 'cuda', Tensor__cuda)
rawset(torch.getmetatable('torch.FloatTensor'), 'cuda', Tensor__cuda)
rawset(torch.getmetatable('torch.ByteTensor'), 'cuda', Tensor__cuda)
rawset(torch.getmetatable('torch.CharTensor'), 'cuda', Tensor__cuda)
rawset(torch.getmetatable('torch.IntTensor'), 'cuda', Tensor__cuda)
rawset(torch.getmetatable('torch.ShortTensor'), 'cuda', Tensor__cuda)
rawset(torch.getmetatable('torch.LongTensor'), 'cuda', Tensor__cuda)
rawset(torch.getmetatable('torch.CudaTensor'), 'cuda', Tensor__cuda)

rawset(torch.getmetatable('torch.DoubleTensor'), 'cudaOn', Tensor__cudaOn)
rawset(torch.getmetatable('torch.FloatTensor'), 'cudaOn', Tensor__cudaOn)
rawset(torch.getmetatable('torch.ByteTensor'), 'cudaOn', Tensor__cudaOn)
rawset(torch.getmetatable('torch.CharTensor'), 'cudaOn', Tensor__cudaOn)
rawset(torch.getmetatable('torch.IntTensor'), 'cudaOn', Tensor__cudaOn)
rawset(torch.getmetatable('torch.ShortTensor'), 'cudaOn', Tensor__cudaOn)
rawset(torch.getmetatable('torch.LongTensor'), 'cudaOn', Tensor__cudaOn)
rawset(torch.getmetatable('torch.CudaTensor'), 'cudaOn', Tensor__cudaOn)

rawset(torch.getmetatable('torch.CudaTensor'), 'type', Tensor__type)
rawset(torch.getmetatable('torch.CudaTensor'), 'type', Tensor__type)
rawset(torch.getmetatable('torch.CudaTensor'), 'typeAs', Tensor__typeAs)
rawset(torch.getmetatable('torch.CudaTensor'), 'double', Tensor__double)
rawset(torch.getmetatable('torch.CudaTensor'), 'float', Tensor__float)
rawset(torch.getmetatable('torch.CudaTensor'), 'byte', Tensor__byte)
rawset(torch.getmetatable('torch.CudaTensor'), 'char', Tensor__char)
rawset(torch.getmetatable('torch.CudaTensor'), 'int', Tensor__int)
rawset(torch.getmetatable('torch.CudaTensor'), 'short', Tensor__short)
rawset(torch.getmetatable('torch.CudaTensor'), 'long', Tensor__long)
rawset(torch.getmetatable('torch.CudaTensor'), 'float', Tensor__float)
rawset(torch.getmetatable('torch.CudaTensor'), 'byte', Tensor__byte)
rawset(torch.getmetatable('torch.CudaTensor'), 'char', Tensor__char)
rawset(torch.getmetatable('torch.CudaTensor'), 'int', Tensor__int)
rawset(torch.getmetatable('torch.CudaTensor'), 'short', Tensor__short)
rawset(torch.getmetatable('torch.CudaTensor'), 'long', Tensor__long)

do
local metatable = torch.getmetatable('torch.CudaTensor')
Expand Down
5 changes: 5 additions & 0 deletions TensorOperator.c
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ static int cutorch_CudaTensorOperator___add__(lua_State *L)
THCudaTensor *tensor2 = luaT_toudata(L, 2, "torch.CudaTensor");
THCudaTensor *r;
THCState *state = cutorch_getstate(L);
THAssert(THCudaTensor_checkGPU(state, 2, tensor1, tensor2));

if(!tensor1 && !tensor2)
luaL_error(L, "expecting two Tensors or one Tensor and one number");
Expand Down Expand Up @@ -44,6 +45,7 @@ static int cutorch_CudaTensorOperator___sub__(lua_State *L)
THCudaTensor *tensor2 = luaT_toudata(L, 2, "torch.CudaTensor");
THCudaTensor *r;
THCState *state = cutorch_getstate(L);
THAssert(THCudaTensor_checkGPU(state, 2, tensor1, tensor2));

if(!tensor1 && !tensor2)
luaL_error(L, "expecting two Tensors or one Tensor and one number");
Expand Down Expand Up @@ -79,6 +81,7 @@ static int cutorch_CudaTensorOperator___unm__(lua_State *L)
THCudaTensor *tensor = luaT_checkudata(L, 1, "torch.CudaTensor");
THCudaTensor *r;
THCState *state = cutorch_getstate(L);
THAssert(THCudaTensor_checkGPU(state, 1, tensor));

r = THCudaTensor_new(state);
luaT_pushudata(L, r, "torch.CudaTensor");
Expand All @@ -95,6 +98,7 @@ static int cutorch_CudaTensorOperator___mul__(lua_State *L)
THCudaTensor *tensor2 = luaT_toudata(L, 2, "torch.CudaTensor");
THCudaTensor *r;
THCState *state = cutorch_getstate(L);
THAssert(THCudaTensor_checkGPU(state, 2, tensor1, tensor2));

if(!tensor1 && !tensor2)
luaL_error(L, "expecting two Tensors or one Tensor and one number");
Expand Down Expand Up @@ -146,6 +150,7 @@ static int cutorch_CudaTensorOperator___div__(lua_State *L)
THCudaTensor *tensor = luaT_checkudata(L, 1, "torch.CudaTensor");
THCudaTensor *r;
THCState *state = cutorch_getstate(L);
THAssert(THCudaTensor_checkGPU(state, 1, tensor));

luaL_argcheck(L, lua_isnumber(L,2), 2, "number expected");

Expand Down
3 changes: 3 additions & 0 deletions lib/THC/THCTensor.c
Original file line number Diff line number Diff line change
Expand Up @@ -770,6 +770,9 @@ int THCudaTensor_checkGPU(THCState *state, unsigned int nTensors, ...)
va_start(args, nTensors);
for (unsigned int i = 0; i < nTensors; i++) {
THCudaTensor* tensor = va_arg(args, THCudaTensor*);
if(tensor == NULL) {
continue;
}
int tensorDev = THCudaTensor_getDevice(state, tensor);
if (tensorDev != THC_DEVICE_NONE) {
if (kernelDev != tensorDev && kernelDev != THC_DEVICE_NONE) {
Expand Down
64 changes: 56 additions & 8 deletions test/test.lua
Original file line number Diff line number Diff line change
Expand Up @@ -1475,7 +1475,7 @@ function test.get_device()
-- 1. assign empty tensor to device, resize in auto mode
local tensors = { }
for i = 1,device_count do
table.insert(tensors, torch.Tensor():cuda(i))
table.insert(tensors, torch.Tensor():cudaOn(i))
end
cutorch.setDevice(0) -- auto
tester:assert(cutorch.getDevice() == 0)
Expand All @@ -1488,7 +1488,7 @@ function test.get_device()

-- 3. create tensor on device; resize on different device. Should be an error
if device_count >= 2 then
local t = torch.Tensor():cuda(1)
local t = torch.Tensor():cudaOn(1)
tester:assert(t:getDevice() == 1)
cutorch.setDevice(2)
local ok, err = pcall(function() t:resize(1,2,3) end)
Expand Down Expand Up @@ -1525,29 +1525,39 @@ function test.tensor_device()
cutorch.setDevice(curDev)
for tensorDev = 1,2 do

-- cudaTensorOn, unallocated
local t1 = torch.CudaTensorOn(tensorDev)
tester:assert(t1:getDevice() == tensorDev)
tester:assert(t1:nDimension() == 0)
t1:resize(0)
tester:assert(t1:getDevice() == tensorDev)

-- cudaTensorOn, allocated
local t2 = torch.CudaTensorOn(tensorDev,2,3)
tester:assert(t2:getDevice() == tensorDev)
tester:assert(t2:nDimension() == 2 and t2:size(1) == 2 and t2:size(2) == 3)
ok = pcall(function() t2 = t2:zero() end)
tester:assert(ok == (curDev == 0 or curDev == tensorDev))

local t3 = torch.FloatTensor(10,10):zero():cuda(tensorDev)
-- cudaOn, unallocated
local t3 = torch.FloatTensor():cudaOn(tensorDev)
tester:assert(t3:getDevice() == tensorDev)
tester:assert(t3:nDimension() == 2 and t3:size(1) == 10 and t3:size(2) == 10)
tester:assert(t3:nDimension() == 0)
tester:assert(cutorch.getDevice() == curDev)

-- cudaOn, allocated
local t3 = torch.FloatTensor(10,10):zero():cudaOn(tensorDev)
tester:assert(t3:getDevice() == tensorDev)
tester:assert(t3:nDimension() == 2 and t3:size(1) == 10 and t3:size(2) == 10)
tester:assert(cutorch.getDevice() == curDev)

-- setDevice
t3:setDevice(tensorDev)
ok, err = pcall(function() t3:setDevice(3-tensorDev) end)
tester:assert(not ok, "setDevice should not work on non-empty tensor")
tester:assert(err:find("Use copy"))

-- clone, allocated
local t1 = torch.CudaTensorOn(tensorDev, 10)
tester:assert(t1:getDevice() == tensorDev)
torch.CudaTensorOn(3-tensorDev,10) -- decoy
Expand All @@ -1558,9 +1568,23 @@ function test.tensor_device()
tester:assert(t2:getDevice() == curDev)
end

-- clone, unallocated
local t1 = torch.CudaTensorOn(tensorDev)
local t2 = t1:clone()
tester:assert(t2:getDevice() == 0)

-- cloneOn, allocated
local t1 = torch.CudaTensorOn(tensorDev, 10)
tester:assert(t1:getDevice() == tensorDev)
local t2 = t1:cloneOn(3-tensorDev)
tester:assert(t2:getDevice() == 3-tensorDev)

-- cloneOn, unallocated
local t1 = torch.CudaTensor()
tester:assert(t1:getDevice() == 0)
torch.CudaTensorOn(tensorDev,10) -- decoy
local t2 = t1:cloneOn(3-tensorDev)
tester:assert(t2:getDevice() == 3-tensorDev)
end
end
end
Expand All @@ -1581,6 +1605,22 @@ function test.mm_multi_device()
end
end

function test.metamethod_multi_device()
local device_count = cutorch.getDeviceCount()
if device_count >= 2 then
cutorch.setDevice(0)
local a = torch.CudaTensorOn(1, 10)
torch.CudaTensorOn(2) -- decoy
local b = a / 2
tester:assert( b:getDevice() == 1 )

local c = torch.CudaTensorOn(2, 10)
local ok, err = pcall(function() return a + c end)
tester:assert(not ok)
tester:assert(err:find("checkGPU"))
end
end

function test.storage_device()
-- test storage:setDevice()
cutorch.setDevice(0)
Expand Down Expand Up @@ -2224,10 +2264,18 @@ for k,v in pairs(test) do
end
end

function cutorch.test(tests)
math.randomseed(os.time())
torch.manualSeed(os.time())
cutorch.manualSeedAll(os.time())
function initSeed(seed)
seed = seed or os.time()
-- ensure that you can reproduce a failing test
print('seed: ', seed)
math.randomseed(seed)
torch.manualSeed(seed)
cutorch.manualSeedAll(seed)
end


function cutorch.test(tests, seed)
initSeed(seed)
tester = torch.Tester()
tester:add(test)
tester:run(tests)
Expand Down
17 changes: 17 additions & 0 deletions torch/generic/Tensor.c
Original file line number Diff line number Diff line change
Expand Up @@ -240,6 +240,22 @@ static int torch_Tensor_(clone)(lua_State *L)
return 1;
}

static int torch_Tensor_(cloneOn)(lua_State *L)
{
THCState *state = cutorch_getstate(L);
THTensor *self = luaT_checkudata(L, 1, torch_Tensor);
int device = luaL_checkint(L, 2)-1;
int oldDev = -1;
THCudaCheck(cudaGetDevice(&oldDev));
THCudaCheck(cudaSetDevice(device));
self = THTensor_(newClone)(cutorch_getstate(L), self);
THCudaCheck(cudaSetDevice(oldDev));
// in case it's size 0, make sure the device is set
THTensor_(setDevice)(state, self, device);
luaT_pushudata(L, self, torch_Tensor);
return 1;
}

static int torch_Tensor_(contiguous)(lua_State *L)
{
THTensor *self = luaT_checkudata(L, 1, torch_Tensor);
Expand Down Expand Up @@ -1206,6 +1222,7 @@ static const struct luaL_Reg torch_Tensor_(_) [] = {
{"storage", torch_Tensor_(storage)},
{"storageOffset", torch_Tensor_(storageOffset)},
{"clone", torch_Tensor_(clone)},
{"cloneOn", torch_Tensor_(cloneOn)},
{"contiguous", torch_Tensor_(contiguous)},
{"resizeAs", torch_Tensor_(resizeAs)},
{"resize", torch_Tensor_(resize)},
Expand Down

0 comments on commit d88ac24

Please sign in to comment.