Auto device: API changes, bug fixes, README.md

- Change :cuda(device) overload to :cudaOn(device) - Add :cloneOn(device) - Fix bug in +,-,*,/ metamethods: checkGPU wasn't being called on these metamethods. - Add description of auto-device mode to README.md
marsbzp · Apr 30, 2015 · d88ac24 · d88ac24
1 parent 911f1fa
commit d88ac24
Show file tree

Hide file tree

Showing 6 changed files with 139 additions and 49 deletions.
diff --git a/README.md b/README.md
@@ -8,13 +8,15 @@ Cutorch provides the following:
 - a new tensor type: `torch.CudaTensor` that acts like `torch.FloatTensor`, but all it's operations are on the GPU. Most of the tensor operations are supported by cutorch. There are a few missing ones, which are being implemented. The missing list can be found here: https://github.com/torch/cutorch/issues/70
 - `cutorch.*` - Functions to set/get GPU, get device properties, memory usage, set/get low-level streams, set/get random number generator's seed, synchronization etc. They are described in more detail below.
 
+<a name="cutorch.cudatensor"/>
 ### torch.CudaTensor
 This new tensor type behaves exactly like a `torch.FloatTensor`, but has a couple of extra functions of note:
 - `t:getDevice()` - Given a CudaTensor `t`, you can call :getDevice on it to find out the GPU ID on which the tensor memory is allocated.
 
+<a name="cutorch.api"/>
 ###`cutorch.*` API
 - `cutorch.synchronize()` : All of the CUDA API is asynchronous (barring a few functions), which means that you can queue up operations. To wait for the operations to finish, you can issue `cutorch.synchronize()` in your code, when the code waits for all GPU operations on the current GPU to finish.
-- `cutorch.setDevice(i)` : If one has multiple-GPUs, you can switch the default GPU (to allocate CUDA tensors and do operations). The GPU IDs are 1-indexed, so having 4 GPUs means, you can setDevice(1), setDevice(2), setDevice(3), setDevice(4).
+- `cutorch.setDevice(i)` : If one has multiple-GPUs, you can switch the default GPU (to allocate CUDA tensors and do operations). The GPU IDs are 1-indexed, so having 4 GPUs means, you can setDevice(1), setDevice(2), setDevice(3), setDevice(4). Alternatively, you can use [auto-device mode](#cutorch.api.autodevice).
 - `idx = cutorch.getDevice()` : Returns the currently set GPU device index.
 - `count = cutorch.getDeviceCount()` : Gets the number of available GPUs.
 - `totalMemory, freeMemory = cutorch.getMemoryUsage(devID)` : Gets the total and free memory in bytes for the given device ID.
@@ -26,9 +28,23 @@ This new tensor type behaves exactly like a `torch.FloatTensor`, but has a coupl
 - `cutorch.getRNGState([device])` - returns the current RNG state in the form of a byte tensor, for the current or specified device.
 - `cutorch.setRNGState(state [, device])` - Sets the RNG state from a previously saved state, on the current or specified device.
 - `cutorch.getState()` - Returns the global state of the cutorch package. This state is not for users, it stores the raw RNG states, cublas handles and other thread and device-specific stuff.
+- `cutorch.withDevice(devID, f)` - This is a convenience for multi-GPU code, that takes in a device ID as well as a function f. It switches cutorch to the new device, executes the function f, and switches back cutorch to the original device. Alternatively, you can use [auto-device mode](#cutorch.api.autodevice).
 
-- `cutorch.withDevice(devID, f)` - This is a convenience for multi-GPU code, that takes in a device ID as well as a function f. It switches cutorch to the new device, executes the function f, and switches back cutorch to the original device.
+<a name="cutorch.api.autodevice"/>
+#### Auto-device mode
 
+Computations on CUDA tensors must be run on the CUDA device where the tensor resides. Running a computation on a tensor from the wrong device will lead to a cutorch error. 
+
+If device is set to 0, cutorch will automatically determine where to run computation. In this mode, tensors must be created with the `torch.CudaTensorOn(device,...)`, `:cudaOn(device,...)`, and `:cloneOn(device)` convenience methods.
+
+```lua
+cutorch.setDevice(0)
+local t1 = torch.CudaTensorOn(2, 1000)  -- on device 2
+local t2 = torch.Tensor(1000):cudaOn(3) -- on device 3
+local t3 = t1 + 1                       -- on device 2
+```
+
+<a name="cutorch.api.streams"/>
 #### Low-level streams functions (dont use this as a user, easy to shoot yourself in the foot):
 - `cutorch.reserveStreams(n)`: creates n user streams for use on every device.
 - `n = cutorch.getNumStreams()`: returns the number of user streams available on every device. By `default`, this is `0`, meaning only the default stream (stream 0) is available.
@@ -41,7 +57,7 @@ This new tensor type behaves exactly like a `torch.FloatTensor`, but has a coupl
 - `cutorch.streamBarrierMultiDevice({[device]={streamsToWaitOn...}...})`: As with streamBarrier but allows barriers between streams on arbitrary devices. Creates a cross-device N-to-N-way barrier between all (device, stream) values listed.
 - `cutorch.streamSynchronize(stream)`: equivalent to `cudaStreamSynchronize(stream)` for the current device. Blocks the CPU until stream completes its queued kernels/events.
 
-##### Common Examples
+#### Common Examples
 Transfering a FloatTensor `src` to the GPU:
 ```lua
 dest = src:cuda() -- dest is on the current GPU
@@ -50,21 +66,12 @@ dest = src:cuda() -- dest is on the current GPU
 Allocating a tensor on a given GPU:
 Allocate `src` on GPU 3
 ```lua
-cutorch.setDevice(3)
-src = torch.CudaTensor(100)
+src = torch.CudaTensorOn(3, 100)
 ```
 
 Copying a CUDA tensor from one GPU to another:
-Given a tensor called `src` on GPU 1, if you want to create it's clone on GPU 2, then:
+Given a tensor called `src` on GPU 1, if you want to create its clone on GPU 2, then:
 
 ```lua
-cutorch.setDevice(2)
-local dest = src:clone()
-```
-
-OR
-
-```
-local dest
-cutorch.withDevice(2, function() dest = src:clone() end)
+local dest = src:cloneOn(2)
 ```
diff --git a/Tensor.lua b/Tensor.lua
@@ -30,20 +30,21 @@ end
 local function Tensor__typeAs(self,tensor)
    return self:type(tensor:type())
 end
-local function Tensor__cuda(self,device)
-   if device ~= nil then
-      local curDev = cutorch.getDevice()
-      cutorch.setDevice(device)
-      local res = self:type('torch.CudaTensor')
-      if res:nElement() == 0 then
-         res:setDevice(device)
-      end
-      cutorch.setDevice(curDev)
-      return res
-   else
-      return self:type('torch.CudaTensor')
+local function Tensor__cuda(self)
+   return self:type('torch.CudaTensor')
+end
+
+local function Tensor__cudaOn(self, device)
+   local curDev = cutorch.getDevice()
+   cutorch.setDevice(device)
+   local res = self:type('torch.CudaTensor')
+   if res:nElement() == 0 then
+      res:setDevice(device)
    end
+   cutorch.setDevice(curDev)
+   return res
 end
+
 local function Tensor__double(self)
    return self:type('torch.DoubleTensor')
 end
@@ -72,23 +73,32 @@ local function Tensor__long(self)
 end
 
 rawset(torch.getmetatable('torch.DoubleTensor'), 'cuda', Tensor__cuda)
-rawset(torch.getmetatable('torch.FloatTensor'), 'cuda', Tensor__cuda)
-rawset(torch.getmetatable('torch.ByteTensor'), 'cuda', Tensor__cuda)
-rawset(torch.getmetatable('torch.CharTensor'), 'cuda', Tensor__cuda)
-rawset(torch.getmetatable('torch.IntTensor'), 'cuda', Tensor__cuda)
-rawset(torch.getmetatable('torch.ShortTensor'), 'cuda', Tensor__cuda)
-rawset(torch.getmetatable('torch.LongTensor'), 'cuda', Tensor__cuda)
-rawset(torch.getmetatable('torch.CudaTensor'), 'cuda', Tensor__cuda)
+rawset(torch.getmetatable('torch.FloatTensor'),  'cuda', Tensor__cuda)
+rawset(torch.getmetatable('torch.ByteTensor'),   'cuda', Tensor__cuda)
+rawset(torch.getmetatable('torch.CharTensor'),   'cuda', Tensor__cuda)
+rawset(torch.getmetatable('torch.IntTensor'),    'cuda', Tensor__cuda)
+rawset(torch.getmetatable('torch.ShortTensor'),  'cuda', Tensor__cuda)
+rawset(torch.getmetatable('torch.LongTensor'),   'cuda', Tensor__cuda)
+rawset(torch.getmetatable('torch.CudaTensor'),   'cuda', Tensor__cuda)
+
+rawset(torch.getmetatable('torch.DoubleTensor'), 'cudaOn', Tensor__cudaOn)
+rawset(torch.getmetatable('torch.FloatTensor'),  'cudaOn', Tensor__cudaOn)
+rawset(torch.getmetatable('torch.ByteTensor'),   'cudaOn', Tensor__cudaOn)
+rawset(torch.getmetatable('torch.CharTensor'),   'cudaOn', Tensor__cudaOn)
+rawset(torch.getmetatable('torch.IntTensor'),    'cudaOn', Tensor__cudaOn)
+rawset(torch.getmetatable('torch.ShortTensor'),  'cudaOn', Tensor__cudaOn)
+rawset(torch.getmetatable('torch.LongTensor'),   'cudaOn', Tensor__cudaOn)
+rawset(torch.getmetatable('torch.CudaTensor'),   'cudaOn', Tensor__cudaOn)
 
-rawset(torch.getmetatable('torch.CudaTensor'), 'type', Tensor__type)
+rawset(torch.getmetatable('torch.CudaTensor'), 'type',   Tensor__type)
 rawset(torch.getmetatable('torch.CudaTensor'), 'typeAs', Tensor__typeAs)
 rawset(torch.getmetatable('torch.CudaTensor'), 'double', Tensor__double)
-rawset(torch.getmetatable('torch.CudaTensor'), 'float', Tensor__float)
-rawset(torch.getmetatable('torch.CudaTensor'), 'byte', Tensor__byte)
-rawset(torch.getmetatable('torch.CudaTensor'), 'char', Tensor__char)
-rawset(torch.getmetatable('torch.CudaTensor'), 'int', Tensor__int)
-rawset(torch.getmetatable('torch.CudaTensor'), 'short', Tensor__short)
-rawset(torch.getmetatable('torch.CudaTensor'), 'long', Tensor__long)
+rawset(torch.getmetatable('torch.CudaTensor'), 'float',  Tensor__float)
+rawset(torch.getmetatable('torch.CudaTensor'), 'byte',   Tensor__byte)
+rawset(torch.getmetatable('torch.CudaTensor'), 'char',   Tensor__char)
+rawset(torch.getmetatable('torch.CudaTensor'), 'int',    Tensor__int)
+rawset(torch.getmetatable('torch.CudaTensor'), 'short',  Tensor__short)
+rawset(torch.getmetatable('torch.CudaTensor'), 'long',   Tensor__long)
 
 do
     local metatable = torch.getmetatable('torch.CudaTensor')

diff --git a/TensorOperator.c b/TensorOperator.c
@@ -8,6 +8,7 @@ static int cutorch_CudaTensorOperator___add__(lua_State *L)
   THCudaTensor *tensor2 = luaT_toudata(L, 2, "torch.CudaTensor");
   THCudaTensor *r;
   THCState *state = cutorch_getstate(L);
+  THAssert(THCudaTensor_checkGPU(state, 2, tensor1, tensor2));
 
   if(!tensor1 && !tensor2)
     luaL_error(L, "expecting two Tensors or one Tensor and one number");
@@ -44,6 +45,7 @@ static int cutorch_CudaTensorOperator___sub__(lua_State *L)
   THCudaTensor *tensor2 = luaT_toudata(L, 2, "torch.CudaTensor");
   THCudaTensor *r;
   THCState *state = cutorch_getstate(L);
+  THAssert(THCudaTensor_checkGPU(state, 2, tensor1, tensor2));
 
   if(!tensor1 && !tensor2)
     luaL_error(L, "expecting two Tensors or one Tensor and one number");
@@ -79,6 +81,7 @@ static int cutorch_CudaTensorOperator___unm__(lua_State *L)
   THCudaTensor *tensor = luaT_checkudata(L, 1, "torch.CudaTensor");
   THCudaTensor *r;
   THCState *state = cutorch_getstate(L);
+  THAssert(THCudaTensor_checkGPU(state, 1, tensor));
 
   r = THCudaTensor_new(state);
   luaT_pushudata(L, r, "torch.CudaTensor");
@@ -95,6 +98,7 @@ static int cutorch_CudaTensorOperator___mul__(lua_State *L)
   THCudaTensor *tensor2 = luaT_toudata(L, 2, "torch.CudaTensor");
   THCudaTensor *r;
   THCState *state = cutorch_getstate(L);
+  THAssert(THCudaTensor_checkGPU(state, 2, tensor1, tensor2));
 
   if(!tensor1 && !tensor2)
     luaL_error(L, "expecting two Tensors or one Tensor and one number");
@@ -146,6 +150,7 @@ static int cutorch_CudaTensorOperator___div__(lua_State *L)
   THCudaTensor *tensor = luaT_checkudata(L, 1, "torch.CudaTensor");
   THCudaTensor *r;
   THCState *state = cutorch_getstate(L);
+  THAssert(THCudaTensor_checkGPU(state, 1, tensor));
 
   luaL_argcheck(L, lua_isnumber(L,2), 2, "number expected");
 

diff --git a/lib/THC/THCTensor.c b/lib/THC/THCTensor.c
@@ -770,6 +770,9 @@ int THCudaTensor_checkGPU(THCState *state, unsigned int nTensors, ...)
   va_start(args, nTensors);
   for (unsigned int i = 0; i < nTensors; i++) {
     THCudaTensor* tensor = va_arg(args, THCudaTensor*);
+    if(tensor == NULL) {
+      continue;
+    }
     int tensorDev = THCudaTensor_getDevice(state, tensor);
     if (tensorDev != THC_DEVICE_NONE) {
       if (kernelDev != tensorDev && kernelDev != THC_DEVICE_NONE) {

diff --git a/test/test.lua b/test/test.lua
@@ -1475,7 +1475,7 @@ function test.get_device()
     -- 1. assign empty tensor to device, resize in auto mode
     local tensors = { }
     for i = 1,device_count do
-        table.insert(tensors, torch.Tensor():cuda(i))
+        table.insert(tensors, torch.Tensor():cudaOn(i))
     end
     cutorch.setDevice(0) -- auto
     tester:assert(cutorch.getDevice() == 0)
@@ -1488,7 +1488,7 @@ function test.get_device()
 
     -- 3. create tensor on device; resize on different device. Should be an error
     if device_count >= 2 then
-       local t = torch.Tensor():cuda(1)
+       local t = torch.Tensor():cudaOn(1)
        tester:assert(t:getDevice() == 1)
        cutorch.setDevice(2)
        local ok, err = pcall(function() t:resize(1,2,3) end)
@@ -1525,29 +1525,39 @@ function test.tensor_device()
          cutorch.setDevice(curDev)
          for tensorDev = 1,2 do
 
+            -- cudaTensorOn, unallocated
             local t1 = torch.CudaTensorOn(tensorDev)
             tester:assert(t1:getDevice() == tensorDev)
             tester:assert(t1:nDimension() == 0)
             t1:resize(0)
             tester:assert(t1:getDevice() == tensorDev)
 
+            -- cudaTensorOn, allocated
             local t2 = torch.CudaTensorOn(tensorDev,2,3)
             tester:assert(t2:getDevice() == tensorDev)
             tester:assert(t2:nDimension() == 2 and t2:size(1) == 2 and t2:size(2) == 3)
             ok = pcall(function() t2 = t2:zero() end)
             tester:assert(ok == (curDev == 0 or curDev == tensorDev))
 
-            local t3 = torch.FloatTensor(10,10):zero():cuda(tensorDev)
+            -- cudaOn, unallocated
+            local t3 = torch.FloatTensor():cudaOn(tensorDev)
             tester:assert(t3:getDevice() == tensorDev)
-            tester:assert(t3:nDimension() == 2 and t3:size(1) == 10 and t3:size(2) == 10)
+            tester:assert(t3:nDimension() == 0)
+            tester:assert(cutorch.getDevice() == curDev)
 
+            -- cudaOn, allocated
+            local t3 = torch.FloatTensor(10,10):zero():cudaOn(tensorDev)
+            tester:assert(t3:getDevice() == tensorDev)
+            tester:assert(t3:nDimension() == 2 and t3:size(1) == 10 and t3:size(2) == 10)
             tester:assert(cutorch.getDevice() == curDev)
 
+            -- setDevice
             t3:setDevice(tensorDev)
             ok, err = pcall(function() t3:setDevice(3-tensorDev) end)
             tester:assert(not ok, "setDevice should not work on non-empty tensor")
             tester:assert(err:find("Use copy"))
 
+            -- clone, allocated
             local t1 = torch.CudaTensorOn(tensorDev, 10)
             tester:assert(t1:getDevice() == tensorDev)
             torch.CudaTensorOn(3-tensorDev,10) -- decoy
@@ -1558,9 +1568,23 @@ function test.tensor_device()
               tester:assert(t2:getDevice() == curDev)
             end
 
+            -- clone, unallocated
             local t1 = torch.CudaTensorOn(tensorDev)
             local t2 = t1:clone()
             tester:assert(t2:getDevice() == 0)
+
+            -- cloneOn, allocated
+            local t1 = torch.CudaTensorOn(tensorDev, 10)
+            tester:assert(t1:getDevice() == tensorDev)
+            local t2 = t1:cloneOn(3-tensorDev)
+            tester:assert(t2:getDevice() == 3-tensorDev)
+
+            -- cloneOn, unallocated
+            local t1 = torch.CudaTensor()
+            tester:assert(t1:getDevice() == 0)
+            torch.CudaTensorOn(tensorDev,10) -- decoy
+            local t2 = t1:cloneOn(3-tensorDev)
+            tester:assert(t2:getDevice() == 3-tensorDev)
          end
       end
    end
@@ -1581,6 +1605,22 @@ function test.mm_multi_device()
   end
 end
 
+function test.metamethod_multi_device()
+  local device_count = cutorch.getDeviceCount()
+  if device_count >= 2 then
+    cutorch.setDevice(0)
+    local a = torch.CudaTensorOn(1, 10)
+    torch.CudaTensorOn(2) -- decoy
+    local b = a / 2
+    tester:assert( b:getDevice() == 1 )
+
+    local c = torch.CudaTensorOn(2, 10)
+    local ok, err = pcall(function() return a + c end)
+    tester:assert(not ok)
+    tester:assert(err:find("checkGPU"))
+  end
+end
+
 function test.storage_device()
     -- test storage:setDevice()
     cutorch.setDevice(0)
@@ -2224,10 +2264,18 @@ for k,v in pairs(test) do
   end
 end
 
-function cutorch.test(tests)
-   math.randomseed(os.time())
-   torch.manualSeed(os.time())
-   cutorch.manualSeedAll(os.time())
+function initSeed(seed)
+   seed = seed or os.time()
+   -- ensure that you can reproduce a failing test
+   print('seed: ', seed)
+   math.randomseed(seed)
+   torch.manualSeed(seed)
+   cutorch.manualSeedAll(seed)
+end
+
+
+function cutorch.test(tests, seed)
+   initSeed(seed)
    tester = torch.Tester()
    tester:add(test)
    tester:run(tests)

diff --git a/torch/generic/Tensor.c b/torch/generic/Tensor.c
@@ -240,6 +240,22 @@ static int torch_Tensor_(clone)(lua_State *L)
   return 1;
 }
 
+static int torch_Tensor_(cloneOn)(lua_State *L)
+{
+  THCState *state = cutorch_getstate(L);
+  THTensor *self = luaT_checkudata(L, 1, torch_Tensor);
+  int device = luaL_checkint(L, 2)-1;
+  int oldDev = -1;
+  THCudaCheck(cudaGetDevice(&oldDev));
+  THCudaCheck(cudaSetDevice(device));
+  self = THTensor_(newClone)(cutorch_getstate(L), self);
+  THCudaCheck(cudaSetDevice(oldDev));
+  // in case it's size 0, make sure the device is set
+  THTensor_(setDevice)(state, self, device);
+  luaT_pushudata(L, self, torch_Tensor);
+  return 1;
+}
+
 static int torch_Tensor_(contiguous)(lua_State *L)
 {
   THTensor *self = luaT_checkudata(L, 1, torch_Tensor);
@@ -1206,6 +1222,7 @@ static const struct luaL_Reg torch_Tensor_(_) [] = {
   {"storage", torch_Tensor_(storage)},
   {"storageOffset", torch_Tensor_(storageOffset)},
   {"clone", torch_Tensor_(clone)},
+  {"cloneOn", torch_Tensor_(cloneOn)},
   {"contiguous", torch_Tensor_(contiguous)},
   {"resizeAs", torch_Tensor_(resizeAs)},
   {"resize", torch_Tensor_(resize)},