Skip to content

Latest commit

 

History

History
599 lines (550 loc) · 39.4 KB

README.md

File metadata and controls

599 lines (550 loc) · 39.4 KB

MiniTorch Module 3

You will need to modify tensor_functions.py slightly in this assignment.

  • Tests:
python run_tests.py
  • Note:

Several of the tests for this assignment will only run if you are on a GPU machine and will not run on github's test infrastructure. Please follow the instructions to setup up a colab machine to run these tests.

This assignment requires the following files from the previous assignments. You can get these by running

python sync_previous_module.py previous-module-dir current-module-dir

The files that will be synced are:

    minitorch/tensor_data.py minitorch/tensor_functions.py minitorch/tensor_ops.py minitorch/operators.py minitorch/module.py minitorch/autodiff.py minitorch/module.py project/run_manual.py project/run_scalar.py project/run_tensor.py
DIAGNOSTICS OUTPUT

(venv) justin@Justins-MBP mle-module-3-justinhalford % python project/parallel_check.py
MAP
 
================================================================================
 Parallel Accelerator Optimizing:  Function tensor_map.<locals>._map, 
/Users/justin/Desktop/mle_workspace/mle-
module-3-justinhalford/minitorch/fast_ops.py (153)  
================================================================================


Parallel loop listing for  Function tensor_map.<locals>._map, /Users/justin/Desktop/mle_workspace/mle-module-3-justinhalford/minitorch/fast_ops.py (153) 
------------------------------------------------------------------------------------------------------|loop #ID
    def _map(                                                                                         | 
        out: Storage,                                                                                 | 
        out_shape: Shape,                                                                             | 
        out_strides: Strides,                                                                         | 
        in_storage: Storage,                                                                          | 
        in_shape: Shape,                                                                              | 
        in_strides: Strides,                                                                          | 
    ) -> None:                                                                                        | 
        # TODO: Implement for Task 3.1.                                                               | 
        strideLenComp = len(in_strides) == len(out_strides)                                           | 
        if strideLenComp:                                                                             | 
            shapeComp = (in_shape == out_shape).all()-------------------------------------------------| #0
            strideComp = (in_strides == out_strides).all() and len(in_strides) == len(out_strides)----| #1
            # When `out` and `in` are stride-aligned, avoid indexing                                  | 
            if shapeComp and strideComp:                                                              | 
                # Main loop in parallel                                                               | 
                for i in prange(len(out)):------------------------------------------------------------| #2
                    out[i] = fn(in_storage[i])                                                        | 
                return                                                                                | 
        # When `out` and `in` are not stride-aligned                                                  | 
        # Main loop in parallel                                                                       | 
        for i in prange(len(out)):--------------------------------------------------------------------| #3
            # All indices use numpy buffers                                                           | 
            in_index, out_index = np.empty(MAX_DIMS, np.int32), np.empty(MAX_DIMS, np.int32)          | 
            to_index(i, out_shape, out_index)                                                         | 
            broadcast_index(out_index, out_shape, in_shape, in_index)                                 | 
            in_position = index_to_position(in_index, in_strides)                                     | 
            out_position = index_to_position(out_index, out_strides)                                  | 
            result = fn(in_storage[in_position])                                                      | 
            out[out_position] = result                                                                | 
        # raise NotImplementedError("Need to implement for Task 3.1")                                 | 
--------------------------------- Fusing loops ---------------------------------
Attempting fusion of parallel loops (combines loops with similar properties)...
Following the attempted fusion of parallel for-loops there are 4 parallel for-
loop(s) (originating from loops labelled: #0, #1, #2, #3).
--------------------------------------------------------------------------------
----------------------------- Before Optimisation ------------------------------
--------------------------------------------------------------------------------
------------------------------ After Optimisation ------------------------------
Parallel structure is already optimal.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
 
---------------------------Loop invariant code motion---------------------------
Allocation hoisting:
The memory allocation derived from the instruction at 
/Users/justin/Desktop/mle_workspace/mle-
module-3-justinhalford/minitorch/fast_ops.py (176) is hoisted out of the 
parallel loop labelled #3 (it will be performed before the loop is executed and 
reused inside the loop):
   Allocation:: in_index, out_index = np.empty(MAX_DIMS, np.int32), 
np.empty(MAX_DIMS, np.int32)
    - numpy.empty() is used for the allocation.
The memory allocation derived from the instruction at 
/Users/justin/Desktop/mle_workspace/mle-
module-3-justinhalford/minitorch/fast_ops.py (176) is hoisted out of the 
parallel loop labelled #3 (it will be performed before the loop is executed and 
reused inside the loop):
   Allocation:: in_index, out_index = np.empty(MAX_DIMS, np.int32), 
np.empty(MAX_DIMS, np.int32)
    - numpy.empty() is used for the allocation.
None
ZIP
 
================================================================================
 Parallel Accelerator Optimizing:  Function tensor_zip.<locals>._zip, 
/Users/justin/Desktop/mle_workspace/mle-
module-3-justinhalford/minitorch/fast_ops.py (210)  
================================================================================


Parallel loop listing for  Function tensor_zip.<locals>._zip, /Users/justin/Desktop/mle_workspace/mle-module-3-justinhalford/minitorch/fast_ops.py (210) 
--------------------------------------------------------------------------------------------------------|loop #ID
    def _zip(                                                                                           | 
        out: Storage,                                                                                   | 
        out_shape: Shape,                                                                               | 
        out_strides: Strides,                                                                           | 
        a_storage: Storage,                                                                             | 
        a_shape: Shape,                                                                                 | 
        a_strides: Strides,                                                                             | 
        b_storage: Storage,                                                                             | 
        b_shape: Shape,                                                                                 | 
        b_strides: Strides,                                                                             | 
    ) -> None:                                                                                          | 
        # TODO: Implement for Task 3.1.                                                                 | 
        strideLenComp = len(a_strides) == len(out_strides) and len(b_strides) == len(out_strides)       | 
        if strideLenComp:                                                                               | 
            strideComp = (a_strides == out_strides).all() and (b_strides == out_strides).all()----------| #4, 5
            shapeComp = (a_shape == out_shape).all() and (b_shape == out_shape).all()-------------------| #6, 7
            # When `out`, `a`, `b` are stride-aligned, avoid indexing                                   | 
            if strideComp and shapeComp:                                                                | 
                # Main loop in parallel                                                                 | 
                for i in prange(len(out)):--------------------------------------------------------------| #8
                    out[i] = fn(a_storage[i], b_storage[i])                                             | 
                return                                                                                  | 
        # When `out`, `a`, `b` are not stride-aligned                                                   | 
        # Main loop in parallel                                                                         | 
        for i in prange(len(out)):----------------------------------------------------------------------| #9
            # All indices use numpy buffers                                                             | 
            a_index, b_index, out_index = (                                                             | 
                np.empty(MAX_DIMS, np.int32),                                                           | 
                np.empty(MAX_DIMS, np.int32),                                                           | 
                np.empty(MAX_DIMS, np.int32),                                                           | 
            )                                                                                           | 
            to_index(i, out_shape, out_index)                                                           | 
            broadcast_index(out_index, out_shape, a_shape, a_index)                                     | 
            broadcast_index(out_index, out_shape, b_shape, b_index)                                     | 
            a_position = index_to_position(a_index, a_strides)                                          | 
            b_position = index_to_position(b_index, b_strides)                                          | 
            out_position = index_to_position(out_index, out_strides)                                    | 
            out[out_position] = fn(a_storage[a_position], b_storage[b_position])                        | 
        # raise NotImplementedError("Need to implement for Task 3.1")                                   | 
--------------------------------- Fusing loops ---------------------------------
Attempting fusion of parallel loops (combines loops with similar properties)...
Following the attempted fusion of parallel for-loops there are 6 parallel for-
loop(s) (originating from loops labelled: #4, #5, #6, #7, #8, #9).
--------------------------------------------------------------------------------
----------------------------- Before Optimisation ------------------------------
--------------------------------------------------------------------------------
------------------------------ After Optimisation ------------------------------
Parallel structure is already optimal.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
 
---------------------------Loop invariant code motion---------------------------
Allocation hoisting:
The memory allocation derived from the instruction at 
/Users/justin/Desktop/mle_workspace/mle-
module-3-justinhalford/minitorch/fast_ops.py (237) is hoisted out of the 
parallel loop labelled #9 (it will be performed before the loop is executed and 
reused inside the loop):
   Allocation:: np.empty(MAX_DIMS, np.int32),
    - numpy.empty() is used for the allocation.
The memory allocation derived from the instruction at 
/Users/justin/Desktop/mle_workspace/mle-
module-3-justinhalford/minitorch/fast_ops.py (238) is hoisted out of the 
parallel loop labelled #9 (it will be performed before the loop is executed and 
reused inside the loop):
   Allocation:: np.empty(MAX_DIMS, np.int32),
    - numpy.empty() is used for the allocation.
The memory allocation derived from the instruction at 
/Users/justin/Desktop/mle_workspace/mle-
module-3-justinhalford/minitorch/fast_ops.py (239) is hoisted out of the 
parallel loop labelled #9 (it will be performed before the loop is executed and 
reused inside the loop):
   Allocation:: np.empty(MAX_DIMS, np.int32),
    - numpy.empty() is used for the allocation.
None
REDUCE
 
================================================================================
 Parallel Accelerator Optimizing:  Function tensor_reduce.<locals>._reduce, 
/Users/justin/Desktop/mle_workspace/mle-
module-3-justinhalford/minitorch/fast_ops.py (272)  
================================================================================


Parallel loop listing for  Function tensor_reduce.<locals>._reduce, /Users/justin/Desktop/mle_workspace/mle-module-3-justinhalford/minitorch/fast_ops.py (272) 
-------------------------------------------------------------------------|loop #ID
    def _reduce(                                                         | 
        out: Storage,                                                    | 
        out_shape: Shape,                                                | 
        out_strides: Strides,                                            | 
        a_storage: Storage,                                              | 
        a_shape: Shape,                                                  | 
        a_strides: Strides,                                              | 
        reduce_dim: int,                                                 | 
    ) -> None:                                                           | 
        # TODO: Implement for Task 3.1.                                  | 
        # Main loop in parallel                                          | 
        for i in prange(len(out)):---------------------------------------| #10
            # All indices use numpy buffers                              | 
            out_index = np.empty(MAX_DIMS, np.int32)                     | 
            to_index(i, out_shape, out_index)                            | 
            a_index = out_index                                          | 
            out_position = index_to_position(out_index, out_strides)     | 
            for j in range(a_shape[reduce_dim]):                         | 
                a_index[reduce_dim] = j                                  | 
                a_position = index_to_position(a_index, a_strides)       | 
                out[out_position] = (                                    | 
                    fn(out[out_position], a_storage[a_position])         | 
                    if j != 0                                            | 
                    else a_storage[a_position]                           | 
                )                                                        | 
        # raise NotImplementedError("Need to implement for Task 3.1")    | 
--------------------------------- Fusing loops ---------------------------------
Attempting fusion of parallel loops (combines loops with similar properties)...
Following the attempted fusion of parallel for-loops there are 1 parallel for-
loop(s) (originating from loops labelled: #10).
--------------------------------------------------------------------------------
----------------------------- Before Optimisation ------------------------------
--------------------------------------------------------------------------------
------------------------------ After Optimisation ------------------------------
Parallel structure is already optimal.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
 
---------------------------Loop invariant code motion---------------------------
Allocation hoisting:
The memory allocation derived from the instruction at 
/Users/justin/Desktop/mle_workspace/mle-
module-3-justinhalford/minitorch/fast_ops.py (285) is hoisted out of the 
parallel loop labelled #10 (it will be performed before the loop is executed and
 reused inside the loop):
   Allocation:: out_index = np.empty(MAX_DIMS, np.int32)
    - numpy.empty() is used for the allocation.
None
MATRIX MULTIPLY
 
================================================================================
 Parallel Accelerator Optimizing:  Function _tensor_matrix_multiply, 
/Users/justin/Desktop/mle_workspace/mle-
module-3-justinhalford/minitorch/fast_ops.py (302)  
================================================================================


Parallel loop listing for  Function _tensor_matrix_multiply, /Users/justin/Desktop/mle_workspace/mle-module-3-justinhalford/minitorch/fast_ops.py (302) 
-----------------------------------------------------------------------------|loop #ID
def _tensor_matrix_multiply(                                                 | 
    out: Storage,                                                            | 
    out_shape: Shape,                                                        | 
    out_strides: Strides,                                                    | 
    a_storage: Storage,                                                      | 
    a_shape: Shape,                                                          | 
    a_strides: Strides,                                                      | 
    b_storage: Storage,                                                      | 
    b_shape: Shape,                                                          | 
    b_strides: Strides,                                                      | 
) -> None:                                                                   | 
    """                                                                      | 
    NUMBA tensor matrix multiply function.                                   | 
                                                                             | 
    Should work for any tensor shapes that broadcast as long as              | 
                                                                             | 
    ```                                                                      | 
    assert a_shape[-1] == b_shape[-2]                                        | 
    ```                                                                      | 
                                                                             | 
    Optimizations:                                                           | 
                                                                             | 
    * Outer loop in parallel                                                 | 
    * No index buffers or function calls                                     | 
    * Inner loop should have no global writes, 1 multiply.                   | 
                                                                             | 
                                                                             | 
    Args:                                                                    | 
        out (Storage): storage for `out` tensor                              | 
        out_shape (Shape): shape for `out` tensor                            | 
        out_strides (Strides): strides for `out` tensor                      | 
        a_storage (Storage): storage for `a` tensor                          | 
        a_shape (Shape): shape for `a` tensor                                | 
        a_strides (Strides): strides for `a` tensor                          | 
        b_storage (Storage): storage for `b` tensor                          | 
        b_shape (Shape): shape for `b` tensor                                | 
        b_strides (Strides): strides for `b` tensor                          | 
                                                                             | 
    Returns:                                                                 | 
        None : Fills in `out`                                                | 
    """                                                                      | 
    # a_batch_stride = a_strides[0] if a_shape[0] > 1 else 0                 | 
    # b_batch_stride = b_strides[0] if b_shape[0] > 1 else 0                 | 
                                                                             | 
    # TODO: Implement for Task 3.2.                                          | 
    x, y = -1, -2                                                            | 
    assert a_shape[x] == b_shape[y]                                          | 
    # Outer loop in parallel                                                 | 
    for i in prange(len(out)):-----------------------------------------------| #12
        out_index = out_shape.copy()                                         | 
        to_index(i, out_shape, out_index)                                    | 
        out_position = index_to_position(out_index, out_strides)             | 
        for j in prange(a_shape[-1]):----------------------------------------| #11
            a_, b_ = out_index.copy(), out_index.copy()                      | 
            a_[x], b_[y] = j, j                                              | 
            a__, b__ = a_shape.copy(), b_shape.copy()                        | 
            broadcast_index(a_, out_shape, a_shape, a__)                     | 
            a_position = index_to_position(a__, a_strides)                   | 
            broadcast_index(b_, out_shape, b_shape, b__)                     | 
            b_position = index_to_position(b__, b_strides)                   | 
            a_comp, b_comp = a_storage[a_position], b_storage[b_position]    | 
            out[out_position] += a_comp * b_comp                             | 
    # raise NotImplementedError("Need to implement for Task 3.2")            | 
--------------------------------- Fusing loops ---------------------------------
Attempting fusion of parallel loops (combines loops with similar properties)...
Following the attempted fusion of parallel for-loops there are 2 parallel for-
loop(s) (originating from loops labelled: #12, #11).
--------------------------------------------------------------------------------
---------------------------- Optimising loop nests -----------------------------
Attempting loop nest rewrites (optimising for the largest parallel loops)...
 
+--12 is a parallel loop
   +--11 --> rewritten as a serial loop
--------------------------------------------------------------------------------
----------------------------- Before Optimisation ------------------------------
Parallel region 0:
+--12 (parallel)
   +--11 (parallel)


--------------------------------------------------------------------------------
------------------------------ After Optimisation ------------------------------
Parallel region 0:
+--12 (parallel)
   +--11 (serial)


 
Parallel region 0 (loop #12) had 0 loop(s) fused and 1 loop(s) serialized as 
part of the larger parallel loop (#12).
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
 
---------------------------Loop invariant code motion---------------------------
Allocation hoisting:
No allocation hoisting found
None

TRAINING LOGS

GPU
Script: python run_fast_tensor.py --BACKEND gpu --HIDDEN 100 --DATASET split --RATE 0.05
Epoch:  0 	Loss:  4.6751798 	Correct: 35 	Time per epoch: 0.61  seconds
Epoch:  10 	Loss:  6.0839791 	Correct: 35 	Time per epoch: 2.11  seconds
Epoch:  20 	Loss:  6.289351 	Correct: 43 	Time per epoch: 2.13  seconds
Epoch:  30 	Loss:  3.5453504 	Correct: 36 	Time per epoch: 2.13  seconds
Epoch:  40 	Loss:  3.7308831 	Correct: 46 	Time per epoch: 2.12  seconds
Epoch:  50 	Loss:  3.5597258 	Correct: 41 	Time per epoch: 2.12  seconds
Epoch:  60 	Loss:  2.9030106 	Correct: 47 	Time per epoch: 2.13  seconds
Epoch:  70 	Loss:  3.4812998 	Correct: 50 	Time per epoch: 2.12  seconds
Epoch:  80 	Loss:  2.615715 	Correct: 50 	Time per epoch: 2.13  seconds
Epoch:  90 	Loss:  2.6139993 	Correct: 48 	Time per epoch: 2.12  seconds
Epoch:  100 	Loss:  1.3980912 	Correct: 50 	Time per epoch: 2.13  seconds
Epoch:  110 	Loss:  1.4104081 	Correct: 50 	Time per epoch: 2.12  seconds
Epoch:  120 	Loss:  2.1015708 	Correct: 49 	Time per epoch: 2.11  seconds
Epoch:  130 	Loss:  0.8530972 	Correct: 50 	Time per epoch: 2.12  seconds
Epoch:  140 	Loss:  1.097273 	Correct: 49 	Time per epoch: 2.12  seconds
Epoch:  150 	Loss:  0.4730568 	Correct: 50 	Time per epoch: 2.12  seconds
Epoch:  160 	Loss:  1.1160981 	Correct: 50 	Time per epoch: 2.13  seconds
Epoch:  170 	Loss:  1.1554417 	Correct: 50 	Time per epoch: 2.12  seconds
Epoch:  180 	Loss:  1.1210623 	Correct: 50 	Time per epoch: 2.11  seconds
Epoch:  190 	Loss:  0.6029956 	Correct: 50 	Time per epoch: 2.13  seconds
Epoch:  200 	Loss:  1.2306773 	Correct: 49 	Time per epoch: 2.12  seconds
Epoch:  210 	Loss:  1.7425789 	Correct: 49 	Time per epoch: 2.12  seconds
Epoch:  220 	Loss:  0.5212502 	Correct: 50 	Time per epoch: 2.12  seconds
Epoch:  230 	Loss:  0.3374005 	Correct: 50 	Time per epoch: 2.11  seconds
Epoch:  240 	Loss:  0.620631 	Correct: 50 	Time per epoch: 2.12  seconds
Epoch:  250 	Loss:  0.9572571 	Correct: 50 	Time per epoch: 2.12  seconds
Epoch:  260 	Loss:  0.7303755 	Correct: 50 	Time per epoch: 2.12  seconds
Epoch:  270 	Loss:  0.1999192 	Correct: 50 	Time per epoch: 2.11  seconds
Epoch:  280 	Loss:  1.0131235 	Correct: 50 	Time per epoch: 2.12  seconds
Epoch:  290 	Loss:  0.4832452 	Correct: 50 	Time per epoch: 2.13  seconds
Epoch:  300 	Loss:  0.5787182 	Correct: 50 	Time per epoch: 2.11  seconds
Epoch:  310 	Loss:  0.904345 	Correct: 50 	Time per epoch: 2.11  seconds
Epoch:  320 	Loss:  0.2270062 	Correct: 50 	Time per epoch: 2.12  seconds
Epoch:  330 	Loss:  0.2611866 	Correct: 50 	Time per epoch: 2.12  seconds
Epoch:  340 	Loss:  0.853673 	Correct: 50 	Time per epoch: 2.13  seconds
Epoch:  350 	Loss:  0.6982114 	Correct: 50 	Time per epoch: 2.11  seconds
Epoch:  360 	Loss:  0.2291951 	Correct: 50 	Time per epoch: 2.12  seconds
Epoch:  370 	Loss:  0.1454667 	Correct: 50 	Time per epoch: 2.12  seconds
Epoch:  380 	Loss:  0.1602238 	Correct: 50 	Time per epoch: 2.12  seconds
Epoch:  390 	Loss:  0.4387185 	Correct: 50 	Time per epoch: 2.11  seconds
Epoch:  400 	Loss:  0.4023651 	Correct: 50 	Time per epoch: 2.12  seconds
Epoch:  410 	Loss:  0.1945227 	Correct: 50 	Time per epoch: 2.12  seconds
Epoch:  420 	Loss:  0.2930128 	Correct: 50 	Time per epoch: 2.12  seconds
Epoch:  430 	Loss:  0.0921325 	Correct: 50 	Time per epoch: 2.12  seconds
Epoch:  440 	Loss:  0.0720437 	Correct: 50 	Time per epoch: 2.12  seconds
Epoch:  450 	Loss:  0.4449849 	Correct: 50 	Time per epoch: 2.12  seconds
Epoch:  460 	Loss:  0.2347959 	Correct: 50 	Time per epoch: 2.12  seconds
Epoch:  470 	Loss:  0.2125934 	Correct: 50 	Time per epoch: 2.12  seconds
Epoch:  480 	Loss:  0.1527142 	Correct: 50 	Time per epoch: 2.12  seconds
Epoch:  490 	Loss:  0.4451058 	Correct: 50 	Time per epoch: 2.11  seconds
AVERAGE TIME PER EPOCH:  2.09  seconds

CPU
Script: python run_fast_tensor.py --BACKEND cpu --HIDDEN 100 --DATASET split --RATE 0.05
Epoch:  0 	Loss:  7.796053 	Correct: 35 	Time per epoch: 2.3  seconds
Epoch:  10 	Loss:  7.3217198 	Correct: 24 	Time per epoch: 1.38  seconds
Epoch:  20 	Loss:  5.6953389 	Correct: 42 	Time per epoch: 1.38  seconds
Epoch:  30 	Loss:  4.5031078 	Correct: 48 	Time per epoch: 1.52  seconds
Epoch:  40 	Loss:  3.320344 	Correct: 42 	Time per epoch: 1.38  seconds
Epoch:  50 	Loss:  2.2498689 	Correct: 38 	Time per epoch: 1.38  seconds
Epoch:  60 	Loss:  2.1538085 	Correct: 48 	Time per epoch: 1.54  seconds
Epoch:  70 	Loss:  2.7061848 	Correct: 48 	Time per epoch: 1.38  seconds
Epoch:  80 	Loss:  1.8005149 	Correct: 46 	Time per epoch: 1.38  seconds
Epoch:  90 	Loss:  2.5934861 	Correct: 49 	Time per epoch: 1.54  seconds
Epoch:  100 	Loss:  1.5814572 	Correct: 50 	Time per epoch: 1.37  seconds
Epoch:  110 	Loss:  1.3320421 	Correct: 48 	Time per epoch: 1.49  seconds
Epoch:  120 	Loss:  1.5194404 	Correct: 49 	Time per epoch: 1.42  seconds
Epoch:  130 	Loss:  0.6980283 	Correct: 48 	Time per epoch: 1.38  seconds
Epoch:  140 	Loss:  1.3317125 	Correct: 49 	Time per epoch: 1.52  seconds
Epoch:  150 	Loss:  0.6043141 	Correct: 50 	Time per epoch: 1.38  seconds
Epoch:  160 	Loss:  0.5964361 	Correct: 48 	Time per epoch: 1.38  seconds
Epoch:  170 	Loss:  1.8951374 	Correct: 48 	Time per epoch: 1.53  seconds
Epoch:  180 	Loss:  0.6762602 	Correct: 50 	Time per epoch: 1.37  seconds
Epoch:  190 	Loss:  1.072632 	Correct: 48 	Time per epoch: 1.37  seconds
Epoch:  200 	Loss:  1.3106452 	Correct: 50 	Time per epoch: 1.49  seconds
Epoch:  210 	Loss:  1.5044155 	Correct: 50 	Time per epoch: 1.39  seconds
Epoch:  220 	Loss:  0.7319147 	Correct: 50 	Time per epoch: 1.38  seconds
Epoch:  230 	Loss:  1.6651903 	Correct: 50 	Time per epoch: 1.52  seconds
Epoch:  240 	Loss:  1.4409077 	Correct: 49 	Time per epoch: 1.37  seconds
Epoch:  250 	Loss:  0.1398968 	Correct: 50 	Time per epoch: 1.39  seconds
Epoch:  260 	Loss:  0.3919883 	Correct: 48 	Time per epoch: 1.52  seconds
Epoch:  270 	Loss:  1.0982466 	Correct: 50 	Time per epoch: 1.38  seconds
Epoch:  280 	Loss:  0.5339689 	Correct: 49 	Time per epoch: 1.39  seconds
Epoch:  290 	Loss:  0.4004985 	Correct: 49 	Time per epoch: 1.51  seconds
Epoch:  300 	Loss:  0.3597154 	Correct: 50 	Time per epoch: 1.39  seconds
Epoch:  310 	Loss:  1.9996767 	Correct: 47 	Time per epoch: 1.4  seconds
Epoch:  320 	Loss:  2.0679084 	Correct: 47 	Time per epoch: 1.51  seconds
Epoch:  330 	Loss:  0.5210477 	Correct: 49 	Time per epoch: 1.39  seconds
Epoch:  340 	Loss:  0.0801231 	Correct: 50 	Time per epoch: 1.5  seconds
Epoch:  350 	Loss:  0.9598003 	Correct: 50 	Time per epoch: 1.4  seconds
Epoch:  360 	Loss:  0.8947082 	Correct: 49 	Time per epoch: 1.37  seconds
Epoch:  370 	Loss:  0.1790264 	Correct: 50 	Time per epoch: 1.51  seconds
Epoch:  380 	Loss:  0.0313458 	Correct: 49 	Time per epoch: 1.42  seconds
Epoch:  390 	Loss:  1.012966 	Correct: 49 	Time per epoch: 1.39  seconds
Epoch:  400 	Loss:  0.772064 	Correct: 50 	Time per epoch: 1.5  seconds
Epoch:  410 	Loss:  0.0373174 	Correct: 49 	Time per epoch: 1.4  seconds
Epoch:  420 	Loss:  0.2305728 	Correct: 50 	Time per epoch: 1.39  seconds
Epoch:  430 	Loss:  0.3531302 	Correct: 48 	Time per epoch: 1.56  seconds
Epoch:  440 	Loss:  0.1410218 	Correct: 50 	Time per epoch: 1.38  seconds
Epoch:  450 	Loss:  0.4870612 	Correct: 50 	Time per epoch: 1.39  seconds
Epoch:  460 	Loss:  1.0781355 	Correct: 48 	Time per epoch: 1.52  seconds
Epoch:  470 	Loss:  1.3224374 	Correct: 48 	Time per epoch: 1.39  seconds
Epoch:  480 	Loss:  0.2725969 	Correct: 49 	Time per epoch: 1.38  seconds
Epoch:  490 	Loss:  0.1662803 	Correct: 50 	Time per epoch: 1.5  seconds
AVERAGE TIME PER EPOCH:  1.44  seconds

GPU - LARGER MODEL
Script: python run_fast_tensor.py --BACKEND cpu --HIDDEN 250 --DATASET split --RATE 0.05
Epoch:  0 	Loss:  40.4289123 	Correct: 26 	Time per epoch: 0.63  seconds
Epoch:  10 	Loss:  3.7904395 	Correct: 43 	Time per epoch: 2.25  seconds
Epoch:  20 	Loss:  3.1538177 	Correct: 43 	Time per epoch: 2.26  seconds
Epoch:  30 	Loss:  1.8133609 	Correct: 47 	Time per epoch: 2.27  seconds
Epoch:  40 	Loss:  3.031231 	Correct: 47 	Time per epoch: 2.26  seconds
Epoch:  50 	Loss:  1.1275088 	Correct: 48 	Time per epoch: 2.26  seconds
Epoch:  60 	Loss:  0.6576759 	Correct: 49 	Time per epoch: 2.26  seconds
Epoch:  70 	Loss:  3.1186411 	Correct: 48 	Time per epoch: 2.25  seconds
Epoch:  80 	Loss:  3.0524176 	Correct: 48 	Time per epoch: 2.26  seconds
Epoch:  90 	Loss:  1.8591487 	Correct: 48 	Time per epoch: 2.25  seconds
Epoch:  100 	Loss:  2.9593964 	Correct: 47 	Time per epoch: 2.25  seconds
Epoch:  110 	Loss:  1.6874119 	Correct: 50 	Time per epoch: 2.25  seconds
Epoch:  120 	Loss:  0.3577296 	Correct: 50 	Time per epoch: 2.25  seconds
Epoch:  130 	Loss:  0.8196006 	Correct: 49 	Time per epoch: 2.26  seconds
Epoch:  140 	Loss:  1.2199134 	Correct: 50 	Time per epoch: 2.25  seconds
Epoch:  150 	Loss:  0.3603652 	Correct: 50 	Time per epoch: 2.25  seconds
Epoch:  160 	Loss:  0.4557673 	Correct: 49 	Time per epoch: 2.26  seconds
Epoch:  170 	Loss:  0.2399357 	Correct: 50 	Time per epoch: 2.25  seconds
Epoch:  180 	Loss:  0.486752 	Correct: 49 	Time per epoch: 2.25  seconds
Epoch:  190 	Loss:  1.6848536 	Correct: 49 	Time per epoch: 2.25  seconds
Epoch:  200 	Loss:  0.2724246 	Correct: 49 	Time per epoch: 2.26  seconds
Epoch:  210 	Loss:  1.5931497 	Correct: 48 	Time per epoch: 2.25  seconds
Epoch:  220 	Loss:  0.2149285 	Correct: 49 	Time per epoch: 2.25  seconds
Epoch:  230 	Loss:  0.3252945 	Correct: 50 	Time per epoch: 2.25  seconds
Epoch:  240 	Loss:  1.6390974 	Correct: 49 	Time per epoch: 2.25  seconds
Epoch:  250 	Loss:  0.889793 	Correct: 49 	Time per epoch: 2.25  seconds
Epoch:  260 	Loss:  0.2399558 	Correct: 49 	Time per epoch: 2.25  seconds
Epoch:  270 	Loss:  0.6206427 	Correct: 50 	Time per epoch: 2.26  seconds
Epoch:  280 	Loss:  0.1538105 	Correct: 49 	Time per epoch: 2.25  seconds
Epoch:  290 	Loss:  0.1448247 	Correct: 50 	Time per epoch: 2.26  seconds
Epoch:  300 	Loss:  0.4177541 	Correct: 49 	Time per epoch: 2.25  seconds
Epoch:  310 	Loss:  0.7999527 	Correct: 49 	Time per epoch: 2.25  seconds
Epoch:  320 	Loss:  1.085701 	Correct: 50 	Time per epoch: 2.26  seconds
Epoch:  330 	Loss:  0.2187044 	Correct: 50 	Time per epoch: 2.25  seconds
Epoch:  340 	Loss:  0.643253 	Correct: 50 	Time per epoch: 2.26  seconds
Epoch:  350 	Loss:  0.5215887 	Correct: 50 	Time per epoch: 2.25  seconds
Epoch:  360 	Loss:  0.4059418 	Correct: 50 	Time per epoch: 2.26  seconds
Epoch:  370 	Loss:  0.0194751 	Correct: 50 	Time per epoch: 2.26  seconds
Epoch:  380 	Loss:  0.0226324 	Correct: 50 	Time per epoch: 2.25  seconds
Epoch:  390 	Loss:  0.2637746 	Correct: 50 	Time per epoch: 2.25  seconds
Epoch:  400 	Loss:  0.297469 	Correct: 50 	Time per epoch: 2.26  seconds
Epoch:  410 	Loss:  0.1043882 	Correct: 50 	Time per epoch: 2.25  seconds
Epoch:  420 	Loss:  0.1247114 	Correct: 50 	Time per epoch: 2.26  seconds
Epoch:  430 	Loss:  0.0162713 	Correct: 50 	Time per epoch: 2.25  seconds
Epoch:  440 	Loss:  0.372408 	Correct: 50 	Time per epoch: 2.25  seconds
Epoch:  450 	Loss:  0.0401251 	Correct: 50 	Time per epoch: 2.27  seconds
Epoch:  460 	Loss:  0.5023019 	Correct: 50 	Time per epoch: 2.24  seconds
Epoch:  470 	Loss:  0.3292899 	Correct: 50 	Time per epoch: 2.26  seconds
Epoch:  480 	Loss:  0.3996228 	Correct: 50 	Time per epoch: 2.25  seconds
Epoch:  490 	Loss:  0.3587235 	Correct: 50 	Time per epoch: 2.25  seconds
AVERAGE TIME PER EPOCH:  2.21  seconds

GPU
Script: python run_fast_tensor.py --BACKEND gpu --HIDDEN 100 --DATASET simple --RATE 0.05

CPU
Script: python run_fast_tensor.py --BACKEND cpu --HIDDEN 100 --DATASET simple --RATE 0.05

GPU - LARGER MODEL
Script: python run_fast_tensor.py --BACKEND cpu --HIDDEN 250 --DATASET simple --RATE 0.05

GPU
Script: python run_fast_tensor.py --BACKEND gpu --HIDDEN 100 --DATASET xor --RATE 0.05
Epoch:  0 	Loss:  7.1735299 	Correct: 27 	Time per epoch: 0.44  seconds
Epoch:  10 	Loss:  5.4344595 	Correct: 40 	Time per epoch: 2.38  seconds
Epoch:  20 	Loss:  4.120264 	Correct: 40 	Time per epoch: 1.89  seconds
Epoch:  30 	Loss:  5.0077927 	Correct: 44 	Time per epoch: 1.89  seconds
Epoch:  40 	Loss:  1.7443528 	Correct: 45 	Time per epoch: 2.38  seconds
Epoch:  50 	Loss:  4.3674667 	Correct: 44 	Time per epoch: 1.89  seconds
Epoch:  60 	Loss:  4.1087492 	Correct: 43 	Time per epoch: 1.92  seconds
Epoch:  70 	Loss:  5.7315126 	Correct: 42 	Time per epoch: 2.37  seconds
Epoch:  80 	Loss:  3.8592486 	Correct: 42 	Time per epoch: 1.89  seconds
Epoch:  90 	Loss:  0.8461004 	Correct: 44 	Time per epoch: 2.39  seconds
Epoch:  100 	Loss:  2.0596891 	Correct: 44 	Time per epoch: 2.47  seconds
Epoch:  110 	Loss:  4.5720548 	Correct: 45 	Time per epoch: 2.3  seconds
Epoch:  120 	Loss:  3.591436 	Correct: 46 	Time per epoch: 1.87  seconds
Epoch:  130 	Loss:  1.910777 	Correct: 46 	Time per epoch: 2.39  seconds
Epoch:  140 	Loss:  1.7964367 	Correct: 46 	Time per epoch: 1.89  seconds
Epoch:  150 	Loss:  2.1196323 	Correct: 47 	Time per epoch: 1.9  seconds
Epoch:  160 	Loss:  3.126072 	Correct: 48 	Time per epoch: 2.38  seconds
Epoch:  170 	Loss:  2.6040145 	Correct: 47 	Time per epoch: 1.89  seconds
Epoch:  180 	Loss:  2.1631396 	Correct: 48 	Time per epoch: 1.88  seconds
Epoch:  190 	Loss:  3.1612961 	Correct: 47 	Time per epoch: 2.29  seconds
Epoch:  200 	Loss:  1.5755062 	Correct: 48 	Time per epoch: 1.97  seconds
Epoch:  210 	Loss:  1.9472784 	Correct: 48 	Time per epoch: 1.9  seconds
Epoch:  220 	Loss:  1.7602159 	Correct: 48 	Time per epoch: 2.37  seconds
Epoch:  230 	Loss:  2.115012 	Correct: 47 	Time per epoch: 2.06  seconds
Epoch:  240 	Loss:  0.9159519 	Correct: 48 	Time per epoch: 2.3  seconds
Epoch:  250 	Loss:  0.4279504 	Correct: 50 	Time per epoch: 2.4  seconds
Epoch:  260 	Loss:  1.6609721 	Correct: 48 	Time per epoch: 1.96  seconds
Epoch:  270 	Loss:  1.5841621 	Correct: 49 	Time per epoch: 1.87  seconds
Epoch:  280 	Loss:  1.7667494 	Correct: 49 	Time per epoch: 1.91  seconds
Epoch:  290 	Loss:  1.7301585 	Correct: 49 	Time per epoch: 2.37  seconds
Epoch:  300 	Loss:  0.1986096 	Correct: 49 	Time per epoch: 1.87  seconds
Epoch:  310 	Loss:  0.4204212 	Correct: 49 	Time per epoch: 1.87  seconds
Epoch:  320 	Loss:  1.1768902 	Correct: 49 	Time per epoch: 2.39  seconds
Epoch:  330 	Loss:  1.1114638 	Correct: 49 	Time per epoch: 1.88  seconds
Epoch:  340 	Loss:  0.6772333 	Correct: 50 	Time per epoch: 1.86  seconds
Epoch:  350 	Loss:  1.1982121 	Correct: 50 	Time per epoch: 2.75  seconds
Epoch:  360 	Loss:  0.5463171 	Correct: 50 	Time per epoch: 1.97  seconds
Epoch:  370 	Loss:  0.8551744 	Correct: 49 	Time per epoch: 1.88  seconds
Epoch:  380 	Loss:  1.4217957 	Correct: 50 	Time per epoch: 2.55  seconds
Epoch:  390 	Loss:  0.6803751 	Correct: 50 	Time per epoch: 1.86  seconds
Epoch:  400 	Loss:  1.1534305 	Correct: 49 	Time per epoch: 1.87  seconds
Epoch:  410 	Loss:  0.4845953 	Correct: 50 	Time per epoch: 2.37  seconds
Epoch:  420 	Loss:  1.2140399 	Correct: 48 	Time per epoch: 1.86  seconds
Epoch:  430 	Loss:  1.4143176 	Correct: 50 	Time per epoch: 1.86  seconds
Epoch:  440 	Loss:  0.6827089 	Correct: 49 	Time per epoch: 2.37  seconds
Epoch:  450 	Loss:  0.3514045 	Correct: 49 	Time per epoch: 1.87  seconds
Epoch:  460 	Loss:  1.0313861 	Correct: 50 	Time per epoch: 1.88  seconds
Epoch:  470 	Loss:  0.4922438 	Correct: 50 	Time per epoch: 2.12  seconds
Epoch:  480 	Loss:  0.1051086 	Correct: 49 	Time per epoch: 2.18  seconds
Epoch:  490 	Loss:  0.3793103 	Correct: 50 	Time per epoch: 2.3  seconds
AVERAGE TIME PER EPOCH:  2.16  seconds


CPU
Script: python run_fast_tensor.py --BACKEND cpu --HIDDEN 100 --DATASET xor --RATE 0.05

GPU - LARGER MODEL
Script: python run_fast_tensor.py --BACKEND cpu --HIDDEN 250 --DATASET xor --RATE 0.05