Nathan Weston <elb...@...>

On 3/31/2012 9:15 AM, Nathan Weston wrote:
On 03/30/2012 05:56 PM, Jeremy Selan wrote:
There are only a limited number of Ops. (MatrixOffset, Exponent, Log,
Lut1D, Lut3D). But the behavior of these ops may depend on data loaded
at runtime. (The LUT ops, for example).

So thinking ahead...
It just may be possible to precompile a generic CUDA kernel for each of
the predefined op types, and then at runtime for OCIO to loop over the
Ops within the processor, load any kernel-specific data, and then apply
the kernels to the data.

In CUDA, Is it appropriate to have a 3rd party library (OCIO) directly
call a bunch of individual kernels on some passed in data?

If so, then this just might work. However, if going this route requires
OCIO to expose publicly the internal Ops, then I would consider this a
non-starter. The internal ops cannot, and should not, become part of
any exposed public interface.
Yes, that should be possible. There may be some performance
implications. Memory access has high latency, so ideally you want to
read your data once, do all the math in registers, then write the final
result. Splitting an operation across kernels will require more
round-trips to memory. But in practice the scheduler is pretty decent at
hiding this latency, so it may not be a problem.

Another possibility is to write a single kernel, which takes the list of
Ops as input and loops over them internally. This would require all the
Ops to be available at compile time (which it sounds like is the case
now, but may not be in the future).

Either way, I don't see any reason that the Ops need to be exposed in
the API.

Thanks for the architecture overview. That should be a big help. I'll
dig into the code next week and see what I can come up with.
I've spent some time going over the OCIO code and doing some CUDA experiments, and I think I have an overall design that will work.

It will require some minor modifications to the Op classes (more on that below), but should end up with CPU and CUDA paths mostly sharing code, and thus producing identical results.

I would add a CMake option to enable CUDA. If that's turned on, OCIO will build with CUDA support and link against the CUDA runtime API -- so using CUDA or not is a compile-time decision. If CUDA is turned off, then OCIO won't have any additional dependencies beyond what it has today.

The Processor class will get a new method, applyCUDA(), which should be passed an ImageDesc that points to GPU memory. (Alternatively, we could put a GPU flag in the ImageDesc and have the existing apply() method dispatch appropriately.)

applyCUDA() will copy the list of Ops to the GPU. Then it will launch a kernel that, at each pixel, loops over each op and calls op->apply() on that pixel. In many cases, the existing code for apply() will work fine in CUDA. Some ops might need a specialized CUDA implementation for good performance. All of this code can go in one .cu file which is compiled by nvcc.

Any functions or methods that we want to run on the GPU ("device" code in CUDA parlance) need to be marked with the "__device__" qualifier so nvcc will pick them up. We can wrap this in a macro to avoid problems with the regular C++ compiler.

CUDA doesn't support separate compilation for device code. So our .cu file will need to #include the .cpp files for all the Ops. When nvcc compiles this, it will produce both host and device code -- but we already have host versions of all this code compiled separately, so we'll get duplicate symbols at link time. Thus we'll need to sprinkle around some #ifdefs to hide host code from nvcc.

Unfortunately, I've also discovered a weird quirk (maybe a bug) in nvcc. If I have code like this:
class Foo {
__device__ void myMethod();
__device__ void Foo::myMethod();

it will still compile a host version of myMethod(), which causes link errors. In order to prevent this problem I have to move the method body into the class declaration. So we'd also have to inline some method definitions, which is not great.

All of this is assuming that we use CUDA's Runtime API, which handles all the nasty details of launching kernels automatically. But CUDA also has a lower-level Driver API. When using the Driver API, CUDA doesn't compile any host code -- so this would sidestep the duplicate symbol problem entirely. But it introduces some hassles of its own, and I don't have much experience with it.

To summarize:
With the runtime API
* Some functions/methods are marked with __device__
* Need ifdefs to prevent duplicate symbols
* Need to inline method definitions to produce duplicate symbols
* CUDA code is easier, but C++ code is uglier

With the driver API
* Some functions/methods are marked with __device__
* No other modifications to existing C++ code
* CUDA code is nastier

The public API will be identical in either case, so it makes no difference to the user of the library.

From my perspective, the runtime API is definitely the easier way to go. But it's not my project so I'm not really in a position to say how much rearranging/uglification of existing code is acceptable. I'll leave that call up to the powers that be. :)

-- Nathan

Join { to automatically receive all group messages.