OCIO CUDA


Nathan Weston <elb...@...>
 

I'm currently integrating OpenColorIO into an application that uses CUDA for GPU processing. In order to use OCIO's shader path, we'd need to copy our images over to OpenGL textures and back again. If OCIO had a CUDA path, it would be cleaner and faster.

Has anyone looked into implementing such a thing? If I were to implement it myself, is there any interest in including it in OCIO?


Jeremy Selan <jeremy...@...>
 

I don't think anyone has looked implementing a CUDA pathway, but I'm very open to such ideas. Someone did ask about an OpenCL implementation recently, but I believe it's still in the concept stage.

A few thoughts on the concept...

Our current GPU implementation does not attempt to match the CPU implementation, by design.  The CPU codepath does the full analytical color operations per pixels, while the GPU GLSL/Cg implementation relies on a combination of analytical shader text code generation, along with a 3d lut sampling.   For color operations which can be done in simple shader text (such as math ops), these all happen in the glsl shader. But if the user references multiple 3d luts for example, it's all baked into a single 3d lut.

I was always hoping that, if we ever implemented a CUDA or OpenCL pathway, it would be more akin to the GPU code path and do more analytically.  Im not sure if this is possible, but I think it's a nice ideal for a 'compute' context.

Another nicety of the current implementation is that even though we support gpu(s), OpenColorIO doesnt actually link to libGL, etc.   The 'GPU API' conceptually only deals with POD types, returning the float * 3dlut, and the const char * shader text.

My hope would be that, if possible, a CUDA / OpenCL wouldn't impose any new linking requirements on the core library, but would instead support new code paths using simple data types.

-- Jeremy


On Wed, Mar 28, 2012 at 12:56 PM, Nathan Weston <elb...@...> wrote:

I'm currently integrating OpenColorIO into an application that uses CUDA for GPU processing. In order to use OCIO's shader path, we'd need to copy our images over to OpenGL textures and back again. If OCIO had a CUDA path, it would be cleaner and faster.

Has anyone looked into implementing such a thing? If I were to implement it myself, is there any interest in including it in OCIO?


Nathan Weston <elb...@...>
 

Cool. If all goes well, I can hopefully find time to work on this over the next couple of months.

CUDA historically hasn't supported dynamic generation/compilation of kernels. I believe it's possible with newer versions of the compiler, but only with the lower-level driver API. A statically-compiled kernel is probably a better bet, which would tend to point toward a more analytical approach along the lines of your CPU codepath.

I've had pretty good luck in the past sharing code between C++ and CUDA in order to implement parallel code paths that produce the same results. The only snags are
 1) the shared code has to go into header files
 2) Virtual functions require CUDA 4.0 and Fermi hardware

If possible I'd like to support older cards and CUDA toolkits, which means no virtual functions.

Is there any documentation of the OCIO internals to help me get my bearings?


On Wednesday, March 28, 2012 6:47:58 PM UTC-4, Jeremy Selan wrote:
I don't think anyone has looked implementing a CUDA pathway, but I'm very open to such ideas. Someone did ask about an OpenCL implementation recently, but I believe it's still in the concept stage.

A few thoughts on the concept...

Our current GPU implementation does not attempt to match the CPU implementation, by design.  The CPU codepath does the full analytical color operations per pixels, while the GPU GLSL/Cg implementation relies on a combination of analytical shader text code generation, along with a 3d lut sampling.   For color operations which can be done in simple shader text (such as math ops), these all happen in the glsl shader. But if the user references multiple 3d luts for example, it's all baked into a single 3d lut.

I was always hoping that, if we ever implemented a CUDA or OpenCL pathway, it would be more akin to the GPU code path and do more analytically.  Im not sure if this is possible, but I think it's a nice ideal for a 'compute' context.

Another nicety of the current implementation is that even though we support gpu(s), OpenColorIO doesnt actually link to libGL, etc.   The 'GPU API' conceptually only deals with POD types, returning the float * 3dlut, and the const char * shader text.

My hope would be that, if possible, a CUDA / OpenCL wouldn't impose any new linking requirements on the core library, but would instead support new code paths using simple data types.

-- Jeremy


On Wed, Mar 28, 2012 at 12:56 PM, Nathan Weston <elb...@...> wrote:
I'm currently integrating OpenColorIO into an application that uses CUDA for GPU processing. In order to use OCIO's shader path, we'd need to copy our images over to OpenGL textures and back again. If OCIO had a CUDA path, it would be cleaner and faster.

Has anyone looked into implementing such a thing? If I were to implement it myself, is there any interest in including it in OCIO?


Jeremy Selan <jeremy...@...>
 

Excellent, looking forward to seeing what you do.

You mention CUDA (historically) doesn't support dynamic compilation of kernels?  That would imply an implementation that looks more like a fixed function processing path, unfortunately.  The downside being that the results would be even less accurate (potentially) then either our current CPU or GPU pathways.

Recall that in OCIO, all of the color transforms are dynamically loaded at runtime, so at library compile-time there's no way to know processing will be required for a given color transform.  (You roughly know the building blocks, but not how they will be chained together).

Even our current GLSL codepath, which leverages a single 3dlut, tries to do as much as possible in the fragment shader at runtime.  (It's pipeline looks like [GLSL CODE + 3DLUT + GLSL CODE], with as much done in code as possible).

What type of CUDA application are you writing? Are you looking for OCIO in CUDA for performance reasons?  Are you looking for OCIO to match the quality of the GPU?  Perhaps we can come up with an alternate implementation approach, or decide that it's better to just target recent CUDA versions.

On first glance, it appears that OpenCL may support dynamic compilation, and thus be easier to match the CPU 1:1.  Can anyone with OpenCL experience chime in?

clCreateProgramWithSource(...), etc.

-- Jeremy

On Thu, Mar 29, 2012 at 11:11 AM, Nathan Weston <elb...@...> wrote:
Cool. If all goes well, I can hopefully find time to work on this over the next couple of months.

CUDA historically hasn't supported dynamic generation/compilation of kernels. I believe it's possible with newer versions of the compiler, but only with the lower-level driver API. A statically-compiled kernel is probably a better bet, which would tend to point toward a more analytical approach along the lines of your CPU codepath.

I've had pretty good luck in the past sharing code between C++ and CUDA in order to implement parallel code paths that produce the same results. The only snags are
 1) the shared code has to go into header files
 2) Virtual functions require CUDA 4.0 and Fermi hardware

If possible I'd like to support older cards and CUDA toolkits, which means no virtual functions.

Is there any documentation of the OCIO internals to help me get my bearings?


On Wednesday, March 28, 2012 6:47:58 PM UTC-4, Jeremy Selan wrote:
I don't think anyone has looked implementing a CUDA pathway, but I'm very open to such ideas. Someone did ask about an OpenCL implementation recently, but I believe it's still in the concept stage.

A few thoughts on the concept...

Our current GPU implementation does not attempt to match the CPU implementation, by design.  The CPU codepath does the full analytical color operations per pixels, while the GPU GLSL/Cg implementation relies on a combination of analytical shader text code generation, along with a 3d lut sampling.   For color operations which can be done in simple shader text (such as math ops), these all happen in the glsl shader. But if the user references multiple 3d luts for example, it's all baked into a single 3d lut.

I was always hoping that, if we ever implemented a CUDA or OpenCL pathway, it would be more akin to the GPU code path and do more analytically.  Im not sure if this is possible, but I think it's a nice ideal for a 'compute' context.

Another nicety of the current implementation is that even though we support gpu(s), OpenColorIO doesnt actually link to libGL, etc.   The 'GPU API' conceptually only deals with POD types, returning the float * 3dlut, and the const char * shader text.

My hope would be that, if possible, a CUDA / OpenCL wouldn't impose any new linking requirements on the core library, but would instead support new code paths using simple data types.

-- Jeremy


On Wed, Mar 28, 2012 at 12:56 PM, Nathan Weston <elb...@...> wrote:
I'm currently integrating OpenColorIO into an application that uses CUDA for GPU processing. In order to use OCIO's shader path, we'd need to copy our images over to OpenGL textures and back again. If OCIO had a CUDA path, it would be cleaner and faster.

Has anyone looked into implementing such a thing? If I were to implement it myself, is there any interest in including it in OCIO?



Larry Gritz <l...@...>
 

IIRC, although it's not quite as easy as OpenCL, these days Cuda can dynamically compile kernels.  (OpenCL was able to do that all along.)

The other advantage of Cuda is that it's really C++ with a couple minor additions, which may make porting our existing code a lot easier, and also use all your favorite C++ features such as classes and templates.  OpenCL is its own thing (though very C like).

The disadvantage is, of course, less HW and vendor independence.


On Mar 30, 2012, at 12:02 PM, Jeremy Selan wrote:

Excellent, looking forward to seeing what you do.

You mention CUDA (historically) doesn't support dynamic compilation of kernels?  That would imply an implementation that looks more like a fixed function processing path, unfortunately.  The downside being that the results would be even less accurate (potentially) then either our current CPU or GPU pathways.

Recall that in OCIO, all of the color transforms are dynamically loaded at runtime, so at library compile-time there's no way to know processing will be required for a given color transform.  (You roughly know the building blocks, but not how they will be chained together).

Even our current GLSL codepath, which leverages a single 3dlut, tries to do as much as possible in the fragment shader at runtime.  (It's pipeline looks like [GLSL CODE + 3DLUT + GLSL CODE], with as much done in code as possible).

What type of CUDA application are you writing? Are you looking for OCIO in CUDA for performance reasons?  Are you looking for OCIO to match the quality of the GPU?  Perhaps we can come up with an alternate implementation approach, or decide that it's better to just target recent CUDA versions.

On first glance, it appears that OpenCL may support dynamic compilation, and thus be easier to match the CPU 1:1.  Can anyone with OpenCL experience chime in?

clCreateProgramWithSource(...), etc.

-- Jeremy

On Thu, Mar 29, 2012 at 11:11 AM, Nathan Weston <elb...@...> wrote:
Cool. If all goes well, I can hopefully find time to work on this over the next couple of months.

CUDA historically hasn't supported dynamic generation/compilation of kernels. I believe it's possible with newer versions of the compiler, but only with the lower-level driver API. A statically-compiled kernel is probably a better bet, which would tend to point toward a more analytical approach along the lines of your CPU codepath.

I've had pretty good luck in the past sharing code between C++ and CUDA in order to implement parallel code paths that produce the same results. The only snags are
 1) the shared code has to go into header files
 2) Virtual functions require CUDA 4.0 and Fermi hardware

If possible I'd like to support older cards and CUDA toolkits, which means no virtual functions.

Is there any documentation of the OCIO internals to help me get my bearings?


On Wednesday, March 28, 2012 6:47:58 PM UTC-4, Jeremy Selan wrote:
I don't think anyone has looked implementing a CUDA pathway, but I'm very open to such ideas. Someone did ask about an OpenCL implementation recently, but I believe it's still in the concept stage.

A few thoughts on the concept...

Our current GPU implementation does not attempt to match the CPU implementation, by design.  The CPU codepath does the full analytical color operations per pixels, while the GPU GLSL/Cg implementation relies on a combination of analytical shader text code generation, along with a 3d lut sampling.   For color operations which can be done in simple shader text (such as math ops), these all happen in the glsl shader. But if the user references multiple 3d luts for example, it's all baked into a single 3d lut.

I was always hoping that, if we ever implemented a CUDA or OpenCL pathway, it would be more akin to the GPU code path and do more analytically.  Im not sure if this is possible, but I think it's a nice ideal for a 'compute' context.

Another nicety of the current implementation is that even though we support gpu(s), OpenColorIO doesnt actually link to libGL, etc.   The 'GPU API' conceptually only deals with POD types, returning the float * 3dlut, and the const char * shader text.

My hope would be that, if possible, a CUDA / OpenCL wouldn't impose any new linking requirements on the core library, but would instead support new code paths using simple data types.

-- Jeremy


On Wed, Mar 28, 2012 at 12:56 PM, Nathan Weston <elb...@...> wrote:
I'm currently integrating OpenColorIO into an application that uses CUDA for GPU processing. In order to use OCIO's shader path, we'd need to copy our images over to OpenGL textures and back again. If OCIO had a CUDA path, it would be cleaner and faster.

Has anyone looked into implementing such a thing? If I were to implement it myself, is there any interest in including it in OCIO?



--
Larry Gritz




Nathan Weston <elb...@...>
 

On 03/30/2012 03:02 PM, Jeremy Selan wrote:
Excellent, looking forward to seeing what you do.

You mention CUDA (historically) doesn't support dynamic compilation of
kernels? That would imply an implementation that looks more like a
fixed function processing path, unfortunately. The downside being that
the results would be even less accurate (potentially) then either our
current CPU or GPU pathways.

Recall that in OCIO, all of the color transforms are dynamically loaded
at runtime, so at library compile-time there's no way to know processing
will be required for a given color transform. (You roughly know the
building blocks, but not how they will be chained together).
Hmm, I didn't know that. I don't yet know anything at all about OCIO under the hood. If there's any documentation of this kind of stuff it would be very helpful.

When you say the color transforms are loaded at runtime, are you talking about additional code (e.g. a .so that implements a particular transform)? If it's simply reading LUTs or other datafiles then a CUDA implementation might be able to work off the same data. CUDA supports most of C++ on the device side so it's quite a bit more flexible than GLSL.

Even our current GLSL codepath, which leverages a single 3dlut, tries to
do as much as possible in the fragment shader at runtime. (It's
pipeline looks like [GLSL CODE + 3DLUT + GLSL CODE], with as much done
in code as possible).

What type of CUDA application are you writing? Are you looking for OCIO
in CUDA for performance reasons? Are you looking for OCIO to match the
quality of the GPU? Perhaps we can come up with an alternate
implementation approach, or decide that it's better to just target
recent CUDA versions.
I have to be slightly cagey about this for the moment since I work on a commercial product and we haven't yet announced our plans regarding OCIO. But broadly, we have an image processing application which has both CPU and CUDA codepaths depending on the available hardware. All of this is already implemented, so we're pretty much wedded to CUDA for the time being.

If we've processed a whole frame on the GPU, it would kill our performance to copy it back to the CPU to run a colorspace transform. Copying to an OpenGL texture to use the GLSL path would also hurt performance, though not as badly. So a CUDA path would be very useful to me.

On first glance, it appears that OpenCL may support dynamic compilation,
and thus be easier to match the CPU 1:1. Can anyone with OpenCL
experience chime in?

clCreateProgramWithSource(...), etc.
I haven't done any OpenCL programming, but have paid close attention to it over the years. It does support dynamic compilation. As Larry mentioned, it's somewhat lacking in language features compared to CUDA (e.g. templates and other C++ stuff).

There isn't any way to share data between OpenCL and CUDA, so an OpenCL path wouldn't be of much use to me. Which, as you can imagine, somewhat reduces my motivation to write one. :)

On Thu, Mar 29, 2012 at 11:11 AM, Nathan Weston <elb...@...
<mailto:elb...@...>> wrote:

Cool. If all goes well, I can hopefully find time to work on this
over the next couple of months.

CUDA historically hasn't supported dynamic generation/compilation of
kernels. I believe it's possible with newer versions of the
compiler, but only with the lower-level driver API. A
statically-compiled kernel is probably a better bet, which would
tend to point toward a more analytical approach along the lines of
your CPU codepath.

I've had pretty good luck in the past sharing code between C++ and
CUDA in order to implement parallel code paths that produce the same
results. The only snags are
1) the shared code has to go into header files
2) Virtual functions require CUDA 4.0 and Fermi hardware

If possible I'd like to support older cards and CUDA toolkits, which
means no virtual functions.

Is there any documentation of the OCIO internals to help me get my
bearings?


On Wednesday, March 28, 2012 6:47:58 PM UTC-4, Jeremy Selan wrote:

I don't think anyone has looked implementing a CUDA pathway, but
I'm very open to such ideas. Someone did ask about an OpenCL
implementation recently, but I believe it's still in the concept
stage.

A few thoughts on the concept...

Our current GPU implementation does not attempt to match the CPU
implementation, by design. The CPU codepath does the full
analytical color operations per pixels, while the GPU GLSL/Cg
implementation relies on a combination of analytical shader text
code generation, along with a 3d lut sampling. For color
operations which can be done in simple shader text (such as math
ops), these all happen in the glsl shader. But if the user
references multiple 3d luts for example, it's all baked into a
single 3d lut.

I was always hoping that, if we ever implemented a CUDA or
OpenCL pathway, it would be more akin to the GPU code path and
do more analytically. Im not sure if this is possible, but I
think it's a nice ideal for a 'compute' context.

Another nicety of the current implementation is that even though
we support gpu(s), OpenColorIO doesnt actually link to libGL,
etc. The 'GPU API' conceptually only deals with POD types,
returning the float * 3dlut, and the const char * shader text.

My hope would be that, if possible, a CUDA / OpenCL wouldn't
impose any new linking requirements on the core library, but
would instead support new code paths using simple data types.

-- Jeremy


On Wed, Mar 28, 2012 at 12:56 PM, Nathan Weston
<elb...@... <mailto:elb...@...>> wrote:

I'm currently integrating OpenColorIO into an application
that uses CUDA for GPU processing. In order to use OCIO's
shader path, we'd need to copy our images over to OpenGL
textures and back again. If OCIO had a CUDA path, it would
be cleaner and faster.

Has anyone looked into implementing such a thing? If I were
to implement it myself, is there any interest in including
it in OCIO?



Nathan Weston <elb...@...>
 

That was my recollection as well, but I can't find any documentation or examples of this. Maybe I'm missing something. Do you remember any details?

All I can turn up are some 3rd-party tools which run the nvcc compiler as a subprocesses. nvcc is part of the CUDA SDK, but I don't think it's shipped with the driver or runtime libraries that would normally be installed on end-user machines, so this kind of approach could get complicated when it comes to deployment.

On 03/30/2012 03:11 PM, Larry Gritz wrote:
IIRC, although it's not quite as easy as OpenCL, these days Cuda can
dynamically compile kernels. (OpenCL was able to do that all along.)

The other advantage of Cuda is that it's really C++ with a couple minor
additions, which may make porting our existing code a lot easier, and
also use all your favorite C++ features such as classes and templates.
OpenCL is its own thing (though very C like).

The disadvantage is, of course, less HW and vendor independence.


On Mar 30, 2012, at 12:02 PM, Jeremy Selan wrote:

Excellent, looking forward to seeing what you do.

You mention CUDA (historically) doesn't support dynamic compilation of
kernels? That would imply an implementation that looks more like a
fixed function processing path, unfortunately. The downside being that
the results would be even less accurate (potentially) then either our
current CPU or GPU pathways.

Recall that in OCIO, all of the color transforms are dynamically
loaded at runtime, so at library compile-time there's no way to know
processing will be required for a given color transform. (You roughly
know the building blocks, but not how they will be chained together).

Even our current GLSL codepath, which leverages a single 3dlut, tries
to do as much as possible in the fragment shader at runtime. (It's
pipeline looks like [GLSL CODE + 3DLUT + GLSL CODE], with as much done
in code as possible).

What type of CUDA application are you writing? Are you looking for
OCIO in CUDA for performance reasons? Are you looking for OCIO to
match the quality of the GPU? Perhaps we can come up with an alternate
implementation approach, or decide that it's better to just target
recent CUDA versions.

On first glance, it appears that OpenCL may support dynamic
compilation, and thus be easier to match the CPU 1:1. Can anyone with
OpenCL experience chime in?

clCreateProgramWithSource(...), etc.

-- Jeremy

On Thu, Mar 29, 2012 at 11:11 AM, Nathan Weston <elb...@...
<mailto:elb...@...>> wrote:

Cool. If all goes well, I can hopefully find time to work on this
over the next couple of months.

CUDA historically hasn't supported dynamic generation/compilation
of kernels. I believe it's possible with newer versions of the
compiler, but only with the lower-level driver API. A
statically-compiled kernel is probably a better bet, which would
tend to point toward a more analytical approach along the lines of
your CPU codepath.

I've had pretty good luck in the past sharing code between C++ and
CUDA in order to implement parallel code paths that produce the
same results. The only snags are
1) the shared code has to go into header files
2) Virtual functions require CUDA 4.0 and Fermi hardware

If possible I'd like to support older cards and CUDA toolkits,
which means no virtual functions.

Is there any documentation of the OCIO internals to help me get my
bearings?


On Wednesday, March 28, 2012 6:47:58 PM UTC-4, Jeremy Selan wrote:

I don't think anyone has looked implementing a CUDA pathway,
but I'm very open to such ideas. Someone did ask about an
OpenCL implementation recently, but I believe it's still in
the concept stage.

A few thoughts on the concept...

Our current GPU implementation does not attempt to match the
CPU implementation, by design. The CPU codepath does the full
analytical color operations per pixels, while the GPU GLSL/Cg
implementation relies on a combination of analytical shader
text code generation, along with a 3d lut sampling. For color
operations which can be done in simple shader text (such as
math ops), these all happen in the glsl shader. But if the
user references multiple 3d luts for example, it's all baked
into a single 3d lut.

I was always hoping that, if we ever implemented a CUDA or
OpenCL pathway, it would be more akin to the GPU code path and
do more analytically. Im not sure if this is possible, but I
think it's a nice ideal for a 'compute' context.

Another nicety of the current implementation is that even
though we support gpu(s), OpenColorIO doesnt actually link to
libGL, etc. The 'GPU API' conceptually only deals with POD
types, returning the float * 3dlut, and the const char *
shader text.

My hope would be that, if possible, a CUDA / OpenCL wouldn't
impose any new linking requirements on the core library, but
would instead support new code paths using simple data types.

-- Jeremy


On Wed, Mar 28, 2012 at 12:56 PM, Nathan Weston
<elb...@... <mailto:elb...@...>> wrote:

I'm currently integrating OpenColorIO into an application
that uses CUDA for GPU processing. In order to use OCIO's
shader path, we'd need to copy our images over to OpenGL
textures and back again. If OCIO had a CUDA path, it would
be cleaner and faster.

Has anyone looked into implementing such a thing? If I
were to implement it myself, is there any interest in
including it in OCIO?


--
Larry Gritz
l...@... <mailto:l...@...>



Dithermaster <dither...@...>
 

There isn't any way to share data between OpenCL and CUDA
Not entirely true. If you base them both on an OpenGL context, you can create OpenCL images from OpenGL textures, and you can access OpenGL textures from CUDA as well.

///d@


On Fri, Mar 30, 2012 at 2:58 PM, Nathan Weston <elb...@...> wrote:
On 03/30/2012 03:02 PM, Jeremy Selan wrote:
Excellent, looking forward to seeing what you do.

You mention CUDA (historically) doesn't support dynamic compilation of
kernels?  That would imply an implementation that looks more like a
fixed function processing path, unfortunately.  The downside being that
the results would be even less accurate (potentially) then either our
current CPU or GPU pathways.

Recall that in OCIO, all of the color transforms are dynamically loaded
at runtime, so at library compile-time there's no way to know processing
will be required for a given color transform.  (You roughly know the
building blocks, but not how they will be chained together).

Hmm, I didn't know that. I don't yet know anything at all about OCIO under the hood. If there's any documentation of this kind of stuff it would be very helpful.

When you say the color transforms are loaded at runtime, are you talking about additional code (e.g. a .so that implements a particular transform)? If it's simply reading LUTs or other datafiles then a CUDA implementation might be able to work off the same data. CUDA supports most of C++ on the device side so it's quite a bit more flexible than GLSL.


Even our current GLSL codepath, which leverages a single 3dlut, tries to
do as much as possible in the fragment shader at runtime.  (It's
pipeline looks like [GLSL CODE + 3DLUT + GLSL CODE], with as much done
in code as possible).

What type of CUDA application are you writing? Are you looking for OCIO
in CUDA for performance reasons?  Are you looking for OCIO to match the
quality of the GPU?  Perhaps we can come up with an alternate
implementation approach, or decide that it's better to just target
recent CUDA versions.

I have to be slightly cagey about this for the moment since I work on a commercial product and we haven't yet announced our plans regarding OCIO. But broadly, we have an image processing application which has both CPU and CUDA codepaths depending on the available hardware. All of this is already implemented, so we're pretty much wedded to CUDA for the time being.

If we've processed a whole frame on the GPU, it would kill our performance to copy it back to the CPU to run a colorspace transform. Copying to an OpenGL texture to use the GLSL path would also hurt performance, though not as badly. So a CUDA path would be very useful to me.


On first glance, it appears that OpenCL may support dynamic compilation,
and thus be easier to match the CPU 1:1.  Can anyone with OpenCL
experience chime in?

clCreateProgramWithSource(...), etc.

I haven't done any OpenCL programming, but have paid close attention to it over the years. It does support dynamic compilation. As Larry mentioned, it's somewhat lacking in language features compared to CUDA (e.g. templates and other C++ stuff).

There isn't any way to share data between OpenCL and CUDA, so an OpenCL path wouldn't be of much use to me. Which, as you can imagine, somewhat reduces my motivation to write one. :)

On Thu, Mar 29, 2012 at 11:11 AM, Nathan Weston <elb...@...
<mailto:elb...@...>> wrote:

   Cool. If all goes well, I can hopefully find time to work on this
   over the next couple of months.

   CUDA historically hasn't supported dynamic generation/compilation of
   kernels. I believe it's possible with newer versions of the
   compiler, but only with the lower-level driver API. A
   statically-compiled kernel is probably a better bet, which would
   tend to point toward a more analytical approach along the lines of
   your CPU codepath.

   I've had pretty good luck in the past sharing code between C++ and
   CUDA in order to implement parallel code paths that produce the same
   results. The only snags are
     1) the shared code has to go into header files
     2) Virtual functions require CUDA 4.0 and Fermi hardware

   If possible I'd like to support older cards and CUDA toolkits, which
   means no virtual functions.

   Is there any documentation of the OCIO internals to help me get my
   bearings?


   On Wednesday, March 28, 2012 6:47:58 PM UTC-4, Jeremy Selan wrote:

       I don't think anyone has looked implementing a CUDA pathway, but
       I'm very open to such ideas. Someone did ask about an OpenCL
       implementation recently, but I believe it's still in the concept
       stage.

       A few thoughts on the concept...

       Our current GPU implementation does not attempt to match the CPU
       implementation, by design.  The CPU codepath does the full
       analytical color operations per pixels, while the GPU GLSL/Cg
       implementation relies on a combination of analytical shader text
       code generation, along with a 3d lut sampling.   For color
       operations which can be done in simple shader text (such as math
       ops), these all happen in the glsl shader. But if the user
       references multiple 3d luts for example, it's all baked into a
       single 3d lut.

       I was always hoping that, if we ever implemented a CUDA or
       OpenCL pathway, it would be more akin to the GPU code path and
       do more analytically.  Im not sure if this is possible, but I
       think it's a nice ideal for a 'compute' context.

       Another nicety of the current implementation is that even though
       we support gpu(s), OpenColorIO doesnt actually link to libGL,
       etc.   The 'GPU API' conceptually only deals with POD types,
       returning the float * 3dlut, and the const char * shader text.

       My hope would be that, if possible, a CUDA / OpenCL wouldn't
       impose any new linking requirements on the core library, but
       would instead support new code paths using simple data types.

       -- Jeremy


       On Wed, Mar 28, 2012 at 12:56 PM, Nathan Weston
       <elb...@... <mailto:elb...@...>> wrote:

           I'm currently integrating OpenColorIO into an application
           that uses CUDA for GPU processing. In order to use OCIO's
           shader path, we'd need to copy our images over to OpenGL
           textures and back again. If OCIO had a CUDA path, it would
           be cleaner and faster.

           Has anyone looked into implementing such a thing? If I were
           to implement it myself, is there any interest in including
           it in OCIO?





Paul Miller <pa...@...>
 

If we've processed a whole frame on the GPU, it would kill our
performance to copy it back to the CPU to run a colorspace transform.
Copying to an OpenGL texture to use the GLSL path would also hurt
performance, though not as badly. So a CUDA path would be very useful to
me.
Can't you just bind your target framebuffer as a texture? No copy required and everything stays on the GPU.


Nathan Weston <elb...@...>
 

On 03/30/2012 05:24 PM, Paul Miller wrote:
If we've processed a whole frame on the GPU, it would kill our
performance to copy it back to the CPU to run a colorspace transform.
Copying to an OpenGL texture to use the GLSL path would also hurt
performance, though not as badly. So a CUDA path would be very useful to
me.
Can't you just bind your target framebuffer as a texture? No copy
required and everything stays on the GPU.
CUDA doesn't really have framebuffers as such. OpenGL textures are mapped to CUDA arrays, which you can copy into but can't write directly from a kernel. So there's at least one copy going in, and probably another going out (depending on how you want to access the data afterward).

In addition to which, we typically work with planar images, so we'd add packing/unpacking overhead.

We're not doing any display so we don't currently use OpenGL at all.

None of this overhead is prohibitive, and I'm sure the shader path would work for us. But if a CUDA path can be implemented without too much effort it will definitely be a nicer solution.


Paul Miller <pa...@...>
 

On 3/30/2012 4:44 PM, Nathan Weston wrote:
On 03/30/2012 05:24 PM, Paul Miller wrote:
If we've processed a whole frame on the GPU, it would kill our
performance to copy it back to the CPU to run a colorspace transform.
Copying to an OpenGL texture to use the GLSL path would also hurt
performance, though not as badly. So a CUDA path would be very useful to
me.
Can't you just bind your target framebuffer as a texture? No copy
required and everything stays on the GPU.
CUDA doesn't really have framebuffers as such. OpenGL textures are
mapped to CUDA arrays, which you can copy into but can't write directly
from a kernel. So there's at least one copy going in, and probably
another going out (depending on how you want to access the data afterward).
Ah when you said you were processingt the whole frame in the GPU, I immediately thought OpenGL, not CUDA. Sorry for the confusion.


Jeremy Selan <jeremy...@...>
 


On Fri, Mar 30, 2012 at 12:58 PM, Nathan Weston <elb...@...> wrote:

Hmm, I didn't know that. I don't yet know anything at all about OCIO under the hood. If there's any documentation of this kind of stuff it would be very helpful.

I've put together a *very* rough document that describes how OCIO works internally. It needs a lot of work, but it's at least a start.  I'll add it to the mainline docs later this week after some cleanup.

https://github.com/jeremyselan/OpenColorIO/blob/docs/docs/InternalArchitecture.rst



When you say the color transforms are loaded at runtime, are you talking about additional code (e.g. a .so that implements a particular transform)? If it's simply reading LUTs or other datafiles then a CUDA implementation might be able to work off the same data. CUDA supports most of C++ on the device side so it's quite a bit more flexible than GLSL.

Currently, the latter. (at runtime there's simple reading of luts / datafiles).  But the types of transforms, and the ordering, are defined on the fly as well.

In the middle-term we're considering the former, where processing operations themselves can be defined as plugins at runtime.  But this shouldnt be a killer.  Any plugin approach will have to support the existing GPU pathway, so CUDA is no worse off here.


A simple overview of OCIO:

At runtime, the $OCIO configuration defines what OCIO::Transform(s) are used for each color space conversion.

Example:  (ACES config, adx->aces conversion).

Transforms:
        - !<FileTransform> {src: adx_adx10_to_cdd.spimtx}
        - !<FileTransform> {src: adx_cdd_to_cid.spimtx}
        - !<FileTransform> {src: adx_cid_to_rle.spi1d, interpolation: linear}
        - !<LogTransform> {base: 10, direction: inverse}
        - !<FileTransform> {src: adx_exp_to_aces.spimtx}

The config->getProcessor call creates a Processor object, which embodies a 'pixel ready' color transform.  (When you call getprocessor, luts may be loaded, etc).  And on the resulting processor object you can call applyRGB, etc.

Example:
During config->getProcessor('adx10','
aces'),  this gets translated into a series of internal "Ops".  Ops are an OCIO implementation detail, and are not publicly exposed in the API.

There are only a limited number of Ops.  (MatrixOffset, Exponent, Log, Lut1D, Lut3D).  But the behavior of these ops may depend on data loaded at runtime. (The LUT ops, for example).

So thinking ahead...
It just may be possible to precompile a generic CUDA kernel for each of the predefined op types, and then at runtime for OCIO to loop over the Ops within the processor, load any kernel-specific data, and then apply the kernels to the data.

In CUDA, Is it appropriate to have a 3rd party library (OCIO) directly call a bunch of individual kernels on some passed in data?

If so, then this just might work.  However, if going this route requires OCIO to expose publicly the internal Ops, then I would consider this a non-starter.  The internal ops cannot, and should not, become part of any exposed public interface.




I have to be slightly cagey about this for the moment since I work on a commercial product and we haven't yet announced our plans regarding OCIO. But broadly, we have an image processing application which has both CPU and CUDA codepaths depending on the available hardware. All of this is already implemented, so we're pretty much wedded to CUDA for the time being.

If we've processed a whole frame on the GPU, it would kill our performance to copy it back to the CPU to run a colorspace transform. Copying to an OpenGL texture to use the GLSL path would also hurt performance, though not as badly. So a CUDA path would be very useful to me.

Understood. (wink)

We'll get you a GPU-friendly solution that doesnt require moving the image back and forth.

-- Jeremy


Nathan Weston <elb...@...>
 

On 03/30/2012 05:56 PM, Jeremy Selan wrote:
There are only a limited number of Ops. (MatrixOffset, Exponent, Log,
Lut1D, Lut3D). But the behavior of these ops may depend on data loaded
at runtime. (The LUT ops, for example).

So thinking ahead...
It just may be possible to precompile a generic CUDA kernel for each of
the predefined op types, and then at runtime for OCIO to loop over the
Ops within the processor, load any kernel-specific data, and then apply
the kernels to the data.

In CUDA, Is it appropriate to have a 3rd party library (OCIO) directly
call a bunch of individual kernels on some passed in data?

If so, then this just might work. However, if going this route requires
OCIO to expose publicly the internal Ops, then I would consider this a
non-starter. The internal ops cannot, and should not, become part of
any exposed public interface.
Yes, that should be possible. There may be some performance implications. Memory access has high latency, so ideally you want to read your data once, do all the math in registers, then write the final result. Splitting an operation across kernels will require more round-trips to memory. But in practice the scheduler is pretty decent at hiding this latency, so it may not be a problem.

Another possibility is to write a single kernel, which takes the list of Ops as input and loops over them internally. This would require all the Ops to be available at compile time (which it sounds like is the case now, but may not be in the future).

Either way, I don't see any reason that the Ops need to be exposed in the API.

Thanks for the architecture overview. That should be a big help. I'll dig into the code next week and see what I can come up with.


Nathan Weston <elb...@...>
 

On 3/31/2012 9:15 AM, Nathan Weston wrote:
On 03/30/2012 05:56 PM, Jeremy Selan wrote:
There are only a limited number of Ops. (MatrixOffset, Exponent, Log,
Lut1D, Lut3D). But the behavior of these ops may depend on data loaded
at runtime. (The LUT ops, for example).

So thinking ahead...
It just may be possible to precompile a generic CUDA kernel for each of
the predefined op types, and then at runtime for OCIO to loop over the
Ops within the processor, load any kernel-specific data, and then apply
the kernels to the data.

In CUDA, Is it appropriate to have a 3rd party library (OCIO) directly
call a bunch of individual kernels on some passed in data?

If so, then this just might work. However, if going this route requires
OCIO to expose publicly the internal Ops, then I would consider this a
non-starter. The internal ops cannot, and should not, become part of
any exposed public interface.
Yes, that should be possible. There may be some performance
implications. Memory access has high latency, so ideally you want to
read your data once, do all the math in registers, then write the final
result. Splitting an operation across kernels will require more
round-trips to memory. But in practice the scheduler is pretty decent at
hiding this latency, so it may not be a problem.

Another possibility is to write a single kernel, which takes the list of
Ops as input and loops over them internally. This would require all the
Ops to be available at compile time (which it sounds like is the case
now, but may not be in the future).

Either way, I don't see any reason that the Ops need to be exposed in
the API.

Thanks for the architecture overview. That should be a big help. I'll
dig into the code next week and see what I can come up with.
I've spent some time going over the OCIO code and doing some CUDA experiments, and I think I have an overall design that will work.

It will require some minor modifications to the Op classes (more on that below), but should end up with CPU and CUDA paths mostly sharing code, and thus producing identical results.

I would add a CMake option to enable CUDA. If that's turned on, OCIO will build with CUDA support and link against the CUDA runtime API -- so using CUDA or not is a compile-time decision. If CUDA is turned off, then OCIO won't have any additional dependencies beyond what it has today.

The Processor class will get a new method, applyCUDA(), which should be passed an ImageDesc that points to GPU memory. (Alternatively, we could put a GPU flag in the ImageDesc and have the existing apply() method dispatch appropriately.)

applyCUDA() will copy the list of Ops to the GPU. Then it will launch a kernel that, at each pixel, loops over each op and calls op->apply() on that pixel. In many cases, the existing code for apply() will work fine in CUDA. Some ops might need a specialized CUDA implementation for good performance. All of this code can go in one .cu file which is compiled by nvcc.

Any functions or methods that we want to run on the GPU ("device" code in CUDA parlance) need to be marked with the "__device__" qualifier so nvcc will pick them up. We can wrap this in a macro to avoid problems with the regular C++ compiler.

CUDA doesn't support separate compilation for device code. So our .cu file will need to #include the .cpp files for all the Ops. When nvcc compiles this, it will produce both host and device code -- but we already have host versions of all this code compiled separately, so we'll get duplicate symbols at link time. Thus we'll need to sprinkle around some #ifdefs to hide host code from nvcc.

Unfortunately, I've also discovered a weird quirk (maybe a bug) in nvcc. If I have code like this:
class Foo {
__device__ void myMethod();
};
__device__ void Foo::myMethod();

it will still compile a host version of myMethod(), which causes link errors. In order to prevent this problem I have to move the method body into the class declaration. So we'd also have to inline some method definitions, which is not great.


All of this is assuming that we use CUDA's Runtime API, which handles all the nasty details of launching kernels automatically. But CUDA also has a lower-level Driver API. When using the Driver API, CUDA doesn't compile any host code -- so this would sidestep the duplicate symbol problem entirely. But it introduces some hassles of its own, and I don't have much experience with it.

To summarize:
With the runtime API
* Some functions/methods are marked with __device__
* Need ifdefs to prevent duplicate symbols
* Need to inline method definitions to produce duplicate symbols
* CUDA code is easier, but C++ code is uglier

With the driver API
* Some functions/methods are marked with __device__
* No other modifications to existing C++ code
* CUDA code is nastier

The public API will be identical in either case, so it makes no difference to the user of the library.

From my perspective, the runtime API is definitely the easier way to go. But it's not my project so I'm not really in a position to say how much rearranging/uglification of existing code is acceptable. I'll leave that call up to the powers that be. :)

-- Nathan


Jeremy Selan <jeremy...@...>
 

Thanks for looking into CUDA more. This could be really cool.

I would add a CMake option to enable CUDA. If that's turned on, OCIO will
build with CUDA support and link against the CUDA runtime API -- so using
CUDA or not is a compile-time decision. If CUDA is turned off, then OCIO
won't have any additional dependencies beyond what it has today.
Excellent.

The Processor class will get a new method, applyCUDA(), which should be
passed an ImageDesc that points to GPU memory. (Alternatively, we could put
a GPU flag in the ImageDesc and have the existing apply() method dispatch
appropriately.)

Is there a C++ wrapped handle to GPU memory in CUDA? If so, would it
be possible to define a custom CUDAImageDesc, which derives off of
ImageDesc, and wraps your CUDA memory handle? (Would we need two
flavors for packed or planar images?)

There's conceptually two 'axes' here:
- Whether the image memory is on GPU or CPU
- Whether we want to process on the GPU or CPU.

I could imagine an implementation where we have both applyCPU(...) and
applyGPU(...). And then you can pass both CPUImageDesc, and
GPUImageDesc, to either. So 2x2=4 cases. Example: If you called
applyGPU on a CUDAImageDesc it would copy it to the GPU, apply in GPU,
and copy back. But this API approach, while explicit, seems frought
with extra copies and wouldnt encourage 'fast' practice.

So I think my inclination is to do something simpler. (Your second
case). Have only a single processor->apply(...) function which
dispatches internally. Even when compiled with CUDA, if you call
apply(...) on a normal ImageDesc it will apply the CPU path. But if
you call it on a CUDAImageDesc, it will apply it on the GPU.

Does this make sense?


applyCUDA() will copy the list of Ops to the GPU. Then it will launch a
kernel that, at each pixel, loops over each op and calls op->apply() on that
pixel. In many cases, the existing code for apply() will work fine in CUDA.
Some ops might need a specialized CUDA implementation for good performance.
I'd like to make sure that even in a CUDA-enabled OCIO, the CPU path
continues to work. Let's consider this goal #1.

All of this code can go in one .cu file which is compiled by nvcc.
Are you saying that all of our implementations need to be in a single
.cu file within the source tree? Or that at build-time a single .cu
file will be generated and compiled?

If possible, I'd really like to maintain implementations of each op in
separate files. (Though if we need both a Lut1DOp.cpp and Lut1DOp.cu
that's not too bad).

All of this is assuming that we use CUDA's Runtime API, which handles all
the nasty details of launching kernels automatically. But CUDA also has a
lower-level Driver API. When using the Driver API, CUDA doesn't compile any
host code -- so this would sidestep the duplicate symbol problem entirely.
But it introduces some hassles of its own, and I don't have much experience
with it.

To summarize:
With the runtime API
 * Some functions/methods are marked with __device__
 * Need ifdefs to prevent duplicate symbols
 * Need to inline method definitions to produce duplicate symbols
 * CUDA code is easier, but C++ code is uglier

With the driver API
 * Some functions/methods are marked with __device__
 * No other modifications to existing C++ code
 * CUDA code is nastier

The public API will be identical in either case, so it makes no difference
to the user of the library.

From my perspective, the runtime API is definitely the easier way to go. But
it's not my project so I'm not really in a position to say how much
rearranging/uglification of existing code is acceptable. I'll leave that
call up to the powers that be. :)
I'm a CUDA newbie so I'll defer to you here. What's more maintainable
in the long run? If the runtimeAPI is what most people use, I'd be
more comfortable following the pack, unless there's a really
compelling argument to use the raw driver API.

Re-arranging of code I'm cool with.

Uglification of code I'd like to avoid unless absolutely necessary.

Are there other libraries that are in a similar boat to OCIO, and also
support CUDA? Surely other have probably looked into the tradeoff of
the two API. Can anyone suggest prior 'best practice' we can copy?


------


So what are the next steps?

I think my preference would be for you to
- mockup the public API
- write CUDA support for only the simplest possible Op, such as 'ExponentOp'
- copy src/apps/ocioconvert -> src/apps/ociocudaconvert, and update
this example to load to a cuda buffer, process using OCIO, copy back
to host memory, and then save to a file.

One there are done, we can iterate on this trivial case until we get
an API / file layout we all like.

Then we should implement Lut1DOp, in the inverse direction, which I
believe will be the hardest Op to get working. If we can prove that
one works, I think the rest will fall into place.

Bonus point if we add unit tests during Op porting that compare the
CPU to GPU pathways and confirms it's within expected bounds.

-- Jeremy


Jeremy Selan <jeremy...@...>
 

One implementation detail I just thought of...

Some of the Ops (such as the ones that use LUTs) will require explicit
allocations on the device side. Is it is ok for OCIO to make these
allocations internally, or is it bad practice for CUDA enabled
libraries to do this under the hood?

Assuming it's ok for OCIO to make device allocations, when are these
released? My intuition says that when the OCIO::Processor destructor
is called, we release the underlying gpu memory. And then if someone
wants to keep a particular transform 'resident', they need to hold
onto the Processor in their host app.

-- Jeremy


Dithermaster <dither...@...>
 

You can wrap CUDA memory objects around OpenGL textures, FWIW. Same with OpenCL images.

It might make some sense to make the transfer operations "just in time" so the bits get transferred to the appropriate context (CPU or GPU) on-demand. That way, a sequence of GPU operations won't have to transfer back to CPU in-between.



On Mon, Apr 9, 2012 at 6:21 PM, Jeremy Selan <jeremy...@...> wrote:
One implementation detail I just thought of...

Some of the Ops (such as the ones that use LUTs) will require explicit
allocations on the device side.   Is it is ok for OCIO to make these
allocations internally, or is it bad practice for CUDA enabled
libraries to do this under the hood?

Assuming it's ok for OCIO to make device allocations, when are these
released?  My intuition says that when the OCIO::Processor destructor
is called, we release the underlying gpu memory.  And then if someone
wants to keep a particular transform 'resident', they need to hold
onto the Processor in their host app.

-- Jeremy


Nathan Weston <elb...@...>
 

On 4/9/2012 7:21 PM, Jeremy Selan wrote:
One implementation detail I just thought of...

Some of the Ops (such as the ones that use LUTs) will require explicit
allocations on the device side. Is it is ok for OCIO to make these
allocations internally, or is it bad practice for CUDA enabled
libraries to do this under the hood?

Assuming it's ok for OCIO to make device allocations, when are these
released? My intuition says that when the OCIO::Processor destructor
is called, we release the underlying gpu memory. And then if someone
wants to keep a particular transform 'resident', they need to hold
onto the Processor in their host app.
Sure, it's no problem for OCIO to make device allocations. Attaching them to the Processor makes sense -- that way the user of the library can easily control when device memory is freed.

-- Nathan


Nathan Weston <elb...@...>
 

On 4/9/2012 7:12 PM, Jeremy Selan wrote:
Thanks for looking into CUDA more. This could be really cool.

I would add a CMake option to enable CUDA. If that's turned on, OCIO will
build with CUDA support and link against the CUDA runtime API -- so using
CUDA or not is a compile-time decision. If CUDA is turned off, then OCIO
won't have any additional dependencies beyond what it has today.
Excellent.

The Processor class will get a new method, applyCUDA(), which should be
passed an ImageDesc that points to GPU memory. (Alternatively, we could put
a GPU flag in the ImageDesc and have the existing apply() method dispatch
appropriately.)

Is there a C++ wrapped handle to GPU memory in CUDA? If so, would it
be possible to define a custom CUDAImageDesc, which derives off of
ImageDesc, and wraps your CUDA memory handle? (Would we need two
flavors for packed or planar images?)

There's conceptually two 'axes' here:
- Whether the image memory is on GPU or CPU
- Whether we want to process on the GPU or CPU.

I could imagine an implementation where we have both applyCPU(...) and
applyGPU(...). And then you can pass both CPUImageDesc, and
GPUImageDesc, to either. So 2x2=4 cases. Example: If you called
applyGPU on a CUDAImageDesc it would copy it to the GPU, apply in GPU,
and copy back. But this API approach, while explicit, seems frought
with extra copies and wouldnt encourage 'fast' practice.

So I think my inclination is to do something simpler. (Your second
case). Have only a single processor->apply(...) function which
dispatches internally. Even when compiled with CUDA, if you call
apply(...) on a normal ImageDesc it will apply the CPU path. But if
you call it on a CUDAImageDesc, it will apply it on the GPU.

Does this make sense?
Yes, sounds good to me.


applyCUDA() will copy the list of Ops to the GPU. Then it will launch a
kernel that, at each pixel, loops over each op and calls op->apply() on that
pixel. In many cases, the existing code for apply() will work fine in CUDA.
Some ops might need a specialized CUDA implementation for good performance.
I'd like to make sure that even in a CUDA-enabled OCIO, the CPU path
continues to work. Let's consider this goal #1.

All of this code can go in one .cu file which is compiled by nvcc.
Are you saying that all of our implementations need to be in a single
.cu file within the source tree? Or that at build-time a single .cu
file will be generated and compiled?

If possible, I'd really like to maintain implementations of each op in
separate files. (Though if we need both a Lut1DOp.cpp and Lut1DOp.cu
that's not too bad).
All the device code for a given kernel has to ultimately end up in a single compilation unit somehow. You can still keep the code in separate files, but you have to #include it in your main .cu file.

I'm imagining we'd have something like ProcessCuda.cu where the actual kernel is defined. Since that will depend on the ops, it would have
#include "Lut1DOp.cpp"
#include "MatrixOps.cpp"

etc.

All of this is assuming that we use CUDA's Runtime API, which handles all
the nasty details of launching kernels automatically. But CUDA also has a
lower-level Driver API. When using the Driver API, CUDA doesn't compile any
host code -- so this would sidestep the duplicate symbol problem entirely.
But it introduces some hassles of its own, and I don't have much experience
with it.

To summarize:
With the runtime API
* Some functions/methods are marked with __device__
* Need ifdefs to prevent duplicate symbols
* Need to inline method definitions to produce duplicate symbols
* CUDA code is easier, but C++ code is uglier

With the driver API
* Some functions/methods are marked with __device__
* No other modifications to existing C++ code
* CUDA code is nastier

The public API will be identical in either case, so it makes no difference
to the user of the library.

From my perspective, the runtime API is definitely the easier way to go. But
it's not my project so I'm not really in a position to say how much
rearranging/uglification of existing code is acceptable. I'll leave that
call up to the powers that be. :)
I'm a CUDA newbie so I'll defer to you here. What's more maintainable
in the long run? If the runtimeAPI is what most people use, I'd be
more comfortable following the pack, unless there's a really
compelling argument to use the raw driver API.

Re-arranging of code I'm cool with.

Uglification of code I'd like to avoid unless absolutely necessary.

Are there other libraries that are in a similar boat to OCIO, and also
support CUDA? Surely other have probably looked into the tradeoff of
the two API. Can anyone suggest prior 'best practice' we can copy?


------


So what are the next steps?

I think my preference would be for you to
- mockup the public API
- write CUDA support for only the simplest possible Op, such as 'ExponentOp'
- copy src/apps/ocioconvert -> src/apps/ociocudaconvert, and update
this example to load to a cuda buffer, process using OCIO, copy back
to host memory, and then save to a file.

One there are done, we can iterate on this trivial case until we get
an API / file layout we all like.

Then we should implement Lut1DOp, in the inverse direction, which I
believe will be the hardest Op to get working. If we can prove that
one works, I think the rest will fall into place.

Bonus point if we add unit tests during Op porting that compare the
CPU to GPU pathways and confirms it's within expected bounds.
Sounds good to me. I'll do this with the runtime API, since that's easier, and we can see how bad the code changes are. If I need to switch to the driver API at that point it shouldn't be too much extra effort.

-- Nathan


Nathan Weston <elb...@...>
 

On 4/9/2012 7:12 PM, Jeremy Selan wrote:
So what are the next steps?

I think my preference would be for you to
- mockup the public API
- write CUDA support for only the simplest possible Op, such as 'ExponentOp'
- copy src/apps/ocioconvert -> src/apps/ociocudaconvert, and update
this example to load to a cuda buffer, process using OCIO, copy back
to host memory, and then save to a file.

One there are done, we can iterate on this trivial case until we get
an API / file layout we all like.
This is done now. My code is on GitHub: https://github.com/nweston/OpenColorIO/tree/cuda

I worked out a little differently than I had planned. I ended up with a parallel class hierarchy of CudaOps. This doesn't result in too much duplicated code since the Ops typically call a function to do most of the work of apply().

I had to move some code into different files, but on the whole the changes to existing code weren't as bad as I expected.

The public API just consists of two new ImageDesc classes, for packed/planar CUDA images.

There are two limitations at the moment:
1. nvcc doesn't support C++0x yet, so the CUDA path only builds if OCIO_USE_BOOST_PTR is enabled. I don't think we really need smart pointers anywhere in the CUDA code, so we ought to be able to work around this, but I haven't tried it yet.

2. The current implementation requires CUDA 4.0 and a Fermi card, because it makes virtual calls in device code. Eventually I'd like to support older cards, but I can worry about that later.

Let me know what you think so far.

-- Nathan