Nathan Weston <elb...@...>
I'm currently integrating OpenColorIO into an application that uses CUDA for GPU processing. In order to use OCIO's shader path, we'd need to copy our images over to OpenGL textures and back again. If OCIO had a CUDA path, it would be cleaner and faster.
Has anyone looked into implementing such a thing? If I were to implement it myself, is there any interest in including it in OCIO?
|
|
Jeremy Selan <jeremy...@...>
I don't think anyone has looked implementing a CUDA pathway, but I'm very open to such ideas. Someone did ask about an OpenCL implementation recently, but I believe it's still in the concept stage. A few thoughts on the concept...
Our current GPU implementation does not attempt to match the CPU implementation, by design. The CPU codepath does the full analytical color operations per pixels, while the GPU GLSL/Cg implementation relies on a combination of analytical shader text code generation, along with a 3d lut sampling. For color operations which can be done in simple shader text (such as math ops), these all happen in the glsl shader. But if the user references multiple 3d luts for example, it's all baked into a single 3d lut.
I was always hoping that, if we ever implemented a CUDA or OpenCL pathway, it would be more akin to the GPU code path and do more analytically. Im not sure if this is possible, but I think it's a nice ideal for a 'compute' context.
Another nicety of the current implementation is that even though we support gpu(s), OpenColorIO doesnt actually link to libGL, etc. The 'GPU API' conceptually only deals with POD types, returning the float * 3dlut, and the const char * shader text.
My hope would be that, if possible, a CUDA / OpenCL wouldn't impose any new linking requirements on the core library, but would instead support new code paths using simple data types. -- Jeremy
On Wed, Mar 28, 2012 at 12:56 PM, Nathan Weston <elb...@...> wrote:
toggle quoted message
Show quoted text
I'm currently integrating OpenColorIO into an application that uses CUDA for GPU processing. In order to use OCIO's shader path, we'd need to copy our images over to OpenGL textures and back again. If OCIO had a CUDA path, it would be cleaner and faster.
Has anyone looked into implementing such a thing? If I were to implement it myself, is there any interest in including it in OCIO?
|
|
Nathan Weston <elb...@...>
Cool. If all goes well, I can hopefully find time to work on this over the next couple of months.
CUDA historically hasn't supported dynamic generation/compilation of kernels. I believe it's possible with newer versions of the compiler, but only with the lower-level driver API. A statically-compiled kernel is probably a better bet, which would tend to point toward a more analytical approach along the lines of your CPU codepath.
I've had pretty good luck in the past sharing code between C++ and CUDA in order to implement parallel code paths that produce the same results. The only snags are 1) the shared code has to go into header files 2) Virtual functions require CUDA 4.0 and Fermi hardware
If possible I'd like to support older cards and CUDA toolkits, which means no virtual functions.
Is there any documentation of the OCIO internals to help me get my bearings?
toggle quoted message
Show quoted text
On Wednesday, March 28, 2012 6:47:58 PM UTC-4, Jeremy Selan wrote: I don't think anyone has looked implementing a CUDA pathway, but I'm very open to such ideas. Someone did ask about an OpenCL implementation recently, but I believe it's still in the concept stage.
A few thoughts on the concept...
Our current GPU implementation does not attempt to match the CPU implementation, by design. The CPU codepath does the full analytical color operations per pixels, while the GPU GLSL/Cg implementation relies on a combination of analytical shader text code generation, along with a 3d lut sampling. For color operations which can be done in simple shader text (such as math ops), these all happen in the glsl shader. But if the user references multiple 3d luts for example, it's all baked into a single 3d lut.
I was always hoping that, if we ever implemented a CUDA or OpenCL pathway, it would be more akin to the GPU code path and do more analytically. Im not sure if this is possible, but I think it's a nice ideal for a 'compute' context.
Another nicety of the current implementation is that even though we support gpu(s), OpenColorIO doesnt actually link to libGL, etc. The 'GPU API' conceptually only deals with POD types, returning the float * 3dlut, and the const char * shader text.
My hope would be that, if possible, a CUDA / OpenCL wouldn't impose any new linking requirements on the core library, but would instead support new code paths using simple data types.
-- Jeremy
On Wed, Mar 28, 2012 at 12:56 PM, Nathan Weston <elb...@...> wrote:
I'm currently integrating OpenColorIO into an application that uses CUDA for GPU processing. In order to use OCIO's shader path, we'd need to copy our images over to OpenGL textures and back again. If OCIO had a CUDA path, it would be cleaner and faster.
Has anyone looked into implementing such a thing? If I were to implement it myself, is there any interest in including it in OCIO?
|
|
Jeremy Selan <jeremy...@...>
Excellent, looking forward to seeing what you do.
You mention CUDA (historically) doesn't support dynamic compilation of kernels? That would imply an implementation that looks more like a fixed function processing path, unfortunately. The downside being that the results would be even less accurate (potentially) then either our current CPU or GPU pathways.
Recall that in OCIO, all of the color transforms are dynamically loaded at runtime, so at library compile-time there's no way to know processing will be required for a given color transform. (You roughly know the building blocks, but not how they will be chained together).
Even our current GLSL codepath, which leverages a single 3dlut, tries to do as much as possible in the fragment shader at runtime. (It's pipeline looks like [GLSL CODE + 3DLUT + GLSL CODE], with as much done in code as possible).
What type of CUDA application are you writing? Are you looking for OCIO in CUDA for performance reasons? Are you looking for OCIO to match the quality of the GPU? Perhaps we can come up with an alternate implementation approach, or decide that it's better to just target recent CUDA versions.
On first glance, it appears that OpenCL may support dynamic compilation, and thus be easier to match the CPU 1:1. Can anyone with OpenCL experience chime in?
clCreateProgramWithSource(...), etc.
-- Jeremy
toggle quoted message
Show quoted text
On Thu, Mar 29, 2012 at 11:11 AM, Nathan Weston <elb...@...> wrote:
Cool. If all goes well, I can hopefully find time to work on this over the next couple of months.
CUDA historically hasn't supported dynamic generation/compilation of kernels. I believe it's possible with newer versions of the compiler, but only with the lower-level driver API. A statically-compiled kernel is probably a better bet, which would tend to point toward a more analytical approach along the lines of your CPU codepath.
I've had pretty good luck in the past sharing code between C++ and CUDA in order to implement parallel code paths that produce the same results. The only snags are 1) the shared code has to go into header files
2) Virtual functions require CUDA 4.0 and Fermi hardware
If possible I'd like to support older cards and CUDA toolkits, which means no virtual functions.
Is there any documentation of the OCIO internals to help me get my bearings?
On Wednesday, March 28, 2012 6:47:58 PM UTC-4, Jeremy Selan wrote: I don't think anyone has looked implementing a CUDA pathway, but I'm very open to such ideas. Someone did ask about an OpenCL implementation recently, but I believe it's still in the concept stage.
A few thoughts on the concept...
Our current GPU implementation does not attempt to match the CPU implementation, by design. The CPU codepath does the full analytical color operations per pixels, while the GPU GLSL/Cg implementation relies on a combination of analytical shader text code generation, along with a 3d lut sampling. For color operations which can be done in simple shader text (such as math ops), these all happen in the glsl shader. But if the user references multiple 3d luts for example, it's all baked into a single 3d lut.
I was always hoping that, if we ever implemented a CUDA or OpenCL pathway, it would be more akin to the GPU code path and do more analytically. Im not sure if this is possible, but I think it's a nice ideal for a 'compute' context.
Another nicety of the current implementation is that even though we support gpu(s), OpenColorIO doesnt actually link to libGL, etc. The 'GPU API' conceptually only deals with POD types, returning the float * 3dlut, and the const char * shader text.
My hope would be that, if possible, a CUDA / OpenCL wouldn't impose any new linking requirements on the core library, but would instead support new code paths using simple data types.
-- Jeremy
On Wed, Mar 28, 2012 at 12:56 PM, Nathan Weston <elb...@...> wrote:
I'm currently integrating OpenColorIO into an application that uses CUDA for GPU processing. In order to use OCIO's shader path, we'd need to copy our images over to OpenGL textures and back again. If OCIO had a CUDA path, it would be cleaner and faster.
Has anyone looked into implementing such a thing? If I were to implement it myself, is there any interest in including it in OCIO?
|
|
IIRC, although it's not quite as easy as OpenCL, these days Cuda can dynamically compile kernels. (OpenCL was able to do that all along.)
The other advantage of Cuda is that it's really C++ with a couple minor additions, which may make porting our existing code a lot easier, and also use all your favorite C++ features such as classes and templates. OpenCL is its own thing (though very C like).
The disadvantage is, of course, less HW and vendor independence. On Mar 30, 2012, at 12:02 PM, Jeremy Selan wrote: Excellent, looking forward to seeing what you do.
You mention CUDA (historically) doesn't support dynamic compilation of kernels? That would imply an implementation that looks more like a fixed function processing path, unfortunately. The downside being that the results would be even less accurate (potentially) then either our current CPU or GPU pathways.
Recall that in OCIO, all of the color transforms are dynamically loaded at runtime, so at library compile-time there's no way to know processing will be required for a given color transform. (You roughly know the building blocks, but not how they will be chained together).
Even our current GLSL codepath, which leverages a single 3dlut, tries to do as much as possible in the fragment shader at runtime. (It's pipeline looks like [GLSL CODE + 3DLUT + GLSL CODE], with as much done in code as possible).
What type of CUDA application are you writing? Are you looking for OCIO in CUDA for performance reasons? Are you looking for OCIO to match the quality of the GPU? Perhaps we can come up with an alternate implementation approach, or decide that it's better to just target recent CUDA versions.
On first glance, it appears that OpenCL may support dynamic compilation, and thus be easier to match the CPU 1:1. Can anyone with OpenCL experience chime in?
clCreateProgramWithSource(...), etc.
-- Jeremy
On Thu, Mar 29, 2012 at 11:11 AM, Nathan Weston <elb...@...> wrote:
Cool. If all goes well, I can hopefully find time to work on this over the next couple of months.
CUDA historically hasn't supported dynamic generation/compilation of kernels. I believe it's possible with newer versions of the compiler, but only with the lower-level driver API. A statically-compiled kernel is probably a better bet, which would tend to point toward a more analytical approach along the lines of your CPU codepath.
I've had pretty good luck in the past sharing code between C++ and CUDA in order to implement parallel code paths that produce the same results. The only snags are 1) the shared code has to go into header files
2) Virtual functions require CUDA 4.0 and Fermi hardware
If possible I'd like to support older cards and CUDA toolkits, which means no virtual functions.
Is there any documentation of the OCIO internals to help me get my bearings?
On Wednesday, March 28, 2012 6:47:58 PM UTC-4, Jeremy Selan wrote: I don't think anyone has looked implementing a CUDA pathway, but I'm very open to such ideas. Someone did ask about an OpenCL implementation recently, but I believe it's still in the concept stage.
A few thoughts on the concept...
Our current GPU implementation does not attempt to match the CPU implementation, by design. The CPU codepath does the full analytical color operations per pixels, while the GPU GLSL/Cg implementation relies on a combination of analytical shader text code generation, along with a 3d lut sampling. For color operations which can be done in simple shader text (such as math ops), these all happen in the glsl shader. But if the user references multiple 3d luts for example, it's all baked into a single 3d lut.
I was always hoping that, if we ever implemented a CUDA or OpenCL pathway, it would be more akin to the GPU code path and do more analytically. Im not sure if this is possible, but I think it's a nice ideal for a 'compute' context.
Another nicety of the current implementation is that even though we support gpu(s), OpenColorIO doesnt actually link to libGL, etc. The 'GPU API' conceptually only deals with POD types, returning the float * 3dlut, and the const char * shader text.
My hope would be that, if possible, a CUDA / OpenCL wouldn't impose any new linking requirements on the core library, but would instead support new code paths using simple data types.
-- Jeremy
On Wed, Mar 28, 2012 at 12:56 PM, Nathan Weston <elb...@...> wrote:
I'm currently integrating OpenColorIO into an application that uses CUDA for GPU processing. In order to use OCIO's shader path, we'd need to copy our images over to OpenGL textures and back again. If OCIO had a CUDA path, it would be cleaner and faster.
Has anyone looked into implementing such a thing? If I were to implement it myself, is there any interest in including it in OCIO?
|
|
Nathan Weston <elb...@...>
On 03/30/2012 03:02 PM, Jeremy Selan wrote: Excellent, looking forward to seeing what you do.
You mention CUDA (historically) doesn't support dynamic compilation of kernels? That would imply an implementation that looks more like a fixed function processing path, unfortunately. The downside being that the results would be even less accurate (potentially) then either our current CPU or GPU pathways.
Recall that in OCIO, all of the color transforms are dynamically loaded at runtime, so at library compile-time there's no way to know processing will be required for a given color transform. (You roughly know the building blocks, but not how they will be chained together). Hmm, I didn't know that. I don't yet know anything at all about OCIO under the hood. If there's any documentation of this kind of stuff it would be very helpful. When you say the color transforms are loaded at runtime, are you talking about additional code (e.g. a .so that implements a particular transform)? If it's simply reading LUTs or other datafiles then a CUDA implementation might be able to work off the same data. CUDA supports most of C++ on the device side so it's quite a bit more flexible than GLSL. Even our current GLSL codepath, which leverages a single 3dlut, tries to do as much as possible in the fragment shader at runtime. (It's pipeline looks like [GLSL CODE + 3DLUT + GLSL CODE], with as much done in code as possible).
What type of CUDA application are you writing? Are you looking for OCIO in CUDA for performance reasons? Are you looking for OCIO to match the quality of the GPU? Perhaps we can come up with an alternate implementation approach, or decide that it's better to just target recent CUDA versions. I have to be slightly cagey about this for the moment since I work on a commercial product and we haven't yet announced our plans regarding OCIO. But broadly, we have an image processing application which has both CPU and CUDA codepaths depending on the available hardware. All of this is already implemented, so we're pretty much wedded to CUDA for the time being. If we've processed a whole frame on the GPU, it would kill our performance to copy it back to the CPU to run a colorspace transform. Copying to an OpenGL texture to use the GLSL path would also hurt performance, though not as badly. So a CUDA path would be very useful to me. On first glance, it appears that OpenCL may support dynamic compilation, and thus be easier to match the CPU 1:1. Can anyone with OpenCL experience chime in?
clCreateProgramWithSource(...), etc. I haven't done any OpenCL programming, but have paid close attention to it over the years. It does support dynamic compilation. As Larry mentioned, it's somewhat lacking in language features compared to CUDA (e.g. templates and other C++ stuff). There isn't any way to share data between OpenCL and CUDA, so an OpenCL path wouldn't be of much use to me. Which, as you can imagine, somewhat reduces my motivation to write one. :) On Thu, Mar 29, 2012 at 11:11 AM, Nathan Weston <elb...@... <mailto:elb...@...>> wrote:
Cool. If all goes well, I can hopefully find time to work on this over the next couple of months.
CUDA historically hasn't supported dynamic generation/compilation of kernels. I believe it's possible with newer versions of the compiler, but only with the lower-level driver API. A statically-compiled kernel is probably a better bet, which would tend to point toward a more analytical approach along the lines of your CPU codepath.
I've had pretty good luck in the past sharing code between C++ and CUDA in order to implement parallel code paths that produce the same results. The only snags are 1) the shared code has to go into header files 2) Virtual functions require CUDA 4.0 and Fermi hardware
If possible I'd like to support older cards and CUDA toolkits, which means no virtual functions.
Is there any documentation of the OCIO internals to help me get my bearings?
On Wednesday, March 28, 2012 6:47:58 PM UTC-4, Jeremy Selan wrote:
I don't think anyone has looked implementing a CUDA pathway, but I'm very open to such ideas. Someone did ask about an OpenCL implementation recently, but I believe it's still in the concept stage.
A few thoughts on the concept...
Our current GPU implementation does not attempt to match the CPU implementation, by design. The CPU codepath does the full analytical color operations per pixels, while the GPU GLSL/Cg implementation relies on a combination of analytical shader text code generation, along with a 3d lut sampling. For color operations which can be done in simple shader text (such as math ops), these all happen in the glsl shader. But if the user references multiple 3d luts for example, it's all baked into a single 3d lut.
I was always hoping that, if we ever implemented a CUDA or OpenCL pathway, it would be more akin to the GPU code path and do more analytically. Im not sure if this is possible, but I think it's a nice ideal for a 'compute' context.
Another nicety of the current implementation is that even though we support gpu(s), OpenColorIO doesnt actually link to libGL, etc. The 'GPU API' conceptually only deals with POD types, returning the float * 3dlut, and the const char * shader text.
My hope would be that, if possible, a CUDA / OpenCL wouldn't impose any new linking requirements on the core library, but would instead support new code paths using simple data types.
-- Jeremy
On Wed, Mar 28, 2012 at 12:56 PM, Nathan Weston <elb...@... <mailto:elb...@...>> wrote:
I'm currently integrating OpenColorIO into an application that uses CUDA for GPU processing. In order to use OCIO's shader path, we'd need to copy our images over to OpenGL textures and back again. If OCIO had a CUDA path, it would be cleaner and faster.
Has anyone looked into implementing such a thing? If I were to implement it myself, is there any interest in including it in OCIO?
|
|
Nathan Weston <elb...@...>
That was my recollection as well, but I can't find any documentation or examples of this. Maybe I'm missing something. Do you remember any details?
All I can turn up are some 3rd-party tools which run the nvcc compiler as a subprocesses. nvcc is part of the CUDA SDK, but I don't think it's shipped with the driver or runtime libraries that would normally be installed on end-user machines, so this kind of approach could get complicated when it comes to deployment.
toggle quoted message
Show quoted text
On 03/30/2012 03:11 PM, Larry Gritz wrote: IIRC, although it's not quite as easy as OpenCL, these days Cuda can dynamically compile kernels. (OpenCL was able to do that all along.)
The other advantage of Cuda is that it's really C++ with a couple minor additions, which may make porting our existing code a lot easier, and also use all your favorite C++ features such as classes and templates. OpenCL is its own thing (though very C like).
The disadvantage is, of course, less HW and vendor independence.
On Mar 30, 2012, at 12:02 PM, Jeremy Selan wrote:
Excellent, looking forward to seeing what you do.
You mention CUDA (historically) doesn't support dynamic compilation of kernels? That would imply an implementation that looks more like a fixed function processing path, unfortunately. The downside being that the results would be even less accurate (potentially) then either our current CPU or GPU pathways.
Recall that in OCIO, all of the color transforms are dynamically loaded at runtime, so at library compile-time there's no way to know processing will be required for a given color transform. (You roughly know the building blocks, but not how they will be chained together).
Even our current GLSL codepath, which leverages a single 3dlut, tries to do as much as possible in the fragment shader at runtime. (It's pipeline looks like [GLSL CODE + 3DLUT + GLSL CODE], with as much done in code as possible).
What type of CUDA application are you writing? Are you looking for OCIO in CUDA for performance reasons? Are you looking for OCIO to match the quality of the GPU? Perhaps we can come up with an alternate implementation approach, or decide that it's better to just target recent CUDA versions.
On first glance, it appears that OpenCL may support dynamic compilation, and thus be easier to match the CPU 1:1. Can anyone with OpenCL experience chime in?
clCreateProgramWithSource(...), etc.
-- Jeremy
On Thu, Mar 29, 2012 at 11:11 AM, Nathan Weston <elb...@... <mailto:elb...@...>> wrote:
Cool. If all goes well, I can hopefully find time to work on this over the next couple of months.
CUDA historically hasn't supported dynamic generation/compilation of kernels. I believe it's possible with newer versions of the compiler, but only with the lower-level driver API. A statically-compiled kernel is probably a better bet, which would tend to point toward a more analytical approach along the lines of your CPU codepath.
I've had pretty good luck in the past sharing code between C++ and CUDA in order to implement parallel code paths that produce the same results. The only snags are 1) the shared code has to go into header files 2) Virtual functions require CUDA 4.0 and Fermi hardware
If possible I'd like to support older cards and CUDA toolkits, which means no virtual functions.
Is there any documentation of the OCIO internals to help me get my bearings?
On Wednesday, March 28, 2012 6:47:58 PM UTC-4, Jeremy Selan wrote:
I don't think anyone has looked implementing a CUDA pathway, but I'm very open to such ideas. Someone did ask about an OpenCL implementation recently, but I believe it's still in the concept stage.
A few thoughts on the concept...
Our current GPU implementation does not attempt to match the CPU implementation, by design. The CPU codepath does the full analytical color operations per pixels, while the GPU GLSL/Cg implementation relies on a combination of analytical shader text code generation, along with a 3d lut sampling. For color operations which can be done in simple shader text (such as math ops), these all happen in the glsl shader. But if the user references multiple 3d luts for example, it's all baked into a single 3d lut.
I was always hoping that, if we ever implemented a CUDA or OpenCL pathway, it would be more akin to the GPU code path and do more analytically. Im not sure if this is possible, but I think it's a nice ideal for a 'compute' context.
Another nicety of the current implementation is that even though we support gpu(s), OpenColorIO doesnt actually link to libGL, etc. The 'GPU API' conceptually only deals with POD types, returning the float * 3dlut, and the const char * shader text.
My hope would be that, if possible, a CUDA / OpenCL wouldn't impose any new linking requirements on the core library, but would instead support new code paths using simple data types.
-- Jeremy
On Wed, Mar 28, 2012 at 12:56 PM, Nathan Weston <elb...@... <mailto:elb...@...>> wrote:
I'm currently integrating OpenColorIO into an application that uses CUDA for GPU processing. In order to use OCIO's shader path, we'd need to copy our images over to OpenGL textures and back again. If OCIO had a CUDA path, it would be cleaner and faster.
Has anyone looked into implementing such a thing? If I were to implement it myself, is there any interest in including it in OCIO?
-- Larry Gritz l...@... <mailto:l...@...>
|
|
Dithermaster <dither...@...>
There isn't any way to share data between OpenCL and CUDA
Not entirely true. If you base them both on an OpenGL context, you can create OpenCL images from OpenGL textures, and you can access OpenGL textures from CUDA as well.
///d@
On Fri, Mar 30, 2012 at 2:58 PM, Nathan Weston <elb...@...> wrote:
On 03/30/2012 03:02 PM, Jeremy Selan wrote:
Excellent, looking forward to seeing what you do.
You mention CUDA (historically) doesn't support dynamic compilation of
kernels? That would imply an implementation that looks more like a
fixed function processing path, unfortunately. The downside being that
the results would be even less accurate (potentially) then either our
current CPU or GPU pathways.
Recall that in OCIO, all of the color transforms are dynamically loaded
at runtime, so at library compile-time there's no way to know processing
will be required for a given color transform. (You roughly know the
building blocks, but not how they will be chained together).
Hmm, I didn't know that. I don't yet know anything at all about OCIO under the hood. If there's any documentation of this kind of stuff it would be very helpful.
When you say the color transforms are loaded at runtime, are you talking about additional code (e.g. a .so that implements a particular transform)? If it's simply reading LUTs or other datafiles then a CUDA implementation might be able to work off the same data. CUDA supports most of C++ on the device side so it's quite a bit more flexible than GLSL.
Even our current GLSL codepath, which leverages a single 3dlut, tries to
do as much as possible in the fragment shader at runtime. (It's
pipeline looks like [GLSL CODE + 3DLUT + GLSL CODE], with as much done
in code as possible).
What type of CUDA application are you writing? Are you looking for OCIO
in CUDA for performance reasons? Are you looking for OCIO to match the
quality of the GPU? Perhaps we can come up with an alternate
implementation approach, or decide that it's better to just target
recent CUDA versions.
I have to be slightly cagey about this for the moment since I work on a commercial product and we haven't yet announced our plans regarding OCIO. But broadly, we have an image processing application which has both CPU and CUDA codepaths depending on the available hardware. All of this is already implemented, so we're pretty much wedded to CUDA for the time being.
If we've processed a whole frame on the GPU, it would kill our performance to copy it back to the CPU to run a colorspace transform. Copying to an OpenGL texture to use the GLSL path would also hurt performance, though not as badly. So a CUDA path would be very useful to me.
On first glance, it appears that OpenCL may support dynamic compilation,
and thus be easier to match the CPU 1:1. Can anyone with OpenCL
experience chime in?
clCreateProgramWithSource(...), etc.
I haven't done any OpenCL programming, but have paid close attention to it over the years. It does support dynamic compilation. As Larry mentioned, it's somewhat lacking in language features compared to CUDA (e.g. templates and other C++ stuff).
There isn't any way to share data between OpenCL and CUDA, so an OpenCL path wouldn't be of much use to me. Which, as you can imagine, somewhat reduces my motivation to write one. :)
On Thu, Mar 29, 2012 at 11:11 AM, Nathan Weston < elb...@...
<mailto: elb...@...>> wrote:
Cool. If all goes well, I can hopefully find time to work on this
over the next couple of months.
CUDA historically hasn't supported dynamic generation/compilation of
kernels. I believe it's possible with newer versions of the
compiler, but only with the lower-level driver API. A
statically-compiled kernel is probably a better bet, which would
tend to point toward a more analytical approach along the lines of
your CPU codepath.
I've had pretty good luck in the past sharing code between C++ and
CUDA in order to implement parallel code paths that produce the same
results. The only snags are
1) the shared code has to go into header files
2) Virtual functions require CUDA 4.0 and Fermi hardware
If possible I'd like to support older cards and CUDA toolkits, which
means no virtual functions.
Is there any documentation of the OCIO internals to help me get my
bearings?
On Wednesday, March 28, 2012 6:47:58 PM UTC-4, Jeremy Selan wrote:
I don't think anyone has looked implementing a CUDA pathway, but
I'm very open to such ideas. Someone did ask about an OpenCL
implementation recently, but I believe it's still in the concept
stage.
A few thoughts on the concept...
Our current GPU implementation does not attempt to match the CPU
implementation, by design. The CPU codepath does the full
analytical color operations per pixels, while the GPU GLSL/Cg
implementation relies on a combination of analytical shader text
code generation, along with a 3d lut sampling. For color
operations which can be done in simple shader text (such as math
ops), these all happen in the glsl shader. But if the user
references multiple 3d luts for example, it's all baked into a
single 3d lut.
I was always hoping that, if we ever implemented a CUDA or
OpenCL pathway, it would be more akin to the GPU code path and
do more analytically. Im not sure if this is possible, but I
think it's a nice ideal for a 'compute' context.
Another nicety of the current implementation is that even though
we support gpu(s), OpenColorIO doesnt actually link to libGL,
etc. The 'GPU API' conceptually only deals with POD types,
returning the float * 3dlut, and the const char * shader text.
My hope would be that, if possible, a CUDA / OpenCL wouldn't
impose any new linking requirements on the core library, but
would instead support new code paths using simple data types.
-- Jeremy
On Wed, Mar 28, 2012 at 12:56 PM, Nathan Weston
< elb...@... <mailto: elb...@...>> wrote:
I'm currently integrating OpenColorIO into an application
that uses CUDA for GPU processing. In order to use OCIO's
shader path, we'd need to copy our images over to OpenGL
textures and back again. If OCIO had a CUDA path, it would
be cleaner and faster.
Has anyone looked into implementing such a thing? If I were
to implement it myself, is there any interest in including
it in OCIO?
|
|
If we've processed a whole frame on the GPU, it would kill our performance to copy it back to the CPU to run a colorspace transform. Copying to an OpenGL texture to use the GLSL path would also hurt performance, though not as badly. So a CUDA path would be very useful to me. Can't you just bind your target framebuffer as a texture? No copy required and everything stays on the GPU.
|
|
Nathan Weston <elb...@...>
On 03/30/2012 05:24 PM, Paul Miller wrote: If we've processed a whole frame on the GPU, it would kill our performance to copy it back to the CPU to run a colorspace transform. Copying to an OpenGL texture to use the GLSL path would also hurt performance, though not as badly. So a CUDA path would be very useful to me. Can't you just bind your target framebuffer as a texture? No copy required and everything stays on the GPU.
CUDA doesn't really have framebuffers as such. OpenGL textures are mapped to CUDA arrays, which you can copy into but can't write directly from a kernel. So there's at least one copy going in, and probably another going out (depending on how you want to access the data afterward). In addition to which, we typically work with planar images, so we'd add packing/unpacking overhead. We're not doing any display so we don't currently use OpenGL at all. None of this overhead is prohibitive, and I'm sure the shader path would work for us. But if a CUDA path can be implemented without too much effort it will definitely be a nicer solution.
|
|
On 3/30/2012 4:44 PM, Nathan Weston wrote: On 03/30/2012 05:24 PM, Paul Miller wrote:
If we've processed a whole frame on the GPU, it would kill our performance to copy it back to the CPU to run a colorspace transform. Copying to an OpenGL texture to use the GLSL path would also hurt performance, though not as badly. So a CUDA path would be very useful to me. Can't you just bind your target framebuffer as a texture? No copy required and everything stays on the GPU. CUDA doesn't really have framebuffers as such. OpenGL textures are mapped to CUDA arrays, which you can copy into but can't write directly from a kernel. So there's at least one copy going in, and probably another going out (depending on how you want to access the data afterward). Ah when you said you were processingt the whole frame in the GPU, I immediately thought OpenGL, not CUDA. Sorry for the confusion.
|
|
Jeremy Selan <jeremy...@...>
On Fri, Mar 30, 2012 at 12:58 PM, Nathan Weston <elb...@...> wrote:
Hmm, I didn't know that. I don't yet know anything at all about OCIO under the hood. If there's any documentation of this kind of stuff it would be very helpful.
When you say the color transforms are loaded at runtime, are you talking about additional code (e.g. a .so that implements a particular transform)? If it's simply reading LUTs or other datafiles then a CUDA implementation might be able to work off the same data. CUDA supports most of C++ on the device side so it's quite a bit more flexible than GLSL.
Currently, the latter. (at runtime there's simple reading of luts / datafiles). But the types of transforms, and the ordering, are defined on the fly as well. In the middle-term we're considering the former, where processing operations themselves can be defined as plugins at runtime. But this shouldnt be a killer. Any plugin approach will have to support the existing GPU pathway, so CUDA is no worse off here.
A simple overview of OCIO: At runtime, the $OCIO configuration defines what OCIO::Transform(s) are used for each color space conversion. Example: (ACES config, adx->aces conversion).
Transforms: - !<FileTransform> {src: adx_adx10_to_cdd.spimtx} - !<FileTransform> {src: adx_cdd_to_cid.spimtx} - !<FileTransform> {src: adx_cid_to_rle.spi1d, interpolation: linear}
- !<LogTransform> {base: 10, direction: inverse} - !<FileTransform> {src: adx_exp_to_aces.spimtx} The
config->getProcessor call creates a Processor object, which embodies
a 'pixel ready' color transform. (When you call getprocessor, luts may
be loaded, etc). And on the resulting processor object you can call
applyRGB, etc.
Example: During config->getProcessor('adx10',' aces'),
this gets translated into a series of internal "Ops". Ops are an OCIO
implementation detail, and are not publicly exposed in the API.
There are only a limited number of Ops. (MatrixOffset,
Exponent, Log, Lut1D, Lut3D). But the behavior of these ops may depend
on data loaded at runtime. (The LUT ops, for example).
So thinking ahead... It just may
be possible to precompile a generic CUDA kernel for each of the predefined op
types, and then at runtime for OCIO to loop over the Ops within the processor,
load any kernel-specific data, and then apply the kernels to the data.
In CUDA, Is it appropriate to have a 3rd party library (OCIO) directly call a bunch of individual kernels on some passed in data?
If so, then this just might work. However, if going this route requires OCIO to expose publicly the internal Ops, then I would consider this a non-starter. The internal ops cannot, and should not, become part of any exposed public interface.
I have to be slightly cagey about this for the moment since I work on a commercial product and we haven't yet announced our plans regarding OCIO. But broadly, we have an image processing application which has both CPU and CUDA codepaths depending on the available hardware. All of this is already implemented, so we're pretty much wedded to CUDA for the time being.
If we've processed a whole frame on the GPU, it would kill our performance to copy it back to the CPU to run a colorspace transform. Copying to an OpenGL texture to use the GLSL path would also hurt performance, though not as badly. So a CUDA path would be very useful to me.
Understood. (wink)
We'll get you a GPU-friendly solution that doesnt require moving the image back and forth.
-- Jeremy
|
|
Nathan Weston <elb...@...>
On 03/30/2012 05:56 PM, Jeremy Selan wrote: There are only a limited number of Ops. (MatrixOffset, Exponent, Log, Lut1D, Lut3D). But the behavior of these ops may depend on data loaded at runtime. (The LUT ops, for example).
So thinking ahead... It just may be possible to precompile a generic CUDA kernel for each of the predefined op types, and then at runtime for OCIO to loop over the Ops within the processor, load any kernel-specific data, and then apply the kernels to the data.
In CUDA, Is it appropriate to have a 3rd party library (OCIO) directly call a bunch of individual kernels on some passed in data?
If so, then this just might work. However, if going this route requires OCIO to expose publicly the internal Ops, then I would consider this a non-starter. The internal ops cannot, and should not, become part of any exposed public interface.
Yes, that should be possible. There may be some performance implications. Memory access has high latency, so ideally you want to read your data once, do all the math in registers, then write the final result. Splitting an operation across kernels will require more round-trips to memory. But in practice the scheduler is pretty decent at hiding this latency, so it may not be a problem. Another possibility is to write a single kernel, which takes the list of Ops as input and loops over them internally. This would require all the Ops to be available at compile time (which it sounds like is the case now, but may not be in the future). Either way, I don't see any reason that the Ops need to be exposed in the API. Thanks for the architecture overview. That should be a big help. I'll dig into the code next week and see what I can come up with.
|
|
Nathan Weston <elb...@...>
On 3/31/2012 9:15 AM, Nathan Weston wrote: On 03/30/2012 05:56 PM, Jeremy Selan wrote:
There are only a limited number of Ops. (MatrixOffset, Exponent, Log, Lut1D, Lut3D). But the behavior of these ops may depend on data loaded at runtime. (The LUT ops, for example).
So thinking ahead... It just may be possible to precompile a generic CUDA kernel for each of the predefined op types, and then at runtime for OCIO to loop over the Ops within the processor, load any kernel-specific data, and then apply the kernels to the data.
In CUDA, Is it appropriate to have a 3rd party library (OCIO) directly call a bunch of individual kernels on some passed in data?
If so, then this just might work. However, if going this route requires OCIO to expose publicly the internal Ops, then I would consider this a non-starter. The internal ops cannot, and should not, become part of any exposed public interface.
Yes, that should be possible. There may be some performance implications. Memory access has high latency, so ideally you want to read your data once, do all the math in registers, then write the final result. Splitting an operation across kernels will require more round-trips to memory. But in practice the scheduler is pretty decent at hiding this latency, so it may not be a problem.
Another possibility is to write a single kernel, which takes the list of Ops as input and loops over them internally. This would require all the Ops to be available at compile time (which it sounds like is the case now, but may not be in the future).
Either way, I don't see any reason that the Ops need to be exposed in the API.
Thanks for the architecture overview. That should be a big help. I'll dig into the code next week and see what I can come up with. I've spent some time going over the OCIO code and doing some CUDA experiments, and I think I have an overall design that will work. It will require some minor modifications to the Op classes (more on that below), but should end up with CPU and CUDA paths mostly sharing code, and thus producing identical results. I would add a CMake option to enable CUDA. If that's turned on, OCIO will build with CUDA support and link against the CUDA runtime API -- so using CUDA or not is a compile-time decision. If CUDA is turned off, then OCIO won't have any additional dependencies beyond what it has today. The Processor class will get a new method, applyCUDA(), which should be passed an ImageDesc that points to GPU memory. (Alternatively, we could put a GPU flag in the ImageDesc and have the existing apply() method dispatch appropriately.) applyCUDA() will copy the list of Ops to the GPU. Then it will launch a kernel that, at each pixel, loops over each op and calls op->apply() on that pixel. In many cases, the existing code for apply() will work fine in CUDA. Some ops might need a specialized CUDA implementation for good performance. All of this code can go in one .cu file which is compiled by nvcc. Any functions or methods that we want to run on the GPU ("device" code in CUDA parlance) need to be marked with the "__device__" qualifier so nvcc will pick them up. We can wrap this in a macro to avoid problems with the regular C++ compiler. CUDA doesn't support separate compilation for device code. So our .cu file will need to #include the .cpp files for all the Ops. When nvcc compiles this, it will produce both host and device code -- but we already have host versions of all this code compiled separately, so we'll get duplicate symbols at link time. Thus we'll need to sprinkle around some #ifdefs to hide host code from nvcc. Unfortunately, I've also discovered a weird quirk (maybe a bug) in nvcc. If I have code like this: class Foo { __device__ void myMethod(); }; __device__ void Foo::myMethod(); it will still compile a host version of myMethod(), which causes link errors. In order to prevent this problem I have to move the method body into the class declaration. So we'd also have to inline some method definitions, which is not great. All of this is assuming that we use CUDA's Runtime API, which handles all the nasty details of launching kernels automatically. But CUDA also has a lower-level Driver API. When using the Driver API, CUDA doesn't compile any host code -- so this would sidestep the duplicate symbol problem entirely. But it introduces some hassles of its own, and I don't have much experience with it. To summarize: With the runtime API * Some functions/methods are marked with __device__ * Need ifdefs to prevent duplicate symbols * Need to inline method definitions to produce duplicate symbols * CUDA code is easier, but C++ code is uglier With the driver API * Some functions/methods are marked with __device__ * No other modifications to existing C++ code * CUDA code is nastier The public API will be identical in either case, so it makes no difference to the user of the library. From my perspective, the runtime API is definitely the easier way to go. But it's not my project so I'm not really in a position to say how much rearranging/uglification of existing code is acceptable. I'll leave that call up to the powers that be. :) -- Nathan
|
|
Jeremy Selan <jeremy...@...>
Thanks for looking into CUDA more. This could be really cool. I would add a CMake option to enable CUDA. If that's turned on, OCIO will build with CUDA support and link against the CUDA runtime API -- so using CUDA or not is a compile-time decision. If CUDA is turned off, then OCIO won't have any additional dependencies beyond what it has today. Excellent. The Processor class will get a new method, applyCUDA(), which should be passed an ImageDesc that points to GPU memory. (Alternatively, we could put a GPU flag in the ImageDesc and have the existing apply() method dispatch appropriately.) Is there a C++ wrapped handle to GPU memory in CUDA? If so, would it be possible to define a custom CUDAImageDesc, which derives off of ImageDesc, and wraps your CUDA memory handle? (Would we need two flavors for packed or planar images?) There's conceptually two 'axes' here: - Whether the image memory is on GPU or CPU - Whether we want to process on the GPU or CPU. I could imagine an implementation where we have both applyCPU(...) and applyGPU(...). And then you can pass both CPUImageDesc, and GPUImageDesc, to either. So 2x2=4 cases. Example: If you called applyGPU on a CUDAImageDesc it would copy it to the GPU, apply in GPU, and copy back. But this API approach, while explicit, seems frought with extra copies and wouldnt encourage 'fast' practice. So I think my inclination is to do something simpler. (Your second case). Have only a single processor->apply(...) function which dispatches internally. Even when compiled with CUDA, if you call apply(...) on a normal ImageDesc it will apply the CPU path. But if you call it on a CUDAImageDesc, it will apply it on the GPU. Does this make sense? applyCUDA() will copy the list of Ops to the GPU. Then it will launch a kernel that, at each pixel, loops over each op and calls op->apply() on that pixel. In many cases, the existing code for apply() will work fine in CUDA. Some ops might need a specialized CUDA implementation for good performance.
I'd like to make sure that even in a CUDA-enabled OCIO, the CPU path continues to work. Let's consider this goal #1. All of this code can go in one .cu file which is compiled by nvcc. Are you saying that all of our implementations need to be in a single .cu file within the source tree? Or that at build-time a single .cu file will be generated and compiled? If possible, I'd really like to maintain implementations of each op in separate files. (Though if we need both a Lut1DOp.cpp and Lut1DOp.cu that's not too bad). All of this is assuming that we use CUDA's Runtime API, which handles all the nasty details of launching kernels automatically. But CUDA also has a lower-level Driver API. When using the Driver API, CUDA doesn't compile any host code -- so this would sidestep the duplicate symbol problem entirely. But it introduces some hassles of its own, and I don't have much experience with it.
To summarize: With the runtime API * Some functions/methods are marked with __device__ * Need ifdefs to prevent duplicate symbols * Need to inline method definitions to produce duplicate symbols * CUDA code is easier, but C++ code is uglier
With the driver API * Some functions/methods are marked with __device__ * No other modifications to existing C++ code * CUDA code is nastier
The public API will be identical in either case, so it makes no difference to the user of the library.
From my perspective, the runtime API is definitely the easier way to go. But it's not my project so I'm not really in a position to say how much rearranging/uglification of existing code is acceptable. I'll leave that call up to the powers that be. :) I'm a CUDA newbie so I'll defer to you here. What's more maintainable in the long run? If the runtimeAPI is what most people use, I'd be more comfortable following the pack, unless there's a really compelling argument to use the raw driver API. Re-arranging of code I'm cool with. Uglification of code I'd like to avoid unless absolutely necessary. Are there other libraries that are in a similar boat to OCIO, and also support CUDA? Surely other have probably looked into the tradeoff of the two API. Can anyone suggest prior 'best practice' we can copy? ------ So what are the next steps? I think my preference would be for you to - mockup the public API - write CUDA support for only the simplest possible Op, such as 'ExponentOp' - copy src/apps/ocioconvert -> src/apps/ociocudaconvert, and update this example to load to a cuda buffer, process using OCIO, copy back to host memory, and then save to a file. One there are done, we can iterate on this trivial case until we get an API / file layout we all like. Then we should implement Lut1DOp, in the inverse direction, which I believe will be the hardest Op to get working. If we can prove that one works, I think the rest will fall into place. Bonus point if we add unit tests during Op porting that compare the CPU to GPU pathways and confirms it's within expected bounds. -- Jeremy
|
|
Jeremy Selan <jeremy...@...>
One implementation detail I just thought of...
Some of the Ops (such as the ones that use LUTs) will require explicit allocations on the device side. Is it is ok for OCIO to make these allocations internally, or is it bad practice for CUDA enabled libraries to do this under the hood?
Assuming it's ok for OCIO to make device allocations, when are these released? My intuition says that when the OCIO::Processor destructor is called, we release the underlying gpu memory. And then if someone wants to keep a particular transform 'resident', they need to hold onto the Processor in their host app.
-- Jeremy
|
|
Dithermaster <dither...@...>
You can wrap CUDA memory objects around OpenGL textures, FWIW. Same with OpenCL images.
It might make some sense to make the transfer operations "just in time" so the bits get transferred to the appropriate context (CPU or GPU) on-demand. That way, a sequence of GPU operations won't have to transfer back to CPU in-between.
toggle quoted message
Show quoted text
On Mon, Apr 9, 2012 at 6:21 PM, Jeremy Selan <jeremy...@...> wrote:
One implementation detail I just thought of...
Some of the Ops (such as the ones that use LUTs) will require explicit
allocations on the device side. Is it is ok for OCIO to make these
allocations internally, or is it bad practice for CUDA enabled
libraries to do this under the hood?
Assuming it's ok for OCIO to make device allocations, when are these
released? My intuition says that when the OCIO::Processor destructor
is called, we release the underlying gpu memory. And then if someone
wants to keep a particular transform 'resident', they need to hold
onto the Processor in their host app.
-- Jeremy
|
|
Nathan Weston <elb...@...>
On 4/9/2012 7:21 PM, Jeremy Selan wrote: One implementation detail I just thought of...
Some of the Ops (such as the ones that use LUTs) will require explicit allocations on the device side. Is it is ok for OCIO to make these allocations internally, or is it bad practice for CUDA enabled libraries to do this under the hood?
Assuming it's ok for OCIO to make device allocations, when are these released? My intuition says that when the OCIO::Processor destructor is called, we release the underlying gpu memory. And then if someone wants to keep a particular transform 'resident', they need to hold onto the Processor in their host app. Sure, it's no problem for OCIO to make device allocations. Attaching them to the Processor makes sense -- that way the user of the library can easily control when device memory is freed. -- Nathan
|
|
Nathan Weston <elb...@...>
On 4/9/2012 7:12 PM, Jeremy Selan wrote: Thanks for looking into CUDA more. This could be really cool.
I would add a CMake option to enable CUDA. If that's turned on, OCIO will build with CUDA support and link against the CUDA runtime API -- so using CUDA or not is a compile-time decision. If CUDA is turned off, then OCIO won't have any additional dependencies beyond what it has today. Excellent.
The Processor class will get a new method, applyCUDA(), which should be passed an ImageDesc that points to GPU memory. (Alternatively, we could put a GPU flag in the ImageDesc and have the existing apply() method dispatch appropriately.) Is there a C++ wrapped handle to GPU memory in CUDA? If so, would it be possible to define a custom CUDAImageDesc, which derives off of ImageDesc, and wraps your CUDA memory handle? (Would we need two flavors for packed or planar images?)
There's conceptually two 'axes' here: - Whether the image memory is on GPU or CPU - Whether we want to process on the GPU or CPU.
I could imagine an implementation where we have both applyCPU(...) and applyGPU(...). And then you can pass both CPUImageDesc, and GPUImageDesc, to either. So 2x2=4 cases. Example: If you called applyGPU on a CUDAImageDesc it would copy it to the GPU, apply in GPU, and copy back. But this API approach, while explicit, seems frought with extra copies and wouldnt encourage 'fast' practice.
So I think my inclination is to do something simpler. (Your second case). Have only a single processor->apply(...) function which dispatches internally. Even when compiled with CUDA, if you call apply(...) on a normal ImageDesc it will apply the CPU path. But if you call it on a CUDAImageDesc, it will apply it on the GPU.
Does this make sense? Yes, sounds good to me. applyCUDA() will copy the list of Ops to the GPU. Then it will launch a kernel that, at each pixel, loops over each op and calls op->apply() on that pixel. In many cases, the existing code for apply() will work fine in CUDA. Some ops might need a specialized CUDA implementation for good performance. I'd like to make sure that even in a CUDA-enabled OCIO, the CPU path continues to work. Let's consider this goal #1.
All of this code can go in one .cu file which is compiled by nvcc. Are you saying that all of our implementations need to be in a single .cu file within the source tree? Or that at build-time a single .cu file will be generated and compiled?
If possible, I'd really like to maintain implementations of each op in separate files. (Though if we need both a Lut1DOp.cpp and Lut1DOp.cu that's not too bad).
All the device code for a given kernel has to ultimately end up in a single compilation unit somehow. You can still keep the code in separate files, but you have to #include it in your main .cu file. I'm imagining we'd have something like ProcessCuda.cu where the actual kernel is defined. Since that will depend on the ops, it would have #include "Lut1DOp.cpp" #include "MatrixOps.cpp" etc. All of this is assuming that we use CUDA's Runtime API, which handles all the nasty details of launching kernels automatically. But CUDA also has a lower-level Driver API. When using the Driver API, CUDA doesn't compile any host code -- so this would sidestep the duplicate symbol problem entirely. But it introduces some hassles of its own, and I don't have much experience with it.
To summarize: With the runtime API * Some functions/methods are marked with __device__ * Need ifdefs to prevent duplicate symbols * Need to inline method definitions to produce duplicate symbols * CUDA code is easier, but C++ code is uglier
With the driver API * Some functions/methods are marked with __device__ * No other modifications to existing C++ code * CUDA code is nastier
The public API will be identical in either case, so it makes no difference to the user of the library.
From my perspective, the runtime API is definitely the easier way to go. But it's not my project so I'm not really in a position to say how much rearranging/uglification of existing code is acceptable. I'll leave that call up to the powers that be. :) I'm a CUDA newbie so I'll defer to you here. What's more maintainable in the long run? If the runtimeAPI is what most people use, I'd be more comfortable following the pack, unless there's a really compelling argument to use the raw driver API.
Re-arranging of code I'm cool with.
Uglification of code I'd like to avoid unless absolutely necessary.
Are there other libraries that are in a similar boat to OCIO, and also support CUDA? Surely other have probably looked into the tradeoff of the two API. Can anyone suggest prior 'best practice' we can copy?
------
So what are the next steps?
I think my preference would be for you to - mockup the public API - write CUDA support for only the simplest possible Op, such as 'ExponentOp' - copy src/apps/ocioconvert -> src/apps/ociocudaconvert, and update this example to load to a cuda buffer, process using OCIO, copy back to host memory, and then save to a file.
One there are done, we can iterate on this trivial case until we get an API / file layout we all like.
Then we should implement Lut1DOp, in the inverse direction, which I believe will be the hardest Op to get working. If we can prove that one works, I think the rest will fall into place.
Bonus point if we add unit tests during Op porting that compare the CPU to GPU pathways and confirms it's within expected bounds.
Sounds good to me. I'll do this with the runtime API, since that's easier, and we can see how bad the code changes are. If I need to switch to the driver API at that point it shouldn't be too much extra effort. -- Nathan
|
|
Nathan Weston <elb...@...>
On 4/9/2012 7:12 PM, Jeremy Selan wrote: So what are the next steps?
I think my preference would be for you to - mockup the public API - write CUDA support for only the simplest possible Op, such as 'ExponentOp' - copy src/apps/ocioconvert -> src/apps/ociocudaconvert, and update this example to load to a cuda buffer, process using OCIO, copy back to host memory, and then save to a file.
One there are done, we can iterate on this trivial case until we get an API / file layout we all like. This is done now. My code is on GitHub: https://github.com/nweston/OpenColorIO/tree/cudaI worked out a little differently than I had planned. I ended up with a parallel class hierarchy of CudaOps. This doesn't result in too much duplicated code since the Ops typically call a function to do most of the work of apply(). I had to move some code into different files, but on the whole the changes to existing code weren't as bad as I expected. The public API just consists of two new ImageDesc classes, for packed/planar CUDA images. There are two limitations at the moment: 1. nvcc doesn't support C++0x yet, so the CUDA path only builds if OCIO_USE_BOOST_PTR is enabled. I don't think we really need smart pointers anywhere in the CUDA code, so we ought to be able to work around this, but I haven't tried it yet. 2. The current implementation requires CUDA 4.0 and a Fermi card, because it makes virtual calls in device code. Eventually I'd like to support older cards, but I can worry about that later. Let me know what you think so far. -- Nathan
|
|