Date
1 - 8 of 8
Varying Closure Keyword Parameters Not Being Correctly Evaluated During Batched Execution?
dresserd@...
I've gotten around to trying out the exciting new OSL batched execution mode for Gaffer.
For context, Gaffer's internal use of OSL is a bit atypical - we're not a renderer, but we offer OSL for processing images and geometry - it works quite nicely for basic operations, and we're looking at extending it to be more powerful. The new batched stuff should be an easy win for this application - we have large contiguous buffers to operate on. A quick preliminary test showed a slightly better than factor of 2 speedup using AVX2 with a naive mandelbrot evaluator written in OSL - not the most realistic test, but fairly promising. We currently use debug closures with a custom keyword parameter to define the output from the OSL shaders we run on images or geometry - we may not actually need to use keyword parameters, but it's been convenient, and I assume it's intended to still work. The behaviour I'm seeing is that when I retrieve the value of the parameters from `shaderGlobals.varying.Ci[ i ]` for each `i` in WidthT, the parameter that was set up as a keyword parameter has the same value for every index. Making things more peculiar, if it's a vector or matrix type, subsequent indices into the vector or matrix parameter return the first component but for a different shading point. For example, if I shade 4 points with `u` ranging from 0.25 to 1, with a shader that sets a keyword closure parameter to `color( u, u * 10, 0 )`, I would expect the parameter values of the closures stored in `shaderGlobals.varying.Ci[ i ]` to be <0.25, 2.5, 0>, <0.5, 5.0, 0>, <0.75, 7.5, 0> and <1.0, 10.0, 0>. But instead I get <0.25, 0.5, 0.75> repeated for all Ci pointers within the batch. I'm definitely not ruling out that I could be doing something wrong - I'm early in prototyping this, and I certainly don't fully understand all the details. But there isn't really any difference in how I access keyword and non-keyword closure parameters, and it seems like they're set up properly, because they both work in non-batched execution ... I haven't been able to find anything I'm doing wrong, and I know this SIMD support is pretty new, so it seems plausible there could be a bug. My next step would usually be to try and reproduce this without using any Gaffer code, to try and rule out any mistakes I may have made, but I'm not sure what the easiest way to do that is - I don't see any test code in the OpenShadingLanguage that would exercise this? I think the easiest test might be to try and hack in support for closures in testshade - doing this properly would probably require a fair bit of work, but maybe it wouldn't take long to hack something up that would be just enough to temporarily demonstrate this issue. Before I dig into this though, I figured it wouldn't hurt to just ask about this here - maybe someone is already aware of this, or maybe someone can just look at the batched llvm gen, and see if they can identify a bug in there ( I haven't really worked much on LLVM stuff, and I don't really understand that code, but I can see that llvm_gen_keyword_fill for batched execution is a separate code path, so if it's not being exercised by the tests, it's plausible there could be a bug ). I would appreciate any help in either confirming that this is a bug, or in confirming that varying closure keyword parameters can be correctly evaluated in batched execution, and that the error must be somewhere in my code. -Daniel Dresser |
|
Stephen Friedman
The intent was that if you had an existing method to parse the closure and decide what to do with them, you could just call that in a loop -- that's what our renderer does, and we've used something very close to what's committed for both keyword and normal parameters. Our codebase had diverged a bit from the mainline, so it is always possible something got botched in the translation. Thanks for being on the bleeding edge, I'll try to go through a high level of what's expected to work, and you may be able to poke around and figure out where things have gone wrong. So if you have void process_closure(ClosureColor *CI, <other args>); where you used to do process_closure(shader_globals.Ci, <other args>); you would instead do for (int i = 0; i < batchWidth; ++i) { process_closure(batched_shader_globals.varying.Ci[i], <other args for i>); } As components of the closure are created, they are allocated in blocks of WidthT, so if you know the right derived type, you can look in the memory at sizeof(derivedType) in a debugger and you should see all batchWidth worth of values saved in that block of memory. (technically, the stride calculation is size_t stride = (int)((sizeof(ClosureComponent) + prim_size + alignment-1)/alignment)*alignment;) The closures are expected to be stored in AOS, not SOA, so each one of those should have parameters for each "lane" of the batch all grouped together. For something like an add, the left-addend closure pointer in lane 0 of the block will end up pointing to lane 0 of a block of WidthT left-addend closures, and lane 5 of the add block should point to lane 5 of the left-addend block, etc. For accessing your parameters, I'd expect something like this for the ClosureColor* closure ClosureComponent* comp = closure->as_comp(); if (comp->id == MY_ID){ MyParamsStruct params = *comp->as<MyParamsStruct> () } to "just work" but that won't be an array of MyParamsStruct, it'll only be a single one, and the outer for-loop would be needed to easily chase pointers to get to the params of the next lane of closures. The testing is testsuite/{closure-array,closure} in batched mode, which just prints the results of the closure, it's not currently wired into testshade or testrender. --Stephen On Thu, Nov 10, 2022 at 5:31 PM <dresserd@...> wrote: I've gotten around to trying out the exciting new OSL batched execution mode for Gaffer. |
|
dresserd@...
Oops, I just wrote up a nice reply to Stephen in the web UI, and then accidentally hit the big "Reply to Author" button instead of the little "Group Reply" button. For the benefit of anyone else watching this thread:
Now that Stephen has pointed out the right tests to look at, I see that I can test closure evalution in testshade using printf, even though testshade doesn't use Ci. Trying this out: badBatch.osl ``` shader badBatch() { Ci = diffuse( N, "label", format( "%f", u ) ); printf( " Ci = %s", Ci ); } ``` > oslc badBatch.osl && testshade -g 2 1 badBatch Ci = (1, 1, 1) * diffuse ((0, 0, 1), "label", "0.000000") Ci = (1, 1, 1) * diffuse ((0, 0, 1), "label", "1.000000") > oslc badBatch.osl && testshade -batched -g 2 1 badBatch Ci = (1, 1, 1) * diffuse ((0, 0, 1), "label", "0.000000") Ci = (1, 1, 1) * diffuse ((0, 0, 1), "label", "0.000000") This seems to confirm an issue with batched execution causing the keyword parameter to not get a distinct value for each shading point ( at least in my environment ). -Daniel |
|
dresserd@...
Since I've now got a clean repro independent of my code, I went ahead and made an issue: https://github.com/AcademySoftwareFoundation/OpenShadingLanguage/issues/1619
-Daniel |
|
Stephen Friedman
Yeah, that looks like I just "missed a spot" -- it turns out while we use keyword args, in our renderer, they've only ever been uniform across the batch so I didn't catch it. I don't have the cycles to vet and submit a fix at the moment, but I'd guess it's really straightforward. If you look in liboslexec/batched_llvm_gen.cpp and go to llvm_gen_closure, you'll see a loop in there going through the formal args and generating a copy of them into the closure struct. There's a bit that does this: if (rop.isSymbolUniform(sym)) { src = rop.llvm_void_ptr (sym); } else { src = rop.ll.GEP(rop.llvm_load_value(sym), i); } and if you diff that with the standard liboslexec/llvm_gen.cpp#llvm_gen_closure, you'll see the non-batched code just always does the uniform case of llvm::Value* src = rop.llvm_void_ptr (sym); Well, it turns out the llvm_gen_keyword_fill code has a similar line llvm::Value* src = rop.llvm_void_ptr (Value); that I didn't update ... so the batched version of llvm_gen_keyword_fill likely just needs the 'i' variable passed in and equivalent check for a uniform symbol, and if it's not uniform, using get-element-pointer(GEP) to get the right one. Since you've got the setup to test this out ready, would you mind giving it a try and see if it addresses your issue? --Stephen On Mon, Nov 14, 2022 at 3:12 PM <dresserd@...> wrote: Since I've now got a clean repro independent of my code, I went ahead and made an issue: https://github.com/AcademySoftwareFoundation/OpenShadingLanguage/issues/1619 |
|
dresserd@...
Hmm, this is a bit more complex than what you're describing, because the batched_llvm_gen version of llvm_gen_closure doesn't just switch the target of the memcpy, it replaces the memcpy completely with a pair of for loops over num_elements and num_components, with manual loads and stores. Based on your description, it's unclear whether I need to copy the logic from llvm_gen_closure to llvm_gen_keyword, or whether I should just switch the target of the memcpy.
I flipped a coin and guessed I should do the former, and it does in fact seem to be working in the simplest case. I'll try out a more complex example tomorrow. My current diff for batched_llvm_gen is: 7067c7067 < ustring clname, llvm::Value* mem_void_ptr, int argsoffset) --- > ustring clname, llvm::Value* mem_void_ptr, int argsoffset, int lane_index) 7091,7094c7091,7127 < llvm::Value* dst = rop.ll.offset_ptr(mem_void_ < llvm::Value* src = rop.llvm_void_ptr(Value); < rop.ll.op_memcpy(dst, src, (int)p.type.size(), < 4 /* use 4 byte alignment for now */); --- > > bool arg_is_uniform = Value.is_uniform(); > > TypeDesc simpletype(Value.typespec(). > int num_elements = simpletype.numelements(); > int num_components = simpletype.aggregate; > > llvm::Value* dest_base = rop.ll.offset_ptr(mem_void_ > dest_base = rop.llvm_ptr_cast(dest_base, p.type); > > for (int a = 0; a < num_elements; ++a) { > llvm::Value* arrind = simpletype.arraylen ? rop.ll.constant(a) > : NULL; > for (int c = 0; c < num_components; c++) { > llvm::Value* dest_elem; > if (num_components > 1) > dest_elem = rop.ll.GEP(dest_base, a, c); > else > dest_elem = dest_base; > > // NOTE: We don't want any uniform arguments to be > // widened, so our typical op_is_uniform doesn't do what we > // want for this when loading. So just pass arg_is_uniform > // which will avoid widening any uniform arguments. > llvm::Value* loaded > = rop.llvm_load_value(Value, 0, arrind, c, > TypeDesc::UNKNOWN, > /*op_is_uniform*/ arg_is_uniform, > /*index_is_uniform*/ true); > > if (!arg_is_uniform) { > loaded = rop.ll.op_extract(loaded, lane_index); > } > rop.ll.op_unmasked_store( > } > } > 7278c7311 < 2 + weighted + clentry->nformal); --- > 2 + weighted + clentry->nformal, lane_index); |
|
Stephen Friedman
Ahh, yes, you're right -- that's more complicated than the copy we have locally. That extra complication is to deal with parameters that are arrays, and translating from SOA back to the AOS layout of the closure results. You seem to have it all figured out though! On Mon, Nov 14, 2022 at 6:29 PM <dresserd@...> wrote: Hmm, this is a bit more complex than what you're describing, because the batched_llvm_gen version of llvm_gen_closure doesn't just switch the target of the memcpy, it replaces the memcpy completely with a pair of for loops over num_elements and num_components, with manual loads and stores. Based on your description, it's unclear whether I need to copy the logic from llvm_gen_closure to llvm_gen_keyword, or whether I should just switch the target of the memcpy. |
|
dresserd@...
( Apologies again to Stephen for accidentally double-posting as reply-to-sender ... )
I've put up a PR with this fix here: https://github.com/AcademySoftwareFoundation/OpenShadingLanguage/pull/1620 Since you were able to guess a fix to that last bug repro pretty quick, here's another one for you:
a.osl
```
shader a ( output closure color out = 0 )
{
out = debug( "test" );
}
```
b.osl
```
shader b ( closure color in0 = 0, closure color in1 = 0 )
{
closure color out = in0 + in1;
printf( " out = %s\n", out );
}
```
shader b ( closure color in0 = 0, closure color in1 = 0 )
{
closure color out = in0 + in1;
printf( " out = %s\n", out );
}
Result:
> oslc a.osl && oslc b.osl && testshade -layer a a -layer b b -connect a out b in0 -g 2 1
Compiled b.osl -> b.oso
Connect a.out to b.in0
out = (1, 1, 1) * debug ("test")
out = (1, 1, 1) * debug ("test")
But adding the -batched flag causes a crash:
> oslc a.osl && oslc b.osl && testshade -layer a a -layer b b -connect a out b in0 -g 2 1 -batched
Compiled a.osl -> a.oso
Compiled b.osl -> b.oso
Connect a.out to b.in0
0# OpenImageIO_v2_3::Sysutil::stacktrace() in /software/apps/OpenImageIO/2.3.11.0/cent7.x86_64/gcc/9.3.1/lib/libOpenImageIO_Util.so.2.3
1# 0x00007F9BB43BAF9B in /software/apps/OpenImageIO/2.3.11.0/cent7.x86_64/gcc/9.3.1/lib/libOpenImageIO_Util.so.2.3
2# 0x00007F9BB29BA400 in /lib64/libc.so.6
3# raise in /lib64/libpthread.so.0
4# 0x00007F9BB43BAFCC in /software/apps/OpenImageIO/2.3.11.0/cent7.x86_64/gcc/9.3.1/lib/libOpenImageIO_Util.so.2.3
5# 0x00007F9BB29BA400 in /lib64/libc.so.6
6# 0x00007F9BB5FBDB6F in /disk1/danield/OpenShadingLanguage-1.12.7.0_install/lib/liboslexec.so.1.12
7# OSL_v1_12::pvt::print_closure(std::ostream&, OSL_v1_12::ClosureColor const*, OSL_v1_12::pvt::ShadingSystemImpl*) in /disk1/danield/OpenShadingLanguage-1.12.7.0_install/lib/liboslexec.so.1.12
8# osl_b8_AVX2_closure_to_string in /software/apps/OpenShadingLanguage/1.12.7.0/cent7.x86_64/gcc/9.3.1/bin/../lib64/lib_b8_AVX2_oslexec.so
9# 0x00007F9BB84FB2C4
Abort
-Daniel |
|