Re: run flags vs active intervals


Larry Gritz <l...@...>
 

It's not really referring to hardware SIMD, but just that we are shading many points at once, "in lock-step."  In other words, if you had many points to shade, you could do it in this order:

    point 0:
        add
        texture
        assign
    point 1:
        add
        texture
        assign
    ...
    point n:
        add
        texture
        assign

or you could do it in this order:

    add all points on [0,n]
    texture all points on [0,n]
    assign all points on [0,n]

We call the latter "SIMD" (single instruction, multiple data), because each instruction operates on big arrays of data (one value for each point being shaded).

SIMD helps by:

   * increasing interpreter speed, by moving much of the interpreter overhead to "per batch" rather than "per point".
   * improving memory coherence, cache behavior, and allowing hardware SIMD, since it's naturally using "structure-of-array" layout, i.e. it's faster to add contiguous arrays than separate floats in more sparse memory locations.
   * improves texture coherence because you're doing lots of lookups from the same texture on all points at the same time (which are in turn likely to be on the same tiles).




On Jan 21, 2010, at 9:30 AM, Wormszer wrote:

Is there a good resource on this topic? I did some googling and didn't see what i was looking for.

Where is the actual SIMD taking place?
Is the compiler figuring it out from the loop and setting up the correct instructions. Or is it relying on the cpu to recognize consecutive ops with different data, modifying instructions in its pipe by combing instructions into a parallel one.

Or maybe i'm way off.

Thanks,

Jeremy

On Thu, Jan 21, 2010 at 12:16 PM, Larry Gritz <l...@...> wrote:
Awesome, Chris.  Would you believe we were just talking internally about this topic yesterday?  We were considering the amount of waste if there were big gaps of "off" points in the middle.  But I think your solution is quite a bit more elegant than what we were discussing (a list of "on" points).  I like how it devolves into (begin,end] for the common case of all points on.

It's a pretty big overhaul, touches every shadeop and a lot of templates, and the call signatures have to be changed (to what, exactly? passing a std::vector<int>& ? or a combo of int *begend and int segments?).  Ugh, and we may wish to change/add OIIO texture routines that take runflags, too.  But your experiment is quite convincing.

What does everybody else think?  Chris/Cliff/Alex?  (I'm happy to do the coding, but I want consensus because it touches so much.)

       -- lg


On Jan 21, 2010, at 8:21 AM, Chris Foster wrote:

> Hi all,
>
> I've been looking through the OSL source a little, and I'm interested to see
> that you're using runflags for the SIMD state.  I know that's a really
> conventional solution, but there's an alternative representation which I reckon
> is significantly faster, and I wondered if you considered it.
>
> Imagine a SIMD runstate as follows
>
> index: [ 0  1  2  3  4  5  6  7  8  9 ]
>
> flags: [ 0  0  1  1  1  1  0  0  1  0 ]
>
>
> An alternative to the flags is to represent this state as a list of active
> intervals.  As a set of active start/stop pairs, the state looks like
>
> active = [ 2 6  8 9 ]
>
> ie,
>
> state: [ 0  0  1  1  1  1  0  0  1  0 ]
>               ^           v     ^  v
>               2           6     8  9
>
> The advantage of doing this is that the inner SIMD loops become tighter since
> there's one less test.  Instead of
>
> for (int i = 0; i < len; ++i)
> {
>    if (state[i])
>        do_something(i);
> }
>
> we'd have
>
> for (int j = 0; j < nActiveTimes2; j+=2)
> {
>    for (int i = active[j]; i < active[j+1]; ++i)
>        do_something(i);
> }
>
> and the inner loop is now completely coherent.
>
> I can see you have the beginnings of this idea in the current OSL code, since
> you pass beginpoint and endpoint along with the runflags everywhere.  However,
> why not take it to its logical conclusion?
>
> Given that you already have beginpoint and endpoint, the particularly big win
> here is when most of the flags are turned on.  If do_something() is a simple
> arithmetic operation (eg, float addition) the difference between the two
> formulations can be a factor of two in speed.
>
>
> I'm attaching some basic test code which I whipped up.  Timings on my core2 duo
> with gcc -O3 look like:
>
>
> All flags on (completely coherent):
>
> run flags:        1.310s
> active intervals: 0.690s
>
>
> Random flags on (completely incoherent):
>
> run flags:        5.440s
> active intervals: 3.310s
>
>
> Alternate flags on (maximum number of active intervals -> worst case):
>
> run flags:        1.500s
> active intervals: 2.150s
>
>
> Thoughts?
>
> ~Chris.
> <ATT00001..txt><runstate.cpp>

--
Larry Gritz
l...@...





--
You received this message because you are subscribed to the Google Groups "OSL Developers" group.
To post to this group, send email to osl...@....
To unsubscribe from this group, send email to osl...@....
For more options, visit this group at http://groups.google.com/group/osl-dev?hl=en.




<ATT00001..htm>

--
Larry Gritz



Join osl-dev@lists.aswf.io to automatically receive all group messages.