Topics

OpenEXR Library: Optimization potential for readChunkOffsetTables for MultiPartInput files

Harsh Patil
 

Hello

I am looking at EXR file of following type: MultiPart Scanline EXR file with ZIPS_COMPRESSION Half-float 4096*2307 with 24 parts. Each Part has 1 layers and each layer has 3-4 channels. 

Currently I am measuring that it takes 330ms (!!) to execute MultiPartInput:: void MultiPartInputFile::initialize() 

Please see screenshot 1: Bulk of that cost ~282ms comes from the operation readChunkOffsetTables() for all the parts in the EXR file (24 parts in this case : each has 1 layer and each layer has anywhere from 3-4 channels). Given I have 300 EXR files in my image sequence, I spend about a minute and half just on initializing the MultiPart input file which is a big overhead

Looking at the implementation of these functions I see following:

Towards the very end of the Initialize function library does following:

//
    // Create InputParts and read chunk offset tables.
    //

        

    for (size_t i = 0; i < _data->_headers.size(); i++)
        _data->parts.push_back(
                new InputPartData(_data, _data->_headers[i], i, _data->numThreads, _data->version));

    _data->readChunkOffsetTables(_data->reconstructChunkOffsetTable);

That function in turn just loops over each part in a for loop and does following: For each part, read ChunkOffsetTableSize from each part header and subsequently read in chunkOffsets and populate it in the MultiPartInput File data structure: vector<InputPartData*>      parts;
Since we are doing this on a vector STL container and we are modifying separate parts of that vector, we can ideally spawn this readChunkOffsetTables() for different parts across different threads and finish this stage of readChunkOffsetTables for the entire multi-part file much much faster. With ideal scaling what takes 282ms on one thread for all 24 parts: if we spawn it across 24 threads it should ideally finish in 11.75ms

Per STL vector description: The container is accessed (neither the const nor the non-const versions modify the container).
The reference returned can be used to access or modify elements. Concurrently accessing or modifying different elements is safe.

And in readChunkOffsetTables we are modifying separate parts by [] operator: e.g. highlighted in RED below it should be thread safe to do it across several threads. 

void
MultiPartInputFile::Data::readChunkOffsetTables(bool reconstructChunkOffsetTable)
{
    bool brokenPartsExist = false;

    for (size_t i = 0; i < parts.size(); i++)
    {
        int chunkOffsetTableSize = getChunkOffsetTableSize(parts[i]->header,false);
        parts[i]->chunkOffsets.resize(chunkOffsetTableSize);

        for (int j = 0; j < chunkOffsetTableSize; j++)
            OPENEXR_IMF_INTERNAL_NAMESPACE::Xdr::read <OPENEXR_IMF_INTERNAL_NAMESPACE::StreamIO> (*is, parts[i]->chunkOffsets[j]);

        //
        // Check chunk offsets, reconstruct if broken.
        // At first we assume the table is complete.
        //
        parts[i]->completed = true;
        for (int j = 0; j < chunkOffsetTableSize; j++)
        {
            if (parts[i]->chunkOffsets[j] <= 0)
            {
                brokenPartsExist = true;
                parts[i]->completed = false;
                break;
            }
        }
    }

    if (brokenPartsExist && reconstructChunkOffsetTable)
        chunkOffsetReconstruction(*is, parts);
}

Peter Hillman
 

I think this may be a similar issue to the readPixelData() - the gain may not be so significant because the Xdr::read() calls cannot be interleaved from different threads. However, it might be possible to have a single read call to ingest the entire chunk tables for all parts in one block, and then multithread the byte swapping and checking for completeness. I'm a little anxious about adding threading to the constructor, since error handling in constructors can be very messy.

In your tests are you reading all the channels in the file? Or do you only ever read a subset of the channels from one or two parts? I wonder whether the entire chunkOffsetTable reading for any part could be deferred until the corresponding InputPart is created (or a call to partComplete() is made). That way, if you only ever read part 0 in the file, the chunk tables for the other parts aren't read from disk at all.

Peter

On 08/05/2020 13:25, Harsh Patil via lists.aswf.io wrote:
Hello

I am looking at EXR file of following type: MultiPart Scanline EXR file with ZIPS_COMPRESSION Half-float 4096*2307 with 24 parts. Each Part has 1 layers and each layer has 3-4 channels. 

Currently I am measuring that it takes 330ms (!!) to execute MultiPartInput:: void MultiPartInputFile::initialize() 

Please see screenshot 1: Bulk of that cost ~282ms comes from the operation readChunkOffsetTables() for all the parts in the EXR file (24 parts in this case : each has 1 layer and each layer has anywhere from 3-4 channels). Given I have 300 EXR files in my image sequence, I spend about a minute and half just on initializing the MultiPart input file which is a big overhead

Looking at the implementation of these functions I see following:

Towards the very end of the Initialize function library does following:

//
    // Create InputParts and read chunk offset tables.
    //

        

    for (size_t i = 0; i < _data->_headers.size(); i++)
        _data->parts.push_back(
                new InputPartData(_data, _data->_headers[i], i, _data->numThreads, _data->version));

    _data->readChunkOffsetTables(_data->reconstructChunkOffsetTable);

That function in turn just loops over each part in a for loop and does following: For each part, read ChunkOffsetTableSize from each part header and subsequently read in chunkOffsets and populate it in the MultiPartInput File data structure: vector<InputPartData*>      parts;
Since we are doing this on a vector STL container and we are modifying separate parts of that vector, we can ideally spawn this readChunkOffsetTables() for different parts across different threads and finish this stage of readChunkOffsetTables for the entire multi-part file much much faster. With ideal scaling what takes 282ms on one thread for all 24 parts: if we spawn it across 24 threads it should ideally finish in 11.75ms

Per STL vector description: The container is accessed (neither the const nor the non-const versions modify the container).
The reference returned can be used to access or modify elements. Concurrently accessing or modifying different elements is safe.

And in readChunkOffsetTables we are modifying separate parts by [] operator: e.g. highlighted in RED below it should be thread safe to do it across several threads. 

void
MultiPartInputFile::Data::readChunkOffsetTables(bool reconstructChunkOffsetTable)
{
    bool brokenPartsExist = false;

    for (size_t i = 0; i < parts.size(); i++)
    {
        int chunkOffsetTableSize = getChunkOffsetTableSize(parts[i]->header,false);
        parts[i]->chunkOffsets.resize(chunkOffsetTableSize);

        for (int j = 0; j < chunkOffsetTableSize; j++)
            OPENEXR_IMF_INTERNAL_NAMESPACE::Xdr::read <OPENEXR_IMF_INTERNAL_NAMESPACE::StreamIO> (*is, parts[i]->chunkOffsets[j]);

        //
        // Check chunk offsets, reconstruct if broken.
        // At first we assume the table is complete.
        //
        parts[i]->completed = true;
        for (int j = 0; j < chunkOffsetTableSize; j++)
        {
            if (parts[i]->chunkOffsets[j] <= 0)
            {
                brokenPartsExist = true;
                parts[i]->completed = false;
                break;
            }
        }
    }

    if (brokenPartsExist && reconstructChunkOffsetTable)
        chunkOffsetReconstruction(*is, parts);
}


Harsh Patil
 

Hello Peter

In this particular customer project and few others that we have been looking into following trend was common:

Read 7-8 parts (and all channels inside each part). For example in this trace following 8 parts were read: <Diffuse, Lighting, Specular, SubSurfaceScattering, Global Illumination, Reflect, AO_RE_Base> and each part had 3 channels. So totally 24/78 channels in the EXR files were read

Regarding following: I wonder whether the entire chunkOffsetTable reading for any part could be deferred until the corresponding InputPart is created (or a call to partComplete() is made). That way, if you only ever read part 0 in the file, the chunk tables for the other parts aren't read from disk at all.

Would this be a library implementation change ?

Best regards

Harsh



On May 7, 2020, at 10:24 PM, Peter Hillman <work@...> wrote:

I think this may be a similar issue to the readPixelData() - the gain may not be so significant because the Xdr::read() calls cannot be interleaved from different threads. However, it might be possible to have a single read call to ingest the entire chunk tables for all parts in one block, and then multithread the byte swapping and checking for completeness. I'm a little anxious about adding threading to the constructor, since error handling in constructors can be very messy.

In your tests are you reading all the channels in the file? Or do you only ever read a subset of the channels from one or two parts? I wonder whether the entire chunkOffsetTable reading for any part could be deferred until the corresponding InputPart is created (or a call to partComplete() is made). That way, if you only ever read part 0 in the file, the chunk tables for the other parts aren't read from disk at all.

Peter

On 08/05/2020 13:25, Harsh Patil via lists.aswf.io wrote:
Hello

I am looking at EXR file of following type: MultiPart Scanline EXR file with ZIPS_COMPRESSION Half-float 4096*2307 with 24 parts. Each Part has 1 layers and each layer has 3-4 channels. 

Currently I am measuring that it takes 330ms (!!) to execute MultiPartInput:: void MultiPartInputFile::initialize() 

Please see screenshot 1: Bulk of that cost ~282ms comes from the operation readChunkOffsetTables() for all the parts in the EXR file (24 parts in this case : each has 1 layer and each layer has anywhere from 3-4 channels). Given I have 300 EXR files in my image sequence, I spend about a minute and half just on initializing the MultiPart input file which is a big overhead

Looking at the implementation of these functions I see following:

Towards the very end of the Initialize function library does following:

//
    // Create InputParts and read chunk offset tables.
    //
        
    for (size_t i = 0; i < _data->_headers.size(); i++)
        _data->parts.push_back(
                new InputPartData(_data, _data->_headers[i], i, _data->numThreads, _data->version));

    _data->readChunkOffsetTables(_data->reconstructChunkOffsetTable);

That function in turn just loops over each part in a for loop and does following: For each part, read ChunkOffsetTableSize from each part header and subsequently read in chunkOffsets and populate it in the MultiPartInput File data structure: vector<InputPartData*>      parts;
Since we are doing this on a vector STL container and we are modifying separate parts of that vector, we can ideally spawn this readChunkOffsetTables() for different parts across different threads and finish this stage of readChunkOffsetTables for the entire multi-part file much much faster. With ideal scaling what takes 282ms on one thread for all 24 parts: if we spawn it across 24 threads it should ideally finish in 11.75ms

Per STL vector description: The container is accessed (neither the const nor the non-const versions modify the container).
The reference returned can be used to access or modify elements. Concurrently accessing or modifying different elements is safe.

And in readChunkOffsetTables we are modifying separate parts by [] operator: e.g. highlighted in RED below it should be thread safe to do it across several threads. 

void
MultiPartInputFile::Data::readChunkOffsetTables(bool reconstructChunkOffsetTable)
{
    bool brokenPartsExist = false;

    for (size_t i = 0; i < parts.size(); i++)
    {
        int chunkOffsetTableSize = getChunkOffsetTableSize(parts[i]->header,false);
        parts[i]->chunkOffsets.resize(chunkOffsetTableSize);

        for (int j = 0; j < chunkOffsetTableSize; j++)
            OPENEXR_IMF_INTERNAL_NAMESPACE::Xdr::read <OPENEXR_IMF_INTERNAL_NAMESPACE::StreamIO> (*is, parts[i]->chunkOffsets[j]);

        //
        // Check chunk offsets, reconstruct if broken.
        // At first we assume the table is complete.
        //
        parts[i]->completed = true;
        for (int j = 0; j < chunkOffsetTableSize; j++)
        {
            if (parts[i]->chunkOffsets[j] <= 0)
            {
                brokenPartsExist = true;
                parts[i]->completed = false;
                break;
            }
        }
    }

    if (brokenPartsExist && reconstructChunkOffsetTable)
        chunkOffsetReconstruction(*is, parts);
}

<Screen Shot 2020-05-07 at 6.09.44 PM.png>



Peter Hillman
 

Hi Harsh,

Yes it would be quite a widespread underlying change to do this, but the API would stay the same. There would be a change to the library's behavior, though, which would be worth flagging up to developers using the library. Some methods may be slower than before or require file access where none was previously required. Also, when handling truncated files, the 'Early end of file' exception error may be thrown by a different method to where it previously was.

I should say it was speculation on my part that such a change will be a benefit in any case - further testing would be required. I'd guess in the case you describe the total saving might be about 150ms per file. Depending on how the API was used, the changes may actually make it slower to read channels from parts of the file, as there could be more seeking around.

Peter


On 8/05/20 5:38 pm, Harsh Patil via lists.aswf.io wrote:
Hello Peter

In this particular customer project and few others that we have been looking into following trend was common:

Read 7-8 parts (and all channels inside each part). For example in this trace following 8 parts were read: <Diffuse, Lighting, Specular, SubSurfaceScattering, Global Illumination, Reflect, AO_RE_Base> and each part had 3 channels. So totally 24/78 channels in the EXR files were read

Regarding following: I wonder whether the entire chunkOffsetTable reading for any part could be deferred until the corresponding InputPart is created (or a call to partComplete() is made). That way, if you only ever read part 0 in the file, the chunk tables for the other parts aren't read from disk at all.

Would this be a library implementation change ?

Best regards

Harsh



On May 7, 2020, at 10:24 PM, Peter Hillman <work@...> wrote:

I think this may be a similar issue to the readPixelData() - the gain may not be so significant because the Xdr::read() calls cannot be interleaved from different threads. However, it might be possible to have a single read call to ingest the entire chunk tables for all parts in one block, and then multithread the byte swapping and checking for completeness. I'm a little anxious about adding threading to the constructor, since error handling in constructors can be very messy.

In your tests are you reading all the channels in the file? Or do you only ever read a subset of the channels from one or two parts? I wonder whether the entire chunkOffsetTable reading for any part could be deferred until the corresponding InputPart is created (or a call to partComplete() is made). That way, if you only ever read part 0 in the file, the chunk tables for the other parts aren't read from disk at all.

Peter

On 08/05/2020 13:25, Harsh Patil via lists.aswf.io wrote:
Hello

I am looking at EXR file of following type: MultiPart Scanline EXR file with ZIPS_COMPRESSION Half-float 4096*2307 with 24 parts. Each Part has 1 layers and each layer has 3-4 channels. 

Currently I am measuring that it takes 330ms (!!) to execute MultiPartInput:: void MultiPartInputFile::initialize() 

Please see screenshot 1: Bulk of that cost ~282ms comes from the operation readChunkOffsetTables() for all the parts in the EXR file (24 parts in this case : each has 1 layer and each layer has anywhere from 3-4 channels). Given I have 300 EXR files in my image sequence, I spend about a minute and half just on initializing the MultiPart input file which is a big overhead

Looking at the implementation of these functions I see following:

Towards the very end of the Initialize function library does following:

//
    // Create InputParts and read chunk offset tables.
    //
        
    for (size_t i = 0; i < _data->_headers.size(); i++)
        _data->parts.push_back(
                new InputPartData(_data, _data->_headers[i], i, _data->numThreads, _data->version));

    _data->readChunkOffsetTables(_data->reconstructChunkOffsetTable);

That function in turn just loops over each part in a for loop and does following: For each part, read ChunkOffsetTableSize from each part header and subsequently read in chunkOffsets and populate it in the MultiPartInput File data structure: vector<InputPartData*>      parts;
Since we are doing this on a vector STL container and we are modifying separate parts of that vector, we can ideally spawn this readChunkOffsetTables() for different parts across different threads and finish this stage of readChunkOffsetTables for the entire multi-part file much much faster. With ideal scaling what takes 282ms on one thread for all 24 parts: if we spawn it across 24 threads it should ideally finish in 11.75ms

Per STL vector description: The container is accessed (neither the const nor the non-const versions modify the container).
The reference returned can be used to access or modify elements. Concurrently accessing or modifying different elements is safe.

And in readChunkOffsetTables we are modifying separate parts by [] operator: e.g. highlighted in RED below it should be thread safe to do it across several threads. 

void
MultiPartInputFile::Data::readChunkOffsetTables(bool reconstructChunkOffsetTable)
{
    bool brokenPartsExist = false;

    for (size_t i = 0; i < parts.size(); i++)
    {
        int chunkOffsetTableSize = getChunkOffsetTableSize(parts[i]->header,false);
        parts[i]->chunkOffsets.resize(chunkOffsetTableSize);

        for (int j = 0; j < chunkOffsetTableSize; j++)
            OPENEXR_IMF_INTERNAL_NAMESPACE::Xdr::read <OPENEXR_IMF_INTERNAL_NAMESPACE::StreamIO> (*is, parts[i]->chunkOffsets[j]);

        //
        // Check chunk offsets, reconstruct if broken.
        // At first we assume the table is complete.
        //
        parts[i]->completed = true;
        for (int j = 0; j < chunkOffsetTableSize; j++)
        {
            if (parts[i]->chunkOffsets[j] <= 0)
            {
                brokenPartsExist = true;
                parts[i]->completed = false;
                break;
            }
        }
    }

    if (brokenPartsExist && reconstructChunkOffsetTable)
        chunkOffsetReconstruction(*is, parts);
}

<Screen Shot 2020-05-07 at 6.09.44 PM.png>



Harsh Patil
 

Hi Peter

I was able to optimize the readChunkOffsetTables for MultiPartFile and make it ~3x faster for one of the production assets that I have. 

Please see screenshot 1: This is from current shipping openEXR implementation. For this 24 part file it takes ~300ms per Multi-Part EXR file in this sequence 
Please see screenshot 2: This is with my modifications it now at 105ms per Multi-Part EXR file in this sequence
Please see screenshot 3: readChunkOffsets now happen in parallel across different cores at *Part* granularity level

Summary of my changes:

Firstly I implemented infrastructure to support pread calls by adding it in ImfIO.h and ImfIO.cpp and its implementations in ImfStdIO.h and ImfStdIO.cpp
I also added StreamIO_V1 in ImfXDR.h (can rename it better) to support pread as well in its readChars implementation

Once above two were in place: In the MultiPartInputFile.cpp class I added logic to do multi-threaded read of chunkOffsets (at the granularity of *Part* level) by dispatching them to IlmThread ThreadPool where each thread reads chunk offsets for a part (=> different parts can have their chunkOffsets across different cores. For that I created subclass of Task class and added those tasks to the thread pool and added pread (plumbed through all IFStream classes) in the Execute() function accounting for chunkOffsets in pread 

These are my changes

Would love to hear your thoughts on this and how do we approach next this 

class readChunkOffsetTableTask : public IlmThread::Task
{
  public:

    readChunkOffsetTableTask (IlmThread::TaskGroup *group, OPENEXR_IMF_INTERNAL_NAMESPACE::IStream* is, std::vector<Int64>* chunkOffsets, int chunkOffsetTableSize, Imf::Int64 position);

    virtual ~readChunkOffsetTableTask ();

    virtual void        execute ();
  private:
    OPENEXR_IMF_INTERNAL_NAMESPACE::IStream*   _is;
    //Int64*          _chunkOffsets;
    Imf::Int64             _position;
    std::vector<Int64>*    _chunkOffsets;
    int                    _chunkOffsetTableSize;
};


readChunkOffsetTableTask::readChunkOffsetTableTask
    (IlmThread::TaskGroup *group,
     OPENEXR_IMF_INTERNAL_NAMESPACE::IStream*  is,
     std::vector<Int64>* chunkOffsets,
     int chunkOffsetTableSize,
     Imf::Int64 position)
:
    IlmThread::Task (group),
    _is(is),
    _position(position),
    _chunkOffsets(chunkOffsets),
    _chunkOffsetTableSize(chunkOffsetTableSize)
{
    // empty
}


readChunkOffsetTableTask::~readChunkOffsetTableTask ()
{

    

}

void
readChunkOffsetTableTask::execute ()
{
    for (int j = 0; j < _chunkOffsetTableSize; j++)
    {
        OPENEXR_IMF_INTERNAL_NAMESPACE::Xdr::pread <OPENEXR_IMF_INTERNAL_NAMESPACE::StreamIO_V1> (*_is, _chunkOffsets->at(j),(_position+(8*j)));
    }
}

IlmThread::Task *
newReadChunkOffsetTableTask (IlmThread::TaskGroup *group, OPENEXR_IMF_INTERNAL_NAMESPACE::IStream* is, std::vector<Int64>* chunkOffsets, int chunkOffsetTableSize,  Imf::Int64 pos)
{
    IlmThread::Task* retTask = 0;
    retTask = new readChunkOffsetTableTask (group, is, chunkOffsets, chunkOffsetTableSize, pos);
    return retTask;
}

    

void
MultiPartInputFile::Data::readChunkOffsetTables(bool reconstructChunkOffsetTable)
{
    bool brokenPartsExist = false;
    Int64 position = is->tellg();

    for (size_t i = 0; i < parts.size(); i++)
    {
        int chunkOffsetTableSize = getChunkOffsetTableSize(parts[i]->header,false);
        parts[i]->chunkOffsets.resize(chunkOffsetTableSize);

        

        /*
        for (int j = 0; j < chunkOffsetTableSize; j++)
        {
            OPENEXR_IMF_INTERNAL_NAMESPACE::Xdr::read <OPENEXR_IMF_INTERNAL_NAMESPACE::StreamIO> (*is, parts[i]->chunkOffsets[j]);
        }*/

        

        

        {
                IlmThread::TaskGroup taskGroup;
     
                IlmThread::ThreadPool::addGlobalTask (newReadChunkOffsetTableTask(&taskGroup, is, &(parts[i]->chunkOffsets), chunkOffsetTableSize, position));
            //
                // finish all tasks
            //
        }
        position = position + (chunkOffsetTableSize*8);

        

        // Check chunk offsets, reconstruct if broken.
        // At first we assume the table is complete.
        //
        parts[i]->completed = true;
        for (int j = 0; j < chunkOffsetTableSize; j++)
        {
            if (parts[i]->chunkOffsets[j] <= 0)
            {
                brokenPartsExist = true;
                parts[i]->completed = false;
                break;
            }
        }
    }

    if (brokenPartsExist && reconstructChunkOffsetTable)
        chunkOffsetReconstruction(*is, parts);
}






On May 8, 2020, at 7:15 PM, Peter Hillman <work@...> wrote:

Hi Harsh,

Yes it would be quite a widespread underlying change to do this, but the API would stay the same. There would be a change to the library's behavior, though, which would be worth flagging up to developers using the library. Some methods may be slower than before or require file access where none was previously required. Also, when handling truncated files, the 'Early end of file' exception error may be thrown by a different method to where it previously was.

I should say it was speculation on my part that such a change will be a benefit in any case - further testing would be required. I'd guess in the case you describe the total saving might be about 150ms per file. Depending on how the API was used, the changes may actually make it slower to read channels from parts of the file, as there could be more seeking around.

Peter


On 8/05/20 5:38 pm, Harsh Patil via lists.aswf.io wrote:
Hello Peter

In this particular customer project and few others that we have been looking into following trend was common:

Read 7-8 parts (and all channels inside each part). For example in this trace following 8 parts were read: <Diffuse, Lighting, Specular, SubSurfaceScattering, Global Illumination, Reflect, AO_RE_Base> and each part had 3 channels. So totally 24/78 channels in the EXR files were read

Regarding following: I wonder whether the entire chunkOffsetTable reading for any part could be deferred until the corresponding InputPart is created (or a call to partComplete() is made). That way, if you only ever read part 0 in the file, the chunk tables for the other parts aren't read from disk at all.

Would this be a library implementation change ?

Best regards

Harsh



On May 7, 2020, at 10:24 PM, Peter Hillman <work@...> wrote:

I think this may be a similar issue to the readPixelData() - the gain may not be so significant because the Xdr::read() calls cannot be interleaved from different threads. However, it might be possible to have a single read call to ingest the entire chunk tables for all parts in one block, and then multithread the byte swapping and checking for completeness. I'm a little anxious about adding threading to the constructor, since error handling in constructors can be very messy.

In your tests are you reading all the channels in the file? Or do you only ever read a subset of the channels from one or two parts? I wonder whether the entire chunkOffsetTable reading for any part could be deferred until the corresponding InputPart is created (or a call to partComplete() is made). That way, if you only ever read part 0 in the file, the chunk tables for the other parts aren't read from disk at all.

Peter

On 08/05/2020 13:25, Harsh Patil via lists.aswf.io wrote:
Hello

I am looking at EXR file of following type: MultiPart Scanline EXR file with ZIPS_COMPRESSION Half-float 4096*2307 with 24 parts. Each Part has 1 layers and each layer has 3-4 channels. 

Currently I am measuring that it takes 330ms (!!) to execute MultiPartInput:: void MultiPartInputFile::initialize() 

Please see screenshot 1: Bulk of that cost ~282ms comes from the operation readChunkOffsetTables() for all the parts in the EXR file (24 parts in this case : each has 1 layer and each layer has anywhere from 3-4 channels). Given I have 300 EXR files in my image sequence, I spend about a minute and half just on initializing the MultiPart input file which is a big overhead

Looking at the implementation of these functions I see following:

Towards the very end of the Initialize function library does following:

//
    // Create InputParts and read chunk offset tables.
    //
        
    for (size_t i = 0; i < _data->_headers.size(); i++)
        _data->parts.push_back(
                new InputPartData(_data, _data->_headers[i], i, _data->numThreads, _data->version));

    _data->readChunkOffsetTables(_data->reconstructChunkOffsetTable);

That function in turn just loops over each part in a for loop and does following: For each part, read ChunkOffsetTableSize from each part header and subsequently read in chunkOffsets and populate it in the MultiPartInput File data structure: vector<InputPartData*>      parts;
Since we are doing this on a vector STL container and we are modifying separate parts of that vector, we can ideally spawn this readChunkOffsetTables() for different parts across different threads and finish this stage of readChunkOffsetTables for the entire multi-part file much much faster. With ideal scaling what takes 282ms on one thread for all 24 parts: if we spawn it across 24 threads it should ideally finish in 11.75ms

Per STL vector description: The container is accessed (neither the const nor the non-const versions modify the container).
The reference returned can be used to access or modify elements. Concurrently accessing or modifying different elements is safe.

And in readChunkOffsetTables we are modifying separate parts by [] operator: e.g. highlighted in RED below it should be thread safe to do it across several threads. 

void
MultiPartInputFile::Data::readChunkOffsetTables(bool reconstructChunkOffsetTable)
{
    bool brokenPartsExist = false;

    for (size_t i = 0; i < parts.size(); i++)
    {
        int chunkOffsetTableSize = getChunkOffsetTableSize(parts[i]->header,false);
        parts[i]->chunkOffsets.resize(chunkOffsetTableSize);

        for (int j = 0; j < chunkOffsetTableSize; j++)
            OPENEXR_IMF_INTERNAL_NAMESPACE::Xdr::read <OPENEXR_IMF_INTERNAL_NAMESPACE::StreamIO> (*is, parts[i]->chunkOffsets[j]);

        //
        // Check chunk offsets, reconstruct if broken.
        // At first we assume the table is complete.
        //
        parts[i]->completed = true;
        for (int j = 0; j < chunkOffsetTableSize; j++)
        {
            if (parts[i]->chunkOffsets[j] <= 0)
            {
                brokenPartsExist = true;
                parts[i]->completed = false;
                break;
            }
        }
    }

    if (brokenPartsExist && reconstructChunkOffsetTable)
        chunkOffsetReconstruction(*is, parts);
}

<Screen Shot 2020-05-07 at 6.09.44 PM.png>




Harsh Patil
 

Hi Peter

I was able to cut down the cost to only 23ms by further queueing up tasks for all parts in 1 task group which I create at the start of readChunkOffset table

Here is the screenshot for reference. And new code

We go from 300ms -> 23ms to read chunk offsets for MultiPart Scanline EXR file with ZIPS_COMPRESSION Half-float 4096*2307 with 24 parts. Each Part has 1 layers and each layer has 3-4 channels. 

class readChunkOffsetTableTask : public IlmThread::Task
{
  public:

    readChunkOffsetTableTask (IlmThread::TaskGroup *group, OPENEXR_IMF_INTERNAL_NAMESPACE::IStream* is, std::vector<Int64>* chunkOffsets, int chunkOffsetTableSize, Imf::Int64 position);

    virtual ~readChunkOffsetTableTask ();

    virtual void        execute ();
  private:
    OPENEXR_IMF_INTERNAL_NAMESPACE::IStream*   _is;
    //Int64*          _chunkOffsets;
    Imf::Int64             _position;
    std::vector<Int64>*    _chunkOffsets;
    int                    _chunkOffsetTableSize;
};


readChunkOffsetTableTask::readChunkOffsetTableTask
    (IlmThread::TaskGroup *group,
     OPENEXR_IMF_INTERNAL_NAMESPACE::IStream*  is,
     std::vector<Int64>* chunkOffsets,
     int chunkOffsetTableSize,
     Imf::Int64 position)
:
    IlmThread::Task (group),
    _is(is),
    _position(position),
    _chunkOffsets(chunkOffsets),
    _chunkOffsetTableSize(chunkOffsetTableSize)
{
    // empty
}


readChunkOffsetTableTask::~readChunkOffsetTableTask ()
{
    
}

void
readChunkOffsetTableTask::execute ()
{
    for (int j = 0; j < _chunkOffsetTableSize; j++)
    {
        OPENEXR_IMF_INTERNAL_NAMESPACE::Xdr::pread <OPENEXR_IMF_INTERNAL_NAMESPACE::StreamIO_V1> (*_is_chunkOffsets->at(j),(_position+(8*j)));
    }
}

IlmThread::Task *
newReadChunkOffsetTableTask (IlmThread::TaskGroup *group, OPENEXR_IMF_INTERNAL_NAMESPACE::IStream* is, std::vector<Int64>* chunkOffsets, int chunkOffsetTableSize,  Imf::Int64 pos)
{
    IlmThread::Task* retTask = 0;
    retTask = new readChunkOffsetTableTask (group, is, chunkOffsets, chunkOffsetTableSize, pos);
    return retTask;
}
    
void
MultiPartInputFile::Data::readChunkOffsetTables(bool reconstructChunkOffsetTable)
{
    bool brokenPartsExist = false;
    Int64 position = is->tellg();
    IlmThread::TaskGroup taskGroup;
    for (size_t i = 0; i < parts.size(); i++)
    {
        int chunkOffsetTableSize = getChunkOffsetTableSize(parts[i]->header,false);
        parts[i]->chunkOffsets.resize(chunkOffsetTableSize);
        
        /*
        for (int j = 0; j < chunkOffsetTableSize; j++)
        {
            OPENEXR_IMF_INTERNAL_NAMESPACE::Xdr::read <OPENEXR_IMF_INTERNAL_NAMESPACE::StreamIO> (*is, parts[i]->chunkOffsets[j]);
        }*/
        
        
        {     
                IlmThread::ThreadPool::addGlobalTask (newReadChunkOffsetTableTask(&taskGroup, is, &(parts[i]->chunkOffsets), chunkOffsetTableSize, position));
            //
                // finish all tasks
            //
        }
        position = position + (chunkOffsetTableSize*8);
        
        // Check chunk offsets, reconstruct if broken.
        // At first we assume the table is complete.
        //
        parts[i]->completed = true;
        for (int j = 0; j < chunkOffsetTableSize; j++)
        {
            if (parts[i]->chunkOffsets[j] <= 0)
            {
                brokenPartsExist = true;
                parts[i]->completed = false;
                break;
            }
        }
    }

    if (brokenPartsExist && reconstructChunkOffsetTable)
        chunkOffsetReconstruction(*isparts);
}


On Jun 23, 2020, at 11:02 PM, Harsh Patil via lists.aswf.io <harsh_patil=apple.com@...> wrote:

Hi Peter

I was able to optimize the readChunkOffsetTables for MultiPartFile and make it ~3x faster for one of the production assets that I have. 

Please see screenshot 1: This is from current shipping openEXR implementation. For this 24 part file it takes ~300ms per Multi-Part EXR file in this sequence 
Please see screenshot 2: This is with my modifications it now at 105ms per Multi-Part EXR file in this sequence
Please see screenshot 3: readChunkOffsets now happen in parallel across different cores at *Part* granularity level

Summary of my changes:

Firstly I implemented infrastructure to support pread calls by adding it in ImfIO.h and ImfIO.cpp and its implementations in ImfStdIO.h and ImfStdIO.cpp
I also added StreamIO_V1 in ImfXDR.h (can rename it better) to support pread as well in its readChars implementation

Once above two were in place: In the MultiPartInputFile.cpp class I added logic to do multi-threaded read of chunkOffsets (at the granularity of *Part* level) by dispatching them to IlmThread ThreadPool where each thread reads chunk offsets for a part (=> different parts can have their chunkOffsets across different cores. For that I created subclass of Task class and added those tasks to the thread pool and added pread (plumbed through all IFStream classes) in the Execute() function accounting for chunkOffsets in pread 

These are my changes

Would love to hear your thoughts on this and how do we approach next this 

class readChunkOffsetTableTask : public IlmThread::Task
{
  public:

    readChunkOffsetTableTask (IlmThread::TaskGroup *group, OPENEXR_IMF_INTERNAL_NAMESPACE::IStream* is, std::vector<Int64>* chunkOffsets, int chunkOffsetTableSize, Imf::Int64 position);

    virtual ~readChunkOffsetTableTask ();

    virtual void        execute ();
  private:
    OPENEXR_IMF_INTERNAL_NAMESPACE::IStream*   _is;
    //Int64*          _chunkOffsets;
    Imf::Int64             _position;
    std::vector<Int64>*    _chunkOffsets;
    int                    _chunkOffsetTableSize;
};


readChunkOffsetTableTask::readChunkOffsetTableTask
    (IlmThread::TaskGroup *group,
     OPENEXR_IMF_INTERNAL_NAMESPACE::IStream*  is,
     std::vector<Int64>* chunkOffsets,
     int chunkOffsetTableSize,
     Imf::Int64 position)
:
    IlmThread::Task (group),
    _is(is),
    _position(position),
    _chunkOffsets(chunkOffsets),
    _chunkOffsetTableSize(chunkOffsetTableSize)
{
    // empty
}


readChunkOffsetTableTask::~readChunkOffsetTableTask ()
{
    
}

void
readChunkOffsetTableTask::execute ()
{
    for (int j = 0; j < _chunkOffsetTableSize; j++)
    {
        OPENEXR_IMF_INTERNAL_NAMESPACE::Xdr::pread <OPENEXR_IMF_INTERNAL_NAMESPACE::StreamIO_V1> (*_is, _chunkOffsets->at(j),(_position+(8*j)));
    }
}

IlmThread::Task *
newReadChunkOffsetTableTask (IlmThread::TaskGroup *group, OPENEXR_IMF_INTERNAL_NAMESPACE::IStream* is, std::vector<Int64>* chunkOffsets, int chunkOffsetTableSize,  Imf::Int64 pos)
{
    IlmThread::Task* retTask = 0;
    retTask = new readChunkOffsetTableTask (group, is, chunkOffsets, chunkOffsetTableSize, pos);
    return retTask;
}
    
void
MultiPartInputFile::Data::readChunkOffsetTables(bool reconstructChunkOffsetTable)
{
    bool brokenPartsExist = false;
    Int64 position = is->tellg();

    for (size_t i = 0; i < parts.size(); i++)
    {
        int chunkOffsetTableSize = getChunkOffsetTableSize(parts[i]->header,false);
        parts[i]->chunkOffsets.resize(chunkOffsetTableSize);
        
        /*
        for (int j = 0; j < chunkOffsetTableSize; j++)
        {
            OPENEXR_IMF_INTERNAL_NAMESPACE::Xdr::read <OPENEXR_IMF_INTERNAL_NAMESPACE::StreamIO> (*is, parts[i]->chunkOffsets[j]);
        }*/
        
        
        {
                IlmThread::TaskGroup taskGroup;
     
                IlmThread::ThreadPool::addGlobalTask (newReadChunkOffsetTableTask(&taskGroup, is, &(parts[i]->chunkOffsets), chunkOffsetTableSize, position));
            //
                // finish all tasks
            //
        }
        position = position + (chunkOffsetTableSize*8);
        
        // Check chunk offsets, reconstruct if broken.
        // At first we assume the table is complete.
        //
        parts[i]->completed = true;
        for (int j = 0; j < chunkOffsetTableSize; j++)
        {
            if (parts[i]->chunkOffsets[j] <= 0)
            {
                brokenPartsExist = true;
                parts[i]->completed = false;
                break;
            }
        }
    }

    if (brokenPartsExist && reconstructChunkOffsetTable)
        chunkOffsetReconstruction(*is, parts);
}


<Screen Shot 2020-06-23 at 10.31.51 PM.png>


<Screen Shot 2020-06-23 at 10.32.03 PM.png>

<Screen Shot 2020-06-23 at 10.45.44 PM.png>

On May 8, 2020, at 7:15 PM, Peter Hillman <work@...> wrote:

Hi Harsh,

Yes it would be quite a widespread underlying change to do this, but the API would stay the same. There would be a change to the library's behavior, though, which would be worth flagging up to developers using the library. Some methods may be slower than before or require file access where none was previously required. Also, when handling truncated files, the 'Early end of file' exception error may be thrown by a different method to where it previously was.

I should say it was speculation on my part that such a change will be a benefit in any case - further testing would be required. I'd guess in the case you describe the total saving might be about 150ms per file. Depending on how the API was used, the changes may actually make it slower to read channels from parts of the file, as there could be more seeking around.

Peter


On 8/05/20 5:38 pm, Harsh Patil via lists.aswf.io wrote:
Hello Peter

In this particular customer project and few others that we have been looking into following trend was common:

Read 7-8 parts (and all channels inside each part). For example in this trace following 8 parts were read: <Diffuse, Lighting, Specular, SubSurfaceScattering, Global Illumination, Reflect, AO_RE_Base> and each part had 3 channels. So totally 24/78 channels in the EXR files were read

Regarding following: I wonder whether the entire chunkOffsetTable reading for any part could be deferred until the corresponding InputPart is created (or a call to partComplete() is made). That way, if you only ever read part 0 in the file, the chunk tables for the other parts aren't read from disk at all.

Would this be a library implementation change ?

Best regards

Harsh



On May 7, 2020, at 10:24 PM, Peter Hillman <work@...> wrote:

I think this may be a similar issue to the readPixelData() - the gain may not be so significant because the Xdr::read() calls cannot be interleaved from different threads. However, it might be possible to have a single read call to ingest the entire chunk tables for all parts in one block, and then multithread the byte swapping and checking for completeness. I'm a little anxious about adding threading to the constructor, since error handling in constructors can be very messy.

In your tests are you reading all the channels in the file? Or do you only ever read a subset of the channels from one or two parts? I wonder whether the entire chunkOffsetTable reading for any part could be deferred until the corresponding InputPart is created (or a call to partComplete() is made). That way, if you only ever read part 0 in the file, the chunk tables for the other parts aren't read from disk at all.

Peter

On 08/05/2020 13:25, Harsh Patil via lists.aswf.io wrote:
Hello

I am looking at EXR file of following type: MultiPart Scanline EXR file with ZIPS_COMPRESSION Half-float 4096*2307 with 24 parts. Each Part has 1 layers and each layer has 3-4 channels. 

Currently I am measuring that it takes 330ms (!!) to execute MultiPartInput:: void MultiPartInputFile::initialize() 

Please see screenshot 1: Bulk of that cost ~282ms comes from the operation readChunkOffsetTables() for all the parts in the EXR file (24 parts in this case : each has 1 layer and each layer has anywhere from 3-4 channels). Given I have 300 EXR files in my image sequence, I spend about a minute and half just on initializing the MultiPart input file which is a big overhead

Looking at the implementation of these functions I see following:

Towards the very end of the Initialize function library does following:

//
    // Create InputParts and read chunk offset tables.
    //
        
    for (size_t i = 0; i < _data->_headers.size(); i++)
        _data->parts.push_back(
                new InputPartData(_data, _data->_headers[i], i, _data->numThreads, _data->version));

    _data->readChunkOffsetTables(_data->reconstructChunkOffsetTable);

That function in turn just loops over each part in a for loop and does following: For each part, read ChunkOffsetTableSize from each part header and subsequently read in chunkOffsets and populate it in the MultiPartInput File data structure: vector<InputPartData*>      parts;
Since we are doing this on a vector STL container and we are modifying separate parts of that vector, we can ideally spawn this readChunkOffsetTables() for different parts across different threads and finish this stage of readChunkOffsetTables for the entire multi-part file much much faster. With ideal scaling what takes 282ms on one thread for all 24 parts: if we spawn it across 24 threads it should ideally finish in 11.75ms

Per STL vector description: The container is accessed (neither the const nor the non-const versions modify the container).
The reference returned can be used to access or modify elements. Concurrently accessing or modifying different elements is safe.

And in readChunkOffsetTables we are modifying separate parts by [] operator: e.g. highlighted in RED below it should be thread safe to do it across several threads. 

void
MultiPartInputFile::Data::readChunkOffsetTables(bool reconstructChunkOffsetTable)
{
    bool brokenPartsExist = false;

    for (size_t i = 0; i < parts.size(); i++)
    {
        int chunkOffsetTableSize = getChunkOffsetTableSize(parts[i]->header,false);
        parts[i]->chunkOffsets.resize(chunkOffsetTableSize);

        for (int j = 0; j < chunkOffsetTableSize; j++)
            OPENEXR_IMF_INTERNAL_NAMESPACE::Xdr::read <OPENEXR_IMF_INTERNAL_NAMESPACE::StreamIO> (*is, parts[i]->chunkOffsets[j]);

        //
        // Check chunk offsets, reconstruct if broken.
        // At first we assume the table is complete.
        //
        parts[i]->completed = true;
        for (int j = 0; j < chunkOffsetTableSize; j++)
        {
            if (parts[i]->chunkOffsets[j] <= 0)
            {
                brokenPartsExist = true;
                parts[i]->completed = false;
                break;
            }
        }
    }

    if (brokenPartsExist && reconstructChunkOffsetTable)
        chunkOffsetReconstruction(*is, parts);
}

<Screen Shot 2020-05-07 at 6.09.44 PM.png>





Peter Hillman
 

Hi Harsh,
Thanks for the update. We discussed this briefly at the latest TSC meeting. You've certainly shown there's scope for some significant optimization here, and I think there is general interest in speeding up read times. I'll try to find time to run some tests of my own to see if I can reproduce this and do some low level analysis.

We wondered how specific the speedup you see is dependent on that specific file system, and how much caching might be playing a part. We were not sure whether different threads reading from different parts of a file "simultaneously" may cause a slowdown in some cases. It's possible that the Xdr code is actually the bottleneck, rather than the file reads, and maybe a less invasive optimization is possible. Your approach means that the constructor becomes multithreaded, which would be a significant change in behavior. Currently only the 'readPixels'/'readTiles' calls are multithreaded. We'd would also have to take care that damaged files are handled correctly.

Unfortunately, the nature of the changes required to optimize reads are likely to mean they have to be delayed until the next major release.

Peter

Harsh Patil
 

Hi Peter

Apologies for the delay in reply. That sounds great. I would love to hear your thoughts after your analysis

I completely agree. I did two tests: I ran on a 28c Part and 8c Part and I can confirm that with more HW threads preads get slower but still much faster than single threaded readChunkOffsets table. Roughly 150ms saving especially on large part count & channel count MultiPart files.

I completely agree that given nature of the change its better to vet it out fully across a few DCCs and next major release seems like a good idea. I will also look into the code for handling damaged files today

Our test files were 24 parts 78 channels 4096*2304 resolution half float, increasing Y ZIPS compression files. We tested the cost across usual VFX DCCs and Preview Apps.

I made changes for scanline reads as well and will follow up on a separate email shortly

Best Regards

Harsh

On Jul 3, 2020, at 11:37 PM, Peter Hillman <work@...> wrote:

Hi Harsh,
Thanks for the update. We discussed this briefly at the latest TSC meeting. You've certainly shown there's scope for some significant optimization here, and I think there is general interest in speeding up read times. I'll try to find time to run some tests of my own to see if I can reproduce this and do some low level analysis.

We wondered how specific the speedup you see is dependent on that specific file system, and how much caching might be playing a part. We were not sure whether different threads reading from different parts of a file "simultaneously" may cause a slowdown in some cases. It's possible that the Xdr code is actually the bottleneck, rather than the file reads, and maybe a less invasive optimization is possible. Your approach means that the constructor becomes multithreaded, which would be a significant change in behavior. Currently only the 'readPixels'/'readTiles' calls are multithreaded. We'd would also have to take care that damaged files are handled correctly.

Unfortunately, the nature of the changes required to optimize reads are likely to mean they have to be delayed until the next major release.

Peter