-
Notifications
You must be signed in to change notification settings - Fork 315
Rewrite job pool to support nested parallelism for improved scalability #339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
zeux
wants to merge
7
commits into
BinomialLLC:master
Choose a base branch
from
zeux:jpool
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Before this change, given N images to encode, the caller had two options: - Use basis_compress for each image; internally this API would use multiple threads (when possible) to process large images on multiple threads - Use basis_parallel_compress for the group of images; internally this API would encode each image on one thread Neither of these would be optimal. For example, given 4 4K images and a 16-core system, basis_parallel_compress is suboptimal because it would only use 4 out of 16 cores for encoding; basis_compress would be better, but it still would be suboptimal because encoding of a large image doesn't scale perfectly and there would be gaps of inactivity where only one core would do the work (notably, decoding the image from PNG inputs or rescaling to compute mips). Also, scaling would decrease in efficiency for smaller mips even for one image. To fully saturate the system, you could use basis_compress and call it from multiple threads. This solves the problem of having enough concurrent work, but results in overhead due to context switches/thread preemption. It's difficult to size the two pools required in this case (outer and inner) wrt how many threads each has - especially if the goal is to reach a given level of total parallelism that is less than the system provides (e.g. -j16 on a 24-thread system) to avoid system wide performance issues due to oversubscription. For this to work well we ideally need to have basis_parallel_compress use a single job pool - both for "outer" processing, handling the flow of image decoding, and for "inner" processing (threading the work required for one image). This would get us the best of both worlds - a predictable total system fanout limited by the job pool capacity, and a full saturation of the said pool. For this to work well, we need to change the mechanics of job dispatch. Notably, when we wait for the jobs to finish, it's important that this only interacts with the jobs launched for the current image - both in terms of which jobs we wait for, and which jobs we help process. Without this there's going to be cross-talk between individual "outer" jobs which can result in poor scalability. To solve this, the job_pool API gets augmented with optional tokens - these are just outstanding job counters, and the pointer to the counter is saved in the job struct, which allows us to selectively steal only jobs relevant for a given token. The token is kept optional just in case simpler uses of this API are interesting elsewhere in this project, although all image encoding really should use tokens to work with nested parallelism. Note that all operations on tokens are protected by the mutex so we don't need to use atomics. In fact, the use of atomics in the old code contained a bug that lead to a rare deadlock on shutdown, see https://cohost.org/zeux/post/520125-condition-variables. This code has been extensively tested as part of gltfpack. Performance numbers on an example glTF model with 12 2K textures using ETC1S encoding on AMD Ryzen 5900X (12-core/24-thread system): - basis_compress one image at a time, no threading: 44s - basis_compress one image at a time: 12.5s - basis_parallel_compress before this change: 9.0s - basis_parallel_compress after this change: 4.2s (this change was tested on a variety of models and should never degrade performance and almost always result in an improvement, the above is just an example - notably, even though all textures there are the same size, they don't take the same amount of time to compress, so even with 12 serially compressed textures which is what basis_parallel_compress did by default, we have a very significant improvement in overall time with the new code)
This does not matter for gltfpack as we don't support HDR but it's valuable for consistency.
Contributor
Author
|
|
|
Commenting so this remains on the radar. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Before this change, given N images to encode, the caller had two options:
basis_compressfor each image; internally this API would use multiple threads (when possible) to process large images on multiple threadsbasis_parallel_compressfor the group of images; internally this API would encode each image on one threadNeither of these would be optimal. For example, given 4 4K images and a 16-core system,
basis_parallel_compressis suboptimal because it would only use 4 out of 16 cores for encoding;basis_compresswould be better, but it still would be suboptimal because encoding of a large image doesn't scale perfectly and there would be gaps of inactivity where only one core would do the work (notably, decoding the image from PNG inputs or rescaling to compute mips). Also, scaling would decrease in efficiency for smaller mips even for one image.To fully saturate the system, you could use
basis_compressand call it from multiple threads. This solves the problem of having enough concurrent work, but results in overhead due to context switches/thread preemption. It's difficult to size the two pools required in this case (outer and inner) wrt how many threads each has - especially if the goal is to reach a given level of total parallelism that is less than the system provides (e.g.-j16on a 24-thread system) to avoid system wide performance issues due to oversubscription.For this to work well we ideally need to have basis_parallel_compress use a single job pool - both for "outer" processing, handling the flow of image decoding, and for "inner" processing (threading the work required for one image). This would get us the best of both worlds - a predictable total system fanout limited by the job pool capacity, and a full saturation of the said pool.
For this to work well, we need to change the mechanics of job dispatch. Notably, when we wait for the jobs to finish, it's important that this only interacts with the jobs launched for the current image - both in terms of which jobs we wait for, and which jobs we help process. Without this there's going to be cross-talk between individual "outer" jobs which can result in poor scalability.
To solve this, the job_pool API gets augmented with optional tokens - these are just outstanding job counters, and the pointer to the counter is saved in the job struct, which allows us to selectively steal only jobs relevant for a given token. The token is kept optional just in case simpler uses of this API are interesting elsewhere in this project, although all image encoding really should use tokens to work with nested parallelism - forgetting it in the nested calls may lead to deadlocks as two running jobs would wait for each other with no forward progress.
Note that all operations on tokens are protected by the mutex so we don't need to use atomics. In fact, the use of atomics in the old code contained a bug that lead to a rare deadlock on shutdown, see https://cohost.org/zeux/post/520125-condition-variables.
This code has been extensively tested as part of gltfpack. Performance numbers on an example glTF model with 12 2K textures using ETC1S encoding on AMD Ryzen 5900X (12-core/24-thread system):
basis_compressone image at a time, no threading: 44sbasis_compressone image at a time: 12.5sbasis_parallel_compressbefore this change: 9.0sbasis_parallel_compressafter this change: 4.2s(this change was tested on a variety of models and should never degrade performance and almost always result in an improvement, the above is just an example - notably, even though all textures there are the same size, they don't take the same amount of time to compress, so even with 12 serially compressed textures which is what
basis_parallel_compressdid by default, we have a very significant improvement in overall time with the new code)