This video made me think about how PH probably has implemented their multi core rendering. And the fact, that all CPUs and threads are working as soon as there is something going on, i guess they have reveresed the multi-core rendering than it is usually implemented.
Let me explain: In normal cases you have jobs to be done. Each jobs has its context, that means things need to be done before and after it, where some jobs can run independently. A simple example is (1)"audio->fx->fx" for a serial sequence part, and (2)"audio1", (3)"audio2" as independent parallel jobs. Jobs (1) to (3) can be send to different cores/threads to be rendered, as long as they are not related, which would result in 3 cores working at the same time.
But, the fact that all cores and threads are running at the same time, the jobs are not send to the cores, but in reverse the cores and threads are activated and poll a centralized job queue. Maybe there are several queues for individual parallelism and the cores and threads check them each.
So where is the difference between sending to cores and let the cores poll? Here are the few things that came in my mind
* power consumption
* cache invalidation
* prediction invalidation
* blocking of threads/cores while accessing the queues
* the cores/threads get a job if they are free and do not have to be idle as it would be in bad scheduling of sending jobs
What does this type of implementation indicate?
* it was probably the easiest and quickest way to implement it
* probably the better core-self-balancing instead of complicated extra balancer overhead
* bad job identification and scheduling
* Following points may also hit on job sending systems:
** architecture dependencies
** hard coupling, bad cohesion
** lot of synchronization may be required
** global buffers
Conclusion:
I am not sure, that the things that Hydlide is seeing are a real issue and can be used for comparison or performance indication. If it is implemented in such i way i speculate, it is the through-put what counts in the end. But polling is never a good way of implementation, because it keeps the system busy where it does not need to be. A more smarter balancing (and some cores can be faster than others), more isolated jobs, less coupling, less locking, better architecture will come automatically, if you implement more isolated jobs. A long way to go maybe...
All this is my opinion. Nothing official.