A multithread performance experiment with Rack v1dev & the 0.6.2c rcomian fork experiment2

I’m sure many of us are interested in how Rack v1 will improve performance. As everyone is by now aware there are two main factors in this, CPU & GPU.

Andrew (@vortico) has recently made some changes that affect CPU performance and I was interested to test these. I will hereafter refer to the v1 development commit I have used as v1dev and it was locally compiled on commit 510f7b2179dcffcd8f2fafaae66f7c0070ee6215. Locally compiled v1dev Fundamental modules were used for patching.

As you are also aware Jim Tupper (@JimT) has been very actively researching this area, has released a series of his experiments as a fork (https://github.com/Rcomian/Rack) and has published a fascinating paper on this: https://github.com/Rcomian/Rack/wiki/Multi-threading.
I will hereafter refer to this as rcomian and it was natively compiled with v0.6.2c-experiments2. Plugin Manager Fundamental modules were used for patching.

Test system: 27” 5K iMac 2015. CPU: 6700K (4 Ghz); GPU: AMD 395X (4Gb); 32Gb 2400.

As much as possible was stopped in the background: WiFI, Time Machine, Spotlight Indexing, iCloud. The only processes running that exceeded 1% CPU were Rack, iTerm, iStat Menus (used for the stats), windowserver and coreaudiod. Other processes had little spikes and these are what create the glitches at the highest number of VCO-1s. Testing was carried out fullscreen with 4 rows displaying on the iMac’s 5k monitor at 2880*1620 resolution. No frame rate limiting was used with either v1dev or rcomian.

Test Patch: Groups of 3 VCO-1s feeding their 12 (4 * 3) oscillator outputs into a Unity (2 * 6). These Unitys feed into further Unitys and then to a Mixer and thence to an Audio. A scope monitors the Mixer’s output in order to ensure the levels do not clip (considerable phasing due to delays in the signals through the circuits produces rapidly changing levels). Testing was carried out at 44.1k (engine & audio) & 256 block size. Patches are here:

threadtest-rack1dev-36.vcv (48.9 KB)
threadtest-rcomian-36.vcv (52.5 KB)

Testing began with a 36 VCO-1 patch, 1-4 threads were tested for around a minute each where any glitches occurred, and a (necessarily) subjective assessment of their severity made; CPU percentages were recorded by observing iStat’s display. At higher numbers of VCO-1s these are not stable percentages and a subjective average of them was made (they displayed greater ranges with rcomian). One by one, VCO-1s were deleted beginning at the bottom right and moving left until a row was cleared and then from the right on the next row up. At no point were any Unitys or their wiring deleted.

Here is a screenshot of the rcomian patch with 36 VCO-1s:

Here is a screenshot of the v1dev patch with 12 VCO-1s:

Results

The data in the spreadsheet covers a range form 12 VCO-1s in my patch to 36 VCO-1s on 4 threads (I have 4 physical cores & hyperthreading does not help). Why not go lower or higher? Because data outside this range is not relevant for this patch. The important thing to remember about multithreading in Rack is that you will get no benefit from adding threads until you need them (ie your audio is glitching because your patch has too many modules). Adding threads before this point is counterproductive (you will have greater CPU usage/power consumption/heat/fan noise). Why is this? That is over my pay grade; I have half an understanding but to avoid looking like a complete idiot I refer you to others who can give you the correct explanation. 12 VCO-1s was the point I needed to add extra threads in this patch; beyond 36 VCO-1s lay only complete audio degradation (and it was nice and symmetrical !).

The results are best viewed here: https://docs.google.com/spreadsheets/d/1zxWdVyPo_PpgWNQPT3zK69ij8UPZeARmvSqlaTLS24U/edit?usp=sharing

But here is a screenshot:

Spreadsheet key:

Dark Green: No glitches
Pale Green: Very occasional glitch (typically a daemon CPU spike)
Yellow: Occasional glitches & audio dropouts
Amber: Glitches & audio dropouts
Red: Severe glitches & audio dropouts
Black: Complete audio degradation

Please be very clear, the numbers here mean very little. They are a test on my system of a particular patch. It is the pattern that is important.

What can we see/conclude? Well, here are some headline figures for the patch (using Bright Green/No Glitches):

Max VCO1s (1 thread): 12 (v1dev); 13 (rcomian)
Max VCO1s (2 threads): 21 (v1dev); 22 (rcomian)
Max VCO1s (3 threads): 24 (v1dev); 19, 22-24, 27-28 (rcomian)
Max VCO1s (4 threads): 22 (v1dev); 19, 21-24, 27 (rcomian)

Conclusions: v1dev has consistent figures, rcomian has higher but inconsistent figures; 3 threads are optimal, not 4.

CPU Range (1 thread): up to 118 (v1dev); up to 123 (rcomian)
CPU Range (2 threads): 185-219% (v1dev); 142-208% (rcomian)
CPU Range (3 threads): 306-307 (v1dev); 252-250% (rcomian - inconsistencies)
CPU Range (4 threads): 411 (v1dev - no benefit over 3 threads); 297 (rcomian - no benefit over 3 threads)

NB These are calculated from the point that the next thread needs to be turned on.

Conclusions: rcomian has better CPU usage than v1dev, especially at 3 threads, but there are inconsistencies that may create a few glitches (probably) depending on the patch.

Note that (debatably) usable results are available for slightly higher numbers of VCO-1s with rcomian (Light Green). This is likely to be very dependent on the patch used.

Note also that whilst the direction of travel is the same, v1dev’s results are reasonably linear, whereas rcomian’s are (very much) not; v1dev’s implementation is done with little code and is a parsimonious solution (AFAIK); rcomian’s implementation is complex and uses multiple vectors.

It is clear that the law of diminishing returns is very much in play here, as Andrew has previously noted.

I hope that this has been interesting and useful for some of you.

Cheers.

NB: There is an update below testing limiting the frame rate in combination with additional threads.

11 Likes

Interesting read Nik, thanks!

1 Like

I had a patch on one machine where adding multiple threads made performance worse. I think there might be an issue with the audio interface on that one though :thinking:

Some further experimentation this morning because Andrew (@vortico) has released a v1dev version with framerate limiting: Rack v1 development blog

I built commit: 927c77eca6c337497633763e7c960ec1b7225086 and following the suggestion in the above post edited settings.json for 25 (frames per second), a reduction from the native vsync of iMac’s graphic card & monitor of 60fps.

I was testing the same rcomian fork this time with the menu setting of 30fps. This is not the same as 25fps above but this was a quick and dirty test and not an exhaustive one.

Taking v1dev first, there were some good improvements.

Max VCO-1s on 1 thread went from 12 to 13 (at CPU 98 %)
Max VCO-1s on 2 threads went from 21 to 24 (at CPU 202 %)
Max VCO-1s total went from 22 to 28 (288% for 3 threads, 387% for 4 threads)

With rcomian:

Max VCO-1s on 1 thread remained at 13 (CPU 106%)
Max VCO-1s on 2 threads remained at 22 (CPU 185%)
Max VCO-1s were 22 for 3 threads (200%) and 19 for 4 threads.

This time with rcomian there were no reliable upper outliers. The same numbers came up as before (23, 24, 27 & 28 threads) but this time they had a glitch or two a minute caused by CPU spikes from other processes on the machine. This makes me wonder a little about the first set of results on these upper numbers - maybe I just got lucky there. However, as noted in the write up, with higher numbers of VCO-1s the first glitches you hear are caused by these CPU spikes, showing how little headroom there is at this point to accommodate whatever else your computer might be up to.

A further note/speculation - gains in the number of modules due to GPU limiting is going to be far more variable with different computers and their GPUs. The gains here are not just in terms of getting to use some extra modules, but are, as Andrew stated, about power usage. If you have a thermally compromised machine (a recent MacBookPro being a prime example) then you are unlikely to be very happy with increasing threads because the CPU overhead of doing so is high and heat and fan noise will go up at a worrying rate. Limiting the frame rate of the GPU however brings considerable thermal benefits (and a few extra modules maybe). YMMV.

These results were a fairly quick test but one thing is writ clear in both: there are diminishing returns here. Maybe it will vary with different patches but with this patch there is not a compelling reason to accept the extra power usage of going beyond 2 threads (especially if thermals are an issue).

NB The results above have been edited (apologies, the 1 and 2 thread results got conflated)

1 Like

Interesting tests Nik. Thanks.

1 Like

Another update. Andrew has introduced a thread priority update as discussed here in commit: a97ae1c7a73ddcdbda0e2ece0fb7c787ab79426f.

This appears to have bought a little more for 2 and 3 threads (again with max fps set at 25):

Max VCO1s (1 thread): 14 (102%)
Max VCO1s (2 threads): 26 (196%)
Max VCO1s (3 threads): 29 (292%)
Max VCO1s (4 threads): 28 (389%)

1 Like

I have a silly question related to the figures and, in general, the interpretation of the cpu load percentage.
Suppose I have a simple patch which takes 40% load with 1 thread (what do you want? my system is bad).
Turning the thread count to 2, I get 130% Rack load.
My explanation is that the added 90% load is spent waiting on spinning locks. Is this assumption correct?

If so I expect the CPU to be warming a little less than an application that is really taking 130% CPU time for heavy processing.

If I add VCOs I get higher CPU load, so definitely I see a large fixed computational cost introduced by multithreading. What is the benefit, then? The fact that I will not top the first CPU core but I’ll be allowed to reach 200% (with 2 threads) or 300% (with 3 threads).

Any insight is appreciated, thanks!

BTW: I have 2 cores with 2 threads each, and I’m running on Linux, with 25 fps video rate.

Yes, although I might make them spin locks that check a global flag and turn into mutexes if any modules are blocking (e.g. Core Audio).

Maybe. mov, test, and jne are probably more power efficient than SSE instructions, but I’d guess less than a factor of 2. But if I do the above, power probably won’t be an issue.

You can add more modules, no? If you’re not experiencing this, it’s probably because having 2 cores doesn’t give enough time to other non-engine threads.

When you turn on two threads, VCV will be able to handle (up to) twice as many plugins. VCV runs all the audio threads all the time, So VCV CPU consumption will double, but your plugins will have twice as much CPU available to to them.

Most DAWs are multi-threaded like this, but often they don’t use CPU when they aren’t doing anything. So, while VCV will 100% use your CPU cores, even with a tinny patch, most DAWs will not “waste” the excess CPU and turn it into heat like VCV will.

But, in any case, all those CPUs you “give” to VCV are available for running large patches.

1 Like

Well, yes I can, the only thing is that I was expecting Rack would “release” some of the CPU resources it was “locking” to the execution of the process functions, instead these require extra resources.

Then a last question: why is this not done in VCV?

Thanks

Now that HybridBarrier has been introduced, which sleeps the engine similar to the experiments branch, i wonder what the original stability comparison is like between the two platforms now.

I believe is has now been done.