1.0's Multithreading on MacOS X

fractalgee · June 19, 2019, 11:38am

I have been involved with a massively multithreaded datawarehouse engine for near 30 years on a technical level and think I do understand issues, challenges and possible ways to implement MT, especially if performance is concerned.

By massively multithreaed I mean an engine that can easily use up all threads an OS can give to a single process (typically something around 4096 threads is the absolute max for a single process on many mainstream OSs still), as well as as many cores as any machine had. We used to joke that it was a good tester of how good your hardware/OS combo really performed.

So I have been looking forward to getting the other 7 cores on my machine (real cores, no HT) on behalf of Rack 1.0. I cannot see any improvement by enabling a second thread when one thread is barely chugging along without glitching. Color me surprised that the patch started to skip and glitch like crazy. So OK, maybe Rack does not farm out existing work to the new thread, let me restart with 2 threads. Nope, same issue.

So testing further I see that the engine thread (worker 0) takes a serious hit when one additional thread is added (more than I would consider reasonable) and ever further thread increases that by a much more expected amount (overhead of synchronization/single engine thread). But no amount of threads allows me to add more work to Rack without it skipping.

So I am left with Rack 1.0 that I can still only use on a single thread. Also when looking at the CPU meters I see that ever module starts uing 10ish percent more CPU, which also looks weird to me.

@Vortico : I would love to start a discussion of how this could be improved so it actually has a benefit (sadly Cherry Audio’s Voltage Modular seems to do that better), as the current implementation seems to still need quite some work.

i had same issue with rcomain’s build but he said he could not help as he had no access to a mac.

Vortico · June 19, 2019, 9:01pm

Works pretty well for me, but of course results will vary depending on your hardware, OS, and other factors I’m not even aware of.
On my ancient quad-core laptop running Linux, I can use 12 Audible Elements without audio hiccups and 35 with 4 threads. So 290% more modules with 4 cores, not bad.
If you start mixing other software, VST plugins hosted in VCV Host, web browsers, Windows update in the background… you can get worse results.

fractalgee · June 20, 2019, 7:19am

Nah, thai was with only Rack and OS X running. And is 100% reproducable on this machine, whereas oodles of stuff runs in MT (virtually all app actually) without any issues, audio and otherwise. The achilles heel here seems to be the audio tghread (the one that has to do the actual rendering of audio,) whic should likely be it’s own thread anyhow, wirth the rest of the engine being the engine, all higher level stuff gets done by seperate threads. That is how a normal well behaved MT app is usally structured. So, no matter what, as soon as I add more threads that audio thread starts to sweat, I am not seeing any increaser in modules I can add, rather it now starts to drop audio earlier with more than 1 core than with say 4 (on real 8 non-HT cores).

Furthermore, Nysthi’s seven seas wavetable VCO uses multiple cores, to do the heavy lifting, that never drops a single sample, smooth as butter even with 4 of them going.

Anyhow, this was to start a discussion of how to make this a real MT app, as this is really still a single threaded app that adds a second thread but still has that monolithik single thread and not 'engine. workers, multiple threads to have stuff execute on, event system to notify stuff, try to use as little mutex locking/unlocking as possible, etc pp. This is hard stuff as you so rightly note and needs some retooling to become fully effective. Anyhow, just so that noone with an old PC is surprised if they find same as I did, and to hopefully get this to a super state that blows everything else out of the water, period (already pretty much there). So mileage will vary as it looks given how it currently works.

Nik · June 20, 2019, 7:42am

I wonder if there is something amiss with your machine. I expect you saw my experiments here a few months ago:

I have just done a quick and dirty test of that (this is on the same 2015 iMac) with Rack 1.0.

I can only run 14 VCO-1s in that configuration with 1 thread. With 4 threads I can run 30 VCO-1s.

I have the odd tiny glitch, but last time I did the test I shut down every daemon I could on my machine (as discussed in that post - and to my cost - because I forgot to turn Time Machine back on for several weeks!). I am sure I would be glitch free at 14/30 if I did turn everything off.

fractalgee · June 20, 2019, 4:48pm

not much running here at all MacOS X 10.11.6, 2x4core non-HT Xeons at 2.33Ghz. Everything else seems to work smooth as butter on this machine and I can run around 30+ VCOs on one thread (Nysthi Poly Seven Seas in 4 channel poly mode with a swarm size of 4 which according to Antonio is 30 VCOs, and still run a couple more, a reverb and a mixer and a clock, and a few other things but then I turn on CPU Meters. 1 Thread say the audio module shows it has 15% left. Enable another thread and it starts stuttering like mad, and no matter how many I enable stutters remain. Disable additional threads and all back to smooth. So I did experiment with just a basic setup, clock, bernoulli gates, caudal, flux, scala quant. There is the squinky mixer8 and the RCM G-Verb as well. Audio module shows 87% CPU left. Turn on another thread: nosedives to somewhere aroun 15%, stays there with further threads. So something not quite right. Realtime priority or not makes no diff. Which is very odd

Nik · June 20, 2019, 5:43pm

Just a speculation - is the difference that you have two processors ?

fractalgee · June 20, 2019, 6:44pm

Yup, 2x4 cores indeed but should make very little diff

6u1ll3 · June 20, 2019, 7:02pm

Hi fractalgee

The spike you see when adding threads is because of the overhead. With more than one Rack has to keep them on sync. It may be posible to improve the implementation. Like you said with a more modular aproach but that is a monumental task.
And to be fair Ableton has the same problem. There is some load balancing, but put too many vst on a track and the audio chugs. I think this is true in most DAWs.

Other thing is that your computer is kind of old. A i5 CPU will likely be faster that your two xeons in everything but maybe some rendering/encoding thing. A 120$ Ryzen CPU is orders of magnitude faster.

Vortico · June 21, 2019, 3:46am

Ah yeah, NUMA was a horrible idea for general purpose computing. It was a popular niche a decade ago, but it’s now almost entirely phased out except for supercomputing clusters with software specifically written for multiple processors with shared memory.
I’m 90% sure the reason you’re having poor performance is because threads running on your two CPUs are trying to access each other’s memory, which takes twice as long as if you used 1 thread.