std::async driving up CPU. I could use some advice!

Hello everyone! I’m experimenting with std::async, and although it may be working, it’s driving CPU way, way up, to the point where rack is nearly frozen.

Here’s my code:

    std::future<bool> t1 = std::async(std::launch::async, &GrooveBox::processTrack, this, 0, &mix_left_output, &mix_right_output);
    std::future<bool> t2 = std::async(std::launch::async, &GrooveBox::processTrack, this, 1, &mix_left_output, &mix_right_output);
    std::future<bool> t3 = std::async(std::launch::async, &GrooveBox::processTrack, this, 2, &mix_left_output, &mix_right_output);
    std::future<bool> t4 = std::async(std::launch::async, &GrooveBox::processTrack, this, 3, &mix_left_output, &mix_right_output);
    std::future<bool> t5 = std::async(std::launch::async, &GrooveBox::processTrack, this, 4, &mix_left_output, &mix_right_output);
    std::future<bool> t6 = std::async(std::launch::async, &GrooveBox::processTrack, this, 5, &mix_left_output, &mix_right_output);
    std::future<bool> t7 = std::async(std::launch::async, &GrooveBox::processTrack, this, 6, &mix_left_output, &mix_right_output);
    std::future<bool> t8 = std::async(std::launch::async, &GrooveBox::processTrack, this, 7, &mix_left_output, &mix_right_output);

    // Wait for all functions to complete
    t1.get();
    t2.get();
    t3.get();
    t4.get();
    t5.get();
    t6.get();
    t7.get();

and…

  bool processTrack(unsigned int track_index, float *mix_left_output, float *mix_right_output)
  {
    // even if this function is empty, cpu goes nuts
    return(true);
  }

image

I also tried adding .wait() calls, like so:

    std::future<bool> t1 = std::async(std::launch::async, &GrooveBox::processTrack, this, 0, &mix_left_output, &mix_right_output);
    std::future<bool> t2 = std::async(std::launch::async, &GrooveBox::processTrack, this, 1, &mix_left_output, &mix_right_output);
    std::future<bool> t3 = std::async(std::launch::async, &GrooveBox::processTrack, this, 2, &mix_left_output, &mix_right_output);
    std::future<bool> t4 = std::async(std::launch::async, &GrooveBox::processTrack, this, 3, &mix_left_output, &mix_right_output);
    std::future<bool> t5 = std::async(std::launch::async, &GrooveBox::processTrack, this, 4, &mix_left_output, &mix_right_output);
    std::future<bool> t6 = std::async(std::launch::async, &GrooveBox::processTrack, this, 5, &mix_left_output, &mix_right_output);
    std::future<bool> t7 = std::async(std::launch::async, &GrooveBox::processTrack, this, 6, &mix_left_output, &mix_right_output);
    std::future<bool> t8 = std::async(std::launch::async, &GrooveBox::processTrack, this, 7, &mix_left_output, &mix_right_output);

    t1.wait();
    t2.wait();
    t3.wait();
    t4.wait();
    t5.wait();
    t6.wait();
    t7.wait();
    t8.wait();

    // Wait for all functions to complete
    t1.get();
    t2.get();
    t3.get();
    t4.get();
    t5.get();
    t6.get();
    t7.get();

This is my first time playing with std::async. Any suggestions? Thanks! :pray:

Are you doing that in your process method? Or are you starting a thread pool in your constructor?

I guess: what are you hoping to accomplish? Can you give us a bit more context?

Threading and thread pooling in real-time audio is hard hard hard. Especially doing it in a way which is cooperative with the rack engine threads or your host daw in vst land.

std::async is not the right approach to multithreading audio, there’s basically no guarantees about how it’s actually implemented in your particular C++ standard library and/or operating system. But to begin with, multithreading audio processing is not necessarily something you’d want to attempt doing anyway.

1 Like

Right - basically a bunch of us are going to chime in with a computer science version of the doctor joke!

There’s very few situations where I think a plugin needs to do threading in rack basicalky. And performance is not one of them which I would put top of my list. “Can you simd vectorize instead” is always the first question to ask

Hi everyone! It sounds like I’m going down the wrong path with std::async. Yes, it’s in my main process method, so it’s being called every frame, and I’m assuming from everyone’s questions that I definitely should not do that. Ha ha ha.

My thought was, “I call this function 8 times, and it’s responsible for most of the CPU usage. Maybe if I can get the 8 function calls to run concurrently, my module will be 8x as fast!” I suppose it’s not that easy. :grinning: :man_facepalming:

Maybe. I’ll take another look!

Oh gosh no think about thread creation as being very slow and expensive and yu will have a better mental model.

And synchronization is also expensive

It’s more subtle than this but you were creating 8 threads every sample which will crush every os in the world

Single threaded simd is almost always the answer.

(As an editorial the clap thread pool extension let’s vsts schedule work in cooperation with the host and that has made plugins like diva 30% faster in some cases but that’s loads of work at super low level. Don’t multithread in rack plugins is the 99.99% advice (don’t find the place in surge where I multi thread and post it here sarcastically is another 0.008% - chuckle))

In MVerb it works very fine with a processing thread, which takes a lot of CPU away from the audio thread. (6-19% CPU without thread and 1-2% CPU with thread)

Yup there’s some ways it can work.

Curious do you set appropriate affinity and priority and stuff for your thread? How do you avoid competing with the engine spin locks? Or had that not been a problem

But I’m also presuming you start your thread once and message to it with some lock free structure yeah?

Oh and if I have 7 mverbs do they share a thread or do you just start getting contention? I guess I could go read the code to see what you did though!

Here’s the code in MVerb, just for reference:

Yeah I just read it, start at construction time and use a ring buffer to communicate onto and off of it. And of course this isn’t a spot for code reviews!

But my main question about that approach is if you are on a 4 core machine and having rack have 3 engine threads and make 2 of the module, dont you get lots of contention with the engine scheduler?

the thread commutication is done with the lock free ringbuffers from the VCV Rack api. Per MVerb there is one additional thread. so with 4 Mverbs you will have 4 more threads so it looks like this:

  11028 docb      20   0 1231416 509760 233492 R  30.6   3.1   1:43.02 Rack                 
  11055 docb     -16   0 1231416 509760 233492 R  24.7   3.1   0:56.35 RtAudio              
  11058 docb      20   0 1231416 509760 233492 R  16.1   3.1   0:17.12 Rack                 
  11071 docb      20   0 1231416 509760 233492 R  15.1   3.1   0:10.03 Rack                 
  11057 docb      20   0 1231416 509760 233492 R  14.5   3.1   0:17.25 Rack                 
  11067 docb      20   0 1231416 509760 233492 R  13.8   3.1   0:09.75 Rack     

you have to try out, with a greater buffersize (in the menu) you may get a pop free state.

Right so yeah you will contend with the engine. And like on an m1 mac you can get routed to efficiency cores and stuff too. That’s all the stuff I was referring too when I suggested it was tricky! Interesting though - thanks for sharing,

1 Like

yes may be this is not optimal for all platforms but you can still use the non threaded mode. but on my 16 core amd it works very well :wink:

Sure! Anyway thanks for sharing. I’ll dm you a thought or two I had from reading the code later today also.

totally agree with what @baconpaul is saying and implying here.

  1. You must never, ever create a thread, lock a semaphore (or otherwise wait) from an audio process call. Never, ever, ever.

  2. Yes it is super hard. And using threads for audio processing if fraught with peril. You will for sure be fighting with the existing audio threads. who will win? depends. who do you want to win? good question with no good answer.

Myself, I’ve done two threaded modules in VCV: Colors and SFZ player. Both are “easy” and avoid having to win the fight.

In colors I do large FFT on my thread to generate colored noise. But I create the thread at default priority, assuming that the audio engine is designed correctly and runs the audio threads at elevated priority. And, since I designed it to lose the fight, nothing bad happens if my thread can’t run - the audio thread will just re-use the last buffer full of noise.

In SFZ player, I use the worker thread to load patches. This can involve reading multiple gigabytes from hundreds of files. You could not do it at all on the audio thread, and it would be delicate to do it on the UI thread (and you aren’t supposed to use the UI thread for that). Here if I lose the fight it’s fine - it will just take longer to load the patch.

The truly evil case I have not dealt with. What if you are processing audio in large blocks and can’t afford to be late? (there are several plugins, including VCV noise that do this). If you do it on the audio thread, you will have huge CPU spikes every buffer. At low latency settings that could be catastrophic. On the other hand, if you do it on a worker, then you care who wins the fight, and there is no good answer.

2 Likes

Yeah there’s other considerations too. If every module chose the MVerb strategy you would crush every machine - in some sense the “i spin up a thread for me” is effective for a module but “selfish” for a system. That may be fine in the MVerb case but i think people would be upset if every surge VCO spun a thread. (This, by the way, is what we solved with the CLAP thread pool extension; the HOST controls the available thread count and the PLUGIN can schedule onto it asynchronously so there’s no contention between plugins).

But also there’s a super long way you can go with just optimization. Avoid virtual functions, dont call special functions on every sample, etc…

and most importantly before you do anything else run a profiler!

So I guess my “99.99%” admonition was aimed, as @Squinky gleaned, at the idea that threads often seem like good local ideas but are very very hard to get right in all situations in all systems without some serious engineering; and Rack doesn’t provide any access to its scheduling other than a relatively clever spin loop engine thread mechanism to call your process.

3 Likes

X 100 what @baconpaul says.

1 Like

Make extra threads be extender modules?