std::async driving up CPU. I could use some advice!

clone45 · November 30, 2022, 5:44pm

Hello everyone! I’m experimenting with std::async, and although it may be working, it’s driving CPU way, way up, to the point where rack is nearly frozen.

Here’s my code:

    std::future<bool> t1 = std::async(std::launch::async, &GrooveBox::processTrack, this, 0, &mix_left_output, &mix_right_output);
    std::future<bool> t2 = std::async(std::launch::async, &GrooveBox::processTrack, this, 1, &mix_left_output, &mix_right_output);
    std::future<bool> t3 = std::async(std::launch::async, &GrooveBox::processTrack, this, 2, &mix_left_output, &mix_right_output);
    std::future<bool> t4 = std::async(std::launch::async, &GrooveBox::processTrack, this, 3, &mix_left_output, &mix_right_output);
    std::future<bool> t5 = std::async(std::launch::async, &GrooveBox::processTrack, this, 4, &mix_left_output, &mix_right_output);
    std::future<bool> t6 = std::async(std::launch::async, &GrooveBox::processTrack, this, 5, &mix_left_output, &mix_right_output);
    std::future<bool> t7 = std::async(std::launch::async, &GrooveBox::processTrack, this, 6, &mix_left_output, &mix_right_output);
    std::future<bool> t8 = std::async(std::launch::async, &GrooveBox::processTrack, this, 7, &mix_left_output, &mix_right_output);

    // Wait for all functions to complete
    t1.get();
    t2.get();
    t3.get();
    t4.get();
    t5.get();
    t6.get();
    t7.get();

and…

  bool processTrack(unsigned int track_index, float *mix_left_output, float *mix_right_output)
  {
    // even if this function is empty, cpu goes nuts
    return(true);
  }

I also tried adding .wait() calls, like so:

    std::future<bool> t1 = std::async(std::launch::async, &GrooveBox::processTrack, this, 0, &mix_left_output, &mix_right_output);
    std::future<bool> t2 = std::async(std::launch::async, &GrooveBox::processTrack, this, 1, &mix_left_output, &mix_right_output);
    std::future<bool> t3 = std::async(std::launch::async, &GrooveBox::processTrack, this, 2, &mix_left_output, &mix_right_output);
    std::future<bool> t4 = std::async(std::launch::async, &GrooveBox::processTrack, this, 3, &mix_left_output, &mix_right_output);
    std::future<bool> t5 = std::async(std::launch::async, &GrooveBox::processTrack, this, 4, &mix_left_output, &mix_right_output);
    std::future<bool> t6 = std::async(std::launch::async, &GrooveBox::processTrack, this, 5, &mix_left_output, &mix_right_output);
    std::future<bool> t7 = std::async(std::launch::async, &GrooveBox::processTrack, this, 6, &mix_left_output, &mix_right_output);
    std::future<bool> t8 = std::async(std::launch::async, &GrooveBox::processTrack, this, 7, &mix_left_output, &mix_right_output);

    t1.wait();
    t2.wait();
    t3.wait();
    t4.wait();
    t5.wait();
    t6.wait();
    t7.wait();
    t8.wait();

    // Wait for all functions to complete
    t1.get();
    t2.get();
    t3.get();
    t4.get();
    t5.get();
    t6.get();
    t7.get();

This is my first time playing with std::async. Any suggestions? Thanks!

baconpaul · November 30, 2022, 5:58pm

Are you doing that in your process method? Or are you starting a thread pool in your constructor?

I guess: what are you hoping to accomplish? Can you give us a bit more context?

Threading and thread pooling in real-time audio is hard hard hard. Especially doing it in a way which is cooperative with the rack engine threads or your host daw in vst land.

Xenakios · November 30, 2022, 6:11pm

std::async is not the right approach to multithreading audio, there’s basically no guarantees about how it’s actually implemented in your particular C++ standard library and/or operating system. But to begin with, multithreading audio processing is not necessarily something you’d want to attempt doing anyway.

baconpaul · November 30, 2022, 6:23pm

Right - basically a bunch of us are going to chime in with a computer science version of the doctor joke!

There’s very few situations where I think a plugin needs to do threading in rack basicalky. And performance is not one of them which I would put top of my list. “Can you simd vectorize instead” is always the first question to ask

clone45 · November 30, 2022, 6:33pm

Hi everyone! It sounds like I’m going down the wrong path with std::async. Yes, it’s in my main process method, so it’s being called every frame, and I’m assuming from everyone’s questions that I definitely should not do that. Ha ha ha.

My thought was, “I call this function 8 times, and it’s responsible for most of the CPU usage. Maybe if I can get the 8 function calls to run concurrently, my module will be 8x as fast!” I suppose it’s not that easy.

Maybe. I’ll take another look!

baconpaul · November 30, 2022, 7:10pm

Oh gosh no think about thread creation as being very slow and expensive and yu will have a better mental model.

And synchronization is also expensive

It’s more subtle than this but you were creating 8 threads every sample which will crush every os in the world

Single threaded simd is almost always the answer.

(As an editorial the clap thread pool extension let’s vsts schedule work in cooperation with the host and that has made plugins like diva 30% faster in some cases but that’s loads of work at super low level. Don’t multithread in rack plugins is the 99.99% advice (don’t find the place in surge where I multi thread and post it here sarcastically is another 0.008% - chuckle))

docB · November 30, 2022, 7:14pm

In MVerb it works very fine with a processing thread, which takes a lot of CPU away from the audio thread. (6-19% CPU without thread and 1-2% CPU with thread)

baconpaul · November 30, 2022, 7:27pm

Yup there’s some ways it can work.

Curious do you set appropriate affinity and priority and stuff for your thread? How do you avoid competing with the engine spin locks? Or had that not been a problem

But I’m also presuming you start your thread once and message to it with some lock free structure yeah?

baconpaul · November 30, 2022, 7:33pm

Oh and if I have 7 mverbs do they share a thread or do you just start getting contention? I guess I could go read the code to see what you did though!

clone45 · November 30, 2022, 7:33pm

Here’s the code in MVerb, just for reference:

github.com

docb/dbRackModules/blob/42d0cdedbcd9f312660bfc2201240dc1b5c5cc7a/src/MVerb.cpp#L214-L232


      
          std::thread thread=std::thread([this] {
            run=true;
            while(run) {
              if(!bufMgr.buffer->out_full()&&!bufMgr.buffer->in_empty()) {
                Vec in=bufMgr.buffer->in_shift();
                Vec out=_process(in.x,in.y);
                bufMgr.buffer->out_push(out);
              } else {
                std::this_thread::sleep_for(std::chrono::duration<double>(sampleTime));
              }
            }
          });
          
          
void onRemove() override {
            run=false;
            if(thread.joinable()) {
              thread.join();
            }
          }

baconpaul · November 30, 2022, 7:41pm

Yeah I just read it, start at construction time and use a ring buffer to communicate onto and off of it. And of course this isn’t a spot for code reviews!

But my main question about that approach is if you are on a 4 core machine and having rack have 3 engine threads and make 2 of the module, dont you get lots of contention with the engine scheduler?

docB · November 30, 2022, 7:42pm

the thread commutication is done with the lock free ringbuffers from the VCV Rack api. Per MVerb there is one additional thread. so with 4 Mverbs you will have 4 more threads so it looks like this:

  11028 docb      20   0 1231416 509760 233492 R  30.6   3.1   1:43.02 Rack                 
  11055 docb     -16   0 1231416 509760 233492 R  24.7   3.1   0:56.35 RtAudio              
  11058 docb      20   0 1231416 509760 233492 R  16.1   3.1   0:17.12 Rack                 
  11071 docb      20   0 1231416 509760 233492 R  15.1   3.1   0:10.03 Rack                 
  11057 docb      20   0 1231416 509760 233492 R  14.5   3.1   0:17.25 Rack                 
  11067 docb      20   0 1231416 509760 233492 R  13.8   3.1   0:09.75 Rack

docB · November 30, 2022, 7:44pm

you have to try out, with a greater buffersize (in the menu) you may get a pop free state.

baconpaul · November 30, 2022, 7:47pm

Right so yeah you will contend with the engine. And like on an m1 mac you can get routed to efficiency cores and stuff too. That’s all the stuff I was referring too when I suggested it was tricky! Interesting though - thanks for sharing,

docB · November 30, 2022, 7:50pm

yes may be this is not optimal for all platforms but you can still use the non threaded mode. but on my 16 core amd it works very well

baconpaul · November 30, 2022, 7:53pm

Sure! Anyway thanks for sharing. I’ll dm you a thought or two I had from reading the code later today also.

Squinky · November 30, 2022, 8:20pm

totally agree with what @baconpaul is saying and implying here.

You must never, ever create a thread, lock a semaphore (or otherwise wait) from an audio process call. Never, ever, ever.
Yes it is super hard. And using threads for audio processing if fraught with peril. You will for sure be fighting with the existing audio threads. who will win? depends. who do you want to win? good question with no good answer.

Myself, I’ve done two threaded modules in VCV: Colors and SFZ player. Both are “easy” and avoid having to win the fight.

In colors I do large FFT on my thread to generate colored noise. But I create the thread at default priority, assuming that the audio engine is designed correctly and runs the audio threads at elevated priority. And, since I designed it to lose the fight, nothing bad happens if my thread can’t run - the audio thread will just re-use the last buffer full of noise.

In SFZ player, I use the worker thread to load patches. This can involve reading multiple gigabytes from hundreds of files. You could not do it at all on the audio thread, and it would be delicate to do it on the UI thread (and you aren’t supposed to use the UI thread for that). Here if I lose the fight it’s fine - it will just take longer to load the patch.

The truly evil case I have not dealt with. What if you are processing audio in large blocks and can’t afford to be late? (there are several plugins, including VCV noise that do this). If you do it on the audio thread, you will have huge CPU spikes every buffer. At low latency settings that could be catastrophic. On the other hand, if you do it on a worker, then you care who wins the fight, and there is no good answer.

baconpaul · November 30, 2022, 8:34pm

Yeah there’s other considerations too. If every module chose the MVerb strategy you would crush every machine - in some sense the “i spin up a thread for me” is effective for a module but “selfish” for a system. That may be fine in the MVerb case but i think people would be upset if every surge VCO spun a thread. (This, by the way, is what we solved with the CLAP thread pool extension; the HOST controls the available thread count and the PLUGIN can schedule onto it asynchronously so there’s no contention between plugins).

But also there’s a super long way you can go with just optimization. Avoid virtual functions, dont call special functions on every sample, etc…

and most importantly before you do anything else run a profiler!

So I guess my “99.99%” admonition was aimed, as @Squinky gleaned, at the idea that threads often seem like good local ideas but are very very hard to get right in all situations in all systems without some serious engineering; and Rack doesn’t provide any access to its scheduling other than a relatively clever spin loop engine thread mechanism to call your process.

Squinky · November 30, 2022, 11:13pm

X 100 what @baconpaul says.

jackokring · December 1, 2022, 11:43am

Make extra threads be extender modules?