For those of you curious about the RTMIDI problem:
<rack>/src/rtmidi.cpp
:
void runThread() {
system::setThreadName("RtMidi output");
std::unique_lock<decltype(mutex)> lock(mutex);
while (!stopped) {
if (messageQueue.empty()) {
// No messages. Wait on the CV to be notified.
cv.wait(lock);
}
else {
// Get earliest message
const MessageSchedule& ms = messageQueue.top();
double duration = ms.timestamp - system::getTime();
// If we need to wait, release the lock and wait for the timeout, or if the CV is notified.
// This correctly handles MIDI messages with no timestamp, because duration will be NAN.
if (duration > 0) {
if (cv.wait_for(lock, std::chrono::duration<double>(duration)) != std::cv_status::timeout)
continue;
}
// Send and remove from queue
sendMessageNow(ms.message);
messageQueue.pop();
}
}
}
The problem is that this code takes a reference to the top item in the message queue, then waits for a timeout (which unlocks the lock). While waiting, other threads can post a message to the queue, which can cause the data in the queue to be reallocated, which makes the reference invalid. It now points to freed memory or memory used for something else.
When the timeout wait is signalled and this thread resumes, it is possible for the timeout condition to be satisfied, skipping the continue. In which case it passes the (invalid) message to sendMessageNow, which ends up a couple of functions deeper in code that dereferences the invalid pointer and crashing with a seg fault. If the queue data has not been reallocated, the reference may now be pointing to a different message which has the potential for the less disastrous consequence of messages being sent out of order.
A fix is to always continue after the timeout wait, so that the top queued item is re-acquired and used inside the lock so that the data is guaranteed to be valid.
if (duration > 0) {
cv.wait_for(lock, std::chrono::duration<double>(duration));
continue;
}
If nothing has changed the queue in the meantime and top is still the same message, when the duration is recalculated and the timeout has expired, duration will be < 0 and it is then sent immediately without any additional waiting.
Now, this is extremely timing-sensitive, so occurs relatively rarely. I was lucky enough to catch this in the debugger at a moment when my brain was un-fogged enough to see the problem. I had been working on and off for weeks trying to catch this bug in the act. I had the same crash under the debugger several other times but had been focused on trying to find a problem in my code.