best practices - store huge data

hi there,

i’m about to store huge arrays of data (i.e. 16x8x36000 float values), and storing them on json is crazy (like 90mb of data).

what’s the best way to do it in Rack? async/thread storing wave files? it’s about 18.432.000byte. is there any out-of the box method? how do you deal with this kind of process?

thanks for your best practices

That depends on the language (and supporting libraries). For python I would use a multidimensional numpy array and store that with pytables in a binary file. Might even make it smaller in size because of automatic data compression.

But for C++ you can just open a file, say ‘humongous.dat’ for binary writing and write your data as one big binary blob. As for instance, in this post on stackoverflow.

file - Writing binary data to fstream in c++ - Stack Overflow.

Very non-portable, only readable when you know the format and are on the same machine architecture, but really quick. The trick is the reinterpret_cast but that assumes your data is in a contiguous area of memory.

Alternatively, you can use some of the C++ wrapper libraries around the HDF5 library, which matches the numpy / pytables solution.

I like to ask some some questions to clear the situation:

What kind of data do you want to store? Audio, CV, something else?
Should these data become part of a module preset?
How often will these date be read from disk?
How often will these date be written to disk?
Can the data be packed, zipped?
Do the data contain a lot of zero values or identical values?

2 Likes

Yes, those are good questions. aren’t these last two the same thing :wink:

1 Like

Of course not, one is a subset of the other :wink:

3 Likes

true, but in both cases a good solution would be “zip that data”.

1 Like

cv modulation (more or less). not sure how this will different from audio (i’ve listed the max number of “samples” which will be recorded). thus, both audio or cv will be float type.

yes. each preset need its own arrays of data.

once i load the patch (in the memory). than i’ll manage data directly from/to memory.

when I close the patch is enough I believe (or when autosave is triggered, so I won’t lost data if some crash happens). once stored on filesystem, I’ll load just on load patch (into memory), and than work with it from RAM (as said above).

it could be, but I’d like to examine the “worst” case, where every sample will be different.

as above…

Have you read this issue:

I think, storing your data in a separate file and putting the path to that file into the JSON file might be a way to deal with such large amount of data in Rack v1. It seems that Rack v2 will bring a better solution here. maybe check out some sampler modules that have open source code to see how it can be done.

i’ve already done a save/load .wav file for samplers. here the situation its a bit different, since it will be processed by “autosave”, so need an async/thread every time the storing start, otherwise the gui will be freezed.

any code out of the box that do it already? otherwise, if there isn’t a native solution, i need to implement from scratch my own :wink: do you know?

So your problem becomes much clearer now to me. Unfortunately I don’t have experience in using threads.

Maybe name your topic “best practice for saving/loading large data in an async thread” to make it easier for other devs to give an answer.

I think a lot of NYSTHI modules do this, but they aren’t open source. One sleazy way to do it is do it on the ui thread if you can. My sampler SFZ Player does all wav loading on a true thread. My Colors noise source uses the same threading support classes.

Btw, how did you do save load for a sampler without using a thread?

I haven’t done any load/save for a sampler so far.

sorry, that was intended for @q8fuel who has, apparently.

VCV Rack seems to require C++ 11 so I guess you can use std::thread to do your saving and loading. in a portable way. Mutexes, locks, condition variables and futures are also available in the std::thread API. VCV Rack API itself does not have many thread functions except for making real time threads (don’t, unless you know what your doing), giving names to threads and so on.

https://en.cppreference.com/w/cpp/thread/thread

The saving itself of course depends on your data structure, see earlier questions asked by @Ahornberg. And of course it is not a good idea to modify this data structure while your ‘save’ thread is still saving so you will need some kind of synchronization and quite possibly a thread join which is where you will end up blocking the GUI again. Unless you are just saving a copy of the data in which case you can perfectly organize a fire and forget save with a std::thread that is simply detached.

#include <thread>
#include <iostream>
#include <fstream>

class WhatEver
{
private:
  float controlData[16*8*36000];
public:
  void autoSaveTriggered();
private:
  void saveMyData();
}

void WhatEver::autoSaveTriggered()
{
   std::thread::thread saveThread(&WhatEver::saveMyData, this);
   saveThread.join(); // this will block the UI thread until the save is complete which is not what you want to do.
}

void WhatEver::saveMyData()
{
   std::fstream file;
   file.open("test.bin", std::ios::app | std::ios::binary);
   file.write(reinterpret_cast<char *>(controlData), sizeof(float) * 16 * 8 * 36000);
   file.close();
}

My C++ is a little rusty and I am not a plugin developer :smiley: so I may have made a mistake (or two) in the above code.

1 Like

this line is the main problem i believe.

i can keep the thread process asynch (so UI won’t freeze9, but what happens when i close Rack? the thread starts and it will end after the dataToJson() finish, and probably after the process end…

shouldn’t be this a problem?

congratulations! you have understood that shutting down threads is very complex, and is a huge source of crashes. I have worked on much commercial software that would “crash on exit” dues to improper shutting down of threads.

It’s really difficult, and really important. Whatever you do, test it a lot!

I think with my threaded plugins I have numerous unit tests just for shutting down.

:slight_smile:

in this case, probably do saveThread.join(); within DCTOR of plugin (or even better within WhatEver class DCTOR) will solve the problem on “exit”…

Well, the other solution is to use:

// saveThread.join(); // this would have blocked the UI thread until the save is completed
saveThread.detach();  // saveThread is now fire and forget, you can not join it.

but be advised that any memory accessed in saveThread must remain in memory until it finishes and you don’t know when that will be without waiting on some object like a condition variable or a semaphore, which is what the saveThread.join() did to begin with.

Multithreaded programming is … er … tricky.

Yes, a saveThread.join() in the destructor is a good idea. But also, you need to make sure no other thread is modifying controlData and the data array it references while the saveThread is running, otherwise what you are you saving?

Furthermore, saveThread in the example code happily existed on the stack but if you make it a member variable so you can join in the destructor, you must also protect yourself against the case where the class instance is destroyed without the thread being created at all or where autoSaveTriggered is called when a thread has already been created in a previous invocation.

Saving 18 megabyte of data in a single write does not take all that long, mind you, a few tens of milliseconds. It’s not like you are writing it to old-fashioned 9-track reel tapes.

1 Like

std::async in C++11 is a nice way of launching a worker thread to do a task (such as saving a buffer to a file) and then exiting. Be sure to wait() on them before destroying the resources they modify, such as in your Module destructor.

1 Like

It is ok for your module destructor to take an unbounded amount of time? What thread is the destructor running on?