best practices - store huge data

q8fuel · May 21, 2021, 8:59am

hi there,

i’m about to store huge arrays of data (i.e. 16x8x36000 float values), and storing them on json is crazy (like 90mb of data).

what’s the best way to do it in Rack? async/thread storing wave files? it’s about 18.432.000byte. is there any out-of the box method? how do you deal with this kind of process?

thanks for your best practices

GretchenV · May 21, 2021, 8:15pm

That depends on the language (and supporting libraries). For python I would use a multidimensional numpy array and store that with pytables in a binary file. Might even make it smaller in size because of automatic data compression.

But for C++ you can just open a file, say ‘humongous.dat’ for binary writing and write your data as one big binary blob. As for instance, in this post on stackoverflow.

file - Writing binary data to fstream in c++ - Stack Overflow.

Very non-portable, only readable when you know the format and are on the same machine architecture, but really quick. The trick is the reinterpret_cast but that assumes your data is in a contiguous area of memory.

Alternatively, you can use some of the C++ wrapper libraries around the HDF5 library, which matches the numpy / pytables solution.

Ahornberg · May 22, 2021, 5:56am

I like to ask some some questions to clear the situation:

What kind of data do you want to store? Audio, CV, something else?
Should these data become part of a module preset?
How often will these date be read from disk?
How often will these date be written to disk?
Can the data be packed, zipped?
Do the data contain a lot of zero values or identical values?

Squinky · May 22, 2021, 5:59am

Yes, those are good questions. aren’t these last two the same thing

stoermelder · May 22, 2021, 6:04am

Of course not, one is a subset of the other

Squinky · May 22, 2021, 6:20am

true, but in both cases a good solution would be “zip that data”.

q8fuel · May 24, 2021, 3:46pm

cv modulation (more or less). not sure how this will different from audio (i’ve listed the max number of “samples” which will be recorded). thus, both audio or cv will be float type.

yes. each preset need its own arrays of data.

once i load the patch (in the memory). than i’ll manage data directly from/to memory.

when I close the patch is enough I believe (or when autosave is triggered, so I won’t lost data if some crash happens). once stored on filesystem, I’ll load just on load patch (into memory), and than work with it from RAM (as said above).

it could be, but I’d like to examine the “worst” case, where every sample will be different.

as above…

Ahornberg · May 24, 2021, 4:23pm

Have you read this issue:

github.com/VCVRack/Rack

Changing patch format to be a ZIP file to support patch assets

opened 09:51AM - 30 Jun 20 UTC

closed 07:20AM - 19 Sep 20 UTC

AndrewBelt

### Problem Some Rack modules need to have "internal storage" for readabl…e/writable assets such as samples, audio clips, automation data, wavetables, machine-learned data, video game save state, etc. If a module's internal storage is less than ~100KB, it can simply be serialized to JSON, possibly with `string::toBase64()` if the data is binary and/or `string::compress()` if the data has low entropy. For assets >100KB, there is no VCV-standardized method for serializing state assets. If a user has 10 modules, each serializing 1MB of assets with base64, autosaving every 15 seconds might visually lag the UI for 10's of milliseconds. A 10MB JSON file isn't elegant but works, but a 100MB-10GB JSON file is not acceptable. Modules have worked around this issue by creating assets in non-standardized locations in the Rack user folder or in the folder of the patch file. But this is not portable, meaning that users cannot easily share patches with friends or transfer them to another computer, and since files aren't one-to-one associated with a Rack patch, they can be modified (or even deleted) by another patch, breaking the original patch. ### Proposed solution Instead of a Rack patch being a JSON file, it will be a ZIP archive containing - `/patch.json` containing the serialized modules, cables, and patch settings. - `/modules/<moduleId>/` containing assets for each module. This folder only exists if the module has requested to create a patch asset. When a user loads a .vcv file, Rack will extract the ZIP archive to `<Rack user folder>/autosave/`. If a previous autosave folder exists, it will be deleted. Modules can call `Module::asset(std::string filename)` to get the absolute path of a patch asset. They can then call `fopen()` or `std::fstream()` to create that file. Modules must delete their own assets when they are finished using them. For example, if a user requests to delete a recorded clip from a timeline module, the module should delete the file. When a user deletes a module, the module asset folder will *not* be deleted. This is so the user can undo the deletion and continue using the module's assets. When a user saves a .vcv file, Rack will serialize its state to `patch.json` and ZIP the autosave folder. The archiver will skip `/modules/<moduleId>/` folders for modules that do not exist in the patch, which effectively garbage-collects deleted modules' assets. ### Potential issues - Loading a patch with 10GB of recorded samples will extract all samples before the patch appears, which would take a long time. Some hard drives are as slow as 100MB/s, making saving/loading patches take up to 100 seconds. However, patches of this size are rare with Rack. If ZIP is the bottleneck (I'd guess it isn't), a different archive format could be used. - There is no way to undo operations that delete files, since modules must delete assets when the user requests. However, if plugin developers want to get fancy, they could move files to the OS's recycle bin and push a Rack history action that restores the file from the recycle bin.

I think, storing your data in a separate file and putting the path to that file into the JSON file might be a way to deal with such large amount of data in Rack v1. It seems that Rack v2 will bring a better solution here. maybe check out some sampler modules that have open source code to see how it can be done.

q8fuel · May 26, 2021, 9:40am

i’ve already done a save/load .wav file for samplers. here the situation its a bit different, since it will be processed by “autosave”, so need an async/thread every time the storing start, otherwise the gui will be freezed.

any code out of the box that do it already? otherwise, if there isn’t a native solution, i need to implement from scratch my own do you know?

Ahornberg · May 26, 2021, 10:07am

So your problem becomes much clearer now to me. Unfortunately I don’t have experience in using threads.

Maybe name your topic “best practice for saving/loading large data in an async thread” to make it easier for other devs to give an answer.

Squinky · May 26, 2021, 1:10pm

I think a lot of NYSTHI modules do this, but they aren’t open source. One sleazy way to do it is do it on the ui thread if you can. My sampler SFZ Player does all wav loading on a true thread. My Colors noise source uses the same threading support classes.

Btw, how did you do save load for a sampler without using a thread?

Ahornberg · May 26, 2021, 1:29pm

I haven’t done any load/save for a sampler so far.

Squinky · May 26, 2021, 1:47pm

sorry, that was intended for @q8fuel who has, apparently.

GretchenV · May 26, 2021, 7:42pm

VCV Rack seems to require C++ 11 so I guess you can use std::thread to do your saving and loading. in a portable way. Mutexes, locks, condition variables and futures are also available in the std::thread API. VCV Rack API itself does not have many thread functions except for making real time threads (don’t, unless you know what your doing), giving names to threads and so on.

https://en.cppreference.com/w/cpp/thread/thread

The saving itself of course depends on your data structure, see earlier questions asked by @Ahornberg. And of course it is not a good idea to modify this data structure while your ‘save’ thread is still saving so you will need some kind of synchronization and quite possibly a thread join which is where you will end up blocking the GUI again. Unless you are just saving a copy of the data in which case you can perfectly organize a fire and forget save with a std::thread that is simply detached.

#include <thread>
#include <iostream>
#include <fstream>

class WhatEver
{
private:
  float controlData[16*8*36000];
public:
  void autoSaveTriggered();
private:
  void saveMyData();
}

void WhatEver::autoSaveTriggered()
{
   std::thread::thread saveThread(&WhatEver::saveMyData, this);
   saveThread.join(); // this will block the UI thread until the save is complete which is not what you want to do.
}

void WhatEver::saveMyData()
{
   std::fstream file;
   file.open("test.bin", std::ios::app | std::ios::binary);
   file.write(reinterpret_cast<char *>(controlData), sizeof(float) * 16 * 8 * 36000);
   file.close();
}

My C++ is a little rusty and I am not a plugin developer so I may have made a mistake (or two) in the above code.

q8fuel · June 1, 2021, 3:01pm

this line is the main problem i believe.

i can keep the thread process asynch (so UI won’t freeze9, but what happens when i close Rack? the thread starts and it will end after the dataToJson() finish, and probably after the process end…

shouldn’t be this a problem?

Squinky · June 1, 2021, 3:35pm

congratulations! you have understood that shutting down threads is very complex, and is a huge source of crashes. I have worked on much commercial software that would “crash on exit” dues to improper shutting down of threads.

It’s really difficult, and really important. Whatever you do, test it a lot!

I think with my threaded plugins I have numerous unit tests just for shutting down.

q8fuel · June 1, 2021, 3:52pm

in this case, probably do saveThread.join(); within DCTOR of plugin (or even better within WhatEver class DCTOR) will solve the problem on “exit”…

GretchenV · June 1, 2021, 7:27pm

Well, the other solution is to use:

// saveThread.join(); // this would have blocked the UI thread until the save is completed
saveThread.detach();  // saveThread is now fire and forget, you can not join it.

but be advised that any memory accessed in saveThread must remain in memory until it finishes and you don’t know when that will be without waiting on some object like a condition variable or a semaphore, which is what the saveThread.join() did to begin with.

Multithreaded programming is … er … tricky.

Yes, a saveThread.join() in the destructor is a good idea. But also, you need to make sure no other thread is modifying controlData and the data array it references while the saveThread is running, otherwise what you are you saving?

Furthermore, saveThread in the example code happily existed on the stack but if you make it a member variable so you can join in the destructor, you must also protect yourself against the case where the class instance is destroyed without the thread being created at all or where autoSaveTriggered is called when a thread has already been created in a previous invocation.

Saving 18 megabyte of data in a single write does not take all that long, mind you, a few tens of milliseconds. It’s not like you are writing it to old-fashioned 9-track reel tapes.

Vortico · June 1, 2021, 7:35pm

std::async in C++11 is a nice way of launching a worker thread to do a task (such as saving a buffer to a file) and then exiting. Be sure to wait() on them before destroying the resources they modify, such as in your Module destructor.

Squinky · June 2, 2021, 3:42am

It is ok for your module destructor to take an unbounded amount of time? What thread is the destructor running on?