Yeah, I maintain a circular buffer of 1024 samples, but 512 should be fine as well. I also maintain a running total (5 of them actually, Σx, Σx², Σy, Σy² and Σxy) Each sample I subtract the oldest sample value from the accumulators, then store the new sample in its place, and add that to the accumulators. This avoids adding up all 1024 values over and over again.
There is a small but non-zero possibility of a cumulative error building up in the accumulators due to rounding errors, especially if the signal wavelengths are some multiple of the buffer length, so I also maintain a second set of accumulators without the subtract stage, and every 1024 samples I swap in the values from those and reset them to 0. But I think you’d have to run the device for a long time before such compensation is really necessary.
I suspect I’m doing the same statistical correlation as you, and it can be formulated just using the running totals: