sure. convolution is just the multiplication of two power spectra. in realtime audio convolution engines, we don’t actually convolve/multiply with the entire IR, because the latency would be too high. instead use overlapped smaller FFT windows. effectively, we are multiplying the results of two STFTs; one of them is transforming the input, and the other is (pre-)transforming the IR in a loop.
that “in a loop” is the part where “time stretching” the IR would get a little weird to implement in realtime. i guess you would want to capture the output of the granulation to another buffer/delayline with a known length. you would be modulating the length. as @_mark points out, to really get the effect of having a longer / shorter IR, you would also need to modulate the amount of overlap in the IR STFT, so that at any given moment you are hearing the effects of convolution with each part of the IR - and this would have CPU impact, so the amount of stretching you can do would be limited.
this should be pretty easily to implement in max/pd/supercollider. the granular convolver (in which @glia also expressed an interest, around the time norns launched) was implemented in supercollider, and i believe also took advantage of the wide array of phase-vocodor processing ugens available there. (to e.g. perform pitch shifting before convolution.)
but yeah - expensive, and i’m honestly not sure if the sonic effects of granulating the IR would be all that different from granulating the result of convolution.