MSEdgeExplainers

WebAudio OfflineAudioContext.startRendering() streaming output

Authors:

Participate

Introduction

WebAudio provides a powerful and versatile API for performing audio-processing workflows in the browser. It supports complex node-based audio graphs that can be piped to system out (speakers) or an in-memory AudioBuffer for further processing, such as writing to a file. WebAudio can be used for many different workloads in the browser. An example relevant to this discussion is web-based video editors, like clipchamp.com, which can use WebAudio to build up complex audio graphs based on multiple input files. These input files are composed, trimmed and processed according to a linear project timeline. The project can be previewed at realtime in the browser or exported faster-than-realtime as an .mp4.

WebAudio works well in a realtime playback context but it is not suitable for offline context (faster-than-realtime) processing due to a limitation in the design of WebAudio's OfflineAudioContext API. The design of the API requires allocating memory to render the whole audio graph's memory up-front which can reach gigabytes of AudioBuffer data.

This document will propose adding a streaming offline context rendering function so that the audio graph data can be incrementally processed rather than allocating the whole audio buffer up-front.

User-Facing Problem

The OfflineAudioContext API works well for rendering small audio graphs but it does not scale for larger projects because it allocates the full graph's AudioBuffer up-front. For example, rendering a 2 hour video composition project in clipchamp.com in an offline context would require an extremely large AudioBuffer allocation. OfflineAudioContext.startRendering() allocates an AudioBuffer large enough to hold the entire rendered WebAudio graph before returning. 2 hour of audio at 48 kHz with 4 channels results in gigabytes of in-memory float32 data in the AudioBuffer. This makes the API unsuitable for very long offline renders or very large channel/length combinations. There is no simple way to chunk the output or consume it as a stream.

The implication of this API is that a user's computer must have enough available memory to export the project even if the in-memory audio buffer will eventually be discarded after it is written to a file. In situations with limited hardware resources or low-powered devices, this limitation makes WebAudio unusable as an offline processor. If memory capacity is exceeded on a user's machine then the processing will stop and the browser may terminate the tab/window leading to potential loss of data for the user and a poor user experience.

Another implication is that the audio buffer cannot be easily interleaved with video data streamed out of WebCodecs. To use clipchamp.com again as an example, the video and audio are combined into a .mp4 file during the export process. The video and audio streams need to be interleaved/muxed in the correct order before writing to the file; the audio data cannot simply be appended at the end. Ignoring the memory implications of the current API, it is difficult to interleave video and audio when all the audio data is delivered as a single chunk at the end of processing. If the audio data was streamed out at the same time as video data is streamed out of WebCodecs then it would simplify the interleaving process.

A workaround to these limitations is for developers to build custom WASM audio-processing which can stream out data incrementally so that the full AudioBuffer is not allocated and therefore memory pressure is not applied to a user's machine. While this works around the API constraint, these 3rd party libraries require complex integration and increase maintenance burden for developers. Custom WASM libraries duplicate features that already exist in WebAudio and only provide streaming output support as a benefit.

Goals

Non-goals

Proposed Approach - Add startRenderingStream() function

The preferred approach is adding a new method startRenderingStream() that yields buffers of interleaved audio samples in a Float32Array, or another format as outlined in Open Questions. In this scenario, the user can read chunks as they arrive and consume them for storage, transcoding via WebCodecs, sending to a server, etc.

Usage example:

const context = new OfflineAudioContext({ numberOfChannels: 2, length: 44100, sampleRate: 44100 });

// Add some nodes to build a graph...

if ("startRenderingStream" in context) {
  const reader = context.startRenderingStream({ format: 'f32', chunkSize: 128 }).getReader();
  while (true) {
    // get the next chunk of data from the stream
    const result = await reader.read();

    // the reader returns done = true when there are no more chunks to consume
    if (result.done) {
      break;
    }

    // result.value contains interleaved Float32Array values
    const buffers = result.value;
  }
} else {
  audioBuffer = await offlineAudioContext.startRendering();
}

Proposed interface:

// From https://developer.mozilla.org/en-US/docs/Web/API/AudioData/format
enum AudioFormat {
  "u8",
  "s16",
  "s32",
  "f32",
  "u8-planar",
  "s16-planar",
  "s32-planar",
  "f32-planar"
}

dictionary OfflineAudioRenderingOptions {
  // Output format
  AudioFormat format = "f32";
  // The number of frames to render each iteration
  Number chunkSize = 128;
}

partial interface OfflineAudioContext {
    // Immediately stops the rendering, to implement a "cancel" button when rendering 
    //  if startRenderingStream was called, this closes the stream
    // If startRendering was called, this rejects the promise
    Promise<void> close();
    // Returns a stream that yields buffers of interleaved audio samples in Float32Array or whatever format is specified
    Promise<ReadableStream> startRenderingStream(optional OfflineAudioRenderingOptions);
};

Pros

Cons

Output format

There is an open question of what data format startRenderingStream() should return. The options under consideration are AudioBuffer, Float32Array planar or Float32Array interleaved.

AudioBuffer

Pros

Cons

Planar Float32Array

Pros

Cons

Interleaved Float32Array

Pros

Cons

Alternative 1 - Modify existing startRendering method to allow streaming output

An alternative approach is to add options to the existing startRendering() to configure its operating mode. The mode can be set to stream to achieve streaming output. This is similar to the proposed approach but rather than adding a new function, it re-uses an existing function.

Usage example:

const context = new OfflineAudioContext({ numberOfChannels: 2, length: 44100, sampleRate: 44100 });

// Add some nodes to build a graph...

const reader = await context.startRendering(options: { mode: "stream"}).getReader();
while (true) {
    // get the next chunk of data from the stream
    const result = await reader.read();

    // the reader returns done = true when there are no more chunks to consume
    if (result.done) {
        break;
    }

    const buffers = result.value;
}

The existing API remains unchanged for backwards compatibility:

/**
 * Existing API unchanged
 */
const context = new OfflineAudioContext({
  numberOfChannels: 2,
  length: 44100,
  sampleRate: 44100,
});

// Add some nodes to build a graph...

// Full AudioBuffer is allocated
const renderedBuffer = await context.startRendering();

Proposed interface:

interface OfflineAudioRenderingOptions {
    mode: "audiobuffer" | "stream"
}

interface OfflineAudioContext {
    Promise<AudioBuffer | ReadableStream> startRendering(optional: startRenderingOptions);
}

Pros

Cons

Alternative 2 - emit ondataavailable events

Keep current startRendering() API but do not allocate the full AudioBuffer. After starting, periodically emit events on the context or a new interface such as ondataavailable(chunk: AudioBuffer).

The user can subscribe and collect chunks for processing.

At the end, the API may optionally still provide a full AudioBuffer.

Pros

Cons

Stakeholder Feedback / Opposition

References & acknowledgements

Many thanks for valuable feedback and advice from: