Manage virtual media streams in Meet Media API

Virtual Media Streams, in the context of WebRTC conferencing, are media streams generated by a Selective Forwarding Unit (SFU) to aggregate and distribute media from multiple participants. Unlike direct peer-to-peer media streams, which would create a complex mesh of connections in large conferences, virtual media streams simplify the topology. The SFU receives individual media streams from each participant and selectively forwards the active or relevant streams to other participants, multiplexing them onto a smaller, fixed set of outgoing virtual media streams.

This approach reduces the number of simultaneous incoming streams each participant needs to handle, lowering processing and bandwidth requirements. Each virtual stream can contain media from one participant at a time, dynamically adjusted by the SFU based on factors like speaker activity or video assignment. Participants receive these virtual streams, effectively seeing a composed view of the conference without needing to manage individual streams from every other participant. This abstraction provided by virtual media streams is crucial for scaling WebRTC conferences to a large number of participants.

To receive audio, the client must offer exactly three audio media descriptions, creating three local audio transceivers. To receive video, the client must offer one to three video media descriptions, establishing that number of video transceivers.

Receivers

Each client-owned transceiver has a dedicated RtpReceiver and a dedicated "media track" that receives the audio RTP streams from Meet servers.

Each track has a unique ID and receives its own distinct stream of RTP packets from that specific media source. For example, Track A might receive audio from production-1 while Track B receives audio from production-2.

SSRCs

Each RTP packet has a Synchronization Source (SSRC) header value, tying it to a specific track.

Audio sessions through the Meet Media API use three distinct media streams, each having its own static SSRC. Once established, these SSRC values never change for the life of the session.

Virtual streams

Meet Media API uses Virtual Media Streams. These are static throughout the session, but the source of the packets may change to reflect the most relevant feeds. Virtual Media Streams behave the same for audio and video.

The Contributing Source (CSRC) in the RTP packet headers identifies the true source of the RTP packets. Meet assigns each participant in a conference their own unique CSRC when they join. This value remains constant until they leave.

Since the number of SSRCs is constant throughout the Meet Media API session, here are the three possible scenarios:

More participants than SSRCs available:

Meet transmits the three loudest people across the three SSRCs. Since each RTP stream is on its own dedicated SSRC, there's no intermixing between the streams.

Figure 1. Meet transmits the three loudest people across the three SSRCs.

If any of the original streams in the conference are no longer one of the loudest streams, Meet switches the RTP packets that make up the SSRC to the loudest.

Figure 2. Meet switches the RTP packets to the new loudest person.
Number of active participants is less than the three audio SSRCs:

For the scenario where more SSRCs are available than there are streams in the conference, Meet maps any available audio packets to its own unique SSRC. Any unused SSRCs are still ready and available, but no RTP packets are transmitted.

Figure 3. Meet maps available audio packets to its own unique SSRC.
Number of active participants equals the three audio SSRCs:

For the scenario of equal participants and available SSRCs, each participant's media is mapped to a dedicated SSRC. These mappings persist as long as this specific scenario persists.

Figure 4. Meet maps each participant's media to a dedicated SSRC.