Our Solution
Last updated
Was this helpful?
Last updated
Was this helpful?
We're using for Real-time audio and video communication and WebSocket for sending chat messages and initiating a video/audio call. We also support Live streaming using WebRTC which is broadcasted to other users using HLS.
We have added process flow diagrams for all the flows like Creating/Joining a meeting room, Initiating a Video/audio call, Sending/receiving chat messages, and Live streaming in the backend architecture section.
Following diagram describes the high-level process flow for video conferencing.
Users interact with one of our mobile applications or web clients to start/join a video conferencing. The client first establishes a WebSocket connection with the Gateway server. If user is already authenticated, the client sends the authentication token in the request header which is validated by the Gateway server.
Once a WebSocket connection is established successfully, the client makes a request to create/join a room. The Gateway server calls the Room server over a GRPC connection to create a room and generate a password for the room, or validate the roomId and password if the user is trying to join an existing room.
Once the user joins the room, the client creates a WebRTC SDP offer to establish a WebRTC connection with the media servers for exchanging video/audio streams. The Gateway server receives the SDP offer and calls the Room server to find a media server where the user’s offer should be sent to.
Room servers constantly monitor media servers through service discovery mechanism and find the best-suited media server for a user. Once the media server is identified, the Gateway server sends the user’s offer to the media server. The media server responds with a WebRTC SDP answer which is sent back to the client.
Once the offer/answer flow is complete, the client establishes a persistent WebRTC connection with the media server and starts sending video/audio streams to the media server.
Media Server is the main component that receives video/audio streams from all participants in a room and sends those streams to the other participants in the room.
We allow guest users as well as authenticated users in our application. The Gateway server is responsible for authenticating users and validating every incoming request.
To authenticate users, we use both email/password-based login as well as OAuth2 based login using Google and Facebook. All the Passwords are hashed using bcrypt and stored in our database. Once a user successfully logs-in through either email/password or Google/Facebook, we create a JWE token and send it back to the client. The client sends this token in an Authorization header along with every request to the server.
When a user wants to create a new meeting room or join an existing room, the client first connects to the Gateway server using a secure WebSocket connection. The WebSocket connection is maintained throughout the life cycle of the meeting. It is used for sending chat messages and other events like user-join, user-leave, etc.
Once the WebSocket connection is established, the client sends a request to the Gateway service to join or create a room. Gateway server calls Room server to create a room or validate the RoomId and password if the user is trying to join an existing room. When a user successfully joins a room. A ‘user-join’ event is sent to all the other participants in the room.
After joining the room, the client initiates a Video/audio communication by creating an SDP (Session Description Protocol) offer and sending it to the Gateway server. The Gateway server calls Room server to find the Media Server where this user should be sent to. Room servers constantly monitor media servers through service discovery mechanism. It tries to assign all users in the same data centre to the same media server (up to a certain limit) to improve efficiency. After getting the response from Room server, the Gateway server sends the SDP offer to the media server. The media server responds with an SDP answer which is sent back to the client.
After the offer/answer flow, the client establishes a secure WebRTC connection with the media server and starts sending Audio/Video streams to the server. The media server routes the audio/video streams of every user to every other user in the same room.
The Media Server also called an SFU (Selective Forwarding Unit) is the core component which is responsible for transmitting audio/video streams between the participants in a room.
Following are the characteristics of the SFU:
Low delay: The SFU only forwards the video. So it adds minimal additional delay to the video/audio streams.
Selective Forwarding: Our SFU is selective. It picks and chooses, from the incoming video stream, a subset to forward to the receiver. And that subset could be different for every one of the receivers. So depending on the receiver’s available bandwidth or display, it will pick a subset of the video streams to forward to the user. If the receiver is watching a participant in a big window, it will choose a high resolution for that video stream. If a participant is hidden in the receiver’s display, the SFU won’t send the video stream of that participant to the receiver.
The SFU sends only the minimally required amount of media to optimize the experience for every individual receiver.
Error correction: Many transmission errors can be corrected at an SFU without impacting anybody on the call. The SFU localizes the error correction only between itself and the endpoint that’s experiencing these errors.
Apart from the optimisations at the SFU server, our client systems are also intelligent to adapt the bit-rate, choose the correct video resolution given the network conditions. Our clients also provide users, the ability to select the video resolution that they want to use.
To send a message to other participants in a Room; a user sends the message to the Gateway server that he is connected to. The Gateway server forwards this message to the Room server. Room server first stores the message in the database. It then looks up in the redis cache and finds out all the Gateway servers who have users that are part of that room, and sends the message to those Gateway servers. The message is finally delivered to the individual participants by the Gateway servers over the WebSocket connection.
The Room server acts as a message router, and the Gateway servers manage WebSocket connection with individual users.
We have designed our architecture to scale to virtually any number of participants per room. Our architecture also supports multi-datacenter deployments without any tweaks to the backend services.
There are three main components in our application - Gateway server, Room server, and Media server. All of these servers are horizontally scalable. The media server is responsible for receiving and forwarding audio/video streams. Let’s understand how we scale the media server.
Each media server periodically reports its health and load, and this information is curated and placed into our service discovery system. Whenever a user joins a new room, the Room server watches the service discovery system and assigns the least utilized media server to that user. Any other user joining the same room is also sent to the same media server as long as the user is connected to the same data centre and the media server is capable of taking more load. If the media server can’t take more load or the user connects to a different data centre, then the user is sent to a media server which is geographically closer to him and is able to take on more load. Once we do this, we have one conference room spanned across multiple media servers. In this case, we set up a server-to-server relay between the existing media server and the new media server. This makes sure that any audio/video stream sent to the first media server is relayed to the second media server and vice versa.
We have taken extra care to make sure that there is no single point of failure in our system. There are multiple instances of each server running in the backend. Every server instance reports its health to the service discovery system (Consul). If an instance crashes, it is removed from Consul, and the backend adapts accordingly.
We have also set up alerts using Prometheus and grafana to get notified whenever any component in our architecture is experiencing issues like high CPU or memory usage, or general application errors like increased 5xx errors.
We support live broadcasting from your device to the outside world. The Audio/video stream is captured at the local device and sent to the HLS gateway over a WebRTC connection. We transcode the video at the HLS gateway and send it to CDN servers from where the video stream is distributed to other users via HLS.
We support live broadcasting from your device to the outside world. The Audio/video stream is captured at the local device and sent to the HLS gateway over a WebRTC connection. We transcode the video at the HLS gateway and send it to CDN servers from where the video stream is distributed to other users via HLS.
We have taken utmost care to make the application as secure as possible. We can divide security features into:
Transport level security
Application level security
WebRTC provides end-to-end encryption of our video/audio streams. It uses SRTP (Secure Real-Time Transport Protocol or Secure RTP) on top of DTLS (Datagram Transport Layer Security) to provide end-to-end encryption.
HTTPS/WSS - All connections to the Gateway service happen over Secure Websocket and HTTPS connections.
Databases - All the data in our database are encrypted.
Public and Private network - Only Gateway servers HAProxy and the Media servers in our infrastructure are accessible to the outside world. All the other components in our architecture are inside a private network.
We provide many application-level security features to secure any meeting from unwanted access. We have -
Password protected rooms
Role-based access
Host controls - Host can mute a participant, pause their video, remove a participant from the meeting, allow/disallow screen sharing etc.
We support settings like ask_before_join, wherein before any person joins a meeting room, we ask the host for permission. A host can allow/disallow a person in the meeting room, or He can put him in a waiting room.
We support settings wherein - After joining a meeting, a participant can only change their name/profile pics after the host approves.
Unlike other video conferencing solutions, We default to the highest security setting for any meeting/webinar. Hosts can turn off some of the application-level security features if they want to.
We’re exploring various approaches and open source ML frameworks to perform closed captioning. There are four steps to achieve close captioning with multilingual support
Obtain a transcript of the content.
Translate the transcript into the target language.
Create the corresponding subtitles from the transcript or translation and create a subtitles file.
Combine all of those pieces into a finished product.
Works on Mobile apps, web browsers, and Desktop apps.
Supports all the audio/video codecs like VP8, VP9, H264, Opus, G722, PCMU, PCMA, etc.
Supports adaptive bit-rate, automatically chooses the right video resolution as per the network conditions. Allows users to choose the video resolution.
Works in low network conditions.
We have chosen technical standards that are mainstream and have huge support in web browsers and mobile applications. WebRTC is supported in most modern web browsers and mobile applications. It is used by some of the well known video conferencing products like Google Duo, Facebook messenger, and Microsoft Teams.
Similarly, WebSocket is the de-facto standard for by-directional communication between clients and servers.
Other technologies like HLS (HTTP Live Streaming) are chosen to provide support for all kinds of devices. Though Mpeg Dash is a newer technology than HLS for live streaming, We chose HLS because Apple doesn’t support Mpeg Dash yet.
Our backend solution works on general purpose commodity hardware. We don’t require any specialised hardware.
We recommend general purpose Linux based VMs with at least 4 CPU cores, 16GB RAM, 2.5 GHz Intel Scalable Processor, and network performance of 5Gbps.
Our solution works across web browsers, desktop, and mobile applications. For deploying the backend infrastructure, we recommend Linux based VMs.
We’re currently trying out a POC based on DeepSpeech () which is a TensorFlow implementation of the Baidu’s DeepSpeech architecture, Mozilla voice () and other language translation machine learning models.