You are currently viewing An Introduction to VoIP

An Introduction to VoIP

Last Updated on April 20, 2021 by Jaron Davis

Key Terminology

  • VOIP: Voice over Internet Protocol
  • PBX:  Private Branch Exchange
  • PSTN: Public Switched Telephony Network
  • ITSP: Internet Telephony Service Provider
  • SIP: Session Initiation Protocol
  • SCCP: Skinny Client Control Protocol
  • MGCP: Media Gateway Control Protocol
  • URI: Uniformed Resource Identifier
  • RTP: Real-Time Transfer Protocol
  • SDP: Session Description Protocol
  • CUCM: Cisco Unified Communications Manager
  • SBC: Session Border Controller
  • CUBE: Cisco Unified Border Element
  • Codec: The algorithm used to encode/decode samples of audio. This algorithm dictates the sample duration and size of packet. (Look into the Nyquist sampling theorem)

How a VoIP system works

Almost all VoIP systems utilize a client/server configuration where the endpoints such as phones, voice gateways, analog telephony adapters, and other collaboration endpoints are the clients that must register to a centralized call processing server. This server can be hosted on-premise or in the cloud. 

They register, utilizing their standard signaling protocol, to the server and all signaling information is processed through that server. 

The server decides how to route the call based on configurations from the system administrator. This server is also referred to as a Private Branch Exchange (PBX). The signaling methods include Session Initiation Protocol (SIP), Skinny Client Control Protocol (SCCP), which is a proprietary Cisco protocol, and Media Gateway Protocol (MGCP).

How an internal phone call works

When placing a phone call, there are two cooperating but separate streams of information from each endpoint involved in the call. There is a signaling stream as well as a media stream that take different routes across the network but work together to complete the call.  

The call signaling must go to the server in the client/server model, so that the call can be processed and routed to its destination. The phone does not keep an index of IP addresses of all phones, and therefore must receive this information from the call processing server. 

The call processing server then notifies the receiving device that it is receiving a call. Once the phone knows the IP address of its destination and has been told it can initiate media because the receiving end picked up, it initiates a Real-Time Transport Protocol (RTP) stream to the device it is calling.

What is Signaling?

Signaling is the communication used to establish, maintain, and terminate connections between a client endpoint and its call processing server.

There are several signaling standards including SIP, SCCP, and MGCP. For the purpose of this training session, we will discuss SIP since it is widely recognized as the industry standard for all new products.

SIP runs on either TCP, UDP, or encrypted TLS and utilizes default ports 5060 (standard) and 5061 (secure)

SIP messaging was developed by the IETF, which also developed the HTTP standards, so many messages utilize the same numbering codes as HTTP, such as the 200 OK, 404 Not Found, and the 500 Internal Server Error.

Above you see an example of a SIP INVITE message. The INVITE message is the initiation of a phone call, where a phone tells the call processing server who it wants to call or the call processing server tells the receiving client who wants to call it. Note the “From” and “To” fields in the message stating a caller ID name, then a SIP URI. A SIP URI is formatted like an email address, where it is the [phone number]@[the destination].

What is media?

  • The media is the data passed directly between two endpoints. This can be audio, video, content sharing, or even live streaming.
  • It is carried over the Real-Time Transport Protocol (RTP) (RFC 3550)
  • RTP Media runs on either UDP or encrypted TLS and uses ports between 16384 and 32767. RTP cannot use TCP due to the nature of real time audio.
  • The media stream is negotiated in the signaling with the Session Description Protocol (SDP) (RFC4566) either in in the INVITE message (known as early offer) or later in the SIP messaging when the called phone answers (known as delayed offer)
  • In phone calls, the SDP portion of the SIP message describes which audio codecs are supported. Devices must agree to use the highest priority codec that both devices support. In the US, we typically use the ITU standard G.711 for audio. Other standards include H.264 for video, BFCP for content sharing, and T.38 for fax, but we will not go into all of those today.

Above you see a packet capture of an RTP stream from a phone call. The wave forms were reconstructed from the digital representation of an analog waveform by Wireshark to give a visual representation of the audio. You can see in the screen capture the RTP ports being used, the amount of packets, the audio sampling rate, and the codec utilized (G.711uLaw). When captured with Wireshark, the audio can be played back in an artificial form of wiretapping. This is how call recording software that uses port mirroring is able to record calls. It listens to the signaling streams to identify which calls to record, then captures the RTP media stream of those calls.

What does a SIP call look like?

In this diagram you can see the full messaging of a SIP call.

The INVITE tells the server that a number has been dialed and the endpoint would like to initiate a call.

The server responds back to the phone saying it has found a match and is TRYING to reach the other end.

The INVITE is relayed to the distant end phone, which also responds with a TRYING and then a RINGING message, indicating that the phone will accept the call and is ringing.

Once the receiving phone picks up, it sends an OK message, at which point the server returns a CANCEL to the initiating phone, telling it to cancel the ringing.

The sending phone says OK, and then initiates the RTP stream.

Once the call is over, the phone that hangs up sends a BYE message to the server, indicating the call should be disconnected. The server relays the BYE to the other phone, and then both the phone and the server return OK’s as a final message.

Call flow to an external phone

The complexity of a phone call increases when you need to leave the internal network. 

An Internet Telephony Service Provider (ITSP) provides your connection to the Public Switched Telephony Network (PSTN), which is kind of like the internet of phones. 

The Cisco Unified Border Element (CUBE) is introduced into the network to serve as a voice gateway as well as a session border controller. Since the Cisco Unified Communications Manager is virtualized on a server, we need a voice gateway to provide a physical interface for our voice services. A Session Border Controller (SBC) is a security device like a firewall for voice. It prevents any non-voice traffic from passing through onto the network and controls where traffic can come from and where it must go to. 

The voice gateway can also provide fail-over survivability in the event your PBX fails, so that calls can continue to process.

The grey dotted lines represent the flow of signaling, and the red line represents the flow of media. 

The need for Quality of Service

Quality of Service in VoIP is critical for the consistency of audio received. Different media streams have various requirements, but for the purpose of this training session we are going to stick with a G.711 voice audio call requirements, since that is the standard for voice communications.

Packet Loss: 

Packet loss is the measurement of traffic that is sent but does not arrive at the destination for any reason.

  • Acceptable Packet Loss: No more than 1%

If a word is missed, generally humans can fill in the blank and continue a conversation, however if an entire sentence is lost, there is no way for the person on the other end to know what is being said, and therefore the call has essentially failed.


Latency (aka Delay), is the amount of time it takes for traffic to travel from the source endpoint to the destination endpoint.

  • Acceptable Latency: No more than 150ms

Delay is important for voice audio due to the requirements of human conversation. If participant A says “hello, Ted. It’s a great day, how are you?” but participant B does not receive the audio until a second or more later, by the time participant B responds, participant A will already be asking if participant B is there. For a phone call to seem natural to humans, the delay must be limited.


The variation of time between traffic that arrives. For example, packet A arrives, then 10ms later packet B arrives, then 15ms later packet C arrives. That is a jitter of 5ms.

  • Acceptable Jitter: Less than 30ms

Jitter helps a conversation flow naturally. If jitter is too high, the received audio will speed and up slow down drastically beyond the point of natural conversation that the receiving participant may be able to understand.


Handley, M., Jacobson, V., & Perkins, C. (2006, July). SDP: Session Description Protocol. Retrieved from Internet Engineering Task Force:

Hiwasaki, Y., & Ohmuro, H. (2009, October). RTP Payload Format for mU-law EMbedded Codec for Low-delay IP. Retrieved from Internet Engineering Task Force:

International Telecommunications Union. (1988, November). G.711 : Pulse code modulation (PCM) of voice frequencies. Retrieved from International Telecommunications Union:

Schulzrinne, H., Casner, S., Frederick, R., & Jaconson, V. (2003, July). RTP: A Transport Protocol for Real-Time Applications. Retrieved from Internet Engineering Task Force: