Replication and Distributed setup

Hi There,

Im planning to deploy and test Yeti-Switch in a distributed architecture as per below:
2x Database servers (CDR and Routing)
2x Management servers (Management + Redis)
2x Sems servers (Sems)
2x Lb server (Kamailio)
2x WebUI server (Yeti-web + Nginx)

Predominantly we’re testing for redundancy, failover, and high availability.

Before I start, I would like to seek your advice or suggestions on the following:

  1. What would be the best practice for Database (Routing and CDR) replication or clustering (if any with Postgres) or HA?
  2. How would the failover configuration be for management nodes?
  3. Is there any specific setup or config for the Redis-server pool/failover?
  4. Should we use Nginx Proxy for webui - HA? or is there a different way?
  5. In the case of Multi-homing with LoadBalancer, Do we need to install and enable RTP Engine or Proxy module?
  6. Given the above architecture, what would be Yeti’s recommended way of clustering all the above components?

We would really like to know your best practice and method, so we can test and provide you feedback based on your recommended setup.

Thanks in advance

What would be the best practice for Database (Routing and CDR) replication or clustering (if any with Postgres) or HA?

For routing DB it is good idea to have postgresql slave on each SEMS server and use this slave as primary routing database on each SEMS. Routing procedure is readonly.
For CDR - Yeti can’t use slave database now, so it only usable for redundancy. In case of master failure you should perform switchover by any mechanism or do it manually. SEMS buffers CDRs so temporary CDR master failure will not cause data loss.

How would the failover configuration be for management nodes?

Management node connection failover on SEMS side was removed during refactoring(it will be added soon). So you should monitor management node availability. Fortunately Management node used only on SEMS startup and for metric exporting to prometheus.

Is there any specific setup or config for the Redis-server pool/failover?

Local slave on each SEMS node + remote master.

Should we use Nginx Proxy for webui - HA? or is there a different way?

Nginx is good solution, but you should run only one instance of cdr billing process.

In the case of Multi-homing with LoadBalancer, Do we need to install and enable RTP Engine or Proxy module?

There are no reasons to process RTP twice - on RTP engine and SEMS.

Thanks Dmitry for your prompt response.

Is there any suggestion or do you recommend:

  • LoadBalancer in Active/Active? or Active/Passive using keepalived
  • Routing DB replication as Master-Master?
  • HA Proxy for Management Servers?

LoadBalancer in Active/Active? or Active/Passive using keepalived

Depends on your cliens. If your customers supports DNS SRV you can use two active LB on differen addresses

Routing DB replication as Master-Master?

Master-Slave

HA Proxy for Management Servers?

HA Proxy doesn’t support SCTP.

Thanks, Dmitry. I appreciate your patience. You have clarified most of it. Just a few more clarifications:

In the case of Master-Slave setup: What happens when we have split-brain?
Let’s assume, We have 2 zones and the link between the two zones is broken. CDR and Routing Master DB are in Zone 1. In this case:
Zone 1 CDR and routing Master db is still the Master
Zone 2 CDR and routing Slave db, will automatically become Master and will be allowed to write
Is that the expected outcome, if so wouldn’t this cause data inconsistencies? or what is expected to happen in the scenario above?

In the same scenario, If we have 1 Web Interface in Zone 1 and we lose the connection to zone 1, Does that mean no web interface since we had one web interface

You said, "SEMS buffers CDRs so temporary CDR master failure will not cause data loss."
For how long and up to what point, it can buffer CDR, Is the buffer it reliant on Memory or disk?

The same split-brain scenario with redis-server, how do we handle it? (I believe redis-server support clustering and Sentinel for HA)

Last one, What is the recommended way to send out traffic to carriers/vendors? Do use the share same IN-LoadBalancer to receive traffic or do we set up a separate instance for Out-LoadBalancer

Do the SEMS node require a public facing interface or same interface type facing originators and terminators for RTP traffic since the loadbalancer does not proxy or process RTP?

Thanks in advance

No reason to switch slave routing db to master mode. SEMS will continue call routing using old data, when master became available - replication mechanism will update information on slave.

In the same scenario, If we have 1 Web Interface in Zone 1 and we lose the connection to zone 1, Does that mean no web interface since we had one web interface

Yes. In case of connectivity issues system functionality will be limited. Only few things we are guarantee:

  • call routing will work
  • CDRs will be saved and billed when connectivity will be restored

You said, “SEMS buffers CDRs so temporary CDR master failure will not cause data loss.”
For how long and up to what point, it can buffer CDR, Is the buffer it reliant on Memory or disk?

By default - in memory. You can configure it to save CDRs to csv, but in this case you should load this CDRs to master database manually - there is no tool included, so this option is not recommended.

The same split-brain scenario with redis-server, how do we handle it? (I believe redis-server support clustering and Sentinel for HA)

We are using master-slave(not clustering) for redis. In case of connectivity issues - system will not limit capacity.

Last one, What is the recommended way to send out traffic to carriers/vendors? Do use the share same IN-LoadBalancer to receive traffic or do we set up a separate instance for Out-LoadBalancer

It is up to you. Usually it is better to have no sip proxy on legB.

Do the SEMS node require a public facing interface or same interface type facing originators and terminators for RTP traffic since the loadbalancer does not proxy or process RTP?

If you are planning receive calls from Internet it is better to have public addresses for load balancers and public addresses for SEMS nodes.
But for internal communication - DB, redis, jrpc we are recommending to use dedicated private network.

Thanks, Dmitry for the clarification :slight_smile: Much appreciated

We started with distributed setup, we came across with issue related to multiple network interface and layers.

Here are some of the issues we would appreciate if you could address:

In Sems.conf: How can we define more than one sip signaling and RTP input/receive IP addresses (LegA) and how can we configure sip signaling and RTP output/outbound (LegB) via specific network interface/ip address

Same with Kamailio multiple network interface, do we configure it in in lb.cfg and enable mhomed?

In Sems.conf: Is there a possibility to have more than one Yeti Management IP? i.e. sems failover to a different management ip

In system.cfg, it say: Master_pool and Slave_Pool, does it mean, we can have pool of master and slave IP addresses, if so, how can we set it ip?
master_pool {
host = 127.0.0.1
port = 5432
name = yeti
user = yeti
pass = some_password
size = 4
check_interval = 10
max_exceptions = 0
statement_timeout=3000
}
failover_to_slave = true
slave_pool {
host = 127.0.0.1
port = 5432
name = yeti
user = yeti
pass = some_password
size = 4
check_interval = 10
max_exceptions = 0
statement_timeout=3000
}

Routing traffic via loadbalancer: Just confirming, if this is the preferred way to send outbound traffic to carrier via loadbalancer (kamailio):

  1. Term use outbound proxy
  2. Term force outbound proxy
  3. Term proxy transport protocol*
  4. Term outbound proxy: Out lb IP address

In Regis-server.conf, do we need to set slave-read-only no

Based on the setup below:

  • 1x Master Routing Database server and 3x Slaves Routing DB (currently read-only)
  • 1x Master CDR Database server and 2x Slaves CDR DB (currently read-only)
  • 2x Yeti-Management servers (POP Id = 4 and 5) + 2x Master Redis-server
    routing { .... failover_to_slave = true ...}
    CDR { ...... failover_to_slave = false ... failover_requeue = true}
  • 4x Sems servers (Sems) + 4x Slave Redis-Server (read-only)
    2x sems connected to POP 4 and 2x Sems connected to POP 5
  • 2x Lb server (Kamailio)
    dispatcher.list = 4x Sems Nodes
  • 1x WebUI server

We did failover testing and this what we have found below:

  1. Turned off Routing Master Database, We loose connection to Web Interface
  2. Turned off CDR Master Database, CDR gets queued in sems but yeti-cdr-billing and yeti-delay-job services are stopped and is required to be manually restarted once cdr master db is resumed
  3. Turned off 1x Redis-Server Master, 2x SEM unable to SIP Register but since loadbalancer has all 4x sems, it will return error 2 times before it re-registers via other 2x sems nodes
  4. Turn off POP4, 2x SEMS of POP4, Inbound, Outbound, Routing and CDR works expect inbound calls to the gateway with Dynamin AoR
  5. Turn off POP 4 and 5, all 4x SEMS - Inbound, Outbound, routing and cdr all working

It is good to see SEMS working even after stopping 2xYeti-Managements as expected. Although, the probability of 1,2, and 3 happening at the same time is relatively low, but is there any fail over for the web interface when the master routing db is unavailable in read-only mode?

We appreciate your support and thanks in advance :slight_smile:

You should have only one master + multiple slaves

is there any fail over for the web interface when the master routing db is unavailable in read-only mode?

no. Yeti designed to continue call routing as long as possible during unavailability of some components. However functionality like web interface can be limited during such problems. You can automate master database recovery and reconnect web to new master, but this is just DBA tasks.

We had it that way initially, but the challenge we had was when we have network outage between two zones, where one zone was operational and other zone wasn’t and the loadbalancer wasn’t able to detect this failure and kept sending SIP REGISTER to SEMS nodes that were not able to reach Master redis-server and therefore, returning 500 Internal Server error. It would be ideal, if there was Best Practice document on distributed setup that is tested well. (I’m happy to contribute if required)

Hello Rabbani & Dmitry
So please, what would you advise as a best practice for setting up a distributed architecture that scales horizontally?

Say the use case is to have a SIGNALING-ONLY node that is able to handle say 500k concurrent calls (for the sake of numbers). Media would be bypassed. What distributed architecture setup might take such a capacity, SCALING HORIZONTALLY? Any ideas?

500k cc system is not something that can be explained in documentation. Expain your use case in private message.