K8s Erlang Clustering
This is #2 in my series of the Erlang VM in k8s, this one is about node clustering.
TL;DR
Want to cluster Erlang VMs in Kubernetes? Here’s what you need to do:
- Create a k8s headless service
- Have your
subdomain
deployment field match the headless service name - Add the
endpoint_pod_names
option to CoreDNS’s Kubernetes plugin - Write some Erlang code to perform DNS discovery and cluster using that
Motivation
The usual guides around Erlang/Elixir clustering involve setting up a StatefulSet and that is fine. This guide will instead detail how to achieve the same result with a regular deployment, a headless service, an obscure CoreDNS k8s plugin option and some Erlang DNS discovery.
Erlang VM clustering
When starting up a distributed Erlang node (eg. with the -name <name>
option) you’re instructing the VM to obtain the fully qualified domain name (ie. FQDN) of the host and assume that as the node name, for clustering purposes this
means it will only accept requests from other nodes that are able to reach the host through that name, here is a simple example that illustrates this:
$ erl -name n1 -setcookie cookie
Erlang/OTP 21 [erts-10.3] [source] [64-bit] [smp:16:16] [ds:16:16:10] [async-threads:1] [hipe]
Eshell V10.3 (abort with ^G)
(n1@imacpro.home)1>
Notice that imacpro.home
is the same as the output of the hostname -f
, this is the FQDN of this host.
Let’s try running a new node on another shell and cluster the two. We’ll do this and refer to n1
not by it’s FQDN but by the localhost
address, it should still work right? After all we know that both nodes are running on the
same machine…
$ erl -name n2 -setcookie cookie
Erlang/OTP 21 [erts-10.3] [source] [64-bit] [smp:16:16] [ds:16:16:10] [async-threads:1] [hipe]
Eshell V10.3 (abort with ^G)
(n2@imacpro.home)1> net_adm:ping('n1@127.0.0.1').
pang
It failed! Ok, let’s try it out with localhost
instead of 127.0.0.1
:
(n2@imacpro.home)2> net_adm:ping('n1@localhost').
=ERROR REPORT==== 22-Nov-2020::21:45:38.925875 ===
** System running to use fully qualified hostnames **
** Hostname localhost is illegal **
pang
(n2@imacpro.home)3>
Now it’s failing with even more error output, let’s try it out with the FQDN then:
(n2@imacpro.home)3> net_adm:ping('n1@imacpro.home').
pong
(n2@imacpro.home)4>
Finally it was able to connect to n1
and cluster with it.
What our little experiment is telling us is that the FQDN that the node assumes and the one that other nodes use to cluster with it must match.
A practical check for this is trying out hostname -f
and ensuring that using this name you are able to reach it from whatever other nodes that you’re looking to cluster.
Let’s now try and take this knowledge to Kubernetes and see how it applies, we’ll again use the simple web server Erlang app as our testbed. Build the Docker image and deploy it in a pod, while you’re at it note one relevant field in deployment.yaml that is the subdomain field, we’ll get to why it’s relevant shortly. Let’s attach to the container and find its FQDN:
$ kubectl exec -it simple-web-service-68b97dc4bf-qxwnl -- /bin/sh
/srv/service # hostname -f
simple-web-service-68b97dc4bf-qxwnl.simple-web-service-headless.default.svc.cluster.local
The name that this host knows itself by is <pod-name>.<subdomain>.<namespace>.svc.<zone>
, up until now we’re good. When we start a distributed Erlang this is the name it will assume, from the previous lesson we know that this must also be the name
that the other nodes must use when clustering. You’re probably wondering why is the subdomain
portion of the FQDN is set to simple-web-service-headless
, this is relevant but we’ll get to why it is so in a bit.
Clustering in k8s
We don’t really need a StatefulSet
to cluster Erlang VMs together, all they need is a way to find each other. A common discovery pattern is using a headless service coupled with DNS, what you’ll need to do to achieve this in a nutshell:
- Create a headless service that groups all the pods in that deployment
- When the VM starts up in each container, perform a DNS lookup on the service name, find all the other hostnames behind the service
- Cluster with all of the hosts
Headless service
First step is creating the headless service, it’s name will be simple-web-service-headless
and it will all select all pods with the app
:simple-web-service
label
apiVersion: v1
kind: Service
metadata:
name: simple-web-service-headless
spec:
type: ClusterIP
clusterIP: None
selector:
app: simple-web-service
ports:
- name: discovery
protocol: TCP
port: 8585
targetPort: 8585
Let’s find it’s DNS name after deploying it, the doc tells us that SRV records are created for headless services in the following format:
_<port-name>._<port-protocol>.<service-name>.<namespace>.svc.<zone>
so let’s check that, first find out the FQDN of the headless service
$ kubectl exec dnsutils-68bd8dc878-cr725 -- host _discovery._tcp.simple-web-service-headless
_discovery._tcp.simple-web-service-headless.default.svc.cluster.local has address 172.17.0.10
Now that we know this let’s fetch it’s SRV records:
$ kubectl exec dnsutils-68bd8dc878-cr725 -- dig -t srv _discovery._tcp.simple-web-service-headless.default.svc.cluster.local
; <<>> DiG 9.11.6-P1 <<>> -t srv _discovery._tcp.simple-web-service-headless.default.svc.cluster.local
;; global options: +cmd
;; Got answer:
;; WARNING: .local is reserved for Multicast DNS
;; You are currently testing what happens when an mDNS query is leaked to DNS
;; ->>HEADER opcode: QUERY, status: NOERROR, id: 38682
;; flags: qr aa rd; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 3
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 55b1fe2e509b573a (echoed)
;; QUESTION SECTION:
;_discovery._tcp.simple-web-service-headless.default.svc.cluster.local. IN SRV
;; ANSWER SECTION:
_discovery._tcp.simple-web-service-headless.default.svc.cluster.local. 30 IN SRV 0 50 8585 172-17-0-10.simple-web-service-headless.default.svc.cluster.local.
;; ADDITIONAL SECTION:
172-17-0-10.simple-web-service-headless.default.svc.cluster.local. 30 IN A 172.17.0.10
;; Query time: 0 msec
;; SERVER: 10.96.0.10#53(10.96.0.10)
;; WHEN: Tue Nov 24 11:39:21 UTC 2020
;; MSG SIZE rcvd: 532
By scaling up the number of replicas in the deployment we should get back more ip addresses:
$ kubectl scale deployment simple-web-service --replicas 2
$ kubectl exec dnsutils-68bd8dc878-cr725 -- nslookup _discovery._tcp.simple-web-service-headless.default.svc.cluster.local
; <<>> DiG 9.11.6-P1 <<>> -t srv _discovery._tcp.simple-web-service-headless.default.svc.cluster.local
;; global options: +cmd
;; Got answer:
;; WARNING: .local is reserved for Multicast DNS
;; You are currently testing what happens when an mDNS query is leaked to DNS
;; ->>HEADER opcode: QUERY, status: NOERROR, id: 38682
;; flags: qr aa rd; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 3
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 55b1fe2e509b573a (echoed)
;; QUESTION SECTION:
;_discovery._tcp.simple-web-service-headless.default.svc.cluster.local. IN SRV
;; ANSWER SECTION:
_discovery._tcp.simple-web-service-headless.default.svc.cluster.local. 30 IN SRV 0 50 8585 172-17-0-10.simple-web-service-headless.default.svc.cluster.local.
_discovery._tcp.simple-web-service-headless.default.svc.cluster.local. 30 IN SRV 0 50 8585 172-17-0-11.simple-web-service-headless.default.svc.cluster.local.
;; ADDITIONAL SECTION:
172-17-0-11.simple-web-service-headless.default.svc.cluster.local. 30 IN A 172.17.0.11
172-17-0-10.simple-web-service-headless.default.svc.cluster.local. 30 IN A 172.17.0.10
;; Query time: 0 msec
;; SERVER: 10.96.0.10#53(10.96.0.10)
;; WHEN: Tue Nov 24 11:39:21 UTC 2020
;; MSG SIZE rcvd: 532
Hm.. we’re getting back two records, that’s expected as we have two pods behind the service. What’s not convenient is
the DNS record format being returned: 172-17-0-10.simple-web-service-headless.default.svc.cluster.local.
. As explained previously we need this to match the name
that the node knows itself by, in this case we’d need simple-web-service-68b97dc4bf-qxwnl.simple-web-service.default.svc.cluster.local
.
The Erlang VM will deny the clustering request if we use this ip address format hostname.
It’s time to dig into CoreDNS and more specifically its Kubernetes plugin:
Kubernetes and CoreDNS
From Kubernetes version 1.13 CoreDNS is the default DNS server. It embeds a plugin to be used in the k8s environment and there’s a plugin option that is relevant to our interests:
endpoint_pod_names: uses the pod name of the pod targeted by the endpoint as the endpoint name in A records, e.g., endpoint-name.my-service.namespace.svc.cluster.local. in A 1.2.3.4
By default, the endpoint-name name selection is as follows: Use the hostname of the endpoint, or if hostname is not set, use the dashed form of the endpoint IP address
(e.g., 1-2-3-4.my-service.namespace.svc.cluster.local.) If this directive is included, then name selection for endpoints changes as follows:
Use the hostname of the endpoint, or if hostname is not set, use the pod name of the pod targeted by the endpoint. If there is no pod targeted by the endpoint, use the dashed IP address form.
This looks like what we need, by setting this option we should be getting back a DNS record with a pod name prefix instead of an ip address, let’s get to it
# open up the CoreDNS configuration
$ kubectl edit configmaps coredns --namespace kube-system
# find the kubernetes plugin configuration section and add endpoint_pod_names, eg:
# kubernetes cluster.local in-addr.arpa ip6.arpa {
# pods insecure
# fallthrough in-addr.arpa ip6.arpa
# ttl 30
# endpoint_pod_names
# }
# restart the coredns pods
$ kubectl rollout restart --namespace kube-system deployment/coredns
And retry the SRV lookup again
$ kubectl exec dnsutils-68bd8dc878-cr725 -- nslookup _discovery._tcp.simple-web-service-headless.default.svc.cluster.local
; <<>> DiG 9.11.6-P1 <<>> -t srv _discovery._tcp.simple-web-service-headless.default.svc.cluster.local
;; global options: +cmd
;; Got answer:
;; WARNING: .local is reserved for Multicast DNS
;; You are currently testing what happens when an mDNS query is leaked to DNS
;; ->>HEADER opcode: QUERY, status: NOERROR, id: 1943
;; flags: qr aa rd; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 3
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 890564f2ead30e14 (echoed)
;; QUESTION SECTION:
;simple-web-service-headless.default.svc.cluster.local. IN SRV
;; ANSWER SECTION:
_discovery._tcp.simple-web-service-headless.default.svc.cluster.local. 29 IN SRV 0 50 8585 simple-web-service-68b97dc4bf-qxwnl.simple-web-service-headless.default.svc.cluster.local.
_discovery._tcp.simple-web-service-headless.default.svc.cluster.local. 29 IN SRV 0 50 8585 simple-web-service-68b97dc4bf-7zltb.simple-web-service-headless.default.svc.cluster.local.
;; ADDITIONAL SECTION:
simple-web-service-68b97dc4bf-qxwnl.simple-web-service-headless.default.svc.cluster.local. 29 IN A 172.17.0.10
simple-web-service-68b97dc4bf-7zltb.simple-web-service-headless.default.svc.cluster.local. 29 IN A 172.17.0.11
;; Query time: 0 msec
;; SERVER: 10.96.0.10#53(10.96.0.10)
;; WHEN: Tue Nov 24 11:42:24 UTC 2020
;; MSG SIZE rcvd: 628
Alright, now we’re cooking, both names now match, we have everything ready to go ahead with the clustering. Now it should become clear why we decided on that specific subdomain
deployment field, it
needs to match the headless service name so the entire hostname matches on both sides.
Summing up, the pod hostname that the own node sees is:
<pod-name>.<subdomain>.<namespace>.svc.<zone>
The SRV DNS record that resolves externally (with the endpoint_pod_names
option applied) is:
<pod-name>.<service-name>.<namespace>.svc.<zone>
DNS Erlang clustering
From now it’s pretty straightforward, the following snippet finds all the hosts that are backing up the headless service that we’ve created for the purpose of discovery
-spec k8s_reverse_lookup(SrvRecord :: string()) -> {ok, [string()]}.
k8s_reverse_lookup(SrvRecord) ->
% 1. get the FQDN of the headless service
{ok, #hostent{h_name = SrvRecordFQDN}} = inet:gethostbyname(SrvRecord),
% 2. ask for SRV records backing up the headless service, this will gives us back, all pod FQDN
% names under the headless service subdomain
{ok, #dns_rec{anlist = Answers}} = inet_res:resolve(SrvRecordFQDN, in, srv),
% 3. go through each of these records and extract the host
Hosts =
lists:map(fun(#dns_rr{type = srv,
class = in,
data = {_Priority, _Weight, _Port, Host}}) ->
Host
end,
Answers),
{ok, Hosts}.
Final thing is simply clustering up with the list of hosts and you’re done!
tags: