F'ing hot code loading, how does it work?
Hot code loading in Erlang, how does that work?
You probably know by now that one of the killer features of Erlang (besides the concurrency thing) is the ability to change code without downtime. This is something that OTP already gives us for free and provides great business value (time is money), this is an attempt to uncover what’s really going on being the scenes when you perform a hot code load.
So let’s start with the simplest possible thing which is a process that offers a KV-like interface with just two methods: get/1
and set/2
, no OTP is used,
only plain old !
and receive
:
Let’s try it out (in OTP20):
Pretty simple, but actually what would be nicer is if our get/1
operation actually returned {ok, {value, 42}}
instead of just {value, 42}
,
so to have that, we’ll just change the tuple being returned in handle_msg/2
:
Since we don’t want to stop our process we’ll just hot code load that one module and be done with it, right? Just run the c/1
which compiles and loads our
module with the change we made and call the get/1
method to make sure that everything is working as expected.
What happened? The term returned is exactly the same, our changes didn’t have any effect! Fortunately we have this other shell command which explicitly loads
a module: l/1
let’s try that and see if it’s any better:
Well that’s just wonderful, now the process died and our get/1
operation is timing out. FML.
Why?
Let’s try and get a high level view of how the VM supports code loading without downtime. Inside the bowels of the beast, for each module there are two code pointers:
current (ie. the one that is running right now) and old which starts out pointing to null. When we did that c/1
command earlier the VM switched these pointers, old pointer became the currently running code and current became the new code, however (and this is the important bit) our process was left pointing to the old code, we can see this
be using the erlang:check_process_code/2
method, it tells us if a given pid is running old code or not.
Something needs to happen inside the process for it to know that it should switch to the current version of the code module instead of using the old one, that
thing is a fully qualified function call (that’s a call of the Module:Function format). The best place for it is in our loop/1
function:
Notice how we need to export loop/1
even though we’re making the call from inside the same module.
But still, going back a bit, why did the process crash when we ran l/1
? Digging a bit in the OTP source you’ll find it’s implementation in
lib/stdlib/src/c.erl
, c/1
does the same thing except it compiles the file from disk first:
So it first does a code purge and only then does it reload the beam file from disk, there are two kinds of purge: brutal and soft.
Brutal (as the name implies) searches for any process that is still running old code for the module and kills it, soft does the same thing without the killing so you know what will happen if you go brutal, you can do a soft purge by calling code:soft_purge/1
. Brutal purge is basically freeing the structures associated with the old code so that’s why it’s important to ensure that there are no running processes still pointing to it.
In our example, the first time we ran c/1
the code purge did nothing (because old pointer was null), right after that new code was loaded, this means the old pointer
is now pointing to the old code. The second time we ran l/1
, the first thing this does is a brutal purge, now things are different because there is one process that
is still pointing to the old code and that’s why it dies. Let’s fool around a bit now that we know what methods are being called:
Conclusion
Our example is a bit contrived, most of the time you’ll want to be using OTP’s gen_server, supervisor and whatnot, these already take care of the needed qualified function
calls and also offer the code_change/3
method that acts as checkpoint when upgrading (or downgrading) from one version of the code to the next.
Still the purge thing needs to be taken into account, if your gen_server is getting blocked somewhere in it’s execution path (say in a TCP accept) it might still get
killed during a brutal purge simply because it did not went through the main loop (and the qualified function call).