···33[*.md]
44BasedOnStyles = proselint, write-good
55write-good.Passive = NO
66+write-good.TooWordy = NO
+475
content/post/beam-process-memory-usage.md
···11++++
22+date = 2023-06-10
33+title = "How much memory is needed to run 1M Erlang processes?"
44+description = "How to not write benchmarks"
55+66+[taxonomies]
77+tags = [
88+ "beam",
99+ "elixir",
1010+ "erlang",
1111+ "benchmarks",
1212+ "programming"
1313+]
1414++++
1515+1616+Recently [benchmark for concurrency implementation in different
1717+languages][benchmark]. In this article [Piotr Kołaczkowski][] used Chat GPT to
1818+generate the examples in the different languages and benchmarked them. This was
1919+poor choice as I have found this article and read the Elixir example:
2020+2121+[benchmark]: https://pkolaczk.github.io/memory-consumption-of-async/ "How Much Memory Do You Need to Run 1 Million Concurrent Tasks?"
2222+[Piotr Kołaczkowski]: https://github.com/pkolaczk
2323+2424+```elixir
2525+tasks =
2626+ for _ <- 1..num_tasks do
2727+ Task.async(fn ->
2828+ :timer.sleep(10000)
2929+ end)
3030+ end
3131+3232+Task.await_many(tasks, :infinity)
3333+```
3434+3535+And, well, it's pretty poor example of BEAM's process memory usage, and I am
3636+not talking about the fact that it uses 4 spaces for indentation.
3737+3838+For 1 million processes this code reported 3.94 GiB of memory used by the process
3939+in Piotr's benchmark, but with little work I managed to reduce it about 4 times
4040+to around 0.93 GiB of RAM usage. In this article I will describe:
4141+4242+- how I did that
4343+- why the original code was consuming so much memory
4444+- why in the real world you probably should not optimise like I did here
4545+- why using ChatGPT to write benchmarking code sucks (TL;DR because that will
4646+ nerd snipe people like me)
4747+4848+## What are Erlang processes?
4949+5050+Erlang is ~~well~~ known of being language which support for concurrency is
5151+superb, and Erlang processes are the main reason for that. But what are these?
5252+5353+In Erlang *process* is the common name for what other languages call *virtual
5454+threads* or *green threads*, but in Erlang these have small neat twist - each of
5555+the process is isolated from the rest and these processes can communicate only
5656+via message passing. That gives Erlang processes 2 features that are rarely
5757+spotted in other implementations:
5858+5959+- Failure isolation - bug, unhandled case, or other issue in single process will
6060+ not directly affect any other process in the system. VM can send some messages
6161+ due to process shutdown, and other processes may be killed because of that,
6262+ but by itself shutting down single process will not cause problems in any
6363+ process not related to that.
6464+- Location transparency - process can be spawned locally or on different
6565+ machine, but from the viewpoint of the programmer, there is no difference.
6666+6767+The above features and requirements results in some design choices, but for our
6868+purpose only one is truly needed today - each process have separate and (almost)
6969+independent memory stack from any other process.
7070+7171+### Process dictionary
7272+7373+Each process in Erlang VM has dedicated *mutable* memory space for their
7474+internal uses. Most people do not use it for anything because in general it
7575+should not be used unless you know exactly what you are doing (in my case, a bad
7676+carpenter could count cases when I needed it, on single hand). In general it's
7777+*here be dragons* area.
7878+7979+How it's relevant to us?
8080+8181+Well, OTP internally uses process dictionary (`pdict` for short) to store
8282+metadata about given process that can be later used for debugging purposes. Some
8383+data that it store are:
8484+8585+- Initial function that was run by the given process
8686+- PIDs to all ancestors of the given process
8787+8888+Different processes abstractions (like `get_server`/`GenServer`, Elixir's
8989+`Task`, etc.) can store even more metadata there, `logger` store process
9090+metadata in process dictionary, `rand` store state of the PRNGs in the process
9191+dictionary. it's used quite extensively by some OTP features.
9292+9393+### "Well behaved" OTP process
9494+9595+In addition to the above metadata if the process is meant to be "well behaved"
9696+process in OTP system, i.e. process that can be observed and debugged using OTP
9797+facilities, it must respond to some additional messages defined by [`sys`][]
9898+module. Without that the features like [`observer`][] would not be able to "see"
9999+the content of the process state.
100100+101101+[`sys`]: https://erlang.org/doc/man/sys.html
102102+[`observer`]: https://erlang.org/doc/man/observer.html
103103+104104+## Process memory usage
105105+106106+As we have seen above, the `Task.async/1` function form Elixir **must** do
107107+much more than just simple "start process and live with it". That was one of the
108108+most important problems with the original process, it was using system, that was
109109+allocating quite substantial memory alongside of the process itself, just to
110110+operate this process. In general, that would be desirable approach (as you
111111+**really, really, want the debugging facilities**), but in synthetic benchmarks,
112112+it reduce the feasibility of such benchmark.
113113+114114+If we want to avoid that additional memory overhead in our spawned processes we
115115+need to go back to more primitive functions in Erlang, namely `erlang:spawn/1`
116116+(`Kernel.spawn/1` in Elixir). But that mean that we cannot use
117117+`Task.await_many/2` anymore, so we need to workaround it by using custom
118118+function:
119119+120120+```elixir
121121+defmodule Bench do
122122+ def await(pid) when is_pid(pid) do
123123+ # Monitor is internal feature of Erlang that will inform you (by sending
124124+ # message) when process you monitor die. The returned value is type called
125125+ # "reference" which is just simply unique value returned by the VM.
126126+ # If the process is already dead, then message will be delivered
127127+ # immediately.
128128+ ref = Process.monitor(pid)
129129+130130+ receive do
131131+ {:DOWN, ^ref, :process, _, _} -> :ok
132132+ end
133133+ end
134134+135135+ def await_many(pids) do
136136+ Enum.each(pids, &await/1)
137137+ end
138138+end
139139+140140+tasks =
141141+ for _ <- 1..num_tasks do
142142+ # `Kernel` module is imported by default, so no need for `Kernel.` prefix
143143+ spawn(fn ->
144144+ :timer.sleep(10000)
145145+ end)
146146+ end
147147+148148+Bench.await_many(tasks)
149149+```
150150+151151+We already removed one problem (well, two in fact, but we will go into
152152+details in next section).
153153+154154+## All your lists belongs to us now
155155+156156+Erlang, like most of the functional programming languages, have 2 built-in
157157+sequence types:
158158+159159+- Tuples - which are non-growable product type of the values, so you can access
160160+ any field quite fast, but adding more values is performance no-no
161161+- (Singly) linked lists - growable type (in most case it will have single type
162162+ values in it, but in Erlang that is not always the case), which is fast to
163163+ prepend or pop data from the beginning, but do not try to do anything else if
164164+ you care about performance.
165165+166166+In this case we will focus on the 2nd one, as there tuples aren't important at
167167+all.
168168+169169+Singly linked list is simple data structure. It's either special value `[]`
170170+(an empty list) or it's something called "cons-cell". Cons-cells are also
171171+simple structures - it's 2ary tuple (tuple with 2 elements) where first value
172172+is head - the value in the list cell, and another one is the "tail" of the list (aka
173173+rest of the list). In Elixir the cons-cell is denoted like that `[head | tail]`.
174174+Super simple structure as you can see, and perfect for the functional
175175+programming as you can add new values to the list without modifying existing
176176+values, so you can be immutable and fast. However if you need to construct the
177177+sequence of a lot of values (like our list of all tasks) then we have problem.
178178+Because Elixir promises that list returned from the `for` will be **in-order**
179179+of the values passed to it. That mean that we either need to process our data
180180+like that:
181181+182182+```elixir
183183+def map([], _), do: []
184184+185185+def map([head | tail], func) do
186186+ [func.(head) | map(tail, func)]
187187+end
188188+```
189189+190190+Where we build call stack (as we cannot have tail call optimisation there, of
191191+course sans compiler optimisations). Or we need to build our list in reverse
192192+order, and then reverse it before returning (so we can have TCO):
193193+194194+```elixir
195195+def map(list, func), do: do_map(list, func, [])
196196+197197+def map([], _func, agg), do: :lists.reverse(agg)
198198+199199+def map([head | tail], func, agg) do
200200+ map(tail, func, [func.(head) | agg])
201201+end
202202+```
203203+204204+Which one of these approaches is more performant is irrelevant[^erlang-perf],
205205+what is relevant is that we need either build call stack or construct our list
206206+*twice* to be able to conform to the Elixir promises (even if in this case we do
207207+not care about order of the list returned by the `for`).
208208+209209+[^erlang-perf]: Sometimes body recursion will be faster, sometimes TCO will be
210210+faster. it's impossible to tell without more benchmarking. For more info check
211211+out [superb article by Ferd Herbert](https://ferd.ca/erlang-s-tail-recursion-is-not-a-silver-bullet.html).
212212+213213+Of course we could mitigate our problem by using `Enum.reduce/3` function (or
214214+writing it on our own) and end with code like:
215215+216216+```elixir
217217+defmodule Bench do
218218+ def await(pid) when is_pid(pid) do
219219+ ref = Process.monitor(pid)
220220+221221+ receive do
222222+ {:DOWN, ^ref, :process, _, _} -> :ok
223223+ end
224224+ end
225225+226226+ def await_many(pids) do
227227+ Enum.each(pids, &await/1)
228228+ end
229229+end
230230+231231+tasks =
232232+ Enum.reduce(1..num_tasks, [], fn _, agg ->
233233+ # `Kernel` module is imported by default, so no need for `Kernel.` prefix
234234+ pid =
235235+ spawn(fn -> :timer.sleep(10000) end)
236236+237237+ [pid | agg]
238238+ end)
239239+240240+Bench.await_many(tasks)
241241+```
242242+243243+Even then we build list of all PIDs.
244244+245245+Here I can also go back to the "second problem* I have mentioned above.
246246+`Task.await_many/1` *also construct a list*. it's list of return value from all
247247+the processes in the list, so not only we constructed list for the tasks' PIDs,
248248+we also constructed list of return values (which will be `:ok` for all processes
249249+as it's what `:timer.sleep/1` returns), and immediately discarded all of that.
250250+251251+How we can better? See that **all** we care is that all `num_task` processes
252252+have gone down. We do not care about any of the return values, all what we want
253253+is to know that all processes that we started went down. For that we can just
254254+send messages from the spawned processes and count the received messages count:
255255+256256+```elixir
257257+defmodule Bench do
258258+ def worker(parent) do
259259+ :timer.sleep(10000)
260260+ send(parent, :done)
261261+ end
262262+263263+ def start(0), do: :ok
264264+ def start(n) when n > 0 do
265265+ this = self()
266266+ spawn(fn -> worker(this) end)
267267+268268+ start(n - 1)
269269+ end
270270+271271+ def await(0), do: :ok
272272+ def await(n) when n > 0 do
273273+ receive do
274274+ :done -> await(n - 1)
275275+ end
276276+ end
277277+end
278278+279279+Bench.start(num_tasks)
280280+Bench.await(num_tasks)
281281+```
282282+283283+Now we do not have any lists involved and we still do what the original task
284284+meant to do - spawn `num_tasks` processes and wait till all go down.
285285+286286+## Arguments copying
287287+288288+One another thing that we can account there - lambda context and data passing
289289+between processes.
290290+291291+You see, we need to pass `this` (which is PID of the parent) to our newly
292292+spawned process. That is suboptimal, as we are looking for the way to reduce
293293+amount of the memory (and ignore all other metrics at the same time). As Erlang
294294+processes are meant to be "share nothing" type of processes there is problem -
295295+we need to copy that PID to all processes. it's just 1 word (which mean 8 bytes
296296+on 64-bit architectures, 4 bytes on 32-bit), but hey, we are microbenchmarking,
297297+so we cut whatever we can (with 1M processes, this adds up to 8 MiBs).
298298+299299+Hey, we can avoid that by using yet another feature of Erlang, called
300300+*registry*. This is yet another simple feature that allows us to assign PID of
301301+the process to the atom, which allows us then to send messages to that process
302302+using just name, we have given. While atoms are also 1 word that wouldn't make
303303+sense to send it as well, but instead we can do what any reasonable
304304+microbenchmarker would do - *hardcode stuff*:
305305+306306+```elixir
307307+defmodule Bench do
308308+ def worker do
309309+ :timer.sleep(10000)
310310+ send(:parent, :done)
311311+ end
312312+313313+ def start(0), do: :ok
314314+ def start(n) when n > 0 do
315315+ spawn(fn -> worker() end)
316316+317317+ start(n - 1)
318318+ end
319319+320320+ def await(0), do: :ok
321321+ def await(n) when n > 0 do
322322+ receive do
323323+ :done -> await(n - 1)
324324+ end
325325+ end
326326+end
327327+328328+Process.register(self(), :parent)
329329+330330+Bench.start(num_tasks)
331331+Bench.await(num_tasks)
332332+```
333333+334334+Now we do not pass any arguments, and instead rely on the registry to dispatch
335335+our messages to respective processes.
336336+337337+## One more thing
338338+339339+As you may have already noticed we are passing lambda to the `spawn/1`. That is
340340+also quite suboptimal, because of [difference between remote and local call][remote-vs-local].
341341+This mean that we are paying slight memory cost for these processes to keep the
342342+old module in memory. Instead we can use either fully qualified function capture
343343+or `spawn/3` function that accepts MFA (module, function name, arguments list)
344344+argument. We end with:
345345+346346+[remote-vs-local]: https://www.erlang.org/doc/reference_manual/code_loading.html#code-replacement
347347+348348+```elixir
349349+defmodule Bench do
350350+ def worker do
351351+ :timer.sleep(10000)
352352+ send(:parent, :done)
353353+ end
354354+355355+ def start(0), do: :ok
356356+ def start(n) when n > 0 do
357357+ spawn(&__MODULE__.worker/0)
358358+359359+ start(n - 1)
360360+ end
361361+362362+ def await(0), do: :ok
363363+ def await(n) when n > 0 do
364364+ receive do
365365+ :done -> await(n - 1)
366366+ end
367367+ end
368368+end
369369+370370+Process.register(self(), :parent)
371371+372372+Bench.start(num_tasks)
373373+Bench.await(num_tasks)
374374+```
375375+376376+## Results
377377+378378+With given Erlang compilation:
379379+380380+```txt
381381+Erlang/OTP 25 [erts-13.2.2.1] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1]
382382+383383+Elixir 1.14.5 (compiled with Erlang/OTP 25)
384384+```
385385+386386+> Note no JIT as Nix on macOS currently[^currently] disable it and I didn't bother to enable
387387+> it in the derivation (it was disabled because there were some issues, but IIRC
388388+> these are resolved now).
389389+390390+[^currently]: Nixpkgs rev `bc3ec5ea`
391391+392392+The results are as follow (in bytes of peak memory footprint returned by
393393+`/usr/bin/time` on macOS):
394394+395395+| Implementation | 1k | 100k | 1M |
396396+| -------------- | -------: | --------: | ---------: |
397397+| Original | 45047808 | 452837376 | 4227715072 |
398398+| Spawn | 43728896 | 318230528 | 2869723136 |
399399+| Reduce | 43552768 | 314798080 | 2849304576 |
400400+| Count | 43732992 | 313507840 | 2780540928 |
401401+| Registry | 44453888 | 311988224 | 2787237888 |
402402+| RemoteCall | 43597824 | 310595584 | 2771525632 |
403403+404404+As we can see we have reduced the memory use by about 30% by just changing
405405+from `Task.async/1` to `spawn/1`. Further optimisations reduced memory usage
406406+slightly, but with no such drastic changes.
407407+408408+Can we do better?
409409+410410+Well, with some VM flags tinkering - of course.
411411+412412+You see, by default Erlang VM will not only create some data required for
413413+handling process itself[^word]:
414414+415415+[^word]: Again, word here mean 8 bytes on 64-bit and 4 bytes on 32-bit architectures.
416416+417417+> | Data Type | Memory Size |
418418+> | - | - |
419419+> | … | … |
420420+> | Erlang process | 338 words when spawned, including a heap of 233 words. |
421421+>
422422+> -- <https://erlang.org/doc/efficiency_guide/advanced.html#Advanced>
423423+424424+As we can see, there are 105 words that are required and 233 words which are
425425+used for preallocated heap. But this is microbenchmarking, so as we do not need
426426+that much of memory (because our processes basically does nothing), we can
427427+reduce it. We do not care about time performance anyway. For that we can use
428428+`+hms` flag and set it to some small value, for example `1`.
429429+430430+In addition to heap size Erlang by default load some additional data from the
431431+BEAM files. That data is used for debugging and error reporting, but again, we
432432+are microbenchmarking, and who need debugging support anyway (answer: everyone,
433433+so **do not** do it in production). Luckily for us, the VM has yet another flag
434434+for that purpose `+L`.
435435+436436+Erlang also uses some [ETS][] (Erlang Term Storage) tables by default (for
437437+example to support process registry we have mentioned above). ETS tables can be
438438+compressed, but by default it's not done, as it can slow down some kinds of
439439+operations on such tables. Fortunately there is, another, flag `+ec` that has
440440+description:
441441+442442+> Forces option compressed on all ETS tables. Only intended for test and
443443+> evaluation.
444444+445445+[ETS]: https://erlang.org/doc/man/ets.html
446446+447447+Sounds good enough for me.
448448+449449+With all these flags enabled we get peak memory footprint at 996257792 bytes.
450450+451451+Compare it in more human readable units.
452452+453453+| | Peak Memory Footprint for 1M processes |
454454+| ------------------------ | -------------------------------------- |
455455+| Original code | 3.94 GiB |
456456+| Improved code | 2.58 GiB |
457457+| Improved code with flags | 0.93 GiB |
458458+459459+Result - about 76% of the peak memory usage reduction. Not bad.
460460+461461+## Summary
462462+463463+First of all:
464464+465465+> Please, do not use ChatGPT for writing code for microbenchmarks.
466466+467467+The thing about *micro*benchmarking is that we write code that does as little as
468468+possible to show (mostly) meaningless features of the given technology in
469469+abstract environment. ChatGPT cannot do that, not out of malice or incompetence,
470470+but because it used (mostly) *good* and idiomatic code to teach itself,
471471+microbenchmarks rarely are something that people will consider to have these
472472+qualities. It also cannot consider other features that [wetware][] can take into
473473+account (like our "we do not need lists there" thing).
474474+475475+[wetware]: https://en.wikipedia.org/wiki/Wetware_(brain)