src/docs/user/field/exit_codes.diviner at recaptime-dev/main

@recaptime-dev's working patches + fork for Phorge, a community fork of Phabricator. (Upstream dev and stable branches are at upstream/main and upstream/stable respectively.) hq.recaptime.dev/wiki/Phorge
phorge phabricator
fork atom
phorge / src / docs / user / field / exit_codes.diviner
at recaptime-dev/main 242 lines 11 kB view raw
wrap content
Matthew Bowker Update Diviner documentation to reference Phorge instead of Phabricator. 4y ago
1ddb953b
  1@title Command Line Exit Codes
  2@group fieldmanual
  3
  4Explains the use of exit codes in Phorge command line scripts.
  5
  6Overview
  7========
  8
  9When you run a command from the command line, it exits with an //exit code//.
 10This code is normally not shown on the CLI, but you can examine the exit code
 11of the last command you ran by looking at `$?` in your shell:
 12
 13  $ ls
 14  ...
 15  $ echo $?
 16  0
 17
 18Programs which run commands can operate on exit codes, and shell constructs
 19like `cmdx && cmdy` operate on exit codes.
 20
 21The code `0` means success. Other codes signal some sort of error or status
 22condition, depending on the system and command.
 23
 24With rare exception, Phorge uses //all other codes// to signal
 25**catastrophic failure**.
 26
 27This is an explicit architectural decision and one we are unlikely to deviate
 28from: generally, we will not accept patches which give a command a nonzero exit
 29code to indicate an expected state, an application status, or a minor abnormal
 30condition.
 31
 32Generally, this decision reflects a philosophical belief that attaching
 33application semantics to exit codes is a relic of a simpler time, and that
 34they are not appropriate for communicating application state in a modern
 35operational environment. This document explains the reasoning behind our use of
 36exit codes in more detail.
 37
 38In particular, this approach is informed by a focus on operating Phorge
 39clusters at scale. This is not a common deployment scenario, but we consider it
 40the most important one. Our use of exit codes makes it easier to deploy and
 41operate a Phorge cluster at larger scales. It makes it slightly harder to
 42deploy and operate a small cluster or single host by gluing together `bash`
 43scripts. We are willingly trading the small scale away for advantages at larger
 44scales.
 45
 46
 47Problems With Exit Codes
 48========================
 49
 50We do not use exit codes to communicate application state because doing so
 51makes it harder to write correct scripts, and the primary benefit is that it
 52makes it easier to write incorrect ones.
 53
 54This is somewhat at odds with the philosophy of "worse is better", but a modern
 55operations environment faces different forces than the interactive shell did
 56in the 1970s, particularly at scale.
 57
 58We consider correctness to be very important to modern operations environments.
 59In particular, we believe that having reliable, repeatable processes for
 60provisioning, configuration and deployment is critical to maintaining and
 61scaling our operations. Our use of exit codes makes it easier to implement
 62processes that are correct and reliable on top of Phorge management scripts.
 63
 64Exit codes as signals for application state are problematic because they are
 65ambiguous: you can't use them to distinguish between dissimilar failure states
 66which should prompt very different operational responses.
 67
 68Exit codes primarily make writing things like `bash` scripts easier, but we
 69think you shouldn't be writing `bash` scripts in a modern operational
 70environment if you care very much about your software working.
 71
 72Software environments which are powerful enough to handle errors properly are
 73also powerful enough to parse command output to unambiguously read and react to
 74complex state. Communicating application state through exit codes almost
 75exclusively makes it easier to handle errors in a haphazard way which is often
 76incorrect.
 77
 78
 79Exit Codes are Ambiguous
 80========================
 81
 82In many cases, exit codes carry very little information and many different
 83conditions can produce the same exit code, including conditions which should
 84prompt very different responses.
 85
 86The command line tool `grep` searches for text. For example, you might run
 87a command like this:
 88
 89  $ grep zebra corpus.txt
 90
 91This searches for the text `zebra` in the file `corpus.txt`. If the text is
 92not found, `grep` exits with a nonzero exit code (specifically, `1`).
 93
 94Suppose you run `grep zebra corpus.txt` and observe a nonzero exit code. What
 95does that mean? These are //some// of the possible conditions which are
 96consistent with your observation:
 97
 98  - The text `zebra` was not found in `corpus.txt`.
 99  - `corpus.txt` does not exist.
100  - You do not have permission to read `corpus.txt`.
101  - `grep` is not installed.
102  - You do not have permission to run `grep`.
103  - There is a bug in `grep`.
104  - Your `grep` binary is corrupt.
105  - `grep` was killed by a signal.
106
107If you're running this command interactively on a single machine, it's probably
108OK for all of these conditions to be conflated. You aren't going to examine the
109exit code anyway (it isn't even visible to you by default), and `grep` likely
110printed useful information to `stderr` if you hit one of the less common issues.
111
112If you're running this command from operational software (like deployment,
113configuration or monitoring scripts) and you care about the correctness and
114repeatability of your process, we believe conflating these conditions is not
115OK. The operational response to text not being present in a file should almost
116always differ substantially from the response to the file not being present or
117`grep` being broken.
118
119In a particularly bad case, a broken `grep` might cause a careless deployment
120script to continue down an inappropriate path and cascade into a more serious
121failure.
122
123Even in a less severe case, unexpected conditions should be detected and raised
124to operations staff. `grep` being broken or a file that is expected to exist
125not existing are both detectable, unexpected, and likely severe conditions, but
126they can not be differentiated and handled by examining the exit code of
127`grep`. It is much better to detect and raise these problems immediately than
128discover them after a lengthy root cause analysis.
129
130Some of these conditions can be differentiated by examining the specific exit
131code of the command instead of acting on all nonzero exit codes. However, many
132failure conditions produce the same exit codes (particularly code `1`) and
133there is no way to guarantee that a particular code signals a particular
134condition, especially across systems.
135
136Realistically, it is also relatively rare for scripts to even make an effort to
137distinguish between exit codes, and all nonzero exit codes are often treated
138the same way.
139
140
141Bash Scripts are not Robust
142============================
143
144Exit codes that indicate application status make writing `bash` scripts (or
145scripts in other tools which provide a thin layer on top of what is essentially
146`bash`) a lot easier and more convenient.
147
148For example, it is pretty tricky to parse JSON in `bash` or with standard
149command-line tools, and much easier to react to exit codes. This is sometimes
150used as an argument for communicating application status in exit codes.
151
152We reject this because we don't think you should be writing `bash` scripts if
153you're doing real operations. Fundamentally, `bash` shell scripts are not a
154robust building block for creating correct, reliable operational processes.
155
156Here is one problem with using `bash` scripts to perform operational tasks.
157Consider this command:
158
159  $ mysqldump | gzip > backup.sql.gz
160
161Now, consider this command:
162
163  $ mysqldermp | gzip > backup.sql.gz
164
165These commands represent a fairly standard way to accomplish a task (dumping
166a compressed database backup to disk) in a `bash` script.
167
168Note that the second command contains a typo (`dermp` instead of `dump`) which
169will cause the command to exit abruptly with a nonzero exit code.
170
171However, both these statements run successfully and exit with exit code `0`
172(indicating success). Both will create a `backup.sql.gz` file. One backs up
173your data; the other never backs up your data. This second command will never
174work and never do what the author intended, but will appear successful under
175casual inspection.
176
177These behaviors are the same under `set -e`.
178
179This fragile attitude toward error handling is endemic to `bash` scripts. The
180default behavior is to continue on errors, and it isn't easy to change this
181default. Options like `set -e` are unreliable and it is difficult to detect and
182react to errors in fundamental constructs like pipes. The tools that `bash`
183scripts employ (like `grep`) emit ambiguous error codes. Scripts can not help
184but propagate this ambiguity no matter how careful they are with error handling.
185
186It is likely //possible// to implement these things safely and correctly in
187`bash`, but it is not easy or straightforward. More importantly, it is not the
188default: the default behavior of `bash` is to ignore errors and continue.
189
190Gluing commands together in `bash` or something that sits on top of `bash`
191makes it easy and convenient to get a process that works fairly well most of
192the time at small scales, but we are not satisfied that it represents a robust
193foundation for operations at larger scales.
194
195
196Reacting to State
197=================
198
199Instead of communicating application state through exit codes, we generally
200communicate application state through machine-parseable output with a success
201(`0`) exit code. All nonzero exit codes indicate catastrophic failure which
202requires operational intervention.
203
204Callers are expected to request machine-parseable output if necessary (for
205example, by passing a `--json` flag or other similar flags), verify the command
206exits with a `0` exit code, parse the output, then react to the state it
207communicates as appropriate.
208
209In a sufficiently powerful scripting environment (e.g., one with data
210structures and a JSON parser), this is straightforward and makes it easy to
211react precisely and correctly. It also allows scripts to communicate
212arbitrarily complex state. Provided your environment gives you an appropriate
213toolset, it is much more powerful and not significantly more complex than using
214error codes.
215
216Most importantly, it allows the calling environment to treat nonzero exit
217statuses as catastrophic failure by default.
218
219
220Moving Forward
221==============
222
223Given these concerns, we are generally unwilling to bring changes which use
224exit codes to communicate application state (other than catastrophic failure)
225into the upstream. There are some exceptions, but these are rare. In
226particular, ease of use in a `bash` environment is not a compelling motivation.
227
228We are broadly willing to make output machine parseable or provide an explicit
229machine output mode (often a `--json` flag) if there is a reasonable use case
230for it. However, we operate a large production cluster of Phorge instances
231with the tools available in the upstream, so the lack of machine parseable
232output is not sufficient to motivate adding such output on its own: we also
233need to understand the problem you're facing, and why it isn't a problem we
234face. A simpler or cleaner approach to the problem may already exist.
235
236If you just want to write `bash` scripts on top of Phorge scripts and you
237are unswayed by these concerns, you can often just build a composite command to
238get roughly the same effect that you'd get out of an exit code.
239
240For example, you can pipe things to `grep` to convert output into exit codes.
241This should generally have failure rates that are comparable to the background
242failure level of relying on `bash` as a scripting environment.