Skip to content

Return non-zero exit from get on query failure#2313

Merged
berendt merged 1 commit into
mainfrom
get-nonzero-exit-on-query-failure
Jun 2, 2026
Merged

Return non-zero exit from get on query failure#2313
berendt merged 1 commit into
mainfrom
get-nonzero-exit-on-query-failure

Conversation

@ideaship

@ideaship ideaship commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

What

osism get hostvars and osism get hosts logged an error but then returned None when the underlying ansible-inventory subprocess failed. cliff turns a take_action return value of None into exit code 0 (return_code = self.take_action(parsed_args) or 0), so these commands exited successfully even when the inventory query could not be run at all — making them unsafe in set -e scripts and && chains.

Both commands now return 1 on subprocess.CalledProcessError, i.e. when the query itself could not be executed.

Missing host

With the pinned ansible-core (2.18.9 — release repo latest/base.yml), ansible-inventory --host <unknown> does not return an empty result. It refuses the query with ERROR! You must pass a single valid host to --host parameter and exits 5. So osism get hostvars <unknown-host> now exits 1, matching its existing Host not found in inventory. error message.

What stays exit 0 (empty results)

Empty results are deliberately left as success — a query that ran fine but had nothing to show:

  • a host that exists but has no host vars (ansible-inventory prints {}, exits 0)
  • a requested variable/fact absent from an otherwise successfully retrieved data set
  • an inventory that lists no matching hosts

A note on the convention

There is no documented exit-code contract for the osism CLI — the behaviour is emergent and inconsistent across commands. This PR does not attempt to define or enforce one. It only tries to make osism get match what looks like the prevailing convention elsewhere in the CLI: a non-zero exit signals an operational failure (the query could not be run), while an empty result set is treated as a valid, successful answer.

Tests

Adds unit tests for both commands covering the failure path (raised CalledProcessError → exit 1) and the empty-result path (successful query with nothing to show → exit 0).

osism get hostvars and osism get hosts logged an error but then
returned None when the underlying ansible-inventory subprocess
failed. cliff turns a take_action return value of None into exit
code 0 (return_code = self.take_action(parsed_args) or 0), so these
commands exited successfully even though the inventory query could
not be run at all. That makes them unsafe to use in set -e scripts
or && chains, where a failed lookup silently looks like success.

Both commands now return 1 when subprocess.CalledProcessError is
raised, i.e. when the query itself could not be executed.

A missing host is one such failure. With the pinned ansible-core
(2.18.9, release latest/base.yml), ansible-inventory --host on a
host that is not in the inventory does not return an empty result;
it refuses the query with "ERROR! You must pass a single valid host
to --host parameter" and exits 5. osism get hostvars for an unknown
host therefore now exits 1, which matches its existing "Host not
found in inventory." error message.

Empty results are deliberately left as exit 0: a host that exists
but has no host vars (ansible-inventory prints {} and exits 0), a
requested variable or fact that is absent from an otherwise
successfully retrieved data set, or an inventory that lists no
matching hosts. These are valid answers to a query that did run.
This keeps the behaviour consistent with the convention used
elsewhere in the CLI, where a non-zero status signals an operational
failure rather than an empty result set.

Adds unit tests for both commands covering the failure path (a
raised CalledProcessError yields exit code 1) and the empty-result
path (a successful query with nothing to show yields exit code 0).

AI-assisted: Claude Code
Signed-off-by: Roger Luethi <luethi@osism.tech>
@ideaship

ideaship commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

Follow-up: #2314 catalogs other osism subcommands with the same exit-0-on-failure pattern this PR fixes in osism get (timeouts, failed inventory queries, and not-found lookups across reconciler, validate, wait, report, baremetal, compute, server, volume, netbox).

@ideaship ideaship requested a review from berendt June 2, 2026 14:59
@ideaship ideaship marked this pull request as ready for review June 2, 2026 15:00
@ideaship ideaship moved this from Ready to In review in Human Board Jun 2, 2026

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've reviewed your changes and they look great!


Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@berendt berendt merged commit 44674c3 into main Jun 2, 2026
4 checks passed
@berendt berendt deleted the get-nonzero-exit-on-query-failure branch June 2, 2026 15:37
@github-project-automation github-project-automation Bot moved this from In review to Done in Human Board Jun 2, 2026
berendt pushed a commit that referenced this pull request Jun 9, 2026
Several commands wait for a Celery task's output, hit TimeoutError,
log the timeout, and then fall through to the end of take_action with
an implicit None return. cliff turns that into exit code 0
(return_code = self.take_action(parsed_args) or 0), so a command that
gave up waiting for its task still reports success. That makes the
affected commands unsafe in set -e scripts and && chains, where a
timed-out sync silently looks like it completed.

This is the same class of bug PR #2313 fixed for osism get, but on the
timeout path rather than the inventory-query path. The affected sites:

- reconciler.Sync
- validate.Run._handle_task (the value take_action returns)
- netbox.Ironic
- netbox.Sync

Each now returns 1 from the TimeoutError handler, so a timeout yields a
non-zero exit. Adds unit tests asserting exit 1 when fetch_task_output
raises TimeoutError, mirroring PR #2313's failure-path tests.

Related-Bug: #2314
AI-assisted: Claude Code
Signed-off-by: Roger Luethi <luethi@osism.tech>
berendt pushed a commit that referenced this pull request Jun 9, 2026
osism report memory, lldp, bgp and status each load the Ansible
inventory by running ansible-inventory via subprocess.run. On a
non-zero return code ("Error loading inventory.") and on
subprocess.TimeoutExpired ("Timeout loading inventory.") they logged
the error and then bare-returned None. cliff turns that into exit
code 0 (return_code = self.take_action(parsed_args) or 0), so a report
whose inventory could not be loaded at all still exited successfully.
That makes these commands unsafe in set -e scripts and && chains.

These are the direct twins of the bug PR #2313 fixed for osism get.
All eight inventory-load failure paths now return 1.

The neighbouring "No hosts found in inventory." branch is deliberately
left at exit 0: the query ran fine and simply returned no hosts, which
is a valid empty result rather than an operational failure, matching
the convention established in PR #2313.

Adds unit tests for all four commands covering the failure path (a
non-zero ansible-inventory return code yields exit 1), the timeout
path (TimeoutExpired yields exit 1) and the empty-result path (an empty
inventory yields exit 0).

Part of issue #2314 (Category B).

Related-Bug: #2314
AI-assisted: Claude Code
Signed-off-by: Roger Luethi <luethi@osism.tech>
berendt pushed a commit that referenced this pull request Jun 9, 2026
Several commands look up a resource, log an error or warning when it
cannot be found, and then bare-return None. cliff turns that into exit
code 0 (return_code = self.take_action(parsed_args) or 0), so a failed
lookup silently looks like success and the commands are unsafe in
set -e scripts and && chains.

All the affected not-found / unresolved-lookup paths now return 1:

- compute.ComputeMigrationList: no user domain, no user, no project
  domain, no project, multiple servers matched, no server found.
- server.ServerList: no user domain, no user, domain not found,
  project domain not found, project not found.
- volume.VolumeList: domain not found, project domain not found,
  project not found.
- netbox.Console / netbox.Dump: NetBox integration not configured
  (the operation cannot run without it), and Dump's device not found.
- baremetal Deploy, Undeploy, BurnIn, Clean, Provide, MaintenanceSet,
  MaintenanceUnset, PowerOn, PowerOff and Delete: the single named node
  could not be found (conn.baremetal.find_node returned nothing).

Adds unit tests for each command driving the not-found path and
asserting exit 1, mirroring PR #2313.

Part of issue #2314 (Category C and Category D not-found lookups).

Related-Bug: #2314
AI-assisted: Claude Code
Signed-off-by: Roger Luethi <luethi@osism.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants