Skip to content

Migrate/Stop VMs with local storage when preparing host for maintenance#4212

Merged
DaanHoogland merged 11 commits into
apache:masterfrom
CLDIN:migrate-local-vms-on-maint
Apr 19, 2021
Merged

Migrate/Stop VMs with local storage when preparing host for maintenance#4212
DaanHoogland merged 11 commits into
apache:masterfrom
CLDIN:migrate-local-vms-on-maint

Conversation

@GabrielBrascher

@GabrielBrascher GabrielBrascher commented Jul 16, 2020

Copy link
Copy Markdown
Member

Description

This PR adds a global settings parameter that configures the strategy for handling VMs in local storage when putting a host in maintenance.

Global settings name: host.maintenance.local.storage.strategy
Description: Defines the strategy towards VMs with volumes on local storage when putting a host in maintenance. The default strategy is 'Error', preventing maintenance in such a case. To migrate away VMs running on local storage choose 'Migrating' strategy. To stop VMs, choose 'Stopping' strategy.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

How Has This Been Tested?

  • Strategy Error:
  1. put a host into maintenance
  2. Maintenance fails and the host gets into ErrorInMaintenance state
  • Strategy Stopping:
  1. put a host into maintenance
  2. VMs with volumes on shared storage are migrated, VMs in local storage are stopped, the host gets into Maintenance
  • Strategy Migrating:
  1. put a host into maintenance
  2. VMs with volumes on shared and on local storage are migrated, the host gets into Maintenance

@GabrielBrascher GabrielBrascher added this to the 4.15.0.0 milestone Jul 16, 2020
@GabrielBrascher GabrielBrascher self-assigned this Jul 16, 2020
@GabrielBrascher GabrielBrascher marked this pull request as ready for review July 16, 2020 14:37
Comment thread server/src/main/java/com/cloud/vm/UserVmManagerImpl.java Outdated

@DaanHoogland DaanHoogland left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see not much wrong with this code but am wondering whether and how this would impact all hypervisors. It was probably intended for KVM, knowing your business, so i trust that is fine. any marvin (so we can automate for other platforms)?

Comment thread server/src/main/java/com/cloud/vm/UserVmManagerImpl.java Outdated
@GabrielBrascher

GabrielBrascher commented Aug 5, 2020

Copy link
Copy Markdown
Member Author

@DaanHoogland I did not answer your Marvin question, sorry for the delay.

To be honest, I had no plans on creating Marvin tests for this. I can take a look at it, though.

@GabrielBrascher

Copy link
Copy Markdown
Member Author

@andrijapanicsb

Copy link
Copy Markdown
Contributor

@GabrielBrascher what was the behaviour so far? I believe we were stopping all VMs, right?

I would be happy to see the this new feature to implement the old behaviour as the defaul behaviour - to keep backward compatibility, if that makes sense

@GabrielBrascher

Copy link
Copy Markdown
Member Author

@andrijapanicsb normally VMs are (live) migrated when a host is put into Maintenance; however, if the host has VMs with local storage the host is placed in the "ErrorInMaintenance" state; the ADMIN then needs to manually stop/migrate these VMs and then re-try the Maintenance.

Following the same concern that you raised, this implementation keeps the current behavior by default, therefore does not cause backward compatibility to any deployed Zone.

To change the behavior the Root Admin needs to configure the global settings parameter host.maintenance.local.storage.strategy on its own risk.

This parameter defines the strategy towards VMs with volumes on local storage when putting a host in maintenance.

  1. The default strategy is 'Error', preventing maintenance in such a case.
  2. To migrate away VMs running on local storage it is necessary to set Migrating strategy.
  3. To stop VMs with local storage instead of placing the host into ErrorInMaintenance, it is necessary to change the strategy to Stopping.

@andrijapanicsb

Copy link
Copy Markdown
Contributor

that sounds good @GabrielBrascher - thx for confirming.
Logic/behaviour is LGTM - but I haven't tested it.
How does it behave with other hypervisors? I trust you've tested KVM only?

@GabrielBrascher

Copy link
Copy Markdown
Member Author

@andrijapanicsb I tested with KVM only indeed.

I just re-checked the maintenance behavior with VMs running on shared storage. The zone holds one of the recent commits at the master (not older than one week).

Tested on KVM node with NFS as primary storage:

  1. the host has only VMs running with Root disk placed on shared storage
  2. call prepareHostForMaintenance API command
  3. all VMs are live migrated to another host
  4. host gets into Maintenance state

@wido wido left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM based on code and discussions with Gabriel

@apache apache deleted a comment from GabrielBrascher Nov 10, 2020
@apache apache deleted a comment from blueorangutan Nov 10, 2020
@apache apache deleted a comment from blueorangutan Nov 10, 2020
@apache apache deleted a comment from yadvr Nov 10, 2020
@apache apache deleted a comment from blueorangutan Nov 10, 2020
@apache apache deleted a comment from blueorangutan Nov 10, 2020
@apache apache deleted a comment from yadvr Nov 10, 2020
@apache apache deleted a comment from blueorangutan Nov 10, 2020
@apache apache deleted a comment from blueorangutan Nov 10, 2020
@GabrielBrascher

Copy link
Copy Markdown
Member Author

Code has been updated adding the option to Stop VM with force stop or a simple stop. Strategies are the following:

  1. The default strategy is 'Error', preventing maintenance in such a case
  2. Choose 'Migration' strategy to migrate away VMs running on local storage
  3. To stop VMs, choose 'Stop' strategy : _haMgr.scheduleStop(vm, hostId, WorkType.Stop);
  4. To force-stop VMs, choose 'ForceStop' strategy: _haMgr.scheduleStop(vm, hostId, WorkType.ForceStop);

@GabrielBrascher

Copy link
Copy Markdown
Member Author

@blueorangutan package

@blueorangutan

Copy link
Copy Markdown

@GabrielBrascher a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan

Copy link
Copy Markdown

Packaging result: ✔centos7 ✔centos8 ✔debian. JID-2852

@GabrielBrascher

Copy link
Copy Markdown
Member Author

Running tests. The strategy added for Stop is failing; StopForce works fine.

@DaanHoogland

Copy link
Copy Markdown
Contributor

@blueorangutan test keepEnv

@blueorangutan

Copy link
Copy Markdown

@DaanHoogland a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan

Copy link
Copy Markdown

Trillian test result (tid-3639)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 33275 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4212-t3639-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_vm_life_cycle.py
Smoke tests completed. 85 look OK, 1 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_01_migrate_VM_and_root_volume Error 60.50 test_vm_life_cycle.py
test_02_migrate_VM_with_two_data_disks Error 51.58 test_vm_life_cycle.py

@DaanHoogland

Copy link
Copy Markdown
Contributor

@GabrielBrascher sounds like you are not ready to have this merged, do you?

@GabrielBrascher

Copy link
Copy Markdown
Member Author

Code has been updated reverting the change to allow choosing between Stop or ForceStop. It turns out that the implementation got a bit tricky when HA management calls a VM Stop at the maintenance workflow. For now, keeping only VM ForceStop.

Fitting the "graceful" Stop into the current maintenance workflow might be a good enhancement for another PR, but for now, keeping the strategy feature as it is:

  1. The default strategy is 'Error', preventing maintenance in such a case
  2. Choose 'Migration' strategy to migrate away VMs running on local storage
  3. To force-stop VMs, choose the 'ForceStop' strategy

Manual tests are looking good.

@GabrielBrascher

Copy link
Copy Markdown
Member Author

@blueorangutan package

@blueorangutan

Copy link
Copy Markdown

@GabrielBrascher a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan

Copy link
Copy Markdown

Packaging result: ✔centos7 ✔centos8 ✔debian. JID-2862

@yadvr

yadvr commented Mar 6, 2021

Copy link
Copy Markdown
Member

@blueorangutan package

@blueorangutan

Copy link
Copy Markdown

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan

Copy link
Copy Markdown

Packaging result: ✖centos7 ✖centos8 ✖debian. JID-2875

@GabrielBrascher

Copy link
Copy Markdown
Member Author

@blueorangutan package

@blueorangutan

Copy link
Copy Markdown

@GabrielBrascher a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress. [S]

@blueorangutan

Copy link
Copy Markdown

Packaging result: ✔️ centos7 ✔️ centos8 ✔️ debian. SL-JID 94

@blueorangutan

Copy link
Copy Markdown

[S] Trillian test result (tid-102)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 35571 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4212-t102-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_vm_life_cycle.py
Smoke tests completed. 85 look OK, 1 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_01_migrate_VM_and_root_volume Error 77.42 test_vm_life_cycle.py
test_02_migrate_VM_with_two_data_disks Error 53.18 test_vm_life_cycle.py

@GabrielBrascher

GabrielBrascher commented Apr 14, 2021

Copy link
Copy Markdown
Member Author

Considering that this implementation only touches at doMaintaintand prepareForMaintenance methods I think that the test failures at test_01_migrate_VM_and_root_volume and test_02_migrate_VM_with_two_data_disks are not related to any of the added commits in this PR.

Thanks to all the reviewers @DaanHoogland @andrijapanicsb @rhtyd @wido @RodrigoDLopez; this one looks ready for merge. Are there any concerns that are missing my attention?

@DaanHoogland DaanHoogland merged commit de55766 into apache:master Apr 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants