Skip to content

VR: fix rsyslog compresses log files but not release disk space in VR#4869

Merged
yadvr merged 1 commit into
apache:4.15from
ustcweizhou:4.15-fix-vr-rsyslog
Apr 1, 2021
Merged

VR: fix rsyslog compresses log files but not release disk space in VR#4869
yadvr merged 1 commit into
apache:4.15from
ustcweizhou:4.15-fix-vr-rsyslog

Conversation

@ustcweizhou

Copy link
Copy Markdown
Contributor

Description

We had critical issue with VR recently. The VRs of shared network or vpc stops working after some days.
After investigation, I found that the disk space is full

root@r-10-VM:~# df
Filesystem     1K-blocks    Used Available Use% Mounted on
/dev/vda5        2086316 2069932         0 100% /

logrotate/ryslog has compresses the log files, but space is not released. see lsof |grep deleted

root@r-10-VM:~# lsof |grep deleted
rsyslogd    960                      root   12w      REG              254,5 493060096        137 /var/log/daemon.log.1 (deleted)
rsyslogd    960                      root   13w      REG              254,5  17715200        110 /var/log/messages.1 (deleted)
rsyslogd    960                      root   16w      REG              254,5 545968128        342 /var/log/auth.log.1 (deleted)
rsyslogd    960                      root   18w      REG              254,5  38313984        341 /var/log/cron.log.1 (deleted)
rsyslogd    960  962 in:imuxso       root   12w      REG              254,5 493060096        137 /var/log/daemon.log.1 (deleted)
rsyslogd    960  962 in:imuxso       root   13w      REG              254,5  17715200        110 /var/log/messages.1 (deleted)
rsyslogd    960  962 in:imuxso       root   16w      REG              254,5 545968128        342 /var/log/auth.log.1 (deleted)
rsyslogd    960  962 in:imuxso       root   18w      REG              254,5  38313984        341 /var/log/cron.log.1 (deleted)
rsyslogd    960  963 in:imklog       root   12w      REG              254,5 493060096        137 /var/log/daemon.log.1 (deleted)
rsyslogd    960  963 in:imklog       root   13w      REG              254,5  17715200        110 /var/log/messages.1 (deleted)
rsyslogd    960  963 in:imklog       root   16w      REG              254,5 545968128        342 /var/log/auth.log.1 (deleted)
rsyslogd    960  963 in:imklog       root   18w      REG              254,5  38313984        341 /var/log/cron.log.1 (deleted)
rsyslogd    960  964 in:imfile       root   12w      REG              254,5 493060096        137 /var/log/daemon.log.1 (deleted)
rsyslogd    960  964 in:imfile       root   13w      REG              254,5  17715200        110 /var/log/messages.1 (deleted)
rsyslogd    960  964 in:imfile       root   16w      REG              254,5 545968128        342 /var/log/auth.log.1 (deleted)
rsyslogd    960  964 in:imfile       root   18w      REG              254,5  38313984        341 /var/log/cron.log.1 (deleted)
rsyslogd    960  965 in:imudp        root   12w      REG              254,5 493060096        137 /var/log/daemon.log.1 (deleted)
rsyslogd    960  965 in:imudp        root   13w      REG              254,5  17715200        110 /var/log/messages.1 (deleted)
rsyslogd    960  965 in:imudp        root   16w      REG              254,5 545968128        342 /var/log/auth.log.1 (deleted)
rsyslogd    960  965 in:imudp        root   18w      REG              254,5  38313984        341 /var/log/cron.log.1 (deleted)
rsyslogd    960  966 rs:main         root   12w      REG              254,5 493060096        137 /var/log/daemon.log.1 (deleted)
rsyslogd    960  966 rs:main         root   13w      REG              254,5  17715200        110 /var/log/messages.1 (deleted)
rsyslogd    960  966 rs:main         root   16w      REG              254,5 545968128        342 /var/log/auth.log.1 (deleted)
rsyslogd    960  966 rs:main         root   18w      REG              254,5  38313984        341 /var/log/cron.log.1 (deleted)

workaround: restarting rsyslog to release the space.

systemctl restart rsyslog

The root cause is, the following command does not work in 4.15 template

root@r-10-VM:~# invoke-rc.d rsyslog rotate
[FAIL] Closing open files: rsyslogd failed!

Fix: use /usr/lib/rsyslog/rsyslog-rotate instead

root@r-10-VM:~# /usr/lib/rsyslog/rsyslog-rotate
root@r-10-VM:~# cat /usr/lib/rsyslog/rsyslog-rotate

if [ -d /run/systemd/system ]; then
    systemctl kill -s HUP rsyslog.service
else
    invoke-rc.d rsyslog rotate > /dev/null
fi

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Feature/Enhancement Scale or Bug Severity

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

We had critical issue with VR recently. The VRs of shared network or vpc stops working after some days.
After investigation, I found that the disk space is full

```
root@r-10-VM:~# df
Filesystem     1K-blocks    Used Available Use% Mounted on
/dev/vda5        2086316 2069932         0 100% /
```

logrotate/ryslog has compresses the log files, but space is not released. see `lsof |grep deleted`

```
root@r-10-VM:~# lsof |grep deleted
rsyslogd    960                      root   12w      REG              254,5 493060096        137 /var/log/daemon.log.1 (deleted)
rsyslogd    960                      root   13w      REG              254,5  17715200        110 /var/log/messages.1 (deleted)
rsyslogd    960                      root   16w      REG              254,5 545968128        342 /var/log/auth.log.1 (deleted)
rsyslogd    960                      root   18w      REG              254,5  38313984        341 /var/log/cron.log.1 (deleted)
rsyslogd    960  962 in:imuxso       root   12w      REG              254,5 493060096        137 /var/log/daemon.log.1 (deleted)
rsyslogd    960  962 in:imuxso       root   13w      REG              254,5  17715200        110 /var/log/messages.1 (deleted)
rsyslogd    960  962 in:imuxso       root   16w      REG              254,5 545968128        342 /var/log/auth.log.1 (deleted)
rsyslogd    960  962 in:imuxso       root   18w      REG              254,5  38313984        341 /var/log/cron.log.1 (deleted)
rsyslogd    960  963 in:imklog       root   12w      REG              254,5 493060096        137 /var/log/daemon.log.1 (deleted)
rsyslogd    960  963 in:imklog       root   13w      REG              254,5  17715200        110 /var/log/messages.1 (deleted)
rsyslogd    960  963 in:imklog       root   16w      REG              254,5 545968128        342 /var/log/auth.log.1 (deleted)
rsyslogd    960  963 in:imklog       root   18w      REG              254,5  38313984        341 /var/log/cron.log.1 (deleted)
rsyslogd    960  964 in:imfile       root   12w      REG              254,5 493060096        137 /var/log/daemon.log.1 (deleted)
rsyslogd    960  964 in:imfile       root   13w      REG              254,5  17715200        110 /var/log/messages.1 (deleted)
rsyslogd    960  964 in:imfile       root   16w      REG              254,5 545968128        342 /var/log/auth.log.1 (deleted)
rsyslogd    960  964 in:imfile       root   18w      REG              254,5  38313984        341 /var/log/cron.log.1 (deleted)
rsyslogd    960  965 in:imudp        root   12w      REG              254,5 493060096        137 /var/log/daemon.log.1 (deleted)
rsyslogd    960  965 in:imudp        root   13w      REG              254,5  17715200        110 /var/log/messages.1 (deleted)
rsyslogd    960  965 in:imudp        root   16w      REG              254,5 545968128        342 /var/log/auth.log.1 (deleted)
rsyslogd    960  965 in:imudp        root   18w      REG              254,5  38313984        341 /var/log/cron.log.1 (deleted)
rsyslogd    960  966 rs:main         root   12w      REG              254,5 493060096        137 /var/log/daemon.log.1 (deleted)
rsyslogd    960  966 rs:main         root   13w      REG              254,5  17715200        110 /var/log/messages.1 (deleted)
rsyslogd    960  966 rs:main         root   16w      REG              254,5 545968128        342 /var/log/auth.log.1 (deleted)
rsyslogd    960  966 rs:main         root   18w      REG              254,5  38313984        341 /var/log/cron.log.1 (deleted)
```

workaround: restarting rsyslog to release the space.
```
systemctl restart rsyslog
```

The root cause is, the following command does not work in 4.15 template
```
root@r-10-VM:~# invoke-rc.d rsyslog rotate
[FAIL] Closing open files: rsyslogd failed!
```

Fix: use `/usr/lib/rsyslog/rsyslog-rotate` instead
```
root@r-10-VM:~# /usr/lib/rsyslog/rsyslog-rotate
root@r-10-VM:~# cat /usr/lib/rsyslog/rsyslog-rotate

if [ -d /run/systemd/system ]; then
    systemctl kill -s HUP rsyslog.service
else
    invoke-rc.d rsyslog rotate > /dev/null
fi

```
@ustcweizhou

Copy link
Copy Markdown
Contributor Author

@rhtyd @DaanHoogland
this is a critical issue. not sure if the issue exists in 4.14.

by the way, do we have plan for 4.15.1.0 ? or 4.15.0.1 ?

@yadvr

yadvr commented Mar 29, 2021

Copy link
Copy Markdown
Member

Yes @ustcweizhou we've plan for 4.15.1.0 already in effect, see https://markmail.org/message/deicvofukqom2w5u

@blueorangutan package

@blueorangutan

Copy link
Copy Markdown

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan

Copy link
Copy Markdown

Packaging result: ✔️ centos7 ✔️ centos8 ✔️ debian. SL-JID 279

@yadvr

yadvr commented Mar 29, 2021

Copy link
Copy Markdown
Member

@blueorangutan test

@blueorangutan

Copy link
Copy Markdown

@rhtyd a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan

Copy link
Copy Markdown

Trillian test result (tid-290)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 49841 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4869-t290-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_kubernetes_clusters.py
Intermittent failure detected: /marvin/tests/smoke/test_routers_network_ops.py
Intermittent failure detected: /marvin/tests/smoke/test_vm_life_cycle.py
Intermittent failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Intermittent failure detected: /marvin/tests/smoke/test_hostha_kvm.py
Smoke tests completed. 81 look OK, 5 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
ContextSuite context=TestKubernetesCluster>:teardown Error 71.94 test_kubernetes_clusters.py
test_01_RVR_Network_FW_PF_SSH_default_routes_egress_true Failure 306.55 test_routers_network_ops.py
test_02_RVR_Network_FW_PF_SSH_default_routes_egress_false Failure 305.44 test_routers_network_ops.py
test_01_migrate_VM_and_root_volume Error 70.24 test_vm_life_cycle.py
test_02_migrate_VM_with_two_data_disks Error 49.06 test_vm_life_cycle.py
test_01_create_redundant_VPC_2tiers_4VMs_4IPs_4PF_ACL Failure 492.69 test_vpc_redundant.py
test_03_create_redundant_VPC_1tier_2VMs_2IPs_2PF_ACL_reboot_routers Failure 483.83 test_vpc_redundant.py
test_05_rvpc_multi_tiers Failure 392.17 test_vpc_redundant.py
test_05_rvpc_multi_tiers Error 418.08 test_vpc_redundant.py
test_hostha_kvm_host_fencing Error 651.60 test_hostha_kvm.py
test_hostha_kvm_host_recovering Error 664.73 test_hostha_kvm.py

@yadvr yadvr self-assigned this Mar 30, 2021
@yadvr

yadvr commented Mar 30, 2021

Copy link
Copy Markdown
Member

@blueorangutan test

@blueorangutan

Copy link
Copy Markdown

@rhtyd a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@DaanHoogland DaanHoogland left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code looks good, but is this compatible with the updated svm template as well?

@blueorangutan

Copy link
Copy Markdown

Trillian test result (tid-303)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 47268 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4869-t303-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_internal_lb.py
Intermittent failure detected: /marvin/tests/smoke/test_kubernetes_clusters.py
Intermittent failure detected: /marvin/tests/smoke/test_routers_network_ops.py
Intermittent failure detected: /marvin/tests/smoke/test_vm_life_cycle.py
Intermittent failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Smoke tests completed. 82 look OK, 4 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
ContextSuite context=TestKubernetesCluster>:teardown Error 72.10 test_kubernetes_clusters.py
test_01_RVR_Network_FW_PF_SSH_default_routes_egress_true Failure 303.47 test_routers_network_ops.py
test_02_RVR_Network_FW_PF_SSH_default_routes_egress_false Failure 303.72 test_routers_network_ops.py
test_01_migrate_VM_and_root_volume Error 67.21 test_vm_life_cycle.py
test_02_migrate_VM_with_two_data_disks Error 50.09 test_vm_life_cycle.py
test_01_create_redundant_VPC_2tiers_4VMs_4IPs_4PF_ACL Failure 492.55 test_vpc_redundant.py
test_03_create_redundant_VPC_1tier_2VMs_2IPs_2PF_ACL_reboot_routers Failure 428.49 test_vpc_redundant.py

@yadvr yadvr requested a review from DaanHoogland April 1, 2021 06:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants