Issue

When upgrading our one production NSX 4.1.2.1 environment I ran into an interesting error trying to upgrade our Bare Metal Tier 0 nodes.

Prepare edge upgrade bundle https://10.10.0.1/repository/4.1.2.3.0.23382408/Edge/nub/VMware-NSX-edge-4.1.2.3.0.23382424.nub failed on edge TransportNode UUID: clientType EDGE , target edge fabric node id UUID, return status Download and verify bundle failed with msg: Checking upgrade bundle /var/vmware/nsx/file-store/VMware-NSX-edge-4.1.2.3.0.23382424.nub contents Verifying bundle VMware-NSX-edge-4.1.2.3.0.23382424.bundle with signature VMware-NSX-edge-4.1.2.3.0.23382424.bundle.sig Failed to verify bundle: ['gpg', '--homedir', '/root/.gnupg', '--verify', '/tmp/tmpc75b6zcx/VMware-NSX-edge-4.1.2.3.0.23382424.bundle.sig', '/tmp/tmpc75b6zcx/VMware-NSX-edge-4.1.2.3.0.23382424.bundle'] returned 2: b"gpg: Signature made Mon 26 Feb 2024 10:40:20 PM MST\ngpg: using RSA key E51BDAAAFDF4DC95\ngpg: Can't check signature: No public key\n" .

However, my T1 edge nodes (which were set to upgrade first) had no problem with the same upgrade package. What a mystery!

Challenges!

Luckily NSX upgrades are generally non destructive when they fail. My T1’s continued on just fine even though my T0’s failed to upgrade. Even more, my T0’s continued on just fine. I just couldn’t upgrade and a retry did nothing different.

Unfortunately, there were no blogs out there I could find that shared why this was happening. Additionally, due to all the changes going on with the Broadcom acquisition of VMware, opening a GSS ticket is painful. So down the rabbit hole I go.

Troubleshooting

I did my googling due diligence and started hunting for this message:

Can't check signature: No public key

That led me to a series of articles, somewhat helpful. I already knew these T0’s couldn’t validate the packages because they didn’t have a public key. I just needed to dig a bit deeper to know where the public keys might be hiding. Luckily I have multiple NSX environments and multiple T1’s and NSX managers to dig through to compare. The error message was also helpful in pointing to where I should look:

Failed to verify bundle: ['gpg', '--homedir', '/root/.gnupg', '--verify', '/tmp/tmpc75b6zcx/VMware-NSX-edge-4.1.2.3.0.23382424.bundle.sig'

Interesting! I can surmise from the arguments sent that the public keys might be found in the /root/.gnupg folder.

I checked that folder on my T0’s. Unfortunately, all I saw was one small pubring.kbx file. So I compared against my T1’s and my NSX Managers. They had more than one file.

drwx------ 3 root root 4.0K Feb 26 22:26 .
drwx------ 4 root root 4.0K Feb 26 22:28 ..
drwx------ 2 root root 4.0K Feb 26 22:26 private-keys-v1.d
-rw-r--r-- 1 root root 1.4K Feb 26 22:26 pubring.kbx
-rw-r--r-- 1 root root  544 Feb 26 22:26 pubring.kbx~
srwx------ 1 root root    0 Feb 26 22:26 S.gpg-agent
srwx------ 1 root root    0 Feb 26 22:26 S.gpg-agent.browser
srwx------ 1 root root    0 Feb 26 22:26 S.gpg-agent.extra
srwx------ 1 root root    0 Feb 26 22:26 S.gpg-agent.ssh
-rw------- 1 root root 1.2K Feb 26 22:26 trustdb.gpg

However my pubring.kbx file was much, much smaller, and I had none of the other files listed above. Something’s strange. Plus… what the heck is a kbx file? So I googled “gpg kbx file”

Quickly my search returned this page: https://www.gnupg.org/documentation/manuals/gnupg/kbxutil.html

Handy. Now I know how to look inside my kbx file.

kbxutil ~/.gnupg/pubring.kbx

Findings

Just as I suspected above – my pubring.kbx file is no where near the same contents as any other T1, T0 or NSX Manager I checked across my environments. This file clearly didn’t have the contents it needed.

Checking the file contents the only difference I could see across my other systems was the Checksum was different on some (but not all) other systems I checked. Now, the rest of the contents were fine, so I took a leap and guessed that the checksum differences could have been due to any number of reasons, but the rest of the file content is what matters.

Fix

Using SCP I copied one of the pubring.kbx files from one of my other T0’s I had in another datacenter and placed it in the /root/.gnupg folder of the T0’s that failed to upgrade (backing up the original pubring.kbx file and replacing it with the one I took from my other T0).

I validated the contents once more with the kbxutil ~/.gnupg/pubring.kbx command to ensure it still validated after replacing it.

After doing so, I went back to my NSX UI and retried the upgrade. Success!