With mascots this janky, what would you expect?

So I've currently been working on deploying private cloud stacks using tripleo on RHEL 7.4. Just to preface, I have nothing against OpenStack or tripleo but OH MAN it is a pain when things aren't working in the way you expect them to.

Today was one of those days... Everytime I tried deploying a 10 node stack, things just didn't work. The stack was simple, 3 controllers, 3 ceph storage nodes and 4 compute nodes. The only output I got from the failed stacks was "There are not enough hosts available, Code: 500". BLEH! That message is the least descriptive error output for something so complex.... I should also mention, I am using ironic and heat to deploy these baremetal nodes - So yea, it's complicated.

Fortunately, at least I know the deployment proccess for HEAT and IRONIC which (loosely) is as follows:

  • Phase 1 - Orchestration (heat and nova services)
  • Phase 2 - Bare Metal Provisioning (ironic service)
  • Phase 3 - Post-Deployment Configuration (Puppet).

So, I was able to see that it failed during phase 1 which is (Orchistration). In most cases, Heat shows the failed overcloud stack after the overcloud fails to deploy. That helped me to find out what happened which narrows it down quite a bit because there's only a couple of things to actually check; 1) Heat templates and 2) Configuration options. I've already deployed stacks with the templates for this deployment so I knew it wasn't it.

So then I started checking the properties for baremetal nodes
and wouldn't ya know it, the node capabilites were completely bork'd on three nodes. The capabilities metafata for nodes after a clean up should look like this:

$ openstack baremetal node show compute1 | grep prop
| properties             | {u'capabilities': u'boot_option:local,profile:compute'}

However, out of the 10 nodes, 3 of them (the ones with issues) that caused the entrie deployment to come to a screeching hault looked like this:

$ openstack baremetal node show compute2 | grep prop
| properties             | {u'capabilities': u'cpu_vt:true,cpu_hugepages:true,boot_option:local,cpu_txt:true,cpu_aes:true,cpu_hugepages_1g:true,profile:compute'}

The issue here was obviously the metadata associated with the ironic nodes, but now what to do? Well, OpenStack ironic nodes do a fairly decent job at self healing, although not always - luckily, this was as simple placing the nodes into Maintenance mode, setting them to manageable and back to available and redploying the stack. It just took forever to get that done and that was what was so painful. The hurry up and wait game. Till next time kids!