Interesting issues encountered with the Azure Connected Machine Agent

Strange things happen when you reach enough machines

One of the components I work on is the Azure Connected Machine Agent. This is an agent component which can be installed on many different Windows versions and Linux distributions. One thing with a software package which customers install on their own machines is operating systems can be configured in so many different ways. You start to find quite obscure issues that crop up on a tiny fraction of the install base. Here are some of their stories.

Ceci n’est pas un pipe

One customer reported that one some of their machines, the agent was unable to start up. The logs showed that we we bailed out while trying to listen on a named pipe. The agent service creates a named pipe \\.\pipe\himds at startup, which azcmagent.exe uses - for example when you run azcmagent connect we send a message through this pipe to the service to configure it.

On the affected machines, the system call to create the pipe failed. After a lot of debugging, we determined that these machines had been configured with a specific local security policy Computer Configuration >> Windows Settings >> Security Settings >> Local Policies >> Security Options >> "System objects: Require case insensitivity for non-Windows subsystems"

The docs for this setting say:

“This security setting determines whether case insensitivity is enforced for all subsystems. The Win32 subsystem is case insensitive. However, the kernel supports case sensitivity for other subsystems, such as POSIX.

If this setting is enabled, case insensitivity is enforced for all directory objects, symbolic links, and IO objects, including file objects. Disabling this setting does not allow the Win32 subsystem to become case sensitive.”

So you would expect that this would only affect programs which use the POSIX subsystem in Windows - a vanishingly small number nowadays. Our code certainly doesn’t use it. But somehow, when this setting was switched from the default Enabled to Disabled, it affects the behavior of Named Pipes.

In my humble opinion, this is a Windows bug. Named pipes are not part of the POSIX subsystem and the documentation doesn’t indicate that ‘pipe’ is case sensitive. However, we found that if we change the pipe name in our code to \\.\PIPE\himds everything worked, regardless of the value of this setting. Fixing that is a lot simpler than backporting a fix to many millions of Windows servers, so we made that change.

DCOMissioned

Another customer found that our agent was unable to start on one out of 2000 of their machines. This time, the logs showed a failure while trying to listen on a local TCP port - the agent listens to 127.0.0.1:40342 for IMDS requests. On this one machine, the listen() call was failing. If we ran the code as a regular user instead of a service, it worked as expected.

Again after some fairly extensive debugging, we found an error message in the Event Log:

“The application-specific configuration settings do not grant Local Activation permission for the COM server with CLSID {D63B10C5-BB46-4990-A94F-E40B9D520160} and AppID {9CA88EE3-ACB7-47c8-AFC4-AB702511C276}…”

This was strange. We don’t use COM or DCOM at all in Azure Connected Machine Agent. Why was this happening?

We were able to resolve it by using dcomcnfg.exe to adjust the permissions on the RuntimeBroker component. By default, all applications get Read access to this component, which was missing on this particular server. But sadly I still haven’t figured out why there’s any connection between this COM component and socket operations, or what changed that config setting on one machine.