Updated Microsoft has vowed to reduce cybersecurity vendors’ reliance on kernel-mode code, which was at the heart of the CrowdStrike super-snafu this month.
Redmond shared a technical incident response write-up on Saturday – titled “Windows Security best practices for integrating and managing security tools” – in which veep for enterprise and OS security David Weston explained how Microsoft measured the impact of the disaster: By accessing crash reports shared by customers.
But of course, as Weston noted, not every Windows customer shares crash reports.
“It’s worth noting the number of devices which generated crash reports is a subset of the number of impacted devices previously shared by Microsoft,” he wrote. Which means the IT giant produced that estimate of 8.5 million Windows computers affected by the CrowdStrike snafu without a crash report from every single one of them. The software giant has not detailed the methodology used to calculate the figure.
Weston’s post justifies how Windows performed, on the grounds that kernel-level drivers – like those employed by CrowdStrike – can improve performance and prevent tampering with security software. He noted, however, that infosec vendors must rationalize those benefits against potential negative impacts on resilience.
If kernel-mode code breaks, like what happened with CrowdStrike when its Falcon suite tried to parse a bad configuration file pushed to millions of Windows machines, the resulting crash will take out the whole operating system and its applications.
Thus, the more that can be done outside the kernel, the better; if that processing goes off the rails in user mode, the rest of the system should keep ticking along at least and the failure handled gracefully.
This is because Windows kernel mode is a powerful, trusted environment in which code runs close to the hardware and there isn’t much in the way of guardrails; it’s the software that manages your devices, keeps CPU cores busy with work from applications, and keeps programs and users separate from each other as needed, among other tasks.
It’s a good place for malware detection engines to run, in the form of kernel drivers, as they get good visibility of the whole computer to sniff out intrusions and other threats.
But the downside is that if these engines are compromised or break down, they can knock over the whole box or, worse, open the system up to further attack. Hence the suggestion to move auxiliary functions, such as config file parsing, out of the kernel and into userspace where damage will be limited.
And especially in the case of CrowdStrike, in which its digitally signed driver-level code – ordinarily approved by Microsoft – is extended by data files pushed out in the form of updates; one rogue update will undo whatever trust Windows had in CrowdStrike’s kernel-level code.’
The Falcon driver in this instance was a file system filter driver, which normally allows the antivirus product to look out for malicious file operations; the bad file update this month caused that driver to access memory it shouldn’t have done, triggering an out-of-bounds read exception and system crash.
“Since kernel drivers run at the most trusted level of Windows, where containment and recovery capabilities are by nature constrained, security vendors must carefully balance needs like visibility and tamper resistance with the risk of operating within kernel mode,” as Weston put it.
He observed that security vendors can find the right balance.
“For example, security vendors can use minimal sensors that run in kernel mode for data collection and enforcement, limiting exposure to availability issues,” he explained. “The remainder of the key product functionality includes managing updates, parsing content, and other operations can occur isolated within user mode where recoverability is possible.”
That arrangement, he suggested, “demonstrates the best practice of minimizing kernel usage while still maintaining a robust security posture and strong visibility.”
Are you taking notes, CrowdStrike?
This is also a good time to note that CrowdStrike did try to test its bad update before its release, though that validation pipeline failed to detect, flag up, and block the corrupted data from going out to everyone. All the more reason to put this kind of config parsing code into user mode, rather than sensitive kernel mode, and all the more reason for CrowdStrike to improve its testing practices with sandboxing and whatnot.
Whether CrowdStrike can easily separate its configuration parsing from its detection code is another story; we hope the vendor is at least mulling it over.
Weston also reminded readers that Redmond runs an industry forum called the Microsoft Virus Initiative (MVI) in which security vendors and the OS giant work together to “define reliable extension points and platform improvements, as well as share information about how to best protect our customers.”
The Microsoft veep listed the many security-related enhancements Microsoft has made over the years, and revealed the software megalith now plans “to work with the anti-malware ecosystem to take advantage of these integrated features to modernize their approach, helping to support and even increase security along with reliability.”
That work will involve four efforts, namely:
- Providing safe rollout guidance, best practices, and technologies to make it safer to perform updates to security products;
- Reducing the need for kernel drivers to access important security data;
- Providing enhanced isolation and anti-tampering capabilities with technologies like recently announced VBS enclaves;
- Enabling zero trust approaches like high integrity attestation which provides a method to determine the security state of the machine based on the health of Windows native security features.
Point two seems aimed at ensuring a CrowdStrike-like event becomes less likely in future.
Weston didn’t explain how that reduced dependence will be delivered – some re-jigging of Windows will likely be needed to make it happen.
Microsoft and Windows have a long and inglorious history of security snafus. If Redmond’s changes go awry, it won’t have CrowdStrike to blame for any new problems. ®
Update at 1930 UTC
This article’s headline and text was revised because, to be blunt, we took Microsoft’s disclosure about the crash report count being a subset to be an admission or suggestion that the IT giant low-balled the number of Windows machines affected by the CrowdStrike snafu. On second thoughts, we don’t know that for sure, and so we’re happy to clarify the article without that assumption accordingly.