After reading Patching VSE - risk level, you may have a mind to test a patch prior to production adoption!
The purpose of this post is to help unlock the minds of those tasked with Patching VSE installations and solving reported issues; to provide insight into the testing and adoption practices you could, and probably should, utilize in your own environment if not already adopting similar practices. It is not intended to provide actual test plans for you to abide by.
|If you assume McAfee's environment is adequate to represent your own environment, that's like assuming yellow snow is from lemons...|
|[Reading Tip] This post is lengthy, so feel free to refer to the Links here to break up your reading.|
High Level Topics
Plan for testing in these areas (explained further below) - be mindful that each of these sections could be explored in much greater detail, but for this blog's purpose it's sufficient to raise awareness to the type of work that could be undertaken prior to Patching VSE, for a better patching experience:
- Take inventory
- Pilot / User acceptance
"We just installed your patch and now Outlook runs really slow"... rolls off the tongue but that statement is LOADED with all kinds of questions needing answers. Did the installation succeed? Did the Outlook mail scanner install? What version of Outlook was it? Is that version supported with this patch? Is it slow to launch, slow to receive mail, send mail, what's slow? Slow for everyone or just this user, or just this computer? And that is just the beginning.
e.g. "We're piloting the patch and it seems installed OK, detects eicar and shows up in ePO but we see Outlook hangs when new mail is received. It affects all Users, and Outlook must be restarted to recover. We're pretty sure it's your fault since it only happened after installing the patch. Yes we've rebooted since installing ."
If you're not sure what information is helpful, that's OK, let McAfee Support ask questions (some testing may also be involved) to work towards identifying the problem.
Is it worth testing at all if you don't know what's in your environment? Okay, it is, but it makes you wonder how much more effective your test efforts would be if you knew what to test, or any high risk areas where you need to focus your testing.
Don't know? That's scary, but here are some clues to the types of software that install kernel drivers: Firewall, Encryption, Backup, Metering, Monitoring, Inventory, Security - including AV, data loss prevention, host intrusion, application control, lots of possibilities within that category.
You'll want a lab representation of that setup if you want to be able to explore what new code does to those sensitive systems. This can be very challenging, so unless management anti-up the cash to empower you here, risk may be unavoidably high. If so, prepare a Business Continuity Plan, one that minimizes down-time given a worst case scenario _and_ takes into consideration that data capture is needed by technical folks in order to tell you why "disaster X" happened - your BCP should ensure systems are configured to generate needful troubleshooting data when a problem is present (pretty please?).
"Activate the Omega 13!"
The more you know about what is present in the environment, the better equipped you will be to strategically formulate a comprehensive Test Plan.
Knowing that plan, you'll be able to project the manpower needed for 100% test coverage. And with that, be able to make trade-offs as necessary in order to meet deadline dates all-the-while advising of how much risk a date requirement imposes upon the project (did you hear that evil laugh just now?), or advise of how many more Heads are needed to meet a date (job creation!)
|Create an installation matrix that describes your environment. It will help knowing the answers to the questions raised in Inventory above. Here's a basic example of tracking installation tests, using totally made-up data:|
|Operating Systems||VSE 8.7i||VSE 8.8||Notes|
|P3||P4||P5||< P2||P2||P4||P4 + 929019|
|North America - NY COE image||North America - NY|
|WinXPsp3||0/10||N/A||N/A||N/A||10/10||10/10||10/10||8.7i P3 systems failing to upgrade. Ask McAfee.|
|Win7sp1||N/A||0/10||0/10||0/10||10/10||N/A||N/A||ToDo: Behind schedule. Can't patch <P2 systems.|
|Svr2008 R2||N/A||N/A||8/10||N/A||10/10||N/A||9/10||Did we get data of the failures?|
|North America - San Diego COE||North America - San Diego|
In this example we are tracking the result of 10 installation tests for each operating system, and for each VSE+Patch combination that exists in the environment. Of course, you don't have to perform an installation test 10 times - this is imaginary data, but you'll want to perform some number of tests that provides confidence to support deployment of the patch. With this layout you can see at a glance in the Notes column any work that still needs to be done for each region. The table can be expanded to incorporate non-standard image instances wherever they may exist too.
With this type of table you could also track the progress of deployments within the entire organization; in particular I like being able to see how many nodes of a certain OS exist, and how many are still running a specific version of VSE. This is data you can extrapolate from ePO... which makes me wonder if one could leverage ePO's reporting capability to present and track this information for you. I like that idea.
"Make it so!"
|How do you know the installation was successful?|
|Create a compatibility matrix that describes software in your environment. It will help knowing the answers to the questions raised in Inventory above. Here's a basic example of compatibility testing, using totally made-up data:|
|App1||VSE relevant functionality||Notes||VPN Client||VSE relevant functionality||Notes|
|App Operation A||Result per OS we use||Function A||Result per OS we use||Startup, Service initializes|
|DAT update over VPN|
|App Operation B||Result per OS we use||Function B||Result per OS we use||Startup, Tray icon loads|
|Patch update over VPN|
|App Operation C||Result per OS we use||Function C||Result per OS we use||Connect, established|
|Deploy VSE over VPN|
|WSC issue? Ask MSFT.|
|Connect, fail w/ old DAT|
|Remote console via VPN|
|Console still broken.|
|Connect, fail w/ old Engine|
|EmailScan test||Slow but expected.||Should we disable this?|
|Connect, fail w/ OAS off|
|File share access|
|Reconnect, after sleep|
|File copy to Server|
|VPN Fails on XP|
|Disco + Reconnect|
|File copy from Server|
Test efforts on your part are a contributing factor to Risk Management. If you or your company accepts Risk of any degree then testing is not necessary, but that would be a really bad idea in my opinion since the potential business impact that could fall out from a software-related issue, be it McAfee-specific or induced through McAfee's presence from interaction with another application, can be crippling. Still, no matter how much testing you do there is still going to be Risk, so at some point you need to recognize where testing concludes and piloting begins - setting "exit criteria" is helpful in that regard.
"I'm testing VSE with Skyrim"
|Create a Performance test matrix. Include a comparative data point, like a baseline and/or prior patch level behavior. It will help knowing the answers to the questions raised in Inventory above. Here's a basic example of network-related performance testing, using totally made-up data, along with product-specific tests you could run associated with network activity:|
|_Network Testing_||_VSE-specific Network Testing_|
|(AVG 5 runs in sec)||Pre Patch||Post Patch||Delta/Notes||(Corp std. configuration)||Baseline||Current Patch||New Patch||Notes|
|Boot-to-ping time||30||32||2s, < 10% delta, tolerable||ODS task (mapped drive U)||N/A||16:33:00 (hh:mm:ss)||08:25:00||Like Patch 2 again!|
|Boot-to-block time||33||33||ODS task (mapped drive Z)||N/A||00:34:50||00:05:22||Finally.|
|Block TCP test||OK||OK|
"This data is:
i2 = -1"
|Update task (gem, incremental)||N/A||00:02:10||00:01:50|
|Block UDP test||OK||OK||Update task (zip, full DAT)||N/A||00:23:00||00:23:00|
|Allow TCP test||OK||OK||Copy A to B - 10mb 1 file||5.4s||5.8s||5.2s|
|Allow UDP test||OK||OK||Copy B to A - 10mb 50 files||5.4s||6.0s||5.2s|
|Data Xfer <10mb||3.2||3.0||Copy A to B - 2gb 50 files||33s||48s||33s|
|Data Xfer >1gb||45||180||Bad test? Bad file? Bug? SMB issue? Timeout?||Copy B to A - 2gb 1 file||36s||54s||54s||I thought this was fixed?|
When doing performance tests it's a good idea to work with the average score of multiple test runs. And if files are involved, run the test multiple times because subsequent runs can/will benefit from our scan cache - whose sole purpose in life is to help the product be more efficient, but it can only benefit you if it's being used . Actions you take may reset the cache (or some of its contents) and complicate any test results; actions like, DAT updates, booting to safe mode, rebooting via improper shutdown, disabling the scanner temporarily... any of these actions can cause cached data to be lost, and performance results to appear worse than what would happen realistically. Still, you might want to look at the performance behavior of "first time seeing the file" too, so you can know what to expect should cached data be lost.
This is where most time will should be spent when patching VSE - I mean, the task will take more time to complete not that it will take up more of your time, ...although that could still happen, sorry . The prior testing efforts are to boost your confidence and reduce Risk to acceptable levels so that you can begin installing the patch into the production environment. Create an adoption plan; it should include a Pilot.
Prepare Pilot Participants
"I love this tool!"
You ought to consider keeping these debug settings enabled for the environment and making it a global change.
The McAfee Agent is a key piece in the whole McAfee solution and often when you face an issue, whatever the nature of the problem, it will be needful that McAfee Support ask you to set LogLevel=8 and dwDebugScript=2 so that we can see more detail for the communications between point product and agent. Don't forget to increase the LogSize too.
"We will be deploying a patch update to your system today, between <start time> and <start time + Randomization window> as part of a Pilot program.
The process will be seamless to you, however, should you notice any abnormal behavior from your system during the mentioned times, please do the following -
Investigate Pilot Issues
If you're in the dark about what data is of interest or how to obtain it, perhaps it's a good idea to work with McAfee Support to steer you in the right direction - I know how frustrating it can be to spend hours - even days - trying to get some data and then learn it wasn't useful, or was done incorrectly.
"Once again, with feeling."
If Users have not reported any issues to you, it's still a good idea to follow up with them and nag for some feedback since "no issues" feedback is better than "none."
Workarounds vs. Solutions
Which group do you belong to? . I'm a subscriber to the latter, and I also believe that when software fails there's some action that can be taken to avoid or resolve that failure -
The primary goal is to maintain business continuity. What are the options? What - you don't know!? You already established a business continuity plan, right? Of course you did, so execute it. But! Depending on the issue you're experiencing, you may be able to massage that BCP to make life easier for you - wouldn't that be nice? Perhaps it's not necessary to re-image the system, or uninstall and reinstall older software with reboots etc., or other hefty BCP tasks - granted, those are often the quickest resolution so you can't be faulted for taking those routes when under pressure. However, if time allows, or as the investigation of an issue proceeds and reveals root-cause or even clarifies the circumstances of the failure, options may arise wherein you do not have to take extreme measures.
A workaround option may exist. A workaround is not a solution, but it can be a resolution.
Then identify from those options what is in keeping with security standards, policies, audit requirements etc. You may need to consider other factors too, like man-power involved to administer the modified BCP and other contingencies should the updated BCP fail.
Don't forget, we're responding to issues found during the Pilot, so we're not talking thousands of nodes, only tens, perhaps hundreds if you're the adventurous type. Oh, what if that was the issue, that you accidentally deployed the change to ALL NODES... yeah, that happens. Don't be that guy/gal .
"It was me...It was me..."
|When an issue has been root-caused and a solution deemed appropriate or necessary to come from McAfee, we will work with you to the best of our ability in resolving the issue - which may include identifying helpful workarounds while awaiting a code fix. Needless to say patience is always appreciated while we work toward solving any reported issue.|
For whichever patch release you are adopting, always, always review the Known Issues KB article for that release. You can find it from our KB51111 article - just look for your patch release within the table, and you'll see the column for the Known Issues article. Some issues may have hotfixes available that solve them, in which case you'd need to factor that hotfix into your planning and testing; other issues may might tell you to avoid deployment to certain types of systems where you know you'll encounter a compatibility issue, or interoperability issue - or maybe you'll see there's a workaround that's needed to be grafted into your plans so you can avoid such an issue.
This piece of your overall Patching effort is crucial to success. We're making a concerted effort to keep you informed, but it's only as effective as you make it.
Thanks for reading!