After reading Patching VSE - risk level, you may have a mind to test a patch prior to production adoption!

The purpose of this post is to help unlock the minds of those tasked with Patching VSE installations and solving reported issues; to provide insight into the testing and adoption practices you could, and probably should, utilize in your own environment if not already adopting similar practices.  It is not intended to provide actual test plans for you to abide by.
If you assume McAfee's environment is adequate to represent your own environment, that's like assuming yellow snow is from lemons...
[Reading Tip]  This post is lengthy, so feel free to refer to the Links here to break up your reading.


High Level Topics
Plan for testing in these areas (explained further below) - be mindful that each of these sections could be explored in much greater detail, but for this blog's purpose it's sufficient to raise awareness to the type of work that could be undertaken prior to Patching VSE, for a better patching experience:

"We just installed your patch and now Outlook runs really slow"... rolls off the tongue but that statement is LOADED with all kinds of questions needing answers.  Did the installation succeed? Did the Outlook mail scanner install? What version of Outlook was it? Is that version supported with this patch? Is it slow to launch, slow to receive mail, send mail, what's slow? Slow for everyone or just this user, or just this computer? And that is just the beginning.

More intelligence gathered for a problem helps you better understand it and helps you convey it to others, like McAfee Support, if need be.

e.g. "We're piloting the patch and it seems installed OK, detects eicar and shows up in ePO but we see Outlook hangs when new mail is received. It affects all Users, and Outlook must be restarted to recover. We're pretty sure it's your fault since it only happened after installing the patch. Yes we've rebooted since installing ."

If you're not sure what information is helpful, that's OK, let McAfee Support ask questions (some testing may also be involved) to work towards identifying the problem.

 

Inventory
Top

Is it worth testing at all if you don't know what's in your environment?  Okay, it is, but it makes you wonder how much more effective your test efforts would be if you knew what to test, or any high risk areas where you need to focus your testing.

  • What operating systems exist in the environment?  Any Windows XP still (we'll miss you XP)?  What about Vista (yikes!)?
  • What systems are 32-bit, which ones are 64-bit?
  • Do we still have single-processor systems?
  • What hardware? Desktops, servers, laptops, virtual, handheld?
  • One device multiple Users, or the User pretty much owns it?
  • How many servers?  What roles do the servers have?
  • What 3rd party applications do we run in-house?
  • Are Users Administrators? (Did he really ask that? Someone's asking for it alright!)
  • How many of these applications have kernel-mode components?
Don't know? That's scary, but here are some clues to the types of software that install kernel drivers:  Firewall, Encryption, Backup, Metering, Monitoring, Inventory, Security - including AV, data loss prevention, host intrusion, application control, lots of possibilities within that category.
_
  • Which systems are Mission Critical?  What's installed on them?  How are they used?
You'll want a lab representation of that setup if you want to be able to explore what new code does to those sensitive systems. This can be very challenging, so unless management anti-up the cash to empower you here, risk may be unavoidably high. If so, prepare a Business Continuity Plan, one that minimizes down-time given a worst case scenario _and_ takes into consideration that data capture is needed by technical folks in order to tell you why "disaster X" happened - your BCP should ensure systems are configured to generate needful troubleshooting data when a problem is present (pretty please?).
"Activate the Omega 13!"
The more you know about what is present in the environment, the better equipped you will be to strategically formulate a comprehensive Test Plan.
Knowing that plan, you'll be able to project the manpower needed for 100% test coverage. And with that, be able to make trade-offs as necessary in order to meet deadline dates all-the-while advising of how much risk a date requirement imposes upon the project (did you hear that evil laugh just now?), or advise of how many more Heads are needed to meet a date (job creation!)

 

Installation
Top
Create an installation matrix that describes your environment.  It will help knowing the answers to the questions raised in Inventory above.  Here's a basic example of tracking installation tests, using totally made-up data:
Operating SystemsVSE 8.7iVSE 8.8Notes
P3P4P5< P2P2P4P4 + 929019
North America - NY COE image
North America - NY
WinXPsp30/10N/AN/AN/A10/1010/1010/108.7i P3 systems failing to upgrade. Ask McAfee.
Win7sp1N/A0/100/100/1010/10N/AN/AToDo: Behind schedule. Can't patch <P2 systems.
Svr2008 R2N/AN/A8/10N/A10/10N/A9/10Did we get data of the failures?
North America - San Diego COE
North America - San Diego
Win8.1N/AN/AN/AN/AN/A10/10N/AOK
Svr2012N/AN/AN/AN/AN/AN/A10/10OK
Svr2012 R2N/AN/AN/AN/AN/AN/A10/10OK

In this example we are tracking the result of 10 installation tests for each operating system, and for each VSE+Patch combination that exists in the environment. Of course, you don't have to perform an installation test 10 times - this is imaginary data, but you'll want to perform some number of tests that provides confidence to support deployment of the patch. With this layout you can see at a glance in the Notes column any work that still needs to be done for each region. The table can be expanded to incorporate non-standard image instances wherever they may exist too.

With this type of table you could also track the progress of deployments within the entire organization; in particular I like being able to see how many nodes of a certain OS exist, and how many are still running a specific version of VSE.  This is data you can extrapolate from ePO... which makes me wonder if one could leverage ePO's reporting capability to present and track this information for you. I like that idea.
"Make it so!"
How do you know the installation was successful?
  • The node detects the eicar test file virus.
  • The node reports the patch level to ePO (if you are ePO managed), or you can check the "About" window from VSE's console.

 

Compatibility
Top
Create a compatibility matrix that describes software in your environment.  It will help knowing the answers to the questions raised in Inventory above.  Here's a basic example of compatibility testing, using totally made-up data:
App1VSE relevant functionalityNotesVPN ClientVSE relevant functionalityNotes
App Operation AResult per OS we useFunction AResult per OS we useStartup, Service initializes
XP3W7W8W81
DAT update over VPN
XP3W7W8W81
App Operation BResult per OS we useFunction BResult per OS we useStartup, Tray icon loads
XP3W7W8W81
Patch update over VPN
XP3W7W8W81
App Operation CResult per OS we useFunction CResult per OS we useConnect, established
XP3W7W8W81
Deploy VSE over VPN
XP3W7W8W81
WSC issue? Ask MSFT.
Connect, fail w/ old DAT
XP3W7W8W81
Remote console via VPN
XP3W7W8W81
Console still broken.
  • On the left is a basic template. To the right is a basic example of it in action using imaginary data.
  • It caters to seeing test results on different OS platforms you might have, which can show what has not yet been tested, or perhaps the testing effort has been reduced to focus on "majority" in order to hasten a release schedule with calculated risk.
  • Try to put more focused compatibility testing of VSE in areas where 3rd party kernel drivers exist, and software that either depends on file accessibility or where a lot of file I/O is generated.
Connect, fail w/ old Engine
XP3W7W8W81
EmailScan testSlow but expected.Should we disable this?
Connect, fail w/ OAS off
XP3W7W8W81
File share access
XP3W7W8W81
Reconnect, after sleep
XP3W7W8W81
File copy to Server
XP3W7W8W81
VPN Fails on XP
Disco + Reconnect
XP3W7W8W81
File copy from Server
XP3W7W8W81
Test efforts on your part are a contributing factor to Risk Management.  If you or your company accepts Risk of any degree then testing is not necessary, but that would be a really bad idea in my opinion since the potential business impact that could fall out from a software-related issue, be it McAfee-specific or induced through McAfee's presence from interaction with another application, can be crippling. Still, no matter how much testing you do there is still going to be Risk, so at some point you need to recognize where testing concludes and piloting begins - setting "exit criteria" is helpful in that regard.
"I'm testing VSE with Skyrim"
_

 

Performance
Top
Create a Performance test matrix. Include a comparative data point, like a baseline and/or prior patch level behavior.  It will help knowing the answers to the questions raised in Inventory above.  Here's a basic example of network-related performance testing, using totally made-up data, along with product-specific tests you could run associated with network activity:
_Network Testing__VSE-specific Network Testing_
(AVG 5 runs in sec)Pre PatchPost PatchDelta/Notes(Corp std. configuration)BaselineCurrent PatchNew PatchNotes
Boot-to-ping time30322s, < 10% delta, tolerableODS task (mapped drive U)N/A16:33:00 (hh:mm:ss)08:25:00Like Patch 2 again!
Boot-to-block time3333ODS task (mapped drive Z)N/A00:34:5000:05:22Finally.
Block TCP testOKOK

"This data is:

i2 = -1"

Update task (gem, incremental)N/A00:02:1000:01:50
Block UDP testOKOKUpdate task (zip, full DAT)N/A00:23:0000:23:00
Allow TCP testOKOKCopy A to B - 10mb 1 file5.4s5.8s5.2s
Allow UDP testOKOKCopy B to A - 10mb 50 files5.4s6.0s5.2s
Data Xfer <10mb3.23.0Copy A to B - 2gb 50 files33s48s33s
Data Xfer >1gb45180Bad test? Bad file? Bug? SMB issue? Timeout?Copy B to A - 2gb 1 file36s54s54sI thought this was fixed?
  • The table shows basic operations Users and Applications will perform in your environment, use cases, which can be tested to gauge the performance of those operations, comparing a defined "Normal" or baseline with your current accepted standard, and with the new Patch.  Spend some time thinking through what your Users and/or Applications do in your environment so you can identify what ought to be tested. Then, can testing be automated? Maybe you have 3rd party performance testing tools you rely on?
  • From the results, you can identify areas of concern that warrant additional investigation.  Challenge your own findings - was the test valid, was the sample data corrupt, is there something other than the new Patch that has been newly introduced? Maybe you've found a bug, maybe the new Patch is exposing a bug in another application that has always existed but only now surfaced? Does this issue block you from continued testing? Does this issue block you from starting a Pilot?
    You may need a resolution before this new code gets pushed out globally - and that's the purpose of putting in this work, to discover any issues and anomalies before they become production headaches, so, job well done.
When doing performance tests it's a good idea to work with the average score of multiple test runs. And if files are involved, run the test multiple times because subsequent runs can/will benefit from our scan cache - whose sole purpose in life is to help the product be more efficient, but it can only benefit you if it's being used . Actions you take may reset the cache (or some of its contents) and complicate any test results; actions like, DAT updates, booting to safe mode, rebooting via improper shutdown, disabling the scanner temporarily... any of these actions can cause cached data to be lost, and performance results to appear worse than what would happen realistically. Still, you might want to look at the performance behavior of "first time seeing the file" too, so you can know what to expect should cached data be lost.

 

Pilot
Top

This is where most time will should be spent when patching VSE - I mean, the task will take more time to complete not that it will take up more of your time, ...although that could still happen, sorry .  The prior testing efforts are to boost your confidence and reduce Risk to acceptable levels so that you can begin installing the patch into the production environment.  Create an adoption plan; it should include a Pilot.

  • The pilot involves installing the patch to participating systems, a controlled number of systems, systems which you are able to manage and monitor so that any reported issues can be investigated promptly. That would be ideal. A production pilot is the real crucible for Patching; you're more likely to find issues here than with the testing efforts because most test environments are a poor reflection of production environments.
  • Organizations vary greatly in size, and can vary greatly within itself such that specialized apps are used in one office vs. another.  Plan your adoption program accordingly - will you have a couple systems from each locale or business unit participate in the pilot, or will you focus on one region/office at a time, or some other crafty approach, or tradition? As long as you're able to detect issues and/or get feedback from pilot participants over a period of time (suggested 2 - 4 weeks) then the Pilot will prove valuable.
  • The pilot should span a period of time that allows high confidence that the system has been well utilized, when you can say to yourself "If it hasn't failed by now, it's not going to fail"... or something like that.  That confidence level should rise with time which is why I say 2 to 4 weeks; I also suggest that amount of time because some issues may take that long to surface (soak-testing). When confidence has risen to an acceptable degree, the next phase of the Pilot can begin e.g. add another region or locale, or increase participating nodes.
Prepare Pilot Participants
Top
  • Configure systems to save a complete (preferred) or kernel memory dump. Here's how. If the systems should BSOD, or hang and require forcing a BSOD, then you'll get a memory dump that has a chance of being investigated to discover root cause.
    NOTE: This preparation step requires a reboot after applying the change, if you haven't configured it already.  To force a BSOD, ensure this is possible via Crash-on-Ctrl+Scroll, or NMI switch, or other equivalent.
  • Have available to you these common troubleshooting tools:
    • Process Monitor, to capture data about an issue that is reproducible; i.e. Start Procmon, reproduce the issue, save the log.
      NOTE: Save ALL EVENTS in the Native PML format.
"I love this tool!"
    • Process Dump, for when it's a process that is crashing (or hanging) and not the whole system.
    • MER tool, our minimum escalation requirement data collection tool
  • Enable advanced logging options for the McAfee Agent
You ought to consider keeping these debug settings enabled for the environment and making it a global change.
The McAfee Agent is a key piece in the whole McAfee solution and often when you face an issue, whatever the nature of the problem, it will be needful that McAfee Support ask you to set LogLevel=8 and dwDebugScript=2 so that we can see more detail for the communications between point product and agent. Don't forget to increase the LogSize too.
    • LogLevel 8 for the Agent itself. See Solution 1. Increase the log file size too.
  • Notify users of participating systems...
"We will be deploying a patch update to your system today, between <start time> and <start time + Randomization window> as part of a Pilot program.
The process will be seamless to you, however, should you notice any abnormal behavior from your system during the mentioned times, please do the following -
  1. Note the time
  2. Describe the symptom(s) as best you can.  If the symptom has a duration, describe how long it lasts.
  3. Report your findings ASAP to <this.person@work.com>, who may follow up with you for clarification of any details. Avoid taking action that could jeopardize investigating the issue.
  4. For any issues experienced after <start time + Randomization window>, follow steps 1 - 3, until notified of the Pilot program ending."
    • What to expect
    • When to expect it
    • What to do when the Unexpected happens
    • What to do when the unexpected happens in future
Investigate Pilot Issues
Top
  1. Collect data
    When an issue is brought to your attention, take immediate action.  Depending on the nature of the issue Time may be a luxury, and needful data for the investigation may be disappearing from the affected system due to LogFile purging or Users taking matters into their own hands, like rebooting.
  1. What is the issue?
    The simple question with a not-so-simple answer, usually.  Did you discover the issue yourself, or was it reported by a User?  What data point or symptom led to you taking action? You must follow that lead.  Since this initial lead could be anything, it's hard for this blog to explain how to proceed because each issue has its own optimal pathway toward identifying root-cause, workaround, or solution.  Therefore I'll describe what you should do generally speaking based on common types of issues:
  1. Collect MER data from the node, the one that's having the issue not a different one you think is "just like it"
    The sooner the better too.  If the issue has traces of bad behavior or errors in any of our product log files, that data can be lost/purged/overwritten if you don't collect it in time.
  2. Collect relevant data for the issue -
    1. BSOD = memory dump
    2. System Hang = forced memory dump
    3. Whole system is sluggish = tough call but assuming you cannot run any tools, forcing a memory dump is wise; perhaps the sluggishness is heading toward a hang
    4. Process Crash = lets hope it occurs again so you can capture a full dump with ProcDump -e -ma <process_name_to_monitor>
    5. Process Crashes on launch = run this command then start the process, ProcDump -e -ma -w <process_name_to_monitor>
    6. Process Hang = ProcDump -ma <process_that_is_hung>
    7. Process Hangs randomly = ProcDump -ma -h <process_that_hangs>
    8. Process using high CPU = this creates 6 dumps over 1 minute so we can see snapshots over time, ProcDump -ma -n 6 <process_name_to_monitor>. It may also be needful to force a BSOD.
    9. Process using high CPU randomly = ProcDump -ma -n 6 -c 90 <process_name_to_monitor>. It may also be needful to force a BSOD.
    10. Error message from McAfee (not any of the above) = Screenshot
    11. Error message from 3rd party (not any of the above) = Screenshot, and you'll want to investigate it with the 3rd party vendor. Don't be thinking McAfee can solve 3rd party errors; it's not our code failing so how can we tell you what the problem is?
If you're in the dark about what data is of interest or how to obtain it, perhaps it's a good idea to work with McAfee Support to steer you in the right direction - I know how frustrating it can be to spend hours - even days - trying to get some data and then learn it wasn't useful, or was done incorrectly.
"Once again, with feeling."
If Users have not reported any issues to you, it's still a good idea to follow up with them and nag for some feedback since "no issues" feedback is better than "none."
Workarounds vs. Solutions
Top
  • Some people believe any software issue is unacceptable. And they're right, aren't they? We all use software in some fashion or other, and we expect it to work; we are often paying for the software so of course we expect it to work, perfectly.
  • Some of us have come to accept that software isn't perfect, even that it's impossible to be perfect, that not all software works perfectly in all environments, and sometimes software has bugs, defects, plain and simple.
Which group do you belong to? .  I'm a subscriber to the latter, and I also believe that when software fails there's some action that can be taken to avoid or resolve that failure -

The primary goal is to maintain business continuity. What are the options? What - you don't know!? You already established a business continuity plan, right? Of course you did, so execute it. But! Depending on the issue you're experiencing, you may be able to massage that BCP to make life easier for you - wouldn't that be nice?  Perhaps it's not necessary to re-image the system, or uninstall and reinstall older software with reboots etc., or other hefty BCP tasks - granted, those are often the quickest resolution so you can't be faulted for taking those routes when under pressure. However, if time allows, or as the investigation of an issue proceeds and reveals root-cause or even clarifies the circumstances of the failure, options may arise wherein you do not have to take extreme measures.

A workaround option may exist. A workaround is not a solution, but it can be a resolution.

 

Then identify from those options what is in keeping with security standards, policies, audit requirements etc. You may need to consider other factors too, like man-power involved to administer the modified BCP and other contingencies should the updated BCP fail.

Don't forget, we're responding to issues found during the Pilot, so we're not talking thousands of nodes, only tens, perhaps hundreds if you're the adventurous type. Oh, what if that was the issue, that you accidentally deployed the change to ALL NODES...  yeah, that happens.  Don't be that guy/gal .
"It was me...It was me..."
When an issue has been root-caused and a solution deemed appropriate or necessary to come from McAfee, we will work with you to the best of our ability in resolving the issue - which may include identifying helpful workarounds while awaiting a code fix. Needless to say patience is always appreciated while we work toward solving any reported issue.
Stay Informed
Top

For whichever patch release you are adopting, always, always review the Known Issues KB article for that release. You can find it from our KB51111 article - just look for your patch release within the table, and you'll see the column for the Known Issues article.  Some issues may have hotfixes available that solve them, in which case you'd need to factor that hotfix into your planning and testing; other issues may might tell you to avoid deployment to certain types of systems where you know you'll encounter a compatibility issue, or interoperability issue - or maybe you'll see there's a workaround that's needed to be grafted into your plans so you can avoid such an issue.

This piece of your overall Patching effort is crucial to success. We're making a concerted effort to keep you informed, but it's only as effective as you make it.

 

Thanks for reading!