The purpose of this blog is to help administrators understand the importance of scanner timeout events. Too often I hear of these events going completely ignored in an environment, as if to say these events are not important. If that includes You, please check that notion at the door before continuing, or maybe this will help explain:
I wonder if some people were thinking VirusScan is a "mandatory scan" product, where if a timeout occurred we would just retry scanning the file, or that since the User gained access to the file that there was some implied successful scan that magically occurred. Nope. If a timeout occurs, that file has not been scanned yet the User or process requesting the file gets access to it.
"Enforce a maximum scanning time for all files"
This is a UI control in the OAS properties for limiting how long the scanner can spend on a file. Most files, when scanned, take < 5 milliseconds to complete the scan (a little longer if the file triggered an Artemis request but still...) - nowhere even close to reaching the specified limit (default 45 seconds, or 45000 milliseconds) but there are some files that are difficult to scan, and there is an interoperability scenario that can also hit that limit.
If you notice scan timeouts occurring in your environment you should be paying attention to them, and over time you'll learn something about the scanner and also about your environment. For example, you may be able to answer these questions:
- Is it the same file (or at least with a higher frequency than others) in reported timeout events?
- Is it the same process that is touching the file? Is this process esteemed as "safe"?
- Are the events for that file coming in at a certain time (or within a set window of time)?
- Are Users reporting the system is sporadically sluggish, or is the system unusable (timeouts are occurring consecutively)
- Is it the same type of file, same extension, or appears random, perhaps affecting any/all files?
- Are the events new since a software update, new software deployment, configuration change, or since the last DAT update?
If you cannot answer these questions, keep paying attention to these events and you eventually will, or, if the timeouts are of concern and you're stumped on what to do with the data you should reach out to our Support team for guidance/assistance.
Difficult Files to Scan
Scanning is performed by the Scan Engine (mcscan32.dll / Mscan64.dll) which is loaded by the scanner service, MCSHIELD.EXE. The Scan Engine determines what is scanned based on the configured settings for the point product (VirusScan Enterprise in our case) and the virus definition content files (avv*.dat). Also, when a scan occurs, only portions of the file are read for the purposes of evaluating if the file is clean or not - the scan engine does not read the entirety of the file; it doesn't need to.
It is reasonable to conclude that when a scan timeout occurs, something is a little... off. "It shouldn't be taking that long to scan this file", you might say to yourself.
Some files really are difficult to scan. The scan engine has to request chunks of data from the file and then compare it against the virus signatures, and there may be numerous checkpoints to work through in order to determine the file is clean or not. The process of reading these chunks of data is complicated further when having to do it over a network connection, i.e. via the Scan Network Drives feature in the On Access Scanner. You'll find that certain file types show up more frequently than others, being indicative of this "difficult to scan" point; common ones we hear of are PDF, RAR (archives in general, really, which includes ZIP self-extracting EXEs and other formats), large Office files usually monstrous spreadsheets.
If you do have the setting enabled for scanning archives you'll definitely see more timeouts than those who do not have it enabled.
What to do with Difficult Files
If these are files you trust and that do not get modified, you can exclude them with little/no risk.
If the files do change with some frequency you could still utilize an exclusion but you ought to apply some mitigation to ensure the file does get scanned, which is best done in one of two ways -
- Add the exclusion only for the Process that accesses the file, so if any other process touches the file it will result in a scan (I wrote about this in an earlier blog: On Access Scanner - Improve Performance & Maintain Security)
- Add the exclusion, and arrange for the file to be scanned via an On Demand Scan task which will run at some acceptable frequency to assure you the file is clean.
You could also look at increasing the scan timeout value. Default is 45s; so if you're seeing timeouts because you have lowered this default value... fix it.
If you're still seeing an unacceptable number of timeouts with our default value, try increasing it but don't increase it larger than 120s. What the timeout value means is "This is how long I'm prepared to have my Users wait before they can get access to files that are difficult to scan". And, 45 seconds seems like a good trade-off; 120s might cause some Users to reach for the nearest heavy object which they might use for troubleshooting, or worse.
Reminder: The value controls how much time to allow the scanner when needed, to determine if a file is clean or not. It does not mean the scanner will take this long to scan files.
3rd party file-system filters can interfere with scanning.
- VirusScan's file-system filter sees the original file access of an application/user, and blocks it, creating a separate scan request for MCSHIELD.EXE to scan the file first.
- MCSHIELD.EXE then accesses the file, seeking to read it (well, portions of it as I mentioned) for the purpose of scanning... but a 3rd party file-system filter sees MCSHIELD's request and blocks it. Meanwhile our scan timer is ticking away until the time limit is reached.
This type of interaction can lead to very high frequency of scan timeouts; they can even be continuous, one after the other, which is extremely annoying if you're the end user suffering through the system's sluggishness.
What to do with Interoperability Issues
First, the 3rd party file-system filter driver must be identified so you know what software it belongs to, which provides leads to follow up on that may resolve the conflict, such as:
- Do they have updated drivers
- Do they have a mechanism for allowing MCSHIELD's file I/O to pass through
Identifying the 3rd party may be as simple as answering the question "What non-Microsoft and non-McAfee application am I using that is interested in files?", otherwise you can peruse the loaded drivers in your MSINFO32 report; typically it's a "manual" start driver and the names tend to reveal their purpose or could flag them as a potential driver to test with (by means of eliminating the driver from the system, like renaming it and rebooting, and then trying to reproduce the issue). Ah... don't do that if you're not sure what the driver is/does, or if you don't know how to recover should the system have issues trying to boot. That type of troubleshooting is brutal, but very effective. A safer approach would be to remove that suspect software and reboot, then test.
There is nothing McAfee, or Intel Security, can do to solve this type of interoperability issue because we cannot stop the 3rd party file-system filter from blocking MCSHIELD's file requests. But we'll be happy to participate in discussions with the other party to help them understand how the scanner works and why they need to allow MCSHIELD unfettered access to files.
Recognizing a Timeout
Scanner Timeouts only apply to the On Access Scanner, or real-time scanner. When a timeout occurs or is in progress, there are telling symptoms to be aware of or to discuss with Users when/if they are reporting a related issue.
Timeouts from the User's perspective
Performance is terrible. The system is unresponsive. The typical User will persist in performing actions while a timeout is in progress, and after the timeout completes there will be a flurry of activity as Windows executes all the queued operations from the User's "If one-click didn't work, try one-hundred!" approach to overcoming the lag. Hopefully they do not resort to this: Guy flips out at computer - YouTube, but they might want to.
OAS Log entry
%deflogdir%\OnAccessScannerLog.txt, is where the product will capture instances of graceful timeouts.
This is a timeout where the limit is reached and the scan engine is advised to stop scanning, and the scan engine acknowledges the cancellation in a timely manner.
Event log entry
The Windows' Application Event log is where the product will capture instances of fatal timeouts.
This is a timeout where the limit is reached and the scan engine is advised to stop scanning, and the scan engine fails to acknowledge the cancellation. Failure is determined through the timeout limit being reached a second time, which is why the Event log entry will say "...took longer than 90000 milliseconds...", i.e. 2 x 45s if 45s is the set limit. At that point the scanner is forcibly terminated, which releases the file we were trying to scan.
Event 5051 is recorded, Source=McLogEvent, and the details of the event will describe the exact file we were trying to scan.
Event ID 1059 is created for the McAfee Agent to upload to the ePolicy Orchestrator server, containing details of the timeout occurrence. This lets you know as the admin, what files are still being accessed but were not scanned by the On Access Scanner. "I'm going to go check my ePO server for these events right now!"
What Really Happens
- The scan timeout value is read from the registry1 by McShield.exe, the scanner service, when the service starts.
- For each scanning thread there is an additional thread to watch the scanner thread, and its purpose is to check how long the scanner thread is taking to scan the file object.
NOTE: This functionality exists for the On Access Scanner feature only; no other scanners have timeout capability or watchdog threads.
- When a watchdog thread hits the ScannerThreadTimeout value, a notification is set to inform the scanner thread "Time's up!".
- The watchdog thread starts over, this time to check how long the scanner thread is taking to stop scanning the file object.
This leads to either -
- A graceful timeout. The scanner thread acknowledges the notification to stop scanning, and relinquishes the file object. This results in the OnAccessScannerLog.txt being updated to indicate a scan timeout occurred, describing the file that was not scanned to completion. The scanner thread then moves on to the next scan request.
- A fatal timeout. The scanner thread does not acknowledge the notification to stop scanning, and the watchdog thread again reaches the timeout value2.
When (b) occurs, the watchdog thread terminates the scanner process, McShield.exe. This forces the release of the file being scanned, which is necessary because we have to assume the scan engine is "stuck" in scanning that file for some reason. If we did not terminate the scanner, the potential exists for the software to cause all file activity to come to a halt, and the system will hang. That is also the risk when you disable the option for "Enforce a maximum scanning time for all files".
1 HKLM\Software\McAfee\SystemCore\VSCore\On Access Scanner\McShield\Configuration, ScannerThreadTimeout
2 HKLM\Software\McAfee\SystemCore\VSCore\On Access Scanner\McShield\Configuration, ScannerThreadTimeoutEx
Find them. Find them all. Find all of the Timeouts.
- Within ePO, a query can be built to show you occurrences of Event 1059 from the VirusScan Enterprise product. This assumes you have not configured the McAfee Agent to filter out these events, which means the events could be occurring on your clients but they are not being reported back to ePO. Being ignorant of the dangers that a Timeout presents does not secure you from those dangers.
- If you're just inspecting one system, we already covered where to look to find evidence of timeouts occurring.
Study the data and formulate answers to the initial questions raised in this blog. If you can answer those questions then you can make a connection to the best way forward for solving the Timeout problem.
You may need to perform some tests to obtain more data about how and when the timeouts occur, or perhaps to check if increasing the timeout threshold is all that's required.
File is Difficult to Scan Test
If the On Access Scanner is having a difficult time with a file, and timing out, it stands to reason that other scanners will have a difficult time with it too right? Wrong , because each scanner has access to different content within our DAT signature files. And it might well be that certain content has been enabled for the OAS only.
Still, that doesn't mean we shouldn't test and do some due diligent profiling of the issue. Performing a scan against the same file object with the On Demand Scanner and seeing how long it takes is a great data point:
- You might find it takes 100s to scan the file; and therefore have a ballpark number for avoiding the OAS timeout through increasing the scan timeout limit.
- You might find the ODS locks up, so, convicts the file in question as being problematic for the scan engine itself and worthy of reporting to Intel Security lest the malformed file be used as a way to crash our scanner.
- You might find ODS has no issue at all with the file, and want to contact our Support team to learn why OAS is having a hard time - it may be due to 3rd party filter, or DAT content. This approach will require sending a sample of the file to McAfee Labs (treating it as potentially malicious until proven otherwise, and gives us a safe way to retrieve the file for Support's own testing.)
3rd Party Incompatibility Test
There's really no test for this... you could try scanning the same file in a different environment, perhaps in a controlled vanilla VM and see if the results are the same which reasonably eliminates 3rd party presence. The timeouts from these interactions with 3rd party filters can be sporadic, triggered after a certain event, or constant. You might glean some wisdom through knowledgebase searches but I find the best tool is simply to peruse the MSINFO32 system information and inspect the loaded drivers... 3rd party filter driver suspects tend to stand out - they are often "manual" start, or load from non-standard locations, or have legacy path references beginning with "\\". I could cite some examples but I don't want folks to think I'm *cough* see below *cough* picking on them. We have had issues too of course where the timeout occurs from our own toe-stepping oversights, but we can fix those (e.g. KB71849).
KB52441, Event ID 5051, Process will be terminated (Scanner Thread Timeout) This article describes a Fatal Timeout.
KB52417, Complete list of Event IDs for VirusScan Enterprise (including Event ID 1059 and 5051)
KB58775, Slow performance and timeout errors with event ID 5051 and 7011 with Computer Associates Backup Agent