Malware detectors trained on one dataset often stumble on another

Machine learning models built to catch malware on Windows systems are typically evaluated on data that closely resembles their training set. In practice, the malware arriving on enterprise endpoints looks different, comes from different sources, and in many cases has been deliberately obfuscated to evade detection. A study from researchers at the Polytechnic of Porto tests what happens when that gap is made explicit, and the results have direct implications for organizations relying on static detectors as a first line of defense.

cross-dataset malware detection

The European Union Agency for Cybersecurity identified public administration as the most frequently targeted sector by malware within the EU across its 2023, 2024, and 2025 threat landscape reports, with ransomware and data intrusions as the primary vectors. Many of the tools used in those intrusions rely on obfuscation to get past endpoint detectors. The research tests whether current ML-based static detectors can hold up when the malware they encounter does not match the distribution they were trained on.

What the research tested

The study built detection pipelines using a standardized feature format common across six public Windows PE datasets. Two training configurations were tested, one using a combination of the EMBER and BODMAS datasets, the other adding ERMDS, a dataset constructed specifically to challenge detectors with obfuscated samples at the binary, source code, and packer levels.

Models were evaluated not only on held-out data from their own training distribution, but also on four external datasets: TRITIUM, built from naturally occurring threat samples collected from operational environments; INFERNO, derived from red team and custom command-and-control malware; SOREL-20M, a large-scale benchmark covering several years of real-world PE files; and ERMDS used as an external test set.

That cross-dataset structure is what separates this study from most published malware detection benchmarks, which evaluate models on splits of the same dataset used for training.

Where detectors held up and where they did not

On data from their own training distribution, the best-performing models reached AUC and F1 scores in the high 90s, with strong true positive rates even at very low false positive thresholds. For enterprise environments where false alarms carry operational cost, those in-distribution numbers look deployable.

The cross-dataset results tell a more sobering story. Models transferred well to TRITIUM, which contains naturally occurring samples from the same general time period. Performance on INFERNO, the red team and C2 dataset, was more variable, with detection rates at strict false positive thresholds dropping considerably.

SOREL-20M, the largest and most temporally diverse external dataset, showed the steepest decline across all metrics. Some model configurations fell far enough that their practical utility at low false positive rates would be limited. ERMDS as an external test set produced similarly poor results.

The obfuscation problem cuts both ways

One of the more instructive findings involves the attempt to fix the obfuscation problem directly. Adding ERMDS to the training set improved performance on obfuscated samples within that dataset’s distribution. It also reduced generalization to SOREL-20M relative to training without it.

That pattern suggests a tension that practitioners building or procuring static detectors should be aware of. Training a model to recognize obfuscated malware can shift its feature distribution in ways that reduce its effectiveness on broader, more diverse data. Solving for one threat profile can create blind spots elsewhere.

The researchers attribute this to obfuscation-heavy samples spreading feature vectors within each class, narrowing the separation between benign and malicious files that the classifier relies on to make decisions.

What this means for endpoint detection

Static detectors are attractive for on-host deployment because they are computationally light and deliver fast verdicts without executing the file. The study confirms that compact, boosting-based models are viable for that use case under the right conditions.

The findings also reinforce a practical limitation that procurement and engineering teams do not always account for: a detector’s benchmark performance is only meaningful if the benchmark data reflects the threat landscape the detector will encounter. Red team tooling, packed malware, and temporally shifted samples can all degrade a model that looks strong on paper.

The researchers plan to extend the evaluation to deep learning architectures, with continued focus on how training data composition affects detection at the low false positive rates that production deployments require.

image

Download: 2026 SANS Identity Threats & Defenses Survey

 

Latest articles

Related articles