Sophos and ReversingLabs presented SoReL-20M database with data for information security researchers
Information security companies Sophos and ReversingLabs presented the SoReL-20M database, which consists of 20 million Windows Portable Executable files. Of these, 10 million files are samples of malware.
The database, designed to improve the information security industry, provides metadata, labels and functions for files, and also allows interested parties to download available malware samples for further research.
“A publicly available dataset containing carefully selected samples and relevant metadata is expected to help accelerate research into the use of machine learning for malware detection”, – write Sophos and ReversingLabs researchers.
Although machine learning models are built on data, there is no standard large-scale database in the field of information security, which can be easily accessed by everyone, from independent researchers to information security laboratories and corporations. According to Sophos experts, the lack of such a database impeded the development of the information security sector.
“Collecting large numbers of carefully selected, labelled samples are costly and complex, and sharing datasets is often complicated by intellectual property issues and the risk of exposing malware to unknown third parties. As a result, most malware detection research uses proprietary internal datasets, so the results cannot be compared”, – Sophos experts said.
By the way, recently the Sophos company notified customers of data leak.
The industrial-scale SoReL-20M database, covering 20 million samples, including 10 million of addresses malware, is designed to solve this problem. For each sample, the database contains functions extracted from the EMBER 2.0 dataset, labels, detection metadata, and complete malware binaries.
It also provides PyTorch and LightGBM machine learning models trained using this data, as well as scripts for loading and iterating the data, and scripts for training and testing the models.
Sophos accepts the possibility that experienced hackers will be able to use the database to their advantage and create tools to carry out cyberattacks. However, according to experts, there are currently many other sources that attackers can use to gain access to information about malware.
Let me remind you that Comodo also opened Endpoint Detection and Response (EDR) Product Source Code.