How to source training data in ML for information security?
A company entrusts a Data Scientist with the mission of processing and valuing data for the research or treatment of events related to traces of computer attacks. I was wondering how would he get the train data.
I guess he would need to exploit the logs of the different devices of the clients and use statistical, Machine Learning and visualization techniques in order to bring a better understanding of the attacks in progress and to identify the weak signals of attacks... But how would he get labelled data?
He might get the logs of attacks received before, but that might not have the same signature with the attacks that are going to come later? So it might be difficult to create a reliable product?
The domain problems in cybersecurity are too-narrow to willy nilly try to apply AIML or DL to.
Not saying to throw data science out the window. Saying you need way more domain expertise to make it go.
One excellent application of data science to the cybersecurity field is to understand the indicators, or IOCs, and how they are sighted (last seen, etc) in memory, on disks, and in network traffic (and positioned where in the network traffic relative to the sources, destinations, and passthroughs -- just like any dataflow). Instead of levering ML or DL, I would instead suggest to focus first on graph algorithms. Understand the relationships of these indicators and their interpretation as time series data
- A File "Hash" (e.g., a SHA256 checksum of a file or a section of memory or network traffic of a process). Here is an example of the domain problems associated with cybersecurity. The way the our signatures (e.g., Yara rules) work on-disk vs. in-memory vs. in network traffic are obviously different code and data, and code and data paths. Parameters or arguments to processes also matter especially for script code
- an IPv4, IPv6 address or path, and its associated network attributes such as BGP-4 ASN if registered with a Regional Internet Registry (RIR). There is often a history associated with these objects and narrowing into them may require understanding complex RWhois and SWIP registration processes
- an FQDN, or Fully-Qualified Domain Name -- sometimes hostname and with Windows Server Forest/Domain bits, perhaps older namestays such as NetBIOS or MS-RPC Named Pipes. Cloud stacks such as Azure AD are changing this nomenclature as well, moving to tenants, subscriptions, resources, et al. A set of Whois records identifying each unique Internet Domain Name can come with its own set of relationships including a rich history of timestamps, owners, name servers, and email addresses
- A credential, often an email address, e.g.,
firstname.lastname@example.org also a cred user/pass pair, i.e.,
bertrand:MathIsK00lB00ksRul3if known (often if compromised)
As a hint of what would be possible, check out the work here -- https://threathunterplaybook.com/introduction.html -- which pivots nicely off of the fields (and parsing languages) from Azure Sentinel and M365/Azure data models