In our previous blog post, Context and intent for AI enable effective cyber investigations, we discussed the importance of context and intent in AI-driven cybersecurity investigations. Today, we're diving deeper into how we determine what investigative questions are most appropriate at any given point in an investigation using a Retrieval Augmented Generation (RAG) approach to assist in making informed decisions on what leads or paths of investigations to follow.
The Power of RAG in Cybersecurity
Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) techniques have opened up a wealth of possibilities in the field of AI-driven cybersecurity investigations. RAG combines LLMs with external knowledge retrieval, allowing for more accurate and contextually relevant responses. This approach is particularly valuable in cybersecurity, where the threat landscape is constantly evolving, and investigations require up-to-date information and domain-specific knowledge to produce informative, accurate, and contextually relevant responses.
While we have various use cases, we are working on, that take advantage of RAG like capabilities, from "how do I investigate..?" to leveraging prior investigations as memory, automated selection and execution of questions (Transforming cyber investigations: The power of asking the right questions) is a core capability throughout our platform and an area we're constantly exploring approaches and techniques to improve quality and fidelity.
At its simplest, RAG involves two components:
Retrieval: Relevant documents or information are fetched from an external source which could be a vector database, file store, or the web. This information is crucial for ensuring the LLM is supplied with current and domain specific information which is unlikely to be captured during the training of the model.
Generation: A model, different LLMs in our case, use this information to generate a response that is contextually relevant. The retrieved content is integrated to enhance the quality, accuracy and relevancy of its outputs.
So how do we select an appropriate set of questions to ask when starting an investigation?
Given a lead, hunt output or a security event to validate, we need to ask a series of questions in order to prove or disprove the initial hypothesis or verdict. As discussed in the previous blog post, context and intent are often critical to determining what steps to take or questions to ask. In the case of investigations with minimal or no human input, we have to work with, and improve on, what context we can extract.
To start, we maintain a comprehensive, constantly updated vector store of investigative knowledge, including our questions and facets. We use Natural Language Processing (NLP) techniques to extract context, generate metadata describing intent and relevance. This is passed through pipelines for creation, cleaning, chunking, embedding and inserting of the data. This knowledge base is also made available throughout the platform. This ensures that users are able to reference and review the data we use to inform some of the decisions we make.
As part of this we'll be adding access to historical investigation reports, analyst actions, annotations and enrichments.
Most importantly, in this case, the quality of the response (what questions to ask) is reliant on the context and investigative steps generated from the initial input, for example: a Risky User sign-in event in Microsoft Entra. Often semantic search, while excelling at understanding human questions, may not capture all relevant keywords or entities. As such, the process for generating the content to extract the most relevant results from our vector store takes into account details around the event, initial verdicts, categories, leads that can be extracted and data sources available to the platform to expand investigation scope.
From this information, a tactical and actionable query is generated as input to our vector store. The results are filtered, ranked and then applied to the individual leads for execution. This process is repeated with the new results being used to provide additional details, change in verdicts, leads and added context for the model to determine what questions extract from the vector store and ask.
In this case, the retrieval phase included the initial event data, contextual summary, leads, available data sources and generated actions all combined and refined to produce the next steps of the investigation.
Inspiration comes from everywhere
LLM-Based Classification of Support Requests: In customer support systems, LLMs are used to classify and route incoming requests. We've adapted this concept to our cyber investigation flow. Just as support systems identify the intent behind a customer query, our system attempts to extract the context and intent behind an investigator's initial query or the current state of an investigation. Based on the recognized intent, our system dynamically generates and prioritizes questions that are most likely to produce valuable results for that specific type of investigation.
E-commerce Product Recommendations: E-commerce platforms use LLMs and algorithmic approaches to recommend products based on user behavior and trending items. We've applied similar principles to our question selection process. Similar to how e-commerce systems analyze past purchases, our platform examines prior activity in the investigation. Just as e-commerce platforms consider trending products, our platform will soon take into account current threat intelligence and attack vectors to suggest relevant questions.
Conclusion
The integration of RAG-based question selection has enhanced our ability to conduct effective cyber investigations. By leveraging AI to intelligently select and prioritize investigative questions, we're able to start investigations and provide outcomes more quickly and with greater efficacy.
As we continue to refine this approach, we're excited about the possibilities it opens up for the future of AI-driven cyber investigations. The combination of human expertise and AI-powered guidance is proving to be a powerful tool in cyber investigations.