Winners are announced.
The use of open-source packages and libraries is permitted, providing use of a stable version and compliance with all licensing as well as proper attribution.
We encourage the use of popular, modern programming languages and tools. The ultimate solution should be deployable in a cloud platform such as AWS, GCP, Azure or OCI. Please include enough documentation such that evaluators could build your final submission and execute a basic test on the given data
Your idea submission must include the following:
Upon successful idea submission, you can begin coding.
Final project submission. You may submit your project as many times as your like. Only the final submission will be judged.
Scored on: |
% |
Accuracy v trained dataset |
30% |
Accuracy v unseen dataset |
15% |
Effort to maintain the model |
15% |
Inference compute time |
15% |
Extensibility to other device types |
15% |
Code quality |
5% |
Future recommendations |
5% |
We have envisioned the solution to this problem as one utilizing AI and specifically natural language processing (NLP). This may include frameworks, libraries or packages designed to learn and understand the meaning of an ambiguous text language based on context. Examples of solutions might utilize an encoder-decoder strategy, a graph attention network or other graph (or otherwise) based deep learning models. Ultimately, we are agnostic in terms of specific technical solutions. The scoring criteria is focused on the accuracy of the models, their performance, the quality of the code, and each team’s future recommendations.
The high-level architecture of the data flow is represented in the following diagram.
In summary, various end point security systems constantly report on activity occurring on the endpoints that they’re monitoring. Some of the events are benign while other indicate suspicious or even malicious activity. Each endpoint security provider reports events in their own event terms and language. Its’ up to human security analysts to sift through the events across these systems, quickly assess the potential for a threat, and take the appropriate action. Your solution represents the creation of a common event ontology that the security analyst can query to understand potential threats across all endpoint security providers without that analyst needing to understand the syntax of any specific provider.
When designing your solution, please consider that while we will be providing static training data sets for your use gathered from data collected from three different security providers, when this system is eventually put into production, the data will not be natively stored in static files but rather constantly streaming. As such, please consider the need for the ability to ingest and process through your model high volumes of data quickly.
The production solution will require both a high level of accuracy and a high level of precision. Said another way your model should produce results that are both close to a known value and be repeatably accurate. There is often a consideration for the sensitivity for false-positives and/or false-negatives. In this example a false-positive ultimately is event flagged as potentially malicious but is not. A false-negative on the other hand represents an event flagged as not malicious but in reality, is. While the creation of the model to produce a common ontology itself does not inherently assess risk of the translated events, it should be noted that while keeping all false results to a minimum is the goal, false-negatives are unacceptable.
For this challenge we have chosen events captured from three major endpoint security providers. Your model should be extensible, allowing for additional providers to be easily included in the future without significantly reworking the model. Please include in your recommendations the steps required to add a new endpoint security provider as well as recommendations on how and how often your model may need updating or retraining