The use of open-source packages and libraries is permitted, providing use of a stable version and compliance with all licensing as well as proper attribution.

We encourage the use of popular, modern programming languages and tools. The ultimate solution should be deployable in a cloud platform such as AWS, GCP, Azure or OCI. Please include enough documentation such that evaluators could build your final submission and execute a basic test on the given data

Your idea submission must include the following:

Idea Title: Written on cover page
Approach: How will you solve the problem? How will you make it extensible for future needs?
Architecture: Share your solution/model design in a diagram, indicating the components and their relationships to one another
Future: What are the strengths and weaknesses of your approach? What is required to maintain and extend your solution over time as more data providers are added?

Upon successful idea submission, you can begin coding.

Final project submission. You may submit your project as many times as your like. Only the final submission will be judged.

Project overview (3-4 sentences explaining what you built)
Final solution design diagram
Invite to repo(s) on Github
Video presentation of your work, no longer than 5 minutes
Test results vs known data set
Test results vs unseen data set

Scored on:	%
Accuracy v trained dataset	30%
Accuracy v unseen dataset	15%
Effort to maintain the model	15%
Inference compute time	15%
Extensibility to other device types	15%
Code quality	5%
Future recommendations	5%

We have envisioned the solution to this problem as one utilizing AI and specifically natural language processing (NLP). This may include frameworks, libraries or packages designed to learn and understand the meaning of an ambiguous text language based on context. Examples of solutions might utilize an encoder-decoder strategy, a graph attention network or other graph (or otherwise) based deep learning models. Ultimately, we are agnostic in terms of specific technical solutions. The scoring criteria is focused on the accuracy of the models, their performance, the quality of the code, and each team’s future recommendations.

The high-level architecture of the data flow is represented in the following diagram.

In summary, various end point security systems constantly report on activity occurring on the endpoints that they’re monitoring. Some of the events are benign while other indicate suspicious or even malicious activity. Each endpoint security provider reports events in their own event terms and language. Its’ up to human security analysts to sift through the events across these systems, quickly assess the potential for a threat, and take the appropriate action. Your solution represents the creation of a common event ontology that the security analyst can query to understand potential threats across all endpoint security providers without that analyst needing to understand the syntax of any specific provider.

When designing your solution, please consider that while we will be providing static training data sets for your use gathered from data collected from three different security providers, when this system is eventually put into production, the data will not be natively stored in static files but rather constantly streaming. As such, please consider the need for the ability to ingest and process through your model high volumes of data quickly.

The production solution will require both a high level of accuracy and a high level of precision. Said another way your model should produce results that are both close to a known value and be repeatably accurate. There is often a consideration for the sensitivity for false-positives and/or false-negatives. In this example a false-positive ultimately is event flagged as potentially malicious but is not. A false-negative on the other hand represents an event flagged as not malicious but in reality, is. While the creation of the model to produce a common ontology itself does not inherently assess risk of the translated events, it should be noted that while keeping all false results to a minimum is the goal, false-negatives are unacceptable.

For this challenge we have chosen events captured from three major endpoint security providers. Your model should be extensible, allowing for additional providers to be easily included in the future without significantly reworking the model. Please include in your recommendations the steps required to add a new endpoint security provider as well as recommendations on how and how often your model may need updating or retraining

NuHarbor Hackathon

Winners

Polyglot: Breaking the Language Barrier

"Unifying Heterogeneous Data for Effective Cybersecurity Management: The Multilingual Initiative"

Recursive search through JSON file to match GPT-generated cybersecurity common ontology to the corresponding key alert pairings

What is Expected?