Working towards an automated security tool typology¹

The goal of this project was to create a typology of open source security tools. This was to help us identify any blind spots within our own tooling arsenal and to keep up to date with trending and popular areas.

Sourcing data

It was decided to focus on tools from GitHub with plans of adding GitLab support in the future. A database of security tools was created with automation to automatically add new tools based on GitHub topics.

Processing

Initially, attempts at using a pure Natural Language Processing (NLP) based approach using the Python NLP library spaCy were made. A few different experiments were performed to find a solution that would most efficiently achieve our goal.

tok2vec classification

Our first NLP attempt used tok2vec to tokenise GitHub tool descriptions based on the semantic meaning of the description. The mandatory StackOverflow was referenced as an algorithm to condense the tokenisation into two easily visualisable dimensions.

Naive word2vec on GitHub description

Predictably, the clustering based on two dimensional vectors did not show any meaningful categories. However, this experiment proved to be useful in an unexpected way as it identified that the outliers in our data fell into one of two main categories:

The GitHub description is in a language other than English
The GitHub description is blank or not semantically useful e.g. a link to a personal blog

The addition of Gemma AI

After hitting these outliers, it was decided to add Google’s Gemma AI model for miscellaneous data wrangling tasks such as translation and reading small code snippets.

It was important that beyond data processing, the actual classification was performed using a deterministic function rather than prompting the AI model.

Semantic clustering using a priori categories

This approach was inspired by the Medium article Topic Modeling and Semantic Clustering with spaCy.

The method involved two steps:

Identify pre-existing security categories
Use the zero shot classification method to label tools within these pre-defined categories

The zero shot classification technique seemed to consistently perform faster and yielded more accurate results compared to using a generative AI. This confirmed our suspicions against using a pure AI approach.

Comparing zero shot classification against Gemma

Since this method was proven to produce reasonably accurate classification results, the data wrangling prompts could be fine tuned during this phase. As seen below, the AI model we used was seen to be quite accurate when translating from different languages. However, when presented with a non-meaningful description, it would occasionally hallucinate information about the tool.

Using Gemma to wrangle data

Ultimately it was determined that while zero shot classification proved to be quite accurate, this approach did not satisfy the main goal of this project, which was to create empirical categories rather than to impose our own understanding onto the data.

Semantic clustering using a posteriori categories

Our next approach added a Bag of Words (BoW) method to create novel categories. To do this, each tool description was crunched into a list of noun phrases using a generative AI in the process shown below.

GitHub data processing flow

Text clustering was then performed following scikit’s K-means and tf-idf method. This produced some meaningful categories, such as the “Vulnerability Scanner” category below.

Category: Security Vulnerabilities  & Vulnerability Scanner
name                                        description
                              future-architect/vuls  Agent-less vulnerability scanner for Linux, Fr...
                                 yogeshojha/rengine  reNgine is an automated reconnaissance framewo...
                             presidentbeef/brakeman  A static analysis security vulnerability scann...
                                       PyCQA/bandit  Bandit is a tool designed to find common secur...
                                 google/osv-scanner  Vulnerability scanner written in Go which uses...
...                                                 ...                                                ...
                                  CyVers-AI/oswar  Comprehensive framework that identifies, categ...
                          thearrival/IsmailScript  Is a tool written by using python programming ...
                            yogsec/OneLinerBounty  OneLinerBounty is a collection of quick, actio...
                            tarunKoyalwar/Sandman  A Target Tracking , NoteTaking , CheckLists an...
h33tlit/SniffCon-Ultimate-Recon-Dashboard-For-...  Sniffcon has a wide list of powerful online bu...
...

Some drawbacks of this method:

Despite our best efforts at filtering out these words, large categories titled “Cyber Security” or “Open Source Tool” often arose
The number of categories needed to be manually decided upon
It lacked a “miscellaneous” category

For fun, here is a category of categories of things that do not belong in a category:

Kingdom Protozoa
The Altaic language family
Domestic Shorthair cats

Domestic Shorthair cat

High level results

A high level set of results are listed below. These show some of the categories which emerged from the first attempt of security tool classification:

Automated CIS Benchmark Compliance Remediation & Automated STIG Benchmark Compliance Remediation: cloud testing tools, configuration checkers
Cyber Security & Technical Guidelines e.g. OSINT for threat intelligence, security checklists
Ethical Hacking & Open Source e.g. also included secret scanning, stenography tools
Nuclei e.g. packet classification tools, search tips
Patrowl - Open Source & Sensitive Information e.g. operating system and platform level tools, command and control tools
Security Vulnerabilities & Vulnerability Scanner e.g. static code scanners, network vulnerability scanners
Source Code & Git Repositories: e.g. secret scanning and bug bounty tools
Social Media Accounts & Social Networks e.g. OSINT for recon, asset discovery
The Center’s Mappings Explorer Project & Cheat Sheet e.g. CVE lists, red team tools, pen test tools
Your Wireless World & Pcap Files e.g. port scanning, proxies

Improvements

By far the greatest improvement that could be made to this project would be to use specialised AI to read the actual source code of each tool rather than just the tool description to determine the tool features. However, the state of the art special purpose AIs seemed to all be closed source (for now). The amount of data processing would also explode quickly based on the size of the codebases analysed.

Realistically, it would be useful to research improvements to the final semantic clustering method discussed in the previous section.

Conclusion

Further analysis and refinement is required but we hope the techniques from our first experimental phase are useful to anyone looking to classify security tools and keep to up to date with newly published open source security software.

Footnote: It was ultimately decided that we were establishing a typology rather than a taxonomy, as we were looking for groups of tools centered around the intent of the tool. ↩

A typology for security tools

Working towards an automated security tool typology1