Training a model using Natural Language Processing (NLP) is challenging. Training one adapted to the unique vocabulary of malicious actors becomes even more difficult. This complex process highlights the need of having a continuously adaptive lexical able to follow new trends in illicit communities.
To overcome the challenge of the distinct vocabulary used by malicious actors, we’ve created and made public the first open-source tokenizer trained on a corpus containing years of content from interactions on the Dark Web. The tokenizer and lexical are in the format of Byte-Pair-Encoding and will be available on GitHub.
We will demonstrate two applications of this model, applied to real world challenges and highlight some insights found by them. First, we will demonstrate how the ML auto-extractor is able to extract contents from a wide variety of illicit forums, without human configuration. Then, we will show how we were able to regroup multiple actors’ monikers based on their writing style.
What you would learn during this talk:
How to avoid the pitfalls when training NLP on slang/jargon.
How to continuously adapt the lexical to follow new trends in illicit communities.
With the release of this open-source model, malware researchers and threat hunters will be able to automate interactions with cybercriminals, support infiltration engagements, and analyze communications and data leaks. Additionally, by constantly updating the lexical based on cybercriminals’ interactions, the model allows researchers to discover and track rising trends.