Cross-Lingual Cybersecurity Analytics in the International Dark Web with Adversarial Deep Representation Learninga

Release time: 2023-10-26 clicks:

It's crucial to focus on the dark web

Cybercrime is estimated to cost the global economy $6 trillion annually by 2021. A large portion of cybercrime stems from the dark web, a conglomerate of international and ever-evolving online platforms mainly characterized by hacker forums and Dark Net Markets (DNMs). The dark web is rife with malicious assets that hackers can leverage to launch attacks that compromise the cybersecurity of individuals and organizations. Hacker assets include malware, hacking tools (e.g., phishing/carding tools), hacking tutorials (e.g., procedures to monetize stolen credit cards), and malicious source code. Dark web content is a valuable cybersecurity resource as it provides reconnaissance on hacker assets. These assets reflect hackers’ tools, techniques, and procedures (TTP) and provide a unique opportunity to understand adversaries’ arsenals and capabilities.

Challenges in detecting hacker assets

The popularity of dark web platforms among cybercriminals has led to an increase in the number of illicit items from several thousand in 2013 to hundreds of thousands in 2018. Given the magnitude of the dark web, it is impractical for human analysts to manually sift through the content to identify hacker assets. However, automatically detecting hacker assets among thousands of other similar illegal items (e.g., pirated e-books, digital goods) is a non-trivial task. Keyword-based searching approaches are prone to inclusion and exclusion errors. Recognizing this issue, recent cybersecurity reports suggest utilizing automated machine learning (ML) techniques to monitor the dark web for hacker. While ML approaches hold significant promise in automating hacker asset detection, their training procedures require human-labeled data, which is expensive and time consuming to obtain. This issue becomes more pronounced when performing hacker asset detection in foreign languages. The language barrier makes acquiring human-labeled training data more expensive for non-English dark web platforms. ML models’ performance often suffers in low-resource environments that lack human-labeled data.

Detecting multilingual hacker assets is essential

Today, Russian, French, and Italian are among the most common languages in the dark web. Nation-specific dark web platforms differ in the type of hacker assets they host. Thus, analyzing non-English content helps security analysts and others better understand the global cybersecurity landscape. Accordingly, scholars have emphasized the critical need for multilingual dark web cybersecurity analytics research. One promising approach for responding to this need is to leverage knowledge from human-labeled English content to analyze low-resource non-English content, known as Cross-Lingual Knowledge Transfer (CLKT).

We introduce an innovative method for detecting multilingual hacker assets

We adopt the computational design science paradigm to develop a novel CLKT framework, Cross-Lingual Hacker Asset Detection (CLHAD), to automatically detect hacker assets in non-English dark web platforms without machine translation. At the core of CLHAD stands a novel Adversarial Deep Representation Learning (ADREL) method. Drawing upon state-of-the-art methodologies in Generative Adversarial Networks (GANs), ADREL is a novel method that automatically extracts language-invariant representations from English contexts and transfers them to non-English contexts without requiring external resources (e.g., human- or machine-translated corpora) or extensive human-labeled training data. Rather than relying on lexicons to translate words while ignoring their context, ADREL constructs representations from textual descriptions of dark web hacker assets that embed salient features from the source and target language. Figure 1 shows the proposed research design for cross-lingual hacker asset detection.

Figure 1. Proposed Research Design for Cross-Lingual Hacker Asset Detection

Our main contributions

Our study makes a significant contribution to cybersecurity analytics by introducing a novel multilingual hacker asset detection framework known as CLHAD with ADREL. We have also conducted multilingual hacker asset profiling on the international dark web. From the results of this profiling, security managers can derive valuable insights. Specifically, they may find it more beneficial to concentrate on Russian platforms when identifying sophisticated hacking assets. On the other hand, identifying financial hacker assets may necessitate attention to several dominant languages in the dark web.

Figure 2. Explanation of Examples Identified by CHLAD

The managerial implications

Our study benefits Information Security Officers (ISOs) and practitioners in cybersecurity analytics organizations. First, discovering that cybercriminals are more likely to be equipped with sophisticated hacking skills can provide insights into cyber attack attribution (a crucial task in incident response). Second, discovering that cybercriminals in dark web platforms almost equally concentrate on financial fraud as a lucrative business, suggests that cybersecurity analytics organizations that protect financial firms need to monitor non-English platforms in addition to only English platforms. As such, automated multi-lingual hacker asset detection helps prioritize hacker assets based on the security needs of firms and the global cybersecurity landscape.

Article Information

Ebrahimi, Mohammadreza, Chai, Yidong, Samtani, Sagar, & Chen, Hsinchun. (2022). Cross-lingual cybersecurity analytics in the international dark web with adversarial deep representation learning. MIS quarterly, 46(2): 1209-1226.

Prev article：Depicting Risk Profile over Time: A Novel Multiperiod Loan Default Prediction Approach Next article：We Can Work it out: A Multilevel Examination of Relationships among Group and Individual Technology Workarounds, and Performance