Importance of Collection Planning in CTI

Robindimyan
4 min readOct 18, 2023

In Cyber Threat Intelligence (CTI), having high-quality data is vital, and having a solid plan for its collection is essential. Yet, it’s not just about the plan but understanding the inherent challenges of different collection approaches. In this article we’ll explore this further by examining how we collect details on Initial Access markets, highlighting the hurdles and the importance of addressing them properly.

Let’s begin with a seemingly simple question: What is the size of the Initial Access market? For a better understanding of the problem, you can further break down the question as follows:

  1. How many initial accesses are sold in a year?
  2. What is the total value of initial accesses sold in a year?
  3. How many suppliers exist in the market?

After refining the questions to pinpoint what needs answering, you must determine ‘how’ to answer them. While there are various sources and methods available, which one will yield the most accurate response? Let’s explore a few options and evaluate their potential limitations.

Approach 1: Scrape all the posts advertising initial access

Challenges with this approach:

  • Advertisements don’t equate to confirmed sales.
  • Not every claim is genuine; actors might falsely advertise access.
  • Repetitive advertisements may skew numbers.
  • Different platforms yield varying volumes. A forum might have hundreds of posts monthly, whereas Telegram can boast over 10,000.

Approach 2: Identify actors selling initial access and scrape only their posts

Challenges with this approach:

  • Determining the veracity of an actor’s claims is difficult.
  • By focusing only on high-reputation actors, you’re only scratching the surface of the broader market.

Approach 3: Engage with sellers or escrows and try to elicitate how many accesses they currently possess

Challenges with this approach:

  • A time-intensive, human-centric approach.
  • There’s a high possibility of not receiving any useful information.
  • Acquired information might still be inaccurate.
  • Engaging with a limited number of sellers provides just a sliver of the overarching market.

Approach 4: Compile threat intelligence reports and tally the number of breaches resulting from IAB

Challenges with this approach:

  • Only a small portion of breaches are publicly reported, and even fewer provide detailed accounts of how the breach occurred.
  • You are capturing only those initial accesses that resulted in breaches, but many more accesses are sold without necessarily leading to a breach.
  • Because breaches happen on various dates, it’s hard to determine the number of initial accesses sold in a set period.
  • Similarly, the date a breach occurs and when initial access is sold can be quite different, making it hard to create an accurate timeline.

As you can see, there are multiple solutions to the same problem, each with its own drawbacks. As it’s often said, there are no solutions, there are only trade-offs. This is especially true in intelligence collection. Analysts need to consider these limitations when assessing any gathered information. Each collection method can yield different results, influencing the direction of analysis, for better or worse.

Data Sampling and Selection Bias

Given the challenges and the enormity of data, scoping down collection becomes inevitable. This is where sampling and extrapolation come into play.

One important consideration is to sample data in a manner that accurately represents the market. Otherwise, we risk encountering selection bias, a bias stemming from choosing a cohort that doesn’t accurately reflect the broader population the study intends to represent.

Markets have their ups and downs. Just like you wouldn’t judge an entire year’s weather by a single day’s forecast, you shouldn’t estimate the entire initial access market based on a limited set of data. For instance, in the graph below, sampling solely from the orange areas would result in an understated representation of the market. Conversely, sampling only from the blue areas would lead to an inflated representation.

Graph depicting Initial Access advertisements over time

In our case of Initial Access markets, a suitable sampling approach might be as follows:

  • Fixed Data Points: For instance, data from specific months in 2023 and 2022 (e.g., January, April, August, October).
  • Random Data Points: An arbitrary selection from both 2023 and 2022.

Additionally, sampling should span across various sources. Relying on a single source, such as darkweb forums, won’t yield a comprehensive picture. Analysts should diversify sources by including platforms like Telegram, Discord, CTI reports, and others.

Another aspect to consider is the variation in data volumes across sources. A handful of messages from a source with thousands of monthly posts holds different weight than the same number from a source with only a few posts monthly. Therefore, when extrapolating data an analyst should:

  • Consider the total message volume of each platform.
  • Account for the total number of active participants in each source.

Conclusion

Collecting relevant data is crucial in CTI, but how we collect it matters just as much. While there’s no one-size-fits-all approach, being aware of the drawbacks inherent to each method can help in refining our strategies. Analysts must always account for these limitations; otherwise, they risk compromising the integrity of their analysis.

--

--