LLM Attribution Engine

CTI research workflow for linking pseudonymous text profiles, clustering aliases, and scoring likely matches using stylometry, embeddings, and entity overlap on synthetic data.

Threat IntelligenceAttributionOSINTPythonNLPSOC

View on GitHub

Overview

What it does

This project looks at a CTI problem that comes up whenever attribution gets messy: you have text, aliases, and scattered profiles, but very little metadata. The goal is not "prove identity." The goal is narrower and more useful for analysis: rank likely matches, surface evidence, and group related profiles for review.

The pipeline runs locally and follows one flow:

extract -> embed -> stylometry -> score -> retrieve -> calibrate -> cluster -> evaluate -> report

In practice, that means the project pulls several kinds of evidence from each profile, scores every query-candidate pair, ranks the strongest matches, then writes reports an analyst can inspect instead of hiding everything behind one similarity score.

Why I added it

Most of my other projects sit closer to SOC workflows, ATT&CK coverage, and detection visibility. This one adds a different CTI angle. It is more about attribution support, alias clustering, and text-based correlation when infrastructure or telemetry is limited.

It also gave me a chance to work through a problem with more uncertainty than a typical rules-and-alerts project. That part was useful. Attribution work gets sloppy fast if the pipeline cannot explain why it ranked one profile above another.

Results and limits

On the synthetic dataset used in the project, the workflow reached an AUC of 0.950, and the correct match ranked first in 4/4 matchable queries. Those numbers are promising for a lab setting, but they are not the main point.

The part I trust more is the structure:

multiple scoring signals instead of one embedding similarity
calibration for p(match) instead of raw confidence theater
cluster output for alias review
markdown reports with overlap summaries and evidence tables

The project is still honest about its limits. The data is synthetic. Real-world recall would be lower. A production use case would need stronger abstention logic, more careful lexicons for technical identifiers, and tighter evaluation on realistic samples.

Ethics

All profiles in the project are synthetic. No real people are analyzed, and the workflow is presented as an analyst support tool for authorized CTI research only. That boundary matters here.

Repo note

This project lives inside the broader CTI-Lab repository rather than a standalone repo. The portfolio entry points to the lab repo, and the project README is here:

LLM Attribution Engine README

Objectives

Explore how text-only CTI attribution workflows can support alias linking and analyst review
Score likely query-candidate matches using multiple independent signals instead of one model output
Cluster related aliases to support operator tracking and cross-platform correlation
Keep the workflow local and synthetic so the project stays safe to test and easy to reproduce

Tools Used

Python 3.11SentenceTransformersspaCyscikit-learnHDBSCANTF-IDFLogistic RegressionMarkdown Reporting

Methodology

Built a single-run pipeline that extracts entities and keywords, embeds raw and engineered text features, computes stylometric features, and scores each query-candidate pair
Combined six signals for retrieval: entity overlap, keyword Jaccard, TF-IDF cosine, stylometry cosine, raw embedding cosine, and feature embedding cosine
Calibrated match probabilities with logistic regression and optional isotonic calibration to make thresholding more practical
Used HDBSCAN to cluster candidate profiles into likely alias groups for analyst review
Evaluated the workflow on synthetic profiles and exported pair scores plus per-query markdown reports

Key Findings

Attribution work is easier to defend when the score comes from several interpretable signals instead of a single black-box similarity number
Stylometry and content similarity are both useful, but they become more practical when entities and keywords are surfaced as evidence
Calibration matters because ranking alone does not tell an analyst where confidence should drop off
Synthetic data is good enough for demonstrating workflow design, but real-world performance would need stricter evaluation and abstention thresholds