LLM Attribution Engine
CTI research workflow for linking pseudonymous text profiles, clustering aliases, and scoring likely matches using stylometry, embeddings, and entity overlap on synthetic data.
Overview
What it does
This project looks at a CTI problem that comes up whenever attribution gets messy: you have text, aliases, and scattered profiles, but very little metadata. The goal is not "prove identity." The goal is narrower and more useful for analysis: rank likely matches, surface evidence, and group related profiles for review.
The pipeline runs locally and follows one flow:
extract -> embed -> stylometry -> score -> retrieve -> calibrate -> cluster -> evaluate -> report
In practice, that means the project pulls several kinds of evidence from each profile, scores every query-candidate pair, ranks the strongest matches, then writes reports an analyst can inspect instead of hiding everything behind one similarity score.
Why I added it
Most of my other projects sit closer to SOC workflows, ATT&CK coverage, and detection visibility. This one adds a different CTI angle. It is more about attribution support, alias clustering, and text-based correlation when infrastructure or telemetry is limited.
It also gave me a chance to work through a problem with more uncertainty than a typical rules-and-alerts project. That part was useful. Attribution work gets sloppy fast if the pipeline cannot explain why it ranked one profile above another.
Results and limits
On the synthetic dataset used in the project, the workflow reached an AUC of 0.950, and the correct match ranked first in 4/4 matchable queries. Those numbers are promising for a lab setting, but they are not the main point.
The part I trust more is the structure:
- multiple scoring signals instead of one embedding similarity
- calibration for
p(match)instead of raw confidence theater - cluster output for alias review
- markdown reports with overlap summaries and evidence tables
The project is still honest about its limits. The data is synthetic. Real-world recall would be lower. A production use case would need stronger abstention logic, more careful lexicons for technical identifiers, and tighter evaluation on realistic samples.
Ethics
All profiles in the project are synthetic. No real people are analyzed, and the workflow is presented as an analyst support tool for authorized CTI research only. That boundary matters here.
Repo note
This project lives inside the broader CTI-Lab repository rather than a standalone repo. The portfolio entry points to the lab repo, and the project README is here:
Objectives
- Explore how text-only CTI attribution workflows can support alias linking and analyst review
- Score likely query-candidate matches using multiple independent signals instead of one model output
- Cluster related aliases to support operator tracking and cross-platform correlation
- Keep the workflow local and synthetic so the project stays safe to test and easy to reproduce
Tools Used
Methodology
- Built a single-run pipeline that extracts entities and keywords, embeds raw and engineered text features, computes stylometric features, and scores each query-candidate pair
- Combined six signals for retrieval: entity overlap, keyword Jaccard, TF-IDF cosine, stylometry cosine, raw embedding cosine, and feature embedding cosine
- Calibrated match probabilities with logistic regression and optional isotonic calibration to make thresholding more practical
- Used HDBSCAN to cluster candidate profiles into likely alias groups for analyst review
- Evaluated the workflow on synthetic profiles and exported pair scores plus per-query markdown reports
Key Findings
- Attribution work is easier to defend when the score comes from several interpretable signals instead of a single black-box similarity number
- Stylometry and content similarity are both useful, but they become more practical when entities and keywords are surfaced as evidence
- Calibration matters because ranking alone does not tell an analyst where confidence should drop off
- Synthetic data is good enough for demonstrating workflow design, but real-world performance would need stricter evaluation and abstention thresholds