Data Science Advanced Persistent Threat Protection SOCs

WitFoo opens 100 million-record cyber attack dataset

Mon, 20th Apr 2026

WitFoo has released the Precinct 6 Cybersecurity Dataset as open source. It contains 100 million structured, labelled security event records drawn from live attack traffic.

Created in collaboration with the University of Canterbury, the collection is intended to support research in cybersecurity, artificial intelligence, and data science.

The release is a major increase in scale from WitFoo's earlier 2-million-record dataset. The new collection is 50 times larger and is based on activity observed in production environments rather than controlled test systems.

The records were derived from attack traffic seen over a two-month period in 2024. The data was sanitised to protect the organisations involved while preserving the timing, structure and patterns of the underlying attacks.

Four Subsets

The dataset is divided into four parts. The Signals subset contains 100 million normalised security events from sources including syslog, Windows Security Auditing, VPC flow logs and endpoint telemetry.

Each record includes fields such as timestamps, network metadata, hostnames, usernames, actions, severity levels, vendor codes and sanitised message content. Two further subsets, Graph Edges and Graph Nodes, map relationships between hosts, users, processes and network connections.

The Incidents subset adds correlated security incidents with binary classification labels, confidence scores, MITRE ATT&CK technique and tactic mappings, suspicion scores and security orchestration lifecycle stage metadata. WitFoo said this structure makes the data suitable for machine learning research and threat hunting.

WitFoo has also published the sanitisation codebase as open source, allowing researchers and reviewers to inspect how the dataset was produced and how sensitive information was removed.

Research Use

Open cybersecurity datasets often rely on simulated or lab-based activity, which researchers have long argued can limit the performance of detection models in operational settings. Precinct 6 differs in that it reflects real adversary behaviour recorded as attacks unfolded against production systems.

The data can be used for intrusion detection, anomaly detection, graph-based threat detection, research on automated incident response, benchmarking, log reduction studies, and teaching. It is available under an Apache 2.0 licence and is free for academic, commercial and government use.

"For a decade, WitFoo ran over 4,000 experiments with Fortune 500 companies, universities, and government agencies to develop Empathetic Processing. This dataset is the product of that research, and we believe it belongs in the hands of the academic community," said Charles Herring, Chairman and Co-Founder, WitFoo.

He contrasted the release with other commonly used sources of cybersecurity training data.

"Most publicly available cybersecurity datasets were generated in lab environments with scripted attacks and synthetic traffic. That's useful for basic benchmarking, but it doesn't teach you what real adversaries actually look like in a production network. This data comes from live attack traffic observed in 2024. The attackers didn't know they were being recorded, and they weren't following a script. We've sanitised the data to protect the organisations involved, and we've published the sanitisation code itself as open source so researchers can verify exactly how we did it. Cybersecurity's biggest bottleneck isn't compute or clever algorithms. It's the lack of realistic data that researchers can actually train against," said Herring.

"One of the persistent challenges in cybersecurity research is that most available datasets are either synthetic or derived from controlled laboratory exercises, which limits how well models trained on them generalise to real-world conditions," said Dr Etienne Borde, Associate Professor in Computer Science and Software Engineering, University of Canterbury.

He said the scale and provenance of the release could broaden the scope of work available to researchers and students.

"A dataset of this scale built from live attack traffic is genuinely rare. It opens up research pathways that simply weren't feasible before, from graph-based threat modelling to evaluating AI-driven detection systems against authentic adversary behaviour. We look forward to incorporating this resource into our research and teaching programmes at Canterbury," said Borde.

ChatGPT

Key takeaways Explain why it matters Create action plan Future watch

Claude

Key takeaways Explain why it matters Create action plan Future watch

Perplexity

Key takeaways Explain why it matters Create action plan Future watch

Grok

Key takeaways Explain why it matters Create action plan Future watch

Share Share

Add us as a preferred source on Google