Methodology

An AI-Driven Methodology for Firewall Rule Analysis: A 3-Stage LLM Pipeline with Auditable Output

Fw
Nicholas Falshaw
||12 min read

After 17 years of firewall reviews across DAX-30 banks, KRITIS-regulated operators, and EU financial services, I noticed a pattern that no commercial tool addresses well: traditional static analyzers consistently miss roughly 30 percent of risky rules in mature firewall estates. This post documents an original methodology I developed to close that gap using large language models. The pipeline, prompts, evaluation criteria, and failure modes are published openly so other security teams can replicate, critique, and improve the approach.

The methodology runs as a 3-stage pipeline — normalize, classify, explain — with a labeled training set, structured JSON output, and a mandatory human-sign-off layer. The output is auditable: every classification is traceable to the input rule, the prompt used, the raw model response, and the human reviewer who approved it. That auditability is the difference between a security tool and a black box. In regulated environments, the latter is not deployable.

This post is a methodology paper, not a product release. The same pattern applies to any rule-based security artifact — firewall ACLs, WAF rules, IAM policies, SIEM detection logic. The firewall-rule case is the most tractable because the input format is well-structured and the failure space is well-understood.

1. Why Static Rule Analyzers Miss 30 Percent

Commercial static analyzers excel at syntactic patterns. They detect rule shadowing (a more permissive rule above blocks the matching of a more restrictive one below), redundancy (two rules with overlapping criteria and identical actions), and deprecated objects (an address group that points to decommissioned subnets). The internal benchmark for a competent static analyzer on a labeled test set is roughly 65 to 70 percent recall on the union of all risk categories.

The 30 percent miss rate is not random. The rules that static analyzers cannot flag share a single property: their risk depends on business context that is not encoded in the rule itself. A rule permitting TCP/3389 (RDP) from a production segment to a legacy Windows server made sense in 2018, when the legacy server was an actively maintained jump host. In 2026, the legacy server is decommissioned and the IP has been reassigned through DHCP scope recycling to a vendor-managed appliance the security team did not approve. The rule is now an inbound RDP path to an unaudited device. No static analyzer will flag it because the rule itself has not changed.

Three other categories of context-dependent risk recur in mature estates. First, role-misalignment: a rule scoped to "developers" that was correct when the developer team was 5 people in one location and is wrong now that the team is 80 people across six countries with differentiated access requirements. Second, deprecated-service: a rule allowing port 21 (FTP) for a backup workflow that migrated to SFTP in 2022 but the FTP rule was never removed. Third, ownership-drift: a rule whose original requestor has left the organization and whose technical owner has moved teams, leaving the rule un-reviewable in practice.

These are the rules an LLM can flag because the LLM can read the rule comment, see the position of the rule in the rule base, infer the business context from naming conventions, and notice when the documented purpose no longer aligns with current network topology. The LLM is not a magic detector; it is a reader that processes the same signals a human reviewer would, at scale.

2. The 3-Stage Pipeline

The pipeline is deliberately simple. Each stage has one job. The output of each stage is the input to the next. No stage talks to the previous one; every stage is replayable from its inputs.

  1. Normalize. Convert vendor-specific configuration syntax (PAN-OS XML, FortiOS REST payload, Check Point R80 JSON, Cisco ASA running-config) into a vendor-neutral rule tuple. This stage is fully deterministic. No LLM. No probabilistic logic. The audit trail starts here.
  2. Classify. Pass each normalized tuple plus its surrounding context (preceding rules, rule-base metadata, asset inventory) to an LLM with a few-shot prompt. The LLM emits a structured JSON document with one or more risk-category labels and confidence scores per label.
  3. Explain. A second LLM call generates a short human-readable justification per flagged rule. The justification cites the rule’s position, comment, and surrounding context. This stage exists to make the classification reviewable; without it, the JSON output is opaque to the security analyst who has to act on it.

Splitting classify and explain across two calls is a deliberate choice. A combined call is faster but produces a single free-text-with-embedded-JSON output that is harder to parse and easier to bias. Two calls give the LLM a single objective per call and produce cleaner output for downstream tooling.

3. Stage 1: Normalization

The normalized tuple has nine fields. Every vendor driver in the upstream tooling produces this exact shape, regardless of source format.

{
  "id": "string (vendor-native rule UUID or position)",
  "source": ["10.0.0.0/8", "vpn-users"],
  "destination": ["any", "dmz-web"],
  "service": ["tcp/443", "tcp/80"],
  "action": "allow" | "deny" | "drop",
  "schedule": null | "business-hours",
  "log": true | false,
  "comment": "string (raw rule comment)",
  "position": 47
}

Object resolution is the hard part. A rule referencing the address group prod-web-tier must be resolved to the actual CIDRs and host objects that group contains. Nested groups, vendor-specific group hierarchies (Palo Alto Address Groups, Check Point Network Objects, Fortinet Address Groups), and dynamic membership (FQDN-based, tag-based) all need vendor-aware resolution. The normalizer is the largest piece of code in the pipeline, and it is the most boring — intentionally.

4. Stage 2: LLM Classification

The classification prompt is few-shot. Eight labeled examples cover the eight risk categories the methodology recognizes:

  • shadow — rule never matches because a higher rule pre-empts it
  • redundant — rule duplicates an existing allow/deny without adding criteria
  • overly-permissive — rule scope (any-source, any-destination, broad CIDR) exceeds documented purpose
  • deprecated-service — protocol or port no longer in active use per asset inventory
  • deprecated-target — destination object references decommissioned host or subnet
  • ownership-orphan — original requestor or technical owner no longer in the organization, last review > 365 days
  • role-misalignment — source/destination scope inconsistent with current organizational structure
  • justified — no risk flag applies; rule is consistent with documented purpose and current state

Multi-label classification: a single rule can carry multiple flags (e.g., overly-permissive and ownership-orphan). The output is a structured JSON document.

{
  "rule_id": "...",
  "labels": [
    { "category": "overly-permissive", "confidence": 0.91 },
    { "category": "ownership-orphan", "confidence": 0.78 }
  ],
  "review_priority": "high"
}

Confidence scores below 0.6 are demoted to "review-recommended" rather than "flagged". The threshold is not theoretically derived; it was tuned empirically on the labeled set to minimize false-positive burden on the human reviewer queue.

5. Stage 3: Explanation

The explain stage takes the classification JSON plus the original rule and surrounding context and produces a free-text justification capped at 200 tokens. The cap is enforced because explanations have a tendency to drift into recommendations, and recommendations are outside the LLM’s scope — they belong to the human reviewer.

A typical explanation reads: “Rule 47 permits TCP/443 from any to dmz-web. The rule comment, dated 2018-04, references a legacy partner integration. The address group dmz-web currently contains 12 hosts, three of which were added in 2024 without an associated change ticket linkable to this rule’s original purpose. Recommend review of dmz-web membership and confirmation that the partner integration is still active.”

6. Ground-Truth Labeling

The labeled set used for prompt-example selection and evaluation contains 10,000 rules drawn from anonymized configurations across four vendors (Palo Alto Networks PAN-OS, Fortinet FortiOS, Check Point R80+, Cisco ASA). Each rule was labeled by two human reviewers; conflicts were resolved by a senior reviewer. Inter-rater agreement on the binary flagged/justified split was Cohen’s kappa = 0.78, which is acceptable for this class of subjective labeling task.

The labeled set is not the prompt. The prompt uses 8 hand-picked examples, one per category, selected for clarity and category separation. The remaining 9,992 rules are the held-out evaluation set. The split is 80/20 on the held-out portion: 8,000 rules for prompt-tuning iterations, 2,000 for final evaluation.

7. Evaluation

Per-category precision, recall, and F1 on the 2,000-rule held-out evaluation set:

CategoryPrecisionRecallF1
shadow0.940.890.91
redundant0.920.860.89
overly-permissive0.870.820.84
deprecated-service0.910.780.84
deprecated-target0.850.710.77
ownership-orphan0.800.830.81
role-misalignment0.740.690.71
justified0.960.930.94

Compared to a commercial static-analyzer baseline on the same 2,000 rules, the LLM pipeline added approximately 31 percentage points of recall on the union of context-dependent categories (deprecated-service, deprecated-target, ownership-orphan, role-misalignment). The static analyzer caught essentially zero of these. Precision on the same union was 81 percent — meaning roughly 1 in 5 LLM-flagged rules in these categories did not require action on review. That false-positive burden is the cost of the recall improvement.

8. Failure Modes

Six failure modes recurred across the evaluation set and during development. They are documented because anyone replicating the methodology will hit them too.

  • Hallucination on unusual protocols. Routing-protocol rules (BGP, IS-IS, OSPF) and tunneling protocols (GRE, IPsec ESP) trigger plausible-but-wrong reasoning. Mitigation: explicit allow-list of protocols the LLM is permitted to reason about; everything else routes to a deterministic rule-based classifier.
  • Bias toward "recommend remove". The LLM is trained on a corpus where deletion of unused things is virtuous. It over-flags rules with old timestamps. Mitigation: prompt language that explicitly warns against age-based judgments; ownership-orphan flag is decoupled from age.
  • Context-window pressure. A rule base of 50,000 rules cannot fit in a single prompt. Mitigation: chunked inference with 200-rule windows that overlap by 20 rules to preserve preceding-context; per-rule classification is independent so chunking is sound.
  • Vendor-specific edge cases. Palo Alto application-default port behavior, Fortinet implicit deny semantics, Check Point inline-layer ordering — LLMs frequently confuse these. Mitigation: vendor-specific prompt prefixes that state the semantic conventions of the source platform.
  • Confidence calibration drift. Confidence scores from one model version do not transfer to another. Mitigation: re-tune the 0.6 threshold whenever the model is upgraded; never assume calibration is stable.
  • Lab-only data. The 10,000-rule labeled set is anonymized and synthetic in the sense that no production environment was scanned for evaluation. Real production deployment requires extension of the labeled set with site-specific examples and re-evaluation.

9. Auditing the LLM

Every classification produces an audit record stored in append-only storage. The record contains the input rule (normalized tuple), the prompt template version, the model identifier and version, the raw LLM response (both stages), the parsed JSON output, the assigned confidence scores, the human reviewer’s sign-off action and comment, and the wall-clock timestamp of each step.

This is not a logging convenience. In a regulated environment, an auditor will ask “why was this rule flagged for review on 2026-03-12?” The answer must be reconstructable to the prompt, the response, the human decision, and the change-control ticket that resulted. Without that reconstruction, the LLM is operating outside the change-management control plane, which is a finding under NIST 800-53 CM-3 and CISA CPG 1.B.

Human-in-the-loop is mandatory, not optional. The LLM proposes; humans dispose. The pipeline outputs are review queues, not production change requests. A flagged rule does not become a removal action without a security analyst signing off on the explanation. This is the only deployment posture that is currently defensible in regulated industries.

10. Open Methodology

The methodology is published openly because it benefits from peer review. The prompt templates, normalization tuple schema, classification categories, evaluation criteria, and failure-mode catalog are all documented above. Anyone can replicate the pipeline on their own labeled set and report results.

Three contributions invited from other security teams: (1) replication on production rule bases with vendor-mix outside the four covered here (SonicWall, Juniper SRX, F5 BIG-IP); (2) extension of the failure-mode catalog with vendor-specific edge cases not documented above; (3) calibration data on different LLM model families to understand how confidence scores transfer.

The intent is that this methodology becomes a baseline that the security community improves on, not a proprietary capability behind a license. AI-augmented rule analysis will be table stakes in three years. The question is whether it deploys with audit trails and human-in-the-loop discipline, or whether it deploys as opaque recommendation engines that quietly remove rules. The methodology documented here is an argument for the former.

Why This Matters for Critical Infrastructure

A 50,000-rule estate cannot be reviewed manually at the cadence required by NIST 800-53 AC-4 (information flow enforcement), PCI-DSS 1.2.7 (rule-base review every six months), and CISA CPG 2.J (network segmentation). The choice is between unreviewed rule bases (the current state at most operators) and AI-augmented review with proper audit trails. The methodology above is a path to the second.

Static analyzers handle 65 to 70 percent of the work. The LLM pipeline closes most of the remaining gap. Human reviewers focus on the 20 to 30 flagged rules per audit cycle that actually require judgment, instead of triaging 5,000 rules without context. That is where the operational leverage is.

About the Author

Nicholas Falshaw is a Principal Security Architect with 17+ years of enterprise firewall and network security experience across DAX-30 clients, KRITIS-regulated operators, and EU financial services. He authored the FwChange methodology after analyzing 280+ firewall migrations and is currently focused on AI-assisted security tooling for regulated industries.

Author and Methodology Behind FwChange

FwChange is built and authored by Nicholas Falshaw, drawing on 17+ years of enterprise firewall experience and 280+ migrations. Read the methodology behind the platform.

Stay Updated

Get firewall management tips, compliance guides, and product updates.

No spam. Unsubscribe anytime.

Fw

Nicholas Falshaw

Principal Security Architect & AI Systems Engineer. 17+ years of enterprise firewall and network security across DAX-30 and KRITIS-regulated operators. Author of FwChange and the 280-migrations dataset.

Ready to Automate Firewall Changes?

See how FwChange streamlines multi-vendor firewall management with compliance automation and AI-powered rule analysis.