The IT Auditor's Field Guide to Auditing AI Systems

Name: AAIA Prep
Author: Dr. Baz Abouelenein

By Dr. Baz Abouelenein (AAIA, CISA, CISM, CRISC, CISSP, PMP) · May 19, 2026 · 16 min read

You have been asked to add AI to the audit plan. The board mentioned it last quarter. Legal is nervous about the EU AI Act. The CIO already deployed three copilots. The CISO is asking who owns model risk. Now it is on your desk, and the existing audit programs — built for ERP, identity, and change management — do not bend cleanly around what a “model” actually is.

This guide is the roadmap I use. Six stages, evidence-first, mapped to the four frameworks an IT auditor working in or selling into US/EU markets typically has to satisfy: ISO/IEC 42001:2023, the NIST AI Risk Management Framework (currently version 1.0), the EU AI Act (Regulation 2024/1689), and the IIA’s 2024 Global Internal Audit Standards. It is built for the auditor who has been handed the assignment and has eight weeks to produce something defensible.

Why most AI audit checklists fail you

The genre has a problem. Search “AI audit checklist” and you get ten pages of the same content: ten or twelve principles (fairness, transparency, accountability, robustness), a paragraph each, and a CTA to a governance platform. They are not wrong. They are the wrong altitude. An auditor cannot test against a principle. You can write a finding that says: “Management has not defined the protected attributes for which the candidate-screening model is evaluated; consequently, bias testing performed in Q1 did not cover age or disability status.” That is a control failure tied to a criterion tied to a consequence. Principle-level findings get redlined by management; control-level findings get fixed.

Stage 1 — Build the inventory before you build the program

You cannot audit what you cannot list. The first deliverable is an AI system inventory. Three sources cover most of what you will miss: vendor-shipped AI features (copilots in Microsoft 365, Salesforce, ServiceNow, Workday), Shadow AI by employees (ChatGPT, Claude, Gemini on personal accounts), and in-house systems with embedded models. Build the intake form: system name and owner, build vs. buy vs. embedded, decisioning vs. recommending, data sensitivity, population affected, and regulator-relevant flag. Run it through CIO/CISO interviews, procurement records for the past 24 months, and a network/SaaS discovery scan. One detection limit worth stating explicitly: personal-device use of consumer AI tools is outside the boundary of corporate network and SaaS scans — address that population through policy attestation and manager interviews. If you have three rows when you finish, you have not finished stage 1.

Stage 2 — Triage to a risk tier

Tier 1 — Deep audit: decisioning systems, high-sensitivity data, regulator-relevant, or population over 10,000 affected (10,000 is a working threshold, not a regulatory bright line — adjust to your organization’s risk appetite). In my experience, Tier-1 engagements run 80–120 hours per system; adjust for scope, system maturity, and evidence availability. Tier 2 — Control walkthrough: recommender systems with human-in-the-loop, moderate data sensitivity, or internal-only populations. Typical range: 20–40 hours (author estimate; scope-dependent). Tier 3 — Inventory-only: low-risk experiments, sandboxes, personal-productivity copilots. Confirm they are inventoried and revisit in twelve months. Document the tiering criteria in the audit plan. If a system qualifies for two tiers depending on use case — for example, a recommender that becomes decisioning in certain contexts, or a vendor-shipped copilot with an in-house fine-tuned layer — assign the higher tier and document the rationale in the audit plan.

Stage 3 — Map the controls across the four frameworks

Four frameworks set the obligation surface: ISO/IEC 42001:2023 (38 controls, Annex A.2–A.10), NIST AI RMF 1.0 (Govern, Map, Measure, Manage), EU AI Act Regulation 2024/1689 (under the Digital Omnibus political agreement of 7 May 2026, stand-alone high-risk AI obligations are now expected to apply from 2 December 2027 and high-risk AI embedded in regulated products from 2 August 2028 — provisional, pending formal adoption), and IIA Global Internal Audit Standards 2024 (effective 9 January 2025). As of May 2026, the IIA has not issued an AI-specific Topical Requirement; the three issued are Cybersecurity (effective 5 February 2026), Third-Party Risk Management (effective 15 September 2026), and Organizational Behavior (effective 15 December 2026); Organizational Resilience is in development, expected 30 April 2027. Sector-specific regimes layer on top for certain domains: SR 11-7 (Federal Reserve model risk guidance) for financial institutions, Colorado SB 24-205 (algorithmic discrimination — original February 2026 effective date delayed, then stayed by a federal court in April 2026; enforcement suspended as of May 2026, with replacement legislation that would set a January 2027 effective date pending governor signature) and NYC Local Law 144 (automated employment decisions) for HR and hiring AI, and FDA guidance for AI-enabled medical devices.

Stage 4 — Write evidence requests that actually work

Replace abstract PBC requests with specific, dated, name-the-artifact requests. Instead of “Provide your AI governance documentation,” write: “For the candidate-screening model (System ID 03), provide the current AI risk assessment, the most recent quarterly review minutes, and the named owner per the RACI as of 2026-03-31.” Name the system, name the artifact, name the date, name the dependency. The PBC request is itself a control-design diagnostic. When an artifact does not exist, document it as ‘control absent’ rather than ‘evidence not provided’ — the distinction matters for finding severity and management response.

Stage 5 — Test what matters

This stage is process-level by design: it confirms controls are designed, documented, and operating as described. Substantive testing of model outputs is a separate engagement scope requiring ML engineering access and is not covered here. Six test areas: model cards (confirm one exists, dated, signed off; compare stated thresholds to production), drift monitoring (ask for the actual dashboard, not the policy), bias and fairness testing (are protected attributes defined, tested at stated cadence, disparities escalated?), human-in-the-loop verification (walk it with an operator; watch three real cases), vendor AI (SOC 2 plus any AI-specific addendum — if no addendum exists, request the vendor’s AI governance policy, model card, and contractual commitments on training-data use and model change notification; absence of these is itself a finding against your vendor risk management program), and logging retention (EU AI Act Article 12 for logging design; Article 26(6) for ≥6-month deployer retention). The strongest engagements pull from all six. For generative AI systems — LLM-based copilots, summarizers, code generators, and retrieval-augmented generation pipelines — add four test areas to the standard six: hallucination rate (request the benchmark methodology and most recent results; if none exists, that is a finding), prompt injection resistance, output toxicity filtering, and training-data memorization controls for in-house or fine-tuned models. These four layer on top of the standard six for the specific risk profile of generative systems.

Stage 6 — Write findings that move

A finding that gets closed is written against a control with a named owner, a clear criterion, and a remediation the owner can execute. The sample finding covers insufficient definition of protected attributes for fairness testing: condition observable, criterion cited (ISO/IEC 42001:2023 Annex A.5), cause specific, consequence bounded, recommendation executable. Findings that read like opinion columns get pushed back to draft.

Seven mistakes IT auditors make on AI

1. Treating AI as a single auditable thing. 2. Auditing the policy instead of the system. 3. Asking for bias testing without scoping the protected attributes. 4. Missing the third-party AI. 5. Confusing observability with explainability. 6. Citing ISO 42001 clauses without mapping to a control. 7. Writing findings against principles, not controls.

Written by Dr. Baz Abouelenein, AAIA, CISA, CISM, CRISC, CISSP, PMP. The AAIA Prep App has 1,155 original practice questions covering all three AAIA domains, including scenario-style questions that mirror the cognitive moves of the real exam.