This article walks the architecture, the AI-specific extensions to traditional API security, and the operational controls that distinguish a mature serving platform from one that has been built incrementally without security in mind.
Authentication: who is calling
Authentication for model serving has the same options it has for any other API: API keys, signed JSON Web Tokens (JWTs), mutual Transport Layer Security (mTLS), Open Authorization 2.0 (OAuth) bearer tokens, and the cloud-platform-native identity systems (AWS Identity and Access Management roles, Azure Managed Identities, Google Cloud Service Accounts). The choice depends on the calling context. Internal service-to-service traffic should use mTLS or platform-native identity; external partner traffic typically uses OAuth or signed JWTs; rate-limited public access for trial users typically uses API keys. The novelty for model serving is that the authentication decision feeds two downstream controls — the authorization decision and the rate-limit decision — both of which depend on knowing the caller’s identity at fine grain.
The NIST AI Risk Management Framework Cybersecurity profile https://www.nist.gov/itl/ai-risk-management-framework specifies that AI systems must enforce authentication on inference endpoints with the same rigor as any other production system. ISO/IEC 42001:2023 Annex A.6 https://www.iso.org/standard/81230.html requires AI Management System operators to apply identity and access management to AI components.
A common failure pattern is the deployment of model serving infrastructure on internal networks under the assumption that “internal traffic is trusted” — an assumption that has been wrong since the first compromised laptop and is wrong now. Zero-trust principles apply to model serving: authenticate every request regardless of origin, verify the authentication independently of the network position, and never confuse network connectivity with authorization.
Authorization: what they are allowed to ask the model to do
Authorization for model serving is more nuanced than for traditional APIs because the action a caller is requesting is contextual to the model’s domain. A call to a fraud-detection model is asking for a risk score on a specific transaction; the authorization question is whether the caller is allowed to score transactions for the merchant the transaction belongs to. A call to a recommendation model is asking for product suggestions for a specific user; the authorization question is whether the caller is allowed to act on behalf of that user. A call to a generative model is asking for content generation; the authorization question is whether the caller’s policy allows the type of content being requested.
Authorization for AI serving therefore requires three levels of policy. Endpoint-level policy decides which callers can call which models at all. Tenant-level policy decides which data partitions a caller can access through any model. Action-level policy decides which specific operations a caller can request — which classes of input the model will accept, which output paths the response will take, which downstream actions the response can trigger. The OWASP Top 10 for Large Language Model Applications https://owasp.org/www-project-top-10-for-large-language-model-applications/ catalogs Excessive Agency as LLM06 — the failure mode in which an LLM is granted authority to take actions the caller’s policy did not authorize — and the cure is rigorous action-level authorization at the boundary, not within the model’s reasoning.
The reference architecture for authorization is the externalized policy decision: an authorization service or sidecar that the serving stack consults for every consequential decision, rather than embedded if-statements in the inference code. The architecture supports the audit trail (Article 13), supports policy evolution without redeploying the model, and supports the centralized risk-based policy decisions that mature platforms eventually require.
Rate limiting: how often, how much, and at what cost
Rate limiting is the third leg of the security stool and the leg that AI serving stresses most uniquely. Rate limits serve four purposes.
Abuse prevention. Rate limits constrain the volume an individual caller can extract from the system, capping the harm any single compromised credential can cause. Per-account, per-IP, and aggregate rate limits are all required.
Extraction-attack resistance. As discussed in Article 4, model-extraction attacks require many queries; rate limits raise the attacker’s cost. Rate limits should be calibrated against legitimate-usage baselines and should be tighter for high-value models.
Cost containment. Inference is expensive — measured in dollars per request for large models, and in some cases dollars per token of input or output. Rate limits prevent a runaway client (compromised, buggy, or malicious) from incurring unbounded cost on the operator’s bill. Cost-aware rate limits use units the cost model cares about (tokens, compute-seconds, dollars) rather than just request counts.
Fairness across tenants. Multi-tenant serving stacks must prevent a single tenant from starving others of capacity. Rate limits at the tenant level and traffic-shaping policies that prioritize service tiers fairly are required for production multi-tenant operation.
The European Union’s AI Act, Article 15 https://artificialintelligenceact.eu/article/15/, requires high-risk AI systems to be resilient against attempts to disrupt or manipulate the system; rate limiting is a primary control for the disruption case. The Gartner AI TRiSM framework https://www.gartner.com/en/articles/gartner-top-strategic-technology-trends-for-2024 treats AI-specific rate limiting as a defining capability of mature AI gateway tooling.
AI-specific extensions to the serving stack
Beyond the traditional triad, secure model serving requires three AI-specific capabilities.
Input validation specific to the model. The serving stack should reject inputs that the model is not expected to handle — out-of-schema requests, unsupported content types, requests that exceed the model’s context window — before the inference engine sees them. Out-of-distribution detection (Article 2) and prompt-injection detection (Article 3) integrate as input-validation layers.
Output handling specific to the model. The serving stack should treat model outputs as untrusted, validate them against the expected schema, run content-policy checks where applicable, and route policy-violating outputs to handling paths the application designed for. Action-gating ensures that any output that triggers a downstream action passes through the authorization layer with the original caller’s credentials, not the model’s.
Inference logging for audit and detection. Every inference request and response is logged with sufficient fidelity to support audit (Article 13) and security monitoring (Articles 13, 14). The log includes the authenticated caller, the input (or its hash, where the input is sensitive), the model version, the output (or its hash), the policy decisions that were taken, and the rate-limit and validation outcomes. Inference logs are the source-of-truth artefact for both compliance evidence and incident response.
NIST SP 800-218A https://csrc.nist.gov/pubs/sp/800/218/a/final prescribes input validation, output handling, and logging as required Secure Software Development Framework practices for AI systems. The MITRE ATLAS knowledge base https://atlas.mitre.org/ catalogs the attacks that each of these controls defends against.
Maturity Indicators
Foundational. Inference endpoints are deployed without authentication or with a single shared credential. Authorization is implicit (the network position is the authorization). No rate limiting exists or rate limits are applied uniformly without regard to caller, tenant, or model value. Inputs and outputs are not validated. Inference logs are not retained or are retained without sufficient fidelity for audit.
Applied. Inference endpoints require authentication. Per-account rate limits exist. Inputs are schema-validated. Outputs are returned with content-type and structure verified. Inference logs are retained at least for short-term operational use. Authorization is enforced but may still be coarse-grained (endpoint-level only).
Advanced. Authentication, authorization, and rate limiting are externalized into a serving gateway that every inference request passes through. Authorization decisions are made at endpoint, tenant, and action levels with externalized policy. Rate limits are calibrated against legitimate-usage baselines and tightened for high-value models. Inference logs include sufficient fidelity for both audit and security analytics. Out-of-distribution and prompt-injection detection are integrated as input-validation layers.
Strategic. The serving platform is a first-class governance surface. Policy decisions are auditable, reversible, and consumable by enterprise risk reporting. Rate-limit decisions reflect cost models and fairness across tenants. Output handling integrates with downstream authorization so that LLM-driven actions are re-authorized at the boundary. The platform itself is audited on a regular schedule by external specialists. Red-team exercises (Article 11) include attempts to bypass each layer of the serving stack.
Practical Application
A team operating an inference endpoint without a serving gateway should adopt one this quarter. Several mature options exist — Kong, Envoy with authorization extensions, AWS API Gateway, Azure API Management, Google Apigee, or AI-specific gateways from the AI TRiSM vendor space. The gateway centralizes the authentication, authorization, and rate-limiting controls so they can be configured, audited, and evolved without changing the inference code.
Once the gateway is in place, the priorities are: enforce per-account authentication; configure per-account and aggregate rate limits calibrated against the previous quarter’s legitimate usage; add structured input validation; emit inference logs into the SIEM (Article 13); and externalize the authorization policy into a separate service so the policy can evolve without redeploying the gateway. These steps in sequence convert a serving endpoint from an undefended attack surface into a controllable platform asset.
© FlowRidge.io — COMPEL AI Transformation Methodology. All rights reserved.