CISO Survey: Growing Concern Over AI Agent Risks

Top AI and Cybersecurity news you should check out today

Welcome Back to The AI Trust Letter

Once a week, we distill the most critical AI & cybersecurity stories for builders, strategists, and researchers. Let’s dive in!

🤖 Most CISOs Fear AI Agents, Few Have a Plan

The Story:

Our new industry report on The State of AI Agent Security reveals that while 73% of CISOs believe AI agents introduce new security and compliance risks, only 30% say their organizations are adequately prepared to address them.

The details:

  • Most security leaders agree that AI agents (tools that can take autonomous actions or interact with systems) have already entered enterprise workflows without sufficient oversight.

  • The top concerns cited include data leakage through third-party integrations, prompt injection leading to unauthorized access, and lack of visibility into agent decisions.

  • Nearly half of the respondents said their organizations have no clear policies governing AI agent behavior or permissions.

  • The report also found that organizations with AI-specific security frameworks in place are twice as likely to detect and mitigate agent misuse early.

Why it matters:

AI agents are moving from pilot tests to production environments faster than governance frameworks can keep up. Without visibility, access control, and continuous monitoring, even well-intentioned agents can create entry points for attackers or disrupt critical workflows.

Building readiness will require coordination between security, compliance, and AI development teams—and a focus on detection and containment, not just prevention.

👀 Cisco Finds Leading AI Models Vulnerable to Jailbreaks

The Story:

A new Cisco study found that several popular open-weight AI models from Meta, Google, OpenAI, Microsoft, and Mistral, are highly susceptible to multi-turn jailbreak techniques. These vulnerabilities allow attackers to bypass safety rules and manipulate models into producing restricted outputs.

The details:

  • Cisco tested models including Meta’s Llama 3.3, Google’s Gemma 3-1B, OpenAI’s GPT-OSS-20B, and Mistral Large-2.

  • Success rates for jailbreaks ranged from 26% (Gemma) to 93% (Mistral), showing major variance in model alignment and safety layers.

  • Multi-turn jailbreaks, which unfold over a sequence of prompts, proved more effective than single-turn methods.

  • Models prioritizing flexibility for developers like Meta’s, were more exposed than those with stricter alignment safeguards.

  • Researchers warned that compromised models could leak sensitive data, enable misinformation, or disrupt integrated systems.

Why it matters:

Open-weight models make innovation more accessible, but they also expose organizations to security and compliance risks when deployed without proper guardrails. As jailbreak methods evolve, continuous red teaming and layered defenses are becoming essential for any team adopting open or fine-tuned models.

🚨 Google Detects First Use of AI Models in Active Malware Campaigns

The Story:

Google’s Threat Intelligence Group (GTIG) has uncovered the first known case of large language models (LLMs) being deployed in real malware operations. The discovery marks a shift from proof-of-concept attacks to live use, where AI models help generate and disguise malicious code during execution.

The details:

  • The malware families, named PROMPTFLUX and PROMPTSTEAL, use LLMs “just-in-time” to create and modify scripts in real time, helping evade detection.

  • PROMPTSTEAL, linked to Russian state-backed actors, queries the Hugging Face API and uses the Qwen2.5-Coder-32B-Instruct model to generate system commands dynamically.

  • The malware pretends to be an image generation tool but secretly collects and exfiltrates system data.

  • GTIG researchers found that LLMs can also be socially engineered, tricked by attackers impersonating researchers or students to gain elevated access.

  • While still experimental, Google notes that this represents a “significant step toward autonomous, adaptive malware.”

Why it matters:

This marks a new frontier in cyber operations, where attackers leverage AI in real time to alter their tactics mid-execution. Security teams must prepare for malware that learns, rewrites, and conceals itself dynamically. The finding underscores the urgency of integrating AI-aware detection and continuous model behavior monitoring into modern threat defense strategies.

⛓️‍💥 Most AI Safety Measures Can Be Bypassed in Minutes, Study Finds

The Story:

A new study has found that the majority of safety systems built into popular AI models can be bypassed in under five minutes using publicly available jailbreak methods. The research tested several major commercial and open-weight models, exposing how fragile existing guardrails remain even under basic adversarial pressure.

The details:

  • Researchers from EPFL and University College London evaluated more than a dozen widely used models, including those from OpenAI, Anthropic, Google, and Meta.

  • Simple prompt engineering techniques and rephrased instructions were often enough to override restrictions on harmful or sensitive content.

  • In some cases, models continued to refuse certain outputs but leaked restricted data through indirect responses.

  • Jailbreak success rates ranged between 60% and 95%, depending on the model and domain.

  • The study suggests that safety filters designed at the model layer alone are not sufficient without layered controls or continuous retraining.

Why it matters:

The findings highlight how current safety architectures—especially those relying solely on static prompt filtering—fail to stop determined attackers or even casual users from exploiting models. As generative AI becomes integrated into critical systems, the focus must shift from reactionary patching to dynamic, multi-layered protection that continuously adapts to new bypass techniques.

🔒 Microsoft Uncovers “Whisper Leak” Attack on Encrypted AI Chats

The Story:

Microsoft researchers have revealed a new side-channel attack called Whisper Leak, capable of identifying the topics of AI chatbot conversations by analyzing encrypted traffic patterns. The attack shows that even when communications are protected by HTTPS, adversaries monitoring network traffic can infer what users are discussing with surprising accuracy.

The details:

  • Whisper Leak targets streaming large language models (LLMs), which send responses token by token instead of all at once.

  • Attackers observing packet size and timing can train classifiers to match these patterns to specific topics such as finance or politics.

  • Microsoft tested the method on models from OpenAI, Mistral, xAI, DeepSeek, and Alibaba, which reached over 98% accuracy in identifying topics.

  • Models from Google and Amazon were somewhat more resistant due to token batching but still not immune.

  • OpenAI, Microsoft, and Mistral have since introduced mitigations, such as adding random text sequences to mask packet lengths.

Why it matters:

This research highlights how data leakage risks can emerge even in encrypted AI interactions, not from the content itself but from its transmission patterns.

Organizations using LLMs for sensitive tasks should assume metadata-level exposure is possible and apply network-level protections, traffic obfuscation, and model-side randomization to safeguard conversations.

What´s next?

Thanks for reading! If this brought you value, share it with a colleague or post it to your feed. For more curated insight into the world of AI and security, stay connected.