10 minute read

AI Agent Workflows as GPU Architecture: A Parallel Computing AnalogyPermalink

I’m taking a Parallel Systems class and trying to understand GPU architecture through the lens of my own AI agent workflows. I already think about serial versus parallelizable work because I constantly try to parallelize as much of my development work as possible across AI agents. Because of that, I already know how hard it is to break down tasks and redesign them to maximize the amount that I can parallelize, as well as all the pitfalls and pain points that come with it. As it turns out, as I’m learning about GPU design principles, there’s a decent bit in common.

The Core Analogy: CPU ↔ Me, GPU ↔ AI AgentsPermalink

CPU = The Human Engineer (Me)Permalink

The CPU handles complex, sequential, often unpredictable work. It has strong single-thread performance but only a few cores. It excels at orchestrating, making high-level decisions, and resolving control hazards.

My role in the workflow:

  • I’m the “single-threaded bottleneck”
  • I’m great at strategy, creative leaps, and ambiguity resolution
  • But I’m slower at grinding through thousands of repetitive subtasks
  • I handle the hard sequential work: strategy, decomposition, resolving ambiguity

GPU = My Pool of AI AgentsPermalink

The GPU has thousands of simple cores, great at massively parallel, regular, well-defined workloads. It needs clear, uniform instructions to shine (SIMD).

My AI agents:

  • Thrive when given well-scoped, parallelizable tasks (e.g., “generate test cases for 100 functions”)
  • Struggle with irregular/ambiguous problems (like high-level feature scoping)
  • Need lots of throughput, not complex single-thread reasoning
  • Do the massively parallel work: repetition, structured generation, large batch subtasks

I’ve explained to others that I think of AI agents as being really really good at very specific, well-defined tasks, and the better you can break up your work into a composition of these well-defined, specific tasks, the better they can be parallelized.

The Pipeline: Workflow as Execution PipelinePermalink

In CPUs/GPUs, pipelines turn instructions into throughput, but stalls happen if stages are empty or blocked.

In my work:

  • The pipeline is the flow of tasks between me and my agents
  • A stall = when my agents are idle, waiting for me to resolve ambiguity or feed them new specs
  • Keeping the “pipeline full” = always ensuring there’s a backlog of parallelizable subtasks for the agents

Hazards and Stalls ↔ Bottlenecks in My WorkflowPermalink

Memory StallPermalink

GPU: Waiting for data from memory My workflow: Agents waiting for missing context, unclear specs, or inaccessible data

Control StallPermalink

GPU: Waiting for branch resolution My workflow: Agents paused because I haven’t decided which design direction to take

Fix: Just as GPUs hide latency with massive multithreading, I hide my bottlenecks by pre-scoping and queuing enough agent-friendly work. As some agents wait for me to update them, others are running in the background, which helps to hide the bottlenecks resulting from the need for my manual intervention.

Task Decomposition ↔ Load BalancingPermalink

GPU efficiency: Depends on dividing work into equally sized, parallel tasks (e.g., each warp gets a uniform job).

My workflow:

  • If I give agents 5 tiny tickets and 1 giant one, most sit idle—just like an SM underutilized because one thread diverged.
  • If I break the giant task into sub-tasks that are all generally evenly scoped and can be done in parallel, all agents can stay busy—just like keeping all GPU cores fed.

This also implies that tasks are best parallelizable when they don’t have any dependence on each other or on previous tasks. The more that can be done in parallel and then collected later on, the better. In the GPU case, this happens when a single task can be set loose to the warps, and then later on collected all at once, as opposed to a staggered release of some computation in warps, then some collection process to CPU, then more computation to warps, and so on. In the AI agent case, if I can set up 5 tasks in a way where they can all be done in parallel, this is far superior than, say, 4 tasks that can be done in parallel and 1 task that needs one of the previous tasks before it can work.

CPU–GPU Transfers ↔ Human–Agent HandoffsPermalink

Data transfer overhead: Moving data between CPU and GPU is slow (PCIe bottleneck).

My analogy: Context-switching between me and the agents costs time:

  • Writing specs
  • Explaining requirements
  • Validating outputs and giving further instructions.

Optimization principle: Minimize back-and-forth by:

  • Bundling instructions (batching tasks)
  • Defining interfaces so agents can run longer without intervention.

Expanded GPU Primitives in My WorkflowPermalink

Threads, Warps, and BlocksPermalink

GPU primitive:

  • Thread = smallest unit of execution
  • Warp (NVIDIA) = group of 32 threads scheduled together in lockstep
  • Block = collection of warps that share local resources

My workflow:

  • Thread = one micro-task assigned to an agent (“write a single set of unit tests”)
  • Warp = a batch of agents working on uniform subtasks with the same control flow. (“generate all test cases at once, across each of these 10 folders”)
  • Block = an entire project component (e.g., frontend module), subdivided into agent batches that share a local “context” (design spec, shared dataset)

Lesson: Organize agents into batches of uniform work to avoid “warp divergence” (agents spinning off into different, incompatible workflows).

Warp DivergencePermalink

GPU primitive: Within a warp, all threads execute the same instruction. If some take a branch (“if” statement) and others don’t, execution serializes → performance drop.

My workflow: If I give multiple agents slightly different subtasks that require different reasoning paths (“5 write tests, 5 write docs, 5 refactor code”), it’s tougher to inspect their results. Here, the analogy slightly breaks down, as the AI agents can still do well even if some do code development and some do code review, but we pick it up again if we consider that I, as the individual, have to inspect their results. It’s tougher to inspect the work of 5 agents if all 5 agents are doing different things (some doing docs, some doing code development, some doing tests), so, for example, having a team of 5 AI agents (a “warp”) that are all doing the same thing (“writing tests”) makes it easier to review.

Optimization: Group tasks of the same type per batch of agents → keep warps coherent.

Shared Memory vs. Global MemoryPermalink

GPU primitive:

  • Shared memory: small, fast memory accessible by threads in the same block
  • Global memory: huge but slow, accessible by all threads

My workflow:

  • Shared memory: context/resources I provide for a specific group of agents (e.g., shared prompt, dataset slice)
  • Global memory: general documentation or knowledge base accessible to all agents

Optimization: Use shared memory (tight, task-specific context packets) when I want fast coordination. Avoid every agent pulling from the “global doc” (slow, as it takes time for them to find the right context, determine relevance, and make sure it’s applicable to whatever they’re doing).

OccupancyPermalink

GPU primitive: % of hardware that is busy executing threads. Low occupancy = wasted GPU potential.

My workflow: % of agents that are productively working. If only 2/6 are busy, occupancy = 33%.

Optimization: Break down big tasks to improve load balancing, keep occupancy high.

Memory CoalescingPermalink

GPU primitive: Threads in a warp should access consecutive memory locations for bandwidth efficiency. Random access = bandwidth waste.

My workflow: Agents should consume data in aligned, contiguous ways.

Good: “Each agent writes tests for each of the 10 files in their folder, in the order specified in the instructions. Bad: “Each agent follows the instructions in a random order. Here, the analogy for my workflow slightly breaks down.

Kernel Launch OverheadPermalink

GPU primitive: Every GPU kernel launch has overhead—launching too many tiny kernels wastes time.

My workflow: Every time I hand off a tiny micro-task to agents, I pay overhead (prompting, context prep, my own context switching, validation).

Optimization: Batch tasks into larger kernels. For example, it’s easier to say “write all tests for this module” vs. “write a test for this function”, as I don’t have to context switch and recall what each function does and I can instead instruct it to write something for the entire module.

Streams & ConcurrencyPermalink

GPU primitive: Multiple streams allow independent kernel execution and overlap between computation and memory transfers.

My workflow:

  • Stream A: agents generate docs
  • Stream B: agents generate tests
  • I overlap their work instead of serializing them.

Optimization: Run independent task streams in parallel to maximize utilization, rather than bottlenecking everything behind one type of task. For example, having multiple agents means that Stream A, which writes docs, doesn’t have to wait for Stream B, which writes tests, to finish.

Synchronization BarriersPermalink

GPU primitive: Threads in a block may need to synchronize before moving on (e.g., after partial sum computation).

My workflow: After parallel subtasks (e.g., agents generating partial datasets), I must pause and validate before proceeding to the next stage (e.g., aggregating results). I can’t have agents continue to do their work until I’ve actually validated their results. This is especially true across multiple agents, where I have to synchronize their work, especially if they’re working on things that relate to each other, before they continue.

Optimization: Insert synchronization points only when necessary. If you have too many synchronization barriers (touchpoints where you have to manually sync what the agents are doing), you end up having too many idle agents, as the nubmer of times where an AI agent will wrap up its work and be waiting for you will increase. If you have too few, you’re not checking their work enough, so you might see race conditions (AI agents tripping over themselves) or quality issues.

Asynchronous ExecutionPermalink

GPU primitive: CPU launches GPU kernels asynchronously; doesn’t wait for completion unless needed.

My workflow: I shouldn’t block waiting on agents. Instead:

  • Kick off a batch
  • Do CPU-side sequential work (e.g., more planning, reviewing some results, anticipating the outputs of the AI agents, reviewing dashboards) while agents grind.
  • Sync only when I need results (e.g., once an AI agent is done).

This allows me (the CPU) and the AI agents (the GPU) to work asynchronously without either awaiting completion of the other.

Cache Coherence (or Lack Thereof)Permalink

GPU primitive: GPUs often drop cache coherence and work is done independently without tight synchronization to reduce overhead.

My workflow: Don’t force all agents to constantly “check in” with me (cache coherence). I don’t need to be seeing all the time what the AI agents are doing nor do they need to be following up with me. Even more, none of the warps need to be seeing what the other warps are doing nor do the warps need to check in with me to make sure that their copy of the code matches some “master copy” of the code that I have and that I reconciled across all the warps. Instead, each AI agent, especially if we’re talking about remote AI agents that have their own sandbox, should have their own environment, slice of work, and memory, and they only reconcile with me at checkpoints. Then, it’s only when all the AI agents are done that I actually go in and reconcile across all of them, instead of them having to constantly “check in” with me to see if their sandbox copy of the code matches mine.

Synthesis: “Throughput Thinking” for AI-Agent WorkflowsPermalink

The key mental reframe: Treat my workflow like a GPU workload:

  • Batch and vectorize tasks so agents can do them in parallel
  • Hide latency by keeping a queue of pre-scoped work ready
  • Balance load so no agent is idle while one huge task drags
  • Minimize transfers by packaging context and tasks to reduce back-and-forth

Practical ImplementationPermalink

Redesigning Workflows ↔ Kernel OptimizationPermalink

GPU programming is about reshaping the problem so it fits GPU hardware. My workflow design is about reshaping tasks so they fit agent capabilities.

Examples:

  • Instead of handing an agent “build a full backend,” break into kernels: “scaffold endpoints,” “write CRUD tests,” “add logging”
  • Just as GPUs excel at uniform kernels, agents excel at well-scoped, modular subtasks

The Workflow Design FrameworkPermalink

  1. Avoid warp divergence (assign homogeneous batches)
  2. Maximize occupancy (keep all agents busy)
  3. Batch tasks to amortize kernel launch overhead
  4. Use shared context wisely (shared memory)
  5. Only sync when necessary (barriers)
  6. Run independent streams concurrently

Why This Analogy MattersPermalink

Both the way that I think about parallelizing AI agent work as well as the way that GPUs attempt to parallelize work have some concepts in common. In particular, for creating AI agent workflows, that ends up looking like:

  • Identifying bottlenecks before they happen
  • Designing (often in unintuitive ways) tasks that maximize parallel execution
  • Minimizing the overhead of human-agent coordination (e.g., minimizing the amount of time that I have to manually do stuff myself).
  • Scaling my productivity by keeping all agents busy.

As it turns out, the problem of “how do I do work efficiently and in parallel” is a common way to solve problems. There are versions of this across fields (you can imagine, for example, that production lines, Michelin-star kitchens, and GPU architects all have their own way of solving it). But it seems like there are patterns in common nonetheless. As I’m learning about GPUs, I’m thinking about analogies that make sense to me.