Designed a multi-agent AI workflow to streamline qualitative analysis, introducing structured evaluation and critique to improve the reliability of research insights.

Timeline

Spring 2025

Role

AI pipeline design

For

Healthcare Researchers

Scope

UX Research, AI Systems, Workflow Design, Evaluation Strategy

overview

Qualitative thematic analysis is widely used in healthcare research to uncover patterns in patient and caregiver experiences.

However, traditional workflows are manual, time-consuming, and often inconsistent across researchers, making it difficult to scale insights.

In this project, we designed a multi-agent AI system to automate the thematic analysis process. By structuring the workflow into specialized agents, generation, evaluation, critique, and refinement, I aimed to improve both the efficiency and reliability of qualitative analysis in clinical research contexts.

My ROLE

As the AI Product & System Designer, my contributions included:
‍
– Defining the overall system architecture and agent roles
– Designing prompt strategies (zero-shot vs. one-shot) and workflow variations
– Developing evaluation frameworks to measure output quality (e.g., similarity metrics, LLM-based scoring)
– Analyzing results to identify performance gaps and inform iterative improvements

pipeline design

Designed a modular multi-agent architecture to simulate collaborative qualitative analysis.

– Generation Agent → generates themes
– Merge Agent → consolidates outputs
– Evaluation Agent → scores quality
– Critic Agent → audits reasoning
– Refinement Agent → improves results

Key Innovation

– Multi-Agent Collaboration
Modularized the workflow into specialized agents to enable structured, scalable analysis
‍
– Critic Agent
Added a review layer to improve reliability and reduce bias
‍
– Parallel Runs
Used multi-temperature outputs to balance diversity and consistency

Flowchart showing AI workflow from interview transcript through LLM agents generating, merging, evaluating, critiquing, refining to producing structured themes.

Two tables displaying evaluation results: Table 1 titled 'Embedding-Based Similarity Results' shows Hit Rate from 0 to 0.5 and Jaccard Similarity from 0 to 0.1 across zero and one shot plus 1 or 2 evaluations; Table 2 titled 'GPT-Based Scoring Results' shows higher Hit Rates (0.75 to 1.0) and Jaccard Similarity (0.30 to 0.60) for same workflows.

Designing and testing different AI workflows

Tested 4 configurations:
‍
– Zero-shot vs One-shot
– Single vs Double evaluation

Metrics

– Hit Rate: Measures the proportion of human-coded themes that have at least one sufficiently similar theme in the LLM-generated theme set
‍
– Jaccard Similarity: Measures the overlap between the LLM-generated theme set and the human-coded theme set, based on matched theme pairs

Key Findings

– One-shot > Zero-shot
– Double evaluation improves quality
– Critic agent improves consistency

Reflection & Key Takeaways

🟣 The quality of AI outputs depends more on how tasks are structured and connected than on any single prompt, especially in multi-step reasoning tasks

🟣 Effective prompt engineering is not about ad-hoc tweaking, but about designing structured, role-specific instructions that align with the overall system workflow

Timeline

Role

For

Scope

System Design

Experiment Design

Reflection & Key Takeaways