feat(librarian): add documentation discovery workflow for targeted doc investigation #377

gtg7784 · 2025-12-31T08:04:31Z

Summary

Add Phase 0.5 (Documentation Discovery) workflow before TYPE A and D requests
Improves documentation lookup accuracy by discovering doc structure before searching

Changes

New Documentation Discovery Flow (Phase 0.5)

Find Official Documentation: Web search to identify the official docs URL
Version Check: If user specifies a version (e.g., "React 18", "Next.js 14"), verify versioned documentation
Sitemap Discovery: Fetch sitemap.xml to understand documentation structure
Targeted Investigation: Use sitemap knowledge to fetch specific relevant pages

Why This Matters

Previously, the librarian would immediately fire parallel searches without understanding the documentation structure. This led to:

Random page fetching that might miss relevant content
No version awareness (could return wrong version's docs)
Inefficient searching without knowing where to look

Now the flow is:

Doc Discovery is SEQUENTIAL (websearch → version check → sitemap → investigate)
Main phase is PARALLEL once you know where to look

Updated Components

Classification table now shows "Doc Discovery →" for TYPE A and D
Tool reference includes sitemap discovery and doc page fetching
Failure recovery handles sitemap/versioned docs not found cases
Parallel execution requirements table now shows Doc Discovery column

Summary by cubic

Adds a Documentation Discovery phase before TYPE A and D requests to locate official, versioned docs and fetch targeted pages. Improves accuracy and cuts down random, unfocused searches.

New Features
- Sequential doc discovery: web search → version check → sitemap → targeted fetch.
- Version-aware lookups when a version is specified.
- Targeted page fetching using sitemap; main investigation runs in parallel after discovery.
- Failure recovery for missing sitemaps or versioned docs.
Refactors
- Update classification to show “Doc Discovery →” for TYPE A and D.
- Switch from context7_get-library-docs to context7_query-docs.
- Expand tool reference for sitemap and doc page fetching.
- Add Doc Discovery column to parallel execution requirements.

^{Written for commit 3853456. Summary will update on new commits.}

…c investigation - Add Phase 0.5 (Documentation Discovery) before TYPE A and D requests - Sequential flow: websearch → version check → sitemap → targeted investigation - Enables version-specific documentation lookup when user specifies version - Sitemap discovery helps understand doc structure before searching - Update tool reference with sitemap and doc page fetching - Add failure recovery for sitemap/versioned docs not found cases

github-actions · 2025-12-31T08:04:42Z

All contributors have signed the CLA. Thank you! ✅
_{Posted by the CLA Assistant Lite bot.}

gtg7784 · 2025-12-31T08:05:25Z

I have read the CLA Document and I hereby sign the CLA

greptile-apps · 2025-12-31T08:07:45Z

Greptile Summary

Adds Phase 0.5 Documentation Discovery workflow to improve documentation lookup accuracy before TYPE A (Conceptual) and TYPE D (Comprehensive) requests. The new workflow discovers official docs URL, verifies versioned documentation, fetches sitemap.xml to understand doc structure, then performs targeted investigation instead of random searching.

Sequential Doc Discovery: websearch → version check → sitemap → targeted investigation
Main Phase Remains Parallel: Once doc structure is known, parallel tool execution continues
Classification table updated: Shows "Doc Discovery →" for TYPE A and D
New tool entries: Added sitemap discovery and doc page fetching with webfetch
Failure recovery: Handles sitemap/versioned docs not found cases

Issue Found: Changed tool name from context7_get-library-docs to context7_query-docs in three locations (lines 114, 181, 237), but src/hooks/agent-usage-reminder/constants.ts:18 still references the old tool name. This creates an inconsistency where the agent usage reminder hook won't recognize the tool being used.

Confidence Score: 3/5

This PR is safe to merge after fixing the tool name inconsistency
The workflow logic and documentation improvements are sound, but the tool name change from context7_get-library-docs to context7_query-docs creates a runtime inconsistency with the agent-usage-reminder hook that still references the old name. This must be corrected either by reverting the tool name changes or updating the hook configuration.
Check src/hooks/agent-usage-reminder/constants.ts to ensure tool name matches the changes in src/agents/librarian.ts

Important Files Changed

Filename	Overview
src/agents/librarian.ts	Added Phase 0.5 Documentation Discovery workflow with sitemap-based targeted investigation. Changed tool name from `context7_get-library-docs` to `context7_query-docs` which creates inconsistency with agent-usage-reminder hook.

Sequence Diagram

sequenceDiagram
    participant User
    participant Librarian
    participant WebSearch as websearch_exa
    participant Sitemap as webfetch(sitemap)
    participant Docs as webfetch(doc_pages)
    participant Context7
    participant GrepApp as grep_app
    
    Note over User,GrepApp: TYPE A or TYPE D Request
    
    User->>Librarian: "How do I use React 18?"
    
    rect rgb(230, 240, 255)
        Note right of Librarian: Phase 0.5: Documentation Discovery (SEQUENTIAL)
        Librarian->>WebSearch: "React official documentation site"
        WebSearch-->>Librarian: https://react.dev
        
        Librarian->>WebSearch: "React v18 documentation"
        WebSearch-->>Librarian: Versioned URL confirmed
        
        Librarian->>Sitemap: GET /sitemap.xml
        Sitemap-->>Librarian: Parse doc structure
        
        Note right of Librarian: Identify relevant sections from sitemap
    end
    
    rect rgb(240, 255, 240)
        Note right of Librarian: Phase 1: Main Investigation (PARALLEL)
        par Parallel Execution
            Librarian->>Context7: resolve-library-id("react")
            Context7-->>Librarian: library_id
            Librarian->>Context7: query-docs(id, "hooks")
            Context7-->>Librarian: Official docs
        and
            Librarian->>Docs: GET /docs/hooks.html (from sitemap)
            Docs-->>Librarian: Targeted doc page
        and
            Librarian->>GrepApp: searchGitHub("React hooks usage")
            GrepApp-->>Librarian: Code examples
        end
    end
    
    Librarian->>User: Synthesized answer with permalinks

greptile-apps

Additional Comments (3)

src/agents/librarian.ts, line 114 (link)

syntax: tool name changed from context7_get-library-docs to context7_query-docs but src/hooks/agent-usage-reminder/constants.ts:18 still references the old name
src/agents/librarian.ts, line 181 (link)

syntax: tool name changed from context7_get-library-docs to context7_query-docs but src/hooks/agent-usage-reminder/constants.ts:18 still references the old name
src/agents/librarian.ts, line 237 (link)

syntax: tool name changed from context7_get-library-docs to context7_query-docs but src/hooks/agent-usage-reminder/constants.ts:18 still references the old name

_{1 file reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

cubic-dev-ai

4 issues found across 1 file

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="src/agents/librarian.ts">

<violation number="1" location="src/agents/librarian.ts:53">
P2: TYPE A table says `context7 + websearch_exa`, but TYPE A instructions now use `webfetch(...)` from the sitemap instead. Align the table with the actual workflow so the agent follows a single, consistent playbook.</violation>

<violation number="2" location="src/agents/librarian.ts:76">
P3: The `webfetch(... + &quot;/versions&quot; or &quot;/v{version}&quot;)` example is ambiguous/invalid pseudo-code. Split into explicit alternatives like you do for sitemap fallbacks.</violation>

<violation number="3" location="src/agents/librarian.ts:96">
P1: `context7_query-docs` appears to be an incorrect/non-existent tool name (only used in this prompt, while the repo references `context7_get-library-docs`). This will likely break the librarian’s Context7 step or confuse tool usage; use the real tool name consistently.</violation>

<violation number="4" location="src/agents/librarian.ts:114">
P2: Context7 call examples are inconsistent (`query:` vs `topic:`). Pick one argument shape and use it everywhere to avoid invalid tool-call schemas.</violation>
</file>

Since this is your first cubic review, here's how it works:

cubic automatically reviews your code and comments on bugs and improvements
Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
Ask questions if you need clarification on any suggestion

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

src/agents/librarian.ts

cubic-dev-ai · 2025-12-31T08:13:46Z

src/agents/librarian.ts

 | Type | Trigger Examples | Tools |
 |------|------------------|-------|
-| **TYPE A: CONCEPTUAL** | "How do I use X?", "Best practice for Y?" | context7 + websearch_exa (parallel) |
+| **TYPE A: CONCEPTUAL** | "How do I use X?", "Best practice for Y?" | Doc Discovery → context7 + websearch_exa |


P2: TYPE A table says context7 + websearch_exa, but TYPE A instructions now use webfetch(...) from the sitemap instead. Align the table with the actual workflow so the agent follows a single, consistent playbook.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At src/agents/librarian.ts, line 53: <comment>TYPE A table says `context7 + websearch_exa`, but TYPE A instructions now use `webfetch(...)` from the sitemap instead. Align the table with the actual workflow so the agent follows a single, consistent playbook.</comment> <file context> @@ -50,10 +50,56 @@ Classify EVERY request into one of these categories before taking action: | Type | Trigger Examples | Tools | |------|------------------|-------| -| **TYPE A: CONCEPTUAL** | "How do I use X?", "Best practice for Y?" | context7 + websearch_exa (parallel) | +| **TYPE A: CONCEPTUAL** | "How do I use X?", "Best practice for Y?" | Doc Discovery → context7 + websearch_exa | | **TYPE B: IMPLEMENTATION** | "How does X implement Y?", "Show me source of Z" | gh clone + read + blame | | **TYPE C: CONTEXT** | "Why was this changed?", "History of X?" | gh issues/prs + git log/blame | </file context>

The table shows "Doc Discovery →" which includes the webfetch sitemap workflow. The context7 + websearch_exa still applies after Doc Discovery phase. The table is a summary, while the detailed workflow is in PHASE 0.5 and TYPE A sections.

src/agents/librarian.ts

cubic-dev-ai · 2025-12-31T08:13:46Z

src/agents/librarian.ts

+With sitemap knowledge, fetch the SPECIFIC documentation pages relevant to the query:
+\`\`\`
+webfetch(specific_doc_page_from_sitemap)
+context7_query-docs(libraryId, query: "specific topic")


P1: context7_query-docs appears to be an incorrect/non-existent tool name (only used in this prompt, while the repo references context7_get-library-docs). This will likely break the librarian’s Context7 step or confuse tool usage; use the real tool name consistently.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At src/agents/librarian.ts, line 96: <comment>`context7_query-docs` appears to be an incorrect/non-existent tool name (only used in this prompt, while the repo references `context7_get-library-docs`). This will likely break the librarian’s Context7 step or confuse tool usage; use the real tool name consistently.</comment> <file context> @@ -50,10 +50,56 @@ Classify EVERY request into one of these categories before taking action: +With sitemap knowledge, fetch the SPECIFIC documentation pages relevant to the query: +\`\`\` +webfetch(specific_doc_page_from_sitemap) +context7_query-docs(libraryId, query: "specific topic") +\`\`\` + </file context>

Fixed in commit 3853456. The tool name context7_query-docs is correct (matches the MCP tool definition). Also updated src/hooks/agent-usage-reminder/constants.ts to use the same name for consistency.

- Fix ambiguous webfetch pseudo-code (split /versions or /v{version}) - Unify context7_query-docs parameter format (query: instead of topic:) - Update agent-usage-reminder constants to use context7_query-docs

code-yeongyu · 2026-01-03T00:58:09Z

@sisyphus-dev-ai review this

And how do you think this will change the librarian's behavior be like? Will it slower? Preciser? More possiblities to answer deprecated stuff?

ulw to share your opinion

sisyphus-dev-ai · 2026-01-03T00:59:41Z

Review: Documentation Discovery Workflow for Librarian

Great question @code-yeongyu! Let me break down the behavioral impact of this change.

TL;DR

Aspect	Impact	Why
Speed	Slightly slower	Sequential discovery phase (3-4 calls) before parallel execution
Precision	Significantly better	Targeted searching informed by doc structure
Deprecated content	Less likely, not more	Version-aware lookups actively prevent wrong-version docs

Detailed Analysis

Speed Trade-off

Before: Immediate parallel fire (3+ calls for TYPE A, 6+ for TYPE D)

After: Sequential discovery → then parallel fire

websearch (find docs URL)
   ↓
webfetch (version check if specified)
   ↓
webfetch (sitemap.xml)
   ↓
[NOW parallel phase begins]

This adds ~2-4 seconds latency on TYPE A/D requests. The main investigation phase remains parallel.

Precision Improvement (The Real Win)

The current librarian does "hopeful parallel searching" - it fires multiple tools simultaneously without knowing WHERE the relevant documentation lives. This leads to:

Random page hits that miss relevant content
No understanding of doc organization
Luck-dependent results

The new flow:

Finds official docs URL - not blogs, not tutorials
Verifies version - React 18 docs, not React 17
Parses sitemap - understands the doc structure
Fetches targeted pages - informed by sitemap knowledge

This is like the difference between grepping blindly vs. understanding the codebase structure first.

Deprecated Content (Counter-intuitive!)

You asked if this increases deprecated content risk. It's actually the opposite.

The new Phase 0.5 explicitly:

Checks for versioned documentation when user specifies a version
Falls back to latest with explicit acknowledgment: "Versioned docs not found → Fall back to latest version, note this in response"

Before this change, the librarian had zero version awareness. It would happily return Next.js 13 App Router docs when the user was asking about Next.js 14.

Trade-off Summary

Before	After
Lower latency	Higher latency (+2-4s)
Random parallel hits	Informed targeted hits
No version awareness	Explicit version checking
Variable precision	Consistent precision
Lower token usage	Higher token usage (sitemap parsing)

Verdict

This is a good trade-off. Documentation questions benefit more from precision than raw speed. Users asking "How do I use X in version Y?" care about getting the right answer, not getting any answer 2 seconds faster.

The sequential discovery is a one-time upfront cost that dramatically improves the quality of the parallel phase that follows.

Minor Suggestion

Consider adding a skip condition for well-known libraries with consistent doc structures (React, Next.js, etc.) where we already know the sitemap location. This could recover some latency for common cases.

// Potential optimization
if (library in KNOWN_DOC_STRUCTURES) {
  // Skip websearch, go directly to sitemap
  webfetch(KNOWN_DOC_STRUCTURES[library].sitemap)
}

But this is a nice-to-have, not a blocker.

LGTM - The precision improvement justifies the latency cost.

github-actions bot added a commit that referenced this pull request Dec 31, 2025

@gtg7784 has signed the CLA in #377

ea6121e

greptile-apps bot reviewed Dec 31, 2025

View reviewed changes

cubic-dev-ai bot reviewed Dec 31, 2025

View reviewed changes

sisyphus-dev-ai mentioned this pull request Jan 1, 2026

Issue re-labeling #252

Open

gtg7784 changed the base branch from master to dev January 2, 2026 05:33

fix(librarian): address PR review feedback

3853456

- Fix ambiguous webfetch pseudo-code (split /versions or /v{version}) - Unify context7_query-docs parameter format (query: instead of topic:) - Update agent-usage-reminder constants to use context7_query-docs

sisyphus-dev-ai added the sisyphus: working Sisyphus is currently working on this label Jan 3, 2026

sisyphus-dev-ai removed the sisyphus: working Sisyphus is currently working on this label Jan 3, 2026

code-yeongyu force-pushed the dev branch from 1b4532e to f61e1a5 Compare January 3, 2026 06:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(librarian): add documentation discovery workflow for targeted doc investigation #377

feat(librarian): add documentation discovery workflow for targeted doc investigation #377

gtg7784 commented Dec 31, 2025 •

edited by cubic-dev-ai bot

Loading

Uh oh!

github-actions bot commented Dec 31, 2025 •

edited

Loading

Uh oh!

gtg7784 commented Dec 31, 2025

Uh oh!

greptile-apps bot commented Dec 31, 2025

Uh oh!

greptile-apps bot left a comment •

edited

Loading

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

Uh oh!

cubic-dev-ai bot Dec 31, 2025 •

edited

Loading

Uh oh!

gtg7784 Jan 2, 2026

Uh oh!

Uh oh!

cubic-dev-ai bot Dec 31, 2025 •

edited

Loading

Uh oh!

gtg7784 Jan 2, 2026

Uh oh!

code-yeongyu commented Jan 3, 2026

Uh oh!

sisyphus-dev-ai commented Jan 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(librarian): add documentation discovery workflow for targeted doc investigation #377

Are you sure you want to change the base?

feat(librarian): add documentation discovery workflow for targeted doc investigation #377

Conversation

gtg7784 commented Dec 31, 2025 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

New Documentation Discovery Flow (Phase 0.5)

Why This Matters

Updated Components

Summary by cubic

Uh oh!

github-actions bot commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gtg7784 commented Dec 31, 2025

Uh oh!

greptile-apps bot commented Dec 31, 2025

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (3)

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai bot Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gtg7784 Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai bot Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gtg7784 Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

code-yeongyu commented Jan 3, 2026

Uh oh!

sisyphus-dev-ai commented Jan 3, 2026

Review: Documentation Discovery Workflow for Librarian

TL;DR

Detailed Analysis

Speed Trade-off

Precision Improvement (The Real Win)

Deprecated Content (Counter-intuitive!)

Trade-off Summary

Verdict

Minor Suggestion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gtg7784 commented Dec 31, 2025 •

edited by cubic-dev-ai bot

Loading

github-actions bot commented Dec 31, 2025 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

cubic-dev-ai bot Dec 31, 2025 •

edited

Loading

cubic-dev-ai bot Dec 31, 2025 •

edited

Loading