Skip to content

semcod/docval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

docval

Python License: Apache-2.0 Tests

AI Cost Tracking

PyPI Version Python License AI Cost Human Time Model

  • πŸ€– LLM usage: $0.5083 (6 commits)
  • πŸ‘€ Human dev: ~$424 (4.2h @ $100/h, 30min dedup)

Generated on 2026-04-20 using openrouter/qwen/qwen3-coder-next


Validate and refactor Markdown documentation against source code β€” detect outdated, orphaned, duplicate, and invalid docs using heuristics + optional LLM.

docval_architecture.svg

How it works

docs/  ──→  chunk by heading  ──→  heuristic checks  ──→  cross-ref with code  ──→  (optional) LLM  ──→  report/fix

Three validation layers, each progressively deeper:

  1. Heuristic validator (fast, free) β€” empty sections, broken internal links, TODO/FIXME markers, duplicate detection via difflib, stale version references, archive path detection, explicit deprecation markers
  2. Cross-reference validator (fast, free) β€” checks that backtick-quoted symbols (ClassName, function_name), import paths in code blocks, and CLI commands actually exist in the project source
  3. LLM validator (optional, paid) β€” semantic validation via litellm for chunks that heuristics couldn't resolve with high confidence

Installation

pip install docval

With LLM support:

pip install docval[llm]

From source:

git clone https://github.com/wronai/docval.git
cd docval
pip install -e ".[dev]"

CLI Usage

Scan and report issues

docval scan docs/
docval scan docs/ --project /path/to/repo -v
docval scan docs/ -o report.md
docval scan docs/ -o report.json

Fix documentation (dry-run by default)

docval fix docs/                           # preview changes
docval fix docs/ --no-dry-run              # apply fixes
docval fix docs/ --no-dry-run --llm        # with LLM validation

Generate a patch file

docval patch docs/ -o fixes.txt
docval patch docs/ --llm --model gpt-4o -o fixes.txt

View documentation statistics

docval stats docs/

LLM validation

export OPENAI_API_KEY=sk-...
docval scan docs/ --llm --model gpt-4o-mini
docval scan docs/ --llm --model anthropic/claude-sonnet-4-20250514
docval scan docs/ --llm --model groq/llama-3.3-70b-versatile

Any model supported by litellm works.

Python API

from pathlib import Path
from docval.pipeline import scan
from docval.reporters import ConsoleReporter, MarkdownReporter

# Run validation
result = scan(
    docs_dir=Path("docs/"),
    project_root=Path("."),
    use_llm=False,
)

# Print to console
ConsoleReporter(verbose=True).report(result)

# Write markdown report
MarkdownReporter().report(result, Path("validation-report.md"))

Using individual validators

from docval.chunker import chunk_directory
from docval.context import build_context
from docval.validators import HeuristicValidator, CrossRefValidator

# Chunk docs
doc_files = chunk_directory(Path("docs/"))

# Build project context
ctx = build_context(Path("."))

# Run heuristics
heuristic = HeuristicValidator(ctx=ctx)
heuristic.validate(doc_files)

# Cross-reference check
crossref = CrossRefValidator(ctx=ctx)
crossref.validate(doc_files)

# Inspect results
for f in doc_files:
    for chunk in f.chunks:
        if chunk.issues:
            print(f"{f.relative_path}:{chunk.line_start} [{chunk.status.value}] {chunk.heading}")
            for issue in chunk.issues:
                print(f"  {issue.severity.value}: {issue.message}")

What it detects

Check Layer Example
Empty sections Heuristic Heading with no body text
Broken internal links Heuristic [guide](./deleted-file.md)
Deprecated markers Heuristic DEPRECATED, OBSOLETE, DO NOT USE
Archive path Heuristic Files in docs/archive/ directories
Stale versions Heuristic References to v1.x when project is v3.x
Duplicates Heuristic >80% similar content across files
TODO/FIXME Heuristic Unfinished documentation markers
Orphaned code refs CrossRef `NonExistentClass` in backticks
Broken imports CrossRef from mypackage.deleted import X in code blocks
Semantic accuracy LLM Content that doesn't match actual project behavior

Architecture

src/docval/
β”œβ”€β”€ cli.py                  # Click CLI: scan, fix, patch, stats
β”œβ”€β”€ pipeline.py             # Orchestrates: discover β†’ chunk β†’ validate β†’ report
β”œβ”€β”€ models.py               # Data models: DocChunk, DocFile, ValidationResult
β”œβ”€β”€ chunker.py              # MD β†’ heading-based semantic chunks
β”œβ”€β”€ context.py              # Build project context (AST, git, .toon files)
β”œβ”€β”€ validators/
β”‚   β”œβ”€β”€ heuristic.py        # Rule-based checks (free, fast)
β”‚   β”œβ”€β”€ crossref.py         # Code ↔ docs cross-reference
β”‚   └── llm_validator.py    # Semantic validation via litellm
β”œβ”€β”€ actions/
β”‚   └── executor.py         # Apply fixes: delete, archive, patch
└── reporters/
    β”œβ”€β”€ console.py           # Rich CLI output
    β”œβ”€β”€ markdown_report.py   # .md report
    └── json_report.py       # .json for CI/CD

Integration with .toon files

docval understands .toon.yaml files from the code2llm ecosystem. When present, it extracts module names, class names, and exported functions for cross-referencing, giving more accurate orphaned-reference detection.

License

Licensed under Apache-2.0.

Status

Last updated by taskill at 2026-04-25 13:37 UTC

Metric Value
HEAD 4fba32f
Coverage β€”
Failing tests β€”
Commits in last cycle 6

Add markdown output for documentation generation (docs feature). Commit also added inclusion of commit messages in the markdown output.

About

Validate and refactor Markdown documentation against source code using heuristics + LLM

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors