mineru-pdf-extractor

Coding Agents & IDEs
v1.0.5
Benign

Extract PDF content to Markdown using MinerU API.

2635 downloads635 installsby @a-i-r

Setup & Installation

Install command

clawhub install a-i-r/mineru-pdf-extractor

If the CLI is not installed:

Install command

npx clawhub@latest install a-i-r/mineru-pdf-extractor

Or install with OpenClaw CLI:

Install command

openclaw skills install a-i-r/mineru-pdf-extractor

or paste the repo link into your assistant's chat

Install command

https://github.com/openclaw/skills/tree/main/skills/a-i-r/mineru-pdf-extractor

What This Skill Does

MinerU PDF Extractor converts PDF documents to structured Markdown via the MinerU API. It handles formula recognition, table extraction, and OCR. Two methods are available: uploading a local file in 4 steps, or submitting a public URL in 2 steps.

Handles complex PDF content like LaTeX formulas and embedded tables that basic text extractors miss.

When to Use It

  • Converting arXiv research papers to Markdown for note-taking
  • Extracting tables from financial reports into structured text
  • Batch processing academic PDFs to build a searchable knowledge base
  • Parsing scanned documents with OCR into plain text
  • Converting technical manuals with embedded formulas to Markdown
View original SKILL.md file
# MinerU PDF Extractor

Extract PDF documents to structured Markdown using the MinerU API. Supports formula recognition, table extraction, and OCR.

> **Note**: This is a community skill, not an official MinerU product. You need to obtain your own API key from [MinerU](https://mineru.net/).

---

## ๐Ÿ“ Skill Structure

```
mineru-pdf-extractor/
โ”œโ”€โ”€ SKILL.md                          # English documentation
โ”œโ”€โ”€ SKILL_zh.md                       # Chinese documentation
โ”œโ”€โ”€ docs/                             # Documentation
โ”‚   โ”œโ”€โ”€ Local_File_Parsing_Guide.md   # Local PDF parsing detailed guide (English)
โ”‚   โ”œโ”€โ”€ Online_URL_Parsing_Guide.md   # Online PDF parsing detailed guide (English)
โ”‚   โ”œโ”€โ”€ MinerU_ๆœฌๅœฐๆ–‡ๆกฃ่งฃๆžๅฎŒๆ•ดๆต็จ‹.md  # Local parsing complete guide (Chinese)
โ”‚   โ””โ”€โ”€ MinerU_ๅœจ็บฟๆ–‡ๆกฃ่งฃๆžๅฎŒๆ•ดๆต็จ‹.md  # Online parsing complete guide (Chinese)
โ””โ”€โ”€ scripts/                          # Executable scripts
    โ”œโ”€โ”€ local_file_step1_apply_upload_url.sh    # Local parsing Step 1
    โ”œโ”€โ”€ local_file_step2_upload_file.sh         # Local parsing Step 2
    โ”œโ”€โ”€ local_file_step3_poll_result.sh         # Local parsing Step 3
    โ”œโ”€โ”€ local_file_step4_download.sh            # Local parsing Step 4
    โ”œโ”€โ”€ online_file_step1_submit_task.sh        # Online parsing Step 1
    โ””โ”€โ”€ online_file_step2_poll_result.sh        # Online parsing Step 2
```

---

## ๐Ÿ”ง Requirements

### Required Environment Variables

Scripts automatically read MinerU Token from environment variables (choose one):

```bash
# Option 1: Set MINERU_TOKEN
export MINERU_TOKEN="your_api_token_here"

# Option 2: Set MINERU_API_KEY
export MINERU_API_KEY="your_api_token_here"
```

### Required Command-Line Tools

- `curl` - For HTTP requests (usually pre-installed)
- `unzip` - For extracting results (usually pre-installed)

### Optional Tools

- `jq` - For enhanced JSON parsing and security (recommended but not required)
  - If not installed, scripts will use fallback methods
  - Install: `apt-get install jq` (Debian/Ubuntu) or `brew install jq` (macOS)

### Optional Configuration

```bash
# Set API base URL (default is pre-configured)
export MINERU_BASE_URL="https://mineru.net/api/v4"
```

> ๐Ÿ’ก **Get Token**: Visit https://mineru.net/apiManage/docs to register and obtain an API Key

---

## ๐Ÿ“„ Feature 1: Parse Local PDF Documents

For locally stored PDF files. Requires 4 steps.

### Quick Start

```bash
cd scripts/

# Step 1: Apply for upload URL
./local_file_step1_apply_upload_url.sh /path/to/your.pdf
# Output: BATCH_ID=xxx UPLOAD_URL=xxx

# Step 2: Upload file
./local_file_step2_upload_file.sh "$UPLOAD_URL" /path/to/your.pdf

# Step 3: Poll for results
./local_file_step3_poll_result.sh "$BATCH_ID"
# Output: FULL_ZIP_URL=xxx

# Step 4: Download results
./local_file_step4_download.sh "$FULL_ZIP_URL" result.zip extracted/
```

### Script Descriptions

#### local_file_step1_apply_upload_url.sh

Apply for upload URL and batch_id.

**Usage:**
```bash
./local_file_step1_apply_upload_url.sh <pdf_file_path> [language] [layout_model]
```

**Parameters:**
- `language`: `ch` (Chinese), `en` (English), `auto` (auto-detect), default `ch`
- `layout_model`: `doclayout_yolo` (fast), `layoutlmv3` (accurate), default `doclayout_yolo`

**Output:**
```
BATCH_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
UPLOAD_URL=https://mineru.oss-cn-shanghai.aliyuncs.com/...
```

---

#### local_file_step2_upload_file.sh

Upload PDF file to the presigned URL.

**Usage:**
```bash
./local_file_step2_upload_file.sh <upload_url> <pdf_file_path>
```

---

#### local_file_step3_poll_result.sh

Poll extraction results until completion or failure.

**Usage:**
```bash
./local_file_step3_poll_result.sh <batch_id> [max_retries] [retry_interval_seconds]
```

**Output:**
```
FULL_ZIP_URL=https://cdn-mineru.openxlab.org.cn/pdf/.../xxx.zip
```

---

#### local_file_step4_download.sh

Download result ZIP and extract.

**Usage:**
```bash
./local_file_step4_download.sh <zip_url> [output_zip_filename] [extract_directory_name]
```

**Output Structure:**
```
extracted/
โ”œโ”€โ”€ full.md              # ๐Ÿ“„ Markdown document (main result)
โ”œโ”€โ”€ images/              # ๐Ÿ–ผ๏ธ Extracted images
โ”œโ”€โ”€ content_list.json    # Structured content
โ””โ”€โ”€ layout.json          # Layout analysis data
```

### Detailed Documentation

๐Ÿ“š **Complete Guide**: See `docs/Local_File_Parsing_Guide.md`

---

## ๐ŸŒ Feature 2: Parse Online PDF Documents (URL Method)

For PDF files already available online (e.g., arXiv, websites). Only 2 steps, more concise and efficient.

### Quick Start

```bash
cd scripts/

# Step 1: Submit parsing task (provide URL directly)
./online_file_step1_submit_task.sh "https://arxiv.org/pdf/2410.17247.pdf"
# Output: TASK_ID=xxx

# Step 2: Poll results and auto-download/extract
./online_file_step2_poll_result.sh "$TASK_ID" extracted/
```

### Script Descriptions

#### online_file_step1_submit_task.sh

Submit parsing task for online PDF.

**Usage:**
```bash
./online_file_step1_submit_task.sh <pdf_url> [language] [layout_model]
```

**Parameters:**
- `pdf_url`: Complete URL of the online PDF (required)
- `language`: `ch` (Chinese), `en` (English), `auto` (auto-detect), default `ch`
- `layout_model`: `doclayout_yolo` (fast), `layoutlmv3` (accurate), default `doclayout_yolo`

**Output:**
```
TASK_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
```

---

#### online_file_step2_poll_result.sh

Poll extraction results, automatically download and extract when complete.

**Usage:**
```bash
./online_file_step2_poll_result.sh <task_id> [output_directory] [max_retries] [retry_interval_seconds]
```

**Output Structure:**
```
extracted/
โ”œโ”€โ”€ full.md              # ๐Ÿ“„ Markdown document (main result)
โ”œโ”€โ”€ images/              # ๐Ÿ–ผ๏ธ Extracted images
โ”œโ”€โ”€ content_list.json    # Structured content
โ””โ”€โ”€ layout.json          # Layout analysis data
```

### Detailed Documentation

๐Ÿ“š **Complete Guide**: See `docs/Online_URL_Parsing_Guide.md`

---

## ๐Ÿ“Š Comparison of Two Parsing Methods

| Feature | **Local PDF Parsing** | **Online PDF Parsing** |
|---------|----------------------|------------------------|
| **Steps** | 4 steps | 2 steps |
| **Upload Required** | โœ… Yes | โŒ No |
| **Average Time** | 30-60 seconds | 10-20 seconds |
| **Use Case** | Local files | Files already online (arXiv, websites, etc.) |
| **File Size Limit** | 200MB | Limited by source server |

---

## โš™๏ธ Advanced Usage

### Batch Process Local Files

```bash
for pdf in /path/to/pdfs/*.pdf; do
    echo "Processing: $pdf"
    
    # Step 1
    result=$(./local_file_step1_apply_upload_url.sh "$pdf" 2>&1)
    batch_id=$(echo "$result" | grep BATCH_ID | cut -d= -f2)
    upload_url=$(echo "$result" | grep UPLOAD_URL | cut -d= -f2)
    
    # Step 2
    ./local_file_step2_upload_file.sh "$upload_url" "$pdf"
    
    # Step 3
    zip_url=$(./local_file_step3_poll_result.sh "$batch_id" | grep FULL_ZIP_URL | cut -d= -f2)
    
    # Step 4
    filename=$(basename "$pdf" .pdf)
    ./local_file_step4_download.sh "$zip_url" "${filename}.zip" "${filename}_extracted"
done
```

### Batch Process Online Files

```bash
for url in \
  "https://arxiv.org/pdf/2410.17247.pdf" \
  "https://arxiv.org/pdf/2409.12345.pdf"; do
    echo "Processing: $url"
    
    # Step 1
    result=$(./online_file_step1_submit_task.sh "$url" 2>&1)
    task_id=$(echo "$result" | grep TASK_ID | cut -d= -f2)
    
    # Step 2
    filename=$(basename "$url" .pdf)
    ./online_file_step2_poll_result.sh "$task_id" "${filename}_extracted"
done
```

---

## โš ๏ธ Notes

1. **Token Configuration**: Scripts prioritize `MINERU_TOKEN`, fall back to `MINERU_API_KEY` if not found
2. **Token Security**: Do not hard-code tokens in scripts; use environment variables
3. **URL Accessibility**: For online parsing, ensure the provided URL is publicly accessible
4. **File Limits**: Single file recommended not exceeding 200MB, maximum 600 pages
5. **Network Stability**: Ensure stable network when uploading large files
6. **Security**: This skill includes input validation and sanitization to prevent JSON injection and directory traversal attacks
7. **Optional jq**: Installing `jq` provides enhanced JSON parsing and additional security checks

---

## ๐Ÿ“š Reference Documentation

| Document | Description |
|----------|-------------|
| `docs/Local_File_Parsing_Guide.md` | Detailed curl commands and parameters for local PDF parsing |
| `docs/Online_URL_Parsing_Guide.md` | Detailed curl commands and parameters for online PDF parsing |

External Resources:
- ๐Ÿ  **MinerU Official**: https://mineru.net/
- ๐Ÿ“– **API Documentation**: https://mineru.net/apiManage/docs
- ๐Ÿ’ป **GitHub Repository**: https://github.com/opendatalab/MinerU

---

*Skill Version: 1.0.0*  
*Release Date: 2026-02-18*  
*Community Skill - Not affiliated with MinerU official*

Example Workflow

Here's how your AI assistant might use this skill in practice.

INPUT

User asks: Converting arXiv research papers to Markdown for note-taking

AGENT
  1. 1Converting arXiv research papers to Markdown for note-taking
  2. 2Extracting tables from financial reports into structured text
  3. 3Batch processing academic PDFs to build a searchable knowledge base
  4. 4Parsing scanned documents with OCR into plain text
  5. 5Converting technical manuals with embedded formulas to Markdown
OUTPUT
Extract PDF content to Markdown using MinerU API.

Share this skill

Security Audits

VirusTotalBenign
OpenClawBenign
View full report

These signals reflect official OpenClaw status values. A Suspicious status means the skill should be used with extra caution.

Details

LanguageMarkdown
Last updatedFeb 26, 2026