mineru-pdf-extractor
Extract PDF content to Markdown using MinerU API.
Setup & Installation
Install command
clawhub install a-i-r/mineru-pdf-extractorIf the CLI is not installed:
Install command
npx clawhub@latest install a-i-r/mineru-pdf-extractorOr install with OpenClaw CLI:
Install command
openclaw skills install a-i-r/mineru-pdf-extractoror paste the repo link into your assistant's chat
Install command
https://github.com/openclaw/skills/tree/main/skills/a-i-r/mineru-pdf-extractorWhat This Skill Does
MinerU PDF Extractor converts PDF documents to structured Markdown via the MinerU API. It handles formula recognition, table extraction, and OCR. Two methods are available: uploading a local file in 4 steps, or submitting a public URL in 2 steps.
Handles complex PDF content like LaTeX formulas and embedded tables that basic text extractors miss.
When to Use It
- Converting arXiv research papers to Markdown for note-taking
- Extracting tables from financial reports into structured text
- Batch processing academic PDFs to build a searchable knowledge base
- Parsing scanned documents with OCR into plain text
- Converting technical manuals with embedded formulas to Markdown
View original SKILL.md file
# MinerU PDF Extractor
Extract PDF documents to structured Markdown using the MinerU API. Supports formula recognition, table extraction, and OCR.
> **Note**: This is a community skill, not an official MinerU product. You need to obtain your own API key from [MinerU](https://mineru.net/).
---
## ๐ Skill Structure
```
mineru-pdf-extractor/
โโโ SKILL.md # English documentation
โโโ SKILL_zh.md # Chinese documentation
โโโ docs/ # Documentation
โ โโโ Local_File_Parsing_Guide.md # Local PDF parsing detailed guide (English)
โ โโโ Online_URL_Parsing_Guide.md # Online PDF parsing detailed guide (English)
โ โโโ MinerU_ๆฌๅฐๆๆกฃ่งฃๆๅฎๆดๆต็จ.md # Local parsing complete guide (Chinese)
โ โโโ MinerU_ๅจ็บฟๆๆกฃ่งฃๆๅฎๆดๆต็จ.md # Online parsing complete guide (Chinese)
โโโ scripts/ # Executable scripts
โโโ local_file_step1_apply_upload_url.sh # Local parsing Step 1
โโโ local_file_step2_upload_file.sh # Local parsing Step 2
โโโ local_file_step3_poll_result.sh # Local parsing Step 3
โโโ local_file_step4_download.sh # Local parsing Step 4
โโโ online_file_step1_submit_task.sh # Online parsing Step 1
โโโ online_file_step2_poll_result.sh # Online parsing Step 2
```
---
## ๐ง Requirements
### Required Environment Variables
Scripts automatically read MinerU Token from environment variables (choose one):
```bash
# Option 1: Set MINERU_TOKEN
export MINERU_TOKEN="your_api_token_here"
# Option 2: Set MINERU_API_KEY
export MINERU_API_KEY="your_api_token_here"
```
### Required Command-Line Tools
- `curl` - For HTTP requests (usually pre-installed)
- `unzip` - For extracting results (usually pre-installed)
### Optional Tools
- `jq` - For enhanced JSON parsing and security (recommended but not required)
- If not installed, scripts will use fallback methods
- Install: `apt-get install jq` (Debian/Ubuntu) or `brew install jq` (macOS)
### Optional Configuration
```bash
# Set API base URL (default is pre-configured)
export MINERU_BASE_URL="https://mineru.net/api/v4"
```
> ๐ก **Get Token**: Visit https://mineru.net/apiManage/docs to register and obtain an API Key
---
## ๐ Feature 1: Parse Local PDF Documents
For locally stored PDF files. Requires 4 steps.
### Quick Start
```bash
cd scripts/
# Step 1: Apply for upload URL
./local_file_step1_apply_upload_url.sh /path/to/your.pdf
# Output: BATCH_ID=xxx UPLOAD_URL=xxx
# Step 2: Upload file
./local_file_step2_upload_file.sh "$UPLOAD_URL" /path/to/your.pdf
# Step 3: Poll for results
./local_file_step3_poll_result.sh "$BATCH_ID"
# Output: FULL_ZIP_URL=xxx
# Step 4: Download results
./local_file_step4_download.sh "$FULL_ZIP_URL" result.zip extracted/
```
### Script Descriptions
#### local_file_step1_apply_upload_url.sh
Apply for upload URL and batch_id.
**Usage:**
```bash
./local_file_step1_apply_upload_url.sh <pdf_file_path> [language] [layout_model]
```
**Parameters:**
- `language`: `ch` (Chinese), `en` (English), `auto` (auto-detect), default `ch`
- `layout_model`: `doclayout_yolo` (fast), `layoutlmv3` (accurate), default `doclayout_yolo`
**Output:**
```
BATCH_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
UPLOAD_URL=https://mineru.oss-cn-shanghai.aliyuncs.com/...
```
---
#### local_file_step2_upload_file.sh
Upload PDF file to the presigned URL.
**Usage:**
```bash
./local_file_step2_upload_file.sh <upload_url> <pdf_file_path>
```
---
#### local_file_step3_poll_result.sh
Poll extraction results until completion or failure.
**Usage:**
```bash
./local_file_step3_poll_result.sh <batch_id> [max_retries] [retry_interval_seconds]
```
**Output:**
```
FULL_ZIP_URL=https://cdn-mineru.openxlab.org.cn/pdf/.../xxx.zip
```
---
#### local_file_step4_download.sh
Download result ZIP and extract.
**Usage:**
```bash
./local_file_step4_download.sh <zip_url> [output_zip_filename] [extract_directory_name]
```
**Output Structure:**
```
extracted/
โโโ full.md # ๐ Markdown document (main result)
โโโ images/ # ๐ผ๏ธ Extracted images
โโโ content_list.json # Structured content
โโโ layout.json # Layout analysis data
```
### Detailed Documentation
๐ **Complete Guide**: See `docs/Local_File_Parsing_Guide.md`
---
## ๐ Feature 2: Parse Online PDF Documents (URL Method)
For PDF files already available online (e.g., arXiv, websites). Only 2 steps, more concise and efficient.
### Quick Start
```bash
cd scripts/
# Step 1: Submit parsing task (provide URL directly)
./online_file_step1_submit_task.sh "https://arxiv.org/pdf/2410.17247.pdf"
# Output: TASK_ID=xxx
# Step 2: Poll results and auto-download/extract
./online_file_step2_poll_result.sh "$TASK_ID" extracted/
```
### Script Descriptions
#### online_file_step1_submit_task.sh
Submit parsing task for online PDF.
**Usage:**
```bash
./online_file_step1_submit_task.sh <pdf_url> [language] [layout_model]
```
**Parameters:**
- `pdf_url`: Complete URL of the online PDF (required)
- `language`: `ch` (Chinese), `en` (English), `auto` (auto-detect), default `ch`
- `layout_model`: `doclayout_yolo` (fast), `layoutlmv3` (accurate), default `doclayout_yolo`
**Output:**
```
TASK_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
```
---
#### online_file_step2_poll_result.sh
Poll extraction results, automatically download and extract when complete.
**Usage:**
```bash
./online_file_step2_poll_result.sh <task_id> [output_directory] [max_retries] [retry_interval_seconds]
```
**Output Structure:**
```
extracted/
โโโ full.md # ๐ Markdown document (main result)
โโโ images/ # ๐ผ๏ธ Extracted images
โโโ content_list.json # Structured content
โโโ layout.json # Layout analysis data
```
### Detailed Documentation
๐ **Complete Guide**: See `docs/Online_URL_Parsing_Guide.md`
---
## ๐ Comparison of Two Parsing Methods
| Feature | **Local PDF Parsing** | **Online PDF Parsing** |
|---------|----------------------|------------------------|
| **Steps** | 4 steps | 2 steps |
| **Upload Required** | โ
Yes | โ No |
| **Average Time** | 30-60 seconds | 10-20 seconds |
| **Use Case** | Local files | Files already online (arXiv, websites, etc.) |
| **File Size Limit** | 200MB | Limited by source server |
---
## โ๏ธ Advanced Usage
### Batch Process Local Files
```bash
for pdf in /path/to/pdfs/*.pdf; do
echo "Processing: $pdf"
# Step 1
result=$(./local_file_step1_apply_upload_url.sh "$pdf" 2>&1)
batch_id=$(echo "$result" | grep BATCH_ID | cut -d= -f2)
upload_url=$(echo "$result" | grep UPLOAD_URL | cut -d= -f2)
# Step 2
./local_file_step2_upload_file.sh "$upload_url" "$pdf"
# Step 3
zip_url=$(./local_file_step3_poll_result.sh "$batch_id" | grep FULL_ZIP_URL | cut -d= -f2)
# Step 4
filename=$(basename "$pdf" .pdf)
./local_file_step4_download.sh "$zip_url" "${filename}.zip" "${filename}_extracted"
done
```
### Batch Process Online Files
```bash
for url in \
"https://arxiv.org/pdf/2410.17247.pdf" \
"https://arxiv.org/pdf/2409.12345.pdf"; do
echo "Processing: $url"
# Step 1
result=$(./online_file_step1_submit_task.sh "$url" 2>&1)
task_id=$(echo "$result" | grep TASK_ID | cut -d= -f2)
# Step 2
filename=$(basename "$url" .pdf)
./online_file_step2_poll_result.sh "$task_id" "${filename}_extracted"
done
```
---
## โ ๏ธ Notes
1. **Token Configuration**: Scripts prioritize `MINERU_TOKEN`, fall back to `MINERU_API_KEY` if not found
2. **Token Security**: Do not hard-code tokens in scripts; use environment variables
3. **URL Accessibility**: For online parsing, ensure the provided URL is publicly accessible
4. **File Limits**: Single file recommended not exceeding 200MB, maximum 600 pages
5. **Network Stability**: Ensure stable network when uploading large files
6. **Security**: This skill includes input validation and sanitization to prevent JSON injection and directory traversal attacks
7. **Optional jq**: Installing `jq` provides enhanced JSON parsing and additional security checks
---
## ๐ Reference Documentation
| Document | Description |
|----------|-------------|
| `docs/Local_File_Parsing_Guide.md` | Detailed curl commands and parameters for local PDF parsing |
| `docs/Online_URL_Parsing_Guide.md` | Detailed curl commands and parameters for online PDF parsing |
External Resources:
- ๐ **MinerU Official**: https://mineru.net/
- ๐ **API Documentation**: https://mineru.net/apiManage/docs
- ๐ป **GitHub Repository**: https://github.com/opendatalab/MinerU
---
*Skill Version: 1.0.0*
*Release Date: 2026-02-18*
*Community Skill - Not affiliated with MinerU official*
Example Workflow
Here's how your AI assistant might use this skill in practice.
User asks: Converting arXiv research papers to Markdown for note-taking
- 1Converting arXiv research papers to Markdown for note-taking
- 2Extracting tables from financial reports into structured text
- 3Batch processing academic PDFs to build a searchable knowledge base
- 4Parsing scanned documents with OCR into plain text
- 5Converting technical manuals with embedded formulas to Markdown
Extract PDF content to Markdown using MinerU API.
Security Audits
These signals reflect official OpenClaw status values. A Suspicious status means the skill should be used with extra caution.