kv extractor mcp server
Extracts structured key-value pairs from arbitrary, noisy, or unstructured text using LLMs and provides output in multiple formats (JSON, YAML, TOML) with type safety.
Extracts structured key-value pairs from arbitrary, noisy, or unstructured text using LLMs and provides output in multiple formats (JSON, YAML, TOML) with type safety.
Version: 0.3.1
This MCP server extracts key-value pairs from arbitrary, noisy, or unstructured text using LLMs (GPT-4.1-mini) and pydantic-ai. It ensures type safety and supports multiple output formats (JSON, YAML, TOML). The server is robust to any input and always attempts to structure data as much as possible, however, perfect extraction is not guaranteed.
While many Large Language Model (LLMs) services offer structured output capabilities, this MCP server provides distinct advantages for key-value extraction, especially from challenging real-world text:
/extract_json
: Extracts type-safe key-value pairs in JSON format from input text./extract_yaml
: Extracts type-safe key-value pairs in YAML format from input text./extract_toml
: Extracts type-safe key-value pairs in TOML format from input text.
Note: - Supported languages: Japanese, English, and Chinese (Simplified: zh-cn / Traditional: zh-tw). - Extraction relies on pydantic-ai and LLMs. Perfect extraction is not guaranteed. - Longer input sentences will take more time to process. Please be patient. - On first launch, the server will download spaCy models, so the process will take longer initially.
Input Tokens | Input Characters (approx.) | Measured Processing Time (sec) | Model Configuration |
---|---|---|---|
200 | ~400 | ~15 | gpt-4.1-mini |
Actual processing time may vary significantly depending on API response, network conditions, and model load. Even short texts may take 15 seconds or more.
The server has been tested with various inputs, including: - Simple key-value pairs - Noisy or unstructured text with important information buried within - Different data formats (JSON, YAML, TOML) for output
Below is a flowchart representing the processing flow of the key-value extraction pipeline as implemented in server.py
:
flowchart TD
A[Input Text] --> B[Step 0: Preprocessing with spaCy Lang Detect then NER]
B --> C[Step 1: Key-Value Extraction - LLM]
C --> D[Step 2: Type Annotation - LLM]
D --> E[Step 3: Type Evaluation - LLM]
E --> F[Step 4: Type Normalization - Static Rules + LLM]
F --> G[Step 5: Final Structuring with Pydantic]
G --> H[Output in JSON/YAML/TOML]
This server uses spaCy with automatic language detection to extract named entities from the input text before passing it to the LLM. Supported languages are Japanese (ja_core_news_md
), English (en_core_web_sm
), and Chinese (Simplified/Traditional, zh_core_web_sm
).
langdetect
.Unsupported lang detected
.[Preprocessing Candidate Phrases (spaCy NER)] The following is a list of phrases automatically extracted from the input text using spaCy s detected language model. These phrases represent detected entities such as names, dates, organizations, locations, numbers, etc. This list is for reference only and may contain irrelevant or incorrect items. The LLM uses its own judgment and considers the entire input text to flexibly infer the most appropriate key-value pairs.
This project s key-value extraction pipeline consists of multiple steps. Each step s details are as follows:
ja_core_news_md
, en_core_web_sm
, zh_core_web_sm
) to extract named entities.key: person, value: ["Tanaka", "Sato"]
key: person, value: ["Tanaka", "Sato"] -> list[str]
This pipeline is designed to accommodate future list format support and Pydantic schema extensions.
items = ["A", "B"]
) can be represented natively, but arrays of objects (dicts) or deeply nested structures cannot be directly represented due to TOML specifications.[{"name": "A"}, {"name": "B"}]
) are stored as "JSON strings" in TOML values.Input:
Thank you for your order (Order Number: ORD-98765). Product: High-Performance Laptop, Price: 89,800 JPY (tax excluded), Delivery: May 15-17. Shipping address: 1-2-3 Shinjuku, Shinjuku-ku, Tokyo, Apartment 101. Phone: 090-1234-5678. Payment: Credit Card (VISA, last 4 digits: 1234). For changes, contact [email protected].
Output (JSON):
{
"order_number": "ORD-98765",
"product_name": "High-Performance Laptop",
"price": 89800,
"price_currency": "JPY",
"tax_excluded": true,
"delivery_start_date": "20240515",
"delivery_end_date": "20240517",
"shipping_address": "1-2-3 Shinjuku, Shinjuku-ku, Tokyo, Apartment 101",
"phone_number": "090-1234-5678",
"payment_method": "Credit Card",
"card_type": "VISA",
"card_last4": "1234",
"customer_support_email": "[email protected]"
}
Output (YAML):
order_number: ORD-98765
product_name: High-Performance Laptop
price: 89800
price_currency: JPY
tax_excluded: true
delivery_start_date: 20240515
delivery_end_date: 20240517
shipping_address: 1-2-3 Shinjuku, Shinjuku-ku, Tokyo, Apartment 101
phone_number: 090-1234-5678
payment_method: Credit Card
card_type: VISA
card_last4: 1234
customer_support_email: [email protected]
Output (TOML, simple case):
order_number = "ORD-98765"
product_name = "High-Performance Laptop"
price = 89800
price_currency = "JPY"
tax_excluded = true
delivery_start_date = "20240515"
delivery_end_date = "20240517"
shipping_address = "1-2-3 Shinjuku, Shinjuku-ku, Tokyo, Apartment 101"
phone_number = "090-1234-5678"
payment_method = "Credit Card"
card_type = "VISA"
card_last4 = "1234"
Output (TOML, complex case):
items = [{"name": "A", "qty": 2}, {"name": "B", "qty": 5}]
addresses = [{"city": "Tokyo", "zip": "160-0022"}, {"city": "Osaka", "zip": "530-0001"}]
Note: Arrays of objects or nested structures are stored as JSON strings in TOML.
extract_json
input_text
(string): Input string containing noisy or unstructured data.{ "success": True, "result": ... }
or { "success": False, "error": ... }
{
"success": true,
"result": { "foo": 1, "bar": "baz" }
}
extract_yaml
input_text
(string): Input string containing noisy or unstructured data.{ "success": True, "result": ... }
or { "success": False, "error": ... }
json { "success": true, "result": "foo: 1 bar: baz" }
extract_toml
input_text
(string): Input string containing noisy or unstructured data.{ "success": True, "result": ... }
or { "success": False, "error": ... }
json { "success": true, "result": "foo = 1 bar = "baz"" }
To install kv-extractor-mcp-server for Claude Desktop automatically via Smithery:
npx -y @smithery/cli install @KunihiroS/kv-extractor-mcp-server --client claude
settings.json
under env
)
python server.py
In case you want to run the server manually.
When running this MCP Server, you must explicitly specify the log output mode and (if enabled) the absolute log file path via command-line arguments.
--log=off
: Disable all logging (no logs are written)--log=on --logfile=/absolute/path/to/logfile.log
: Enable logging and write logs to the specified absolute file path
"kv-extractor-mcp-server": {
"command": "pipx",
"args": ["run", "kv-extractor-mcp-server", "--log=off"],
"env": {
"OPENAI_API_KEY": "{apikey}"
}
}
"kv-extractor-mcp-server": {
"command": "pipx",
"args": ["run", "kv-extractor-mcp-server", "--log=on", "--logfile=/workspace/logs/kv-extractor-mcp-server.log"],
"env": {
"OPENAI_API_KEY": "{apikey}"
}
}
Note: - When logging is enabled, logs are written only to the specified absolute file path. Relative paths or omission of
--logfile
will cause an error. - When logging is disabled, no logs are output. - If the required arguments are missing or invalid, the server will not start and will print an error message. - The log file must be accessible and writable by the MCP Server process. - If you have trouble to run this server, it may be due to caching older version of kv-extractor-mcp-server. Please try to run it with the latest version (setx.y.z
to the latest version) of kv-extractor-mcp-server by the below setting.
"kv-extractor-mcp-server": {
"command": "pipx",
"args": ["run", "kv-extractor-mcp-server==x.y.z", "--log=off"],
"env": {
"OPENAI_API_KEY": "{apikey}"
}
}
GPL-3.0-or-later
KunihiroS (and contributors)