parquet mcp server
A powerful MCP (Model Control Protocol) server that provides tools for manipulating and analyzing Parquet files. This server is designed to work with Claude Desktop and offers four main functionalities:
A powerful MCP (Model Control Protocol) server that provides tools for manipulating and analyzing Parquet files. This server is designed to work with Claude Desktop and offers four main functionalities:
A powerful MCP (Model Control Protocol) server that provides tools for manipulating and analyzing Parquet files. This server is designed to work with Claude Desktop and offers five main functionalities:
This server is particularly useful for: - Data scientists working with large Parquet datasets - Applications requiring vector embeddings for text data - Projects needing to analyze or convert Parquet files - Workflows that benefit from DuckDB's fast querying capabilities - Applications requiring vector similarity search with PostgreSQL and pgvector
To install Parquet MCP Server for Claude Desktop automatically via Smithery:
npx -y @smithery/cli install @DeepSpringAI/parquet_mcp_server --client claude
git clone ...
cd parquet_mcp_server
uv venv
.venvScriptsactivate # On Windows
source .venv/bin/activate # On macOS/Linux
uv pip install -e .
Create a .env
file with the following variables:
EMBEDDING_URL= # URL for the embedding service
OLLAMA_URL= # URL for Ollama server
EMBEDDING_MODEL=nomic-embed-text # Model to use for generating embeddings
# PostgreSQL Configuration
POSTGRES_DB=your_database_name
POSTGRES_USER=your_username
POSTGRES_PASSWORD=your_password
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
Add this to your Claude Desktop configuration file (claude_desktop_config.json
):
{
"mcpServers": {
"parquet-mcp-server": {
"command": "uv",
"args": [
"--directory",
"/home/${USER}/workspace/parquet_mcp_server/src/parquet_mcp_server",
"run",
"main.py"
]
}
}
}
The server provides five main tools:
Required parameters:
input_path
: Path to input Parquet fileoutput_path
: Path to save the outputcolumn_name
: Column containing text to embedembedding_column
: Name for the new embedding columnbatch_size
: Number of texts to process in each batch (for better performance)Parquet Information: Get details about a Parquet file
Required parameters:
file_path
: Path to the Parquet file to analyzeConvert to DuckDB: Convert a Parquet file to a DuckDB database
parquet_path
: Path to the input Parquet fileOptional parameters:
output_dir
: Directory to save the DuckDB database (defaults to same directory as input file)Convert to PostgreSQL: Convert a Parquet file to a PostgreSQL table with pgvector support
Required parameters:
parquet_path
: Path to the input Parquet filetable_name
: Name of the PostgreSQL table to create or append toProcess Markdown: Convert markdown files into structured chunks with metadata
file_path
: Path to the markdown file to processoutput_path
: Path to save the output parquet fileHere are some example prompts you can use with the agent:
"Please embed the column 'text' in the parquet file '/path/to/input.parquet' and save the output to '/path/to/output.parquet'. Use 'embeddings' as the final column name and a batch size of 2"
"Please give me some information about the parquet file '/path/to/input.parquet'"
"Please convert the parquet file '/path/to/input.parquet' to DuckDB format and save it in '/path/to/output/directory'"
"Please convert the parquet file '/path/to/input.parquet' to a PostgreSQL table named 'my_table'"
"Please process the markdown file '/path/to/input.md' and save the chunks to '/path/to/output.parquet'"
The project includes a comprehensive test suite in the src/tests
directory. You can run all tests using:
python src/tests/run_tests.py
Or run individual tests:
# Test embedding functionality
python src/tests/test_embedding.py
# Test parquet information tool
python src/tests/test_parquet_info.py
# Test DuckDB conversion
python src/tests/test_duckdb_conversion.py
# Test PostgreSQL conversion
python src/tests/test_postgres_conversion.py
# Test Markdown processing
python src/tests/test_markdown_processing.py
You can also test the server using the client directly:
from parquet_mcp_server.client import (
convert_to_duckdb,
embed_parquet,
get_parquet_info,
convert_to_postgres,
process_markdown_file # New markdown processing function
)
# Test DuckDB conversion
result = convert_to_duckdb(
parquet_path="input.parquet",
output_dir="db_output"
)
# Test embedding
result = embed_parquet(
input_path="input.parquet",
output_path="output.parquet",
column_name="text",
embedding_column="embeddings",
batch_size=2
)
# Test parquet information
result = get_parquet_info("input.parquet")
# Test PostgreSQL conversion
result = convert_to_postgres(
parquet_path="input.parquet",
table_name="my_table"
)
# Test markdown processing
result = process_markdown_file(
file_path="input.md",
output_path="output.parquet"
)
.env
file are correct.env
file are correctThe embeddings are returned in the following format:
{
"object": "list",
"data": [{
"object": "embedding",
"embedding": [0.123, 0.456, ...],
"index": 0
}],
"model": "llama2",
"usage": {
"prompt_tokens": 4,
"total_tokens": 4
}
}
Each embedding vector is stored in the Parquet file as a NumPy array in the specified embedding column.
The DuckDB conversion tool returns a success message with the path to the created database file or an error message if the conversion fails.
The PostgreSQL conversion tool returns a success message indicating whether a new table was created or data was appended to an existing table.
The markdown chunking tool processes markdown files into chunks and saves them as a Parquet file with the following columns:
- text
: The text content of each chunk
- metadata
: Additional metadata about the chunk (e.g., headers, section info)
The tool returns a success message with the path to the created Parquet file or an error message if the processing fails.