dataset viewer

Local 2025-08-31 23:18:59 0

Enables interaction with the Hugging Face Dataset Viewer API, allowing users to browse, search, filter, and analyze datasets hosted on the Hugging Face Hub.

An MCP server for interacting with the Hugging Face Dataset Viewer API, providing capabilities to browse and analyze datasets hosted on the Hugging Face Hub.

Features

Resources

Uses dataset:// URI scheme for accessing Hugging Face datasets
Supports dataset configurations and splits
Provides paginated access to dataset contents
Handles authentication for private datasets
Supports searching and filtering dataset contents
Provides dataset statistics and analysis

Tools

The server provides the following tools:

validate
Check if a dataset exists and is accessible
Parameters:
- dataset: Dataset identifier (e.g. 'stanfordnlp/imdb')
- auth_token (optional): For private datasets
get_info
Get detailed information about a dataset
Parameters:
- dataset: Dataset identifier
- auth_token (optional): For private datasets
get_rows
Get paginated contents of a dataset
Parameters:
- dataset: Dataset identifier
- config: Configuration name
- split: Split name
- page (optional): Page number (0-based)
- auth_token (optional): For private datasets
get_first_rows
Get first rows from a dataset split
Parameters:
- dataset: Dataset identifier
- config: Configuration name
- split: Split name
- auth_token (optional): For private datasets
get_statistics
Get statistics about a dataset split
Parameters:
- dataset: Dataset identifier
- config: Configuration name
- split: Split name
- auth_token (optional): For private datasets
search_dataset
Search for text within a dataset
Parameters:
- dataset: Dataset identifier
- config: Configuration name
- split: Split name
- query: Text to search for
- auth_token (optional): For private datasets
filter
Filter rows using SQL-like conditions
Parameters:
- dataset: Dataset identifier
- config: Configuration name
- split: Split name
- where: SQL WHERE clause (e.g. "score > 0.5")
- orderby (optional): SQL ORDER BY clause
- page (optional): Page number (0-based)
- auth_token (optional): For private datasets
get_parquet
Download entire dataset in Parquet format
Parameters:
- dataset: Dataset identifier
- auth_token (optional): For private datasets

Installation

Prerequisites

Python 3.12 or higher
uv - Fast Python package installer and resolver

Setup

Clone the repository:

git clone https://github.com/privetin/dataset-viewer.git
cd dataset-viewer

Create a virtual environment and install:

# Create virtual environment
uv venv

# Activate virtual environment
# On Unix:
source .venv/bin/activate
# On Windows:
.venvScriptsactivate

# Install in development mode
uv add -e .

Configuration

Environment Variables

HUGGINGFACE_TOKEN: Your Hugging Face API token for accessing private datasets

Claude Desktop Integration

Add the following to your Claude Desktop config file:

On Windows: %APPDATA%Claudeclaude_desktop_config.json

On MacOS: ~/Library/Application Support/Claude/claude_desktop_config.json

{
  "mcpServers": {
    "dataset-viewer": {
      "command": "uv",
      "args": [
        "run",
        "dataset-viewer"
      ]
    }
  }
}

Usage Examples

Validate a dataset:
```
{
  "dataset": "stanfordnlp/imdb"
}
```
Get dataset information:
```
{
  "dataset": "stanfordnlp/imdb"
}
```

Search dataset contents:

{
  "dataset": "stanfordnlp/imdb",
  "config": "plain_text",
  "split": "train",
  "query": "great movie"
}

Filter and sort rows:

{
  "dataset": "stanfordnlp/imdb",
  "config": "plain_text",
  "split": "train",
  "where": "label = 'positive'",
  "orderby": "text DESC",
  "page": 0
}

Get dataset statistics:

{
  "dataset": "stanfordnlp/imdb",
  "config": "plain_text",
  "split": "train"
}

License

MIT License - see LICENSE for details

[
  {
    "description": "Get detailed information about a Hugging Face dataset including description, features, splits, and statistics. Run validate first to check if the dataset exists and is accessible.",
    "inputSchema": {
      "properties": {
        "auth_token": {
          "description": "Hugging Face auth token for private/gated datasets",
          "optional": true,
          "type": "string"
        },
        "dataset": {
          "description": "Hugging Face dataset identifier in the format owner/dataset",
          "examples": [
            "ylecun/mnist",
            "stanfordnlp/imdb"
          ],
          "pattern": "^[^/]+/[^/]+$",
          "type": "string"
        }
      },
      "required": [
        "dataset"
      ],
      "type": "object"
    },
    "name": "get_info"
  },
  {
    "description": "Get paginated rows from a Hugging Face dataset",
    "inputSchema": {
      "properties": {
        "auth_token": {
          "description": "Hugging Face auth token for private/gated datasets",
          "optional": true,
          "type": "string"
        },
        "config": {
          "description": "Dataset configuration/subset name. Use get_info to list available configs",
          "examples": [
            "default",
            "en",
            "es"
          ],
          "type": "string"
        },
        "dataset": {
          "description": "Hugging Face dataset identifier in the format owner/dataset",
          "examples": [
            "ylecun/mnist",
            "stanfordnlp/imdb"
          ],
          "pattern": "^[^/]+/[^/]+$",
          "type": "string"
        },
        "page": {
          "default": 0,
          "description": "Page number (0-based), returns 100 rows per page",
          "type": "integer"
        },
        "split": {
          "description": "Dataset split name. Splits partition the data for training/evaluation",
          "examples": [
            "train",
            "validation",
            "test"
          ],
          "type": "string"
        }
      },
      "required": [
        "dataset",
        "config",
        "split"
      ],
      "type": "object"
    },
    "name": "get_rows"
  },
  {
    "description": "Get first rows from a Hugging Face dataset split",
    "inputSchema": {
      "properties": {
        "auth_token": {
          "description": "Hugging Face auth token for private/gated datasets",
          "optional": true,
          "type": "string"
        },
        "config": {
          "description": "Dataset configuration/subset name. Use get_info to list available configs",
          "examples": [
            "default",
            "en",
            "es"
          ],
          "type": "string"
        },
        "dataset": {
          "description": "Hugging Face dataset identifier in the format owner/dataset",
          "examples": [
            "ylecun/mnist",
            "stanfordnlp/imdb"
          ],
          "pattern": "^[^/]+/[^/]+$",
          "type": "string"
        },
        "split": {
          "description": "Dataset split name. Splits partition the data for training/evaluation",
          "examples": [
            "train",
            "validation",
            "test"
          ],
          "type": "string"
        }
      },
      "required": [
        "dataset",
        "config",
        "split"
      ],
      "type": "object"
    },
    "name": "get_first_rows"
  },
  {
    "description": "Search for text within a Hugging Face dataset",
    "inputSchema": {
      "properties": {
        "auth_token": {
          "description": "Hugging Face auth token for private/gated datasets",
          "optional": true,
          "type": "string"
        },
        "config": {
          "description": "Dataset configuration/subset name. Use get_info to list available configs",
          "examples": [
            "default",
            "en",
            "es"
          ],
          "type": "string"
        },
        "dataset": {
          "description": "Hugging Face dataset identifier in the format owner/dataset",
          "examples": [
            "ylecun/mnist",
            "stanfordnlp/imdb"
          ],
          "pattern": "^[^/]+/[^/]+$",
          "type": "string"
        },
        "query": {
          "description": "Text to search for in the dataset",
          "type": "string"
        },
        "split": {
          "description": "Dataset split name. Splits partition the data for training/evaluation",
          "examples": [
            "train",
            "validation",
            "test"
          ],
          "type": "string"
        }
      },
      "required": [
        "dataset",
        "config",
        "split",
        "query"
      ],
      "type": "object"
    },
    "name": "search_dataset"
  },
  {
    "description": "Filter rows in a Hugging Face dataset using SQL-like conditions",
    "inputSchema": {
      "properties": {
        "auth_token": {
          "description": "Hugging Face auth token for private/gated datasets",
          "optional": true,
          "type": "string"
        },
        "config": {
          "description": "Dataset configuration/subset name. Use get_info to list available configs",
          "examples": [
            "default",
            "en",
            "es"
          ],
          "type": "string"
        },
        "dataset": {
          "description": "Hugging Face dataset identifier in the format owner/dataset",
          "examples": [
            "ylecun/mnist",
            "stanfordnlp/imdb"
          ],
          "pattern": "^[^/]+/[^/]+$",
          "type": "string"
        },
        "orderby": {
          "description": "SQL-like ORDER BY clause to sort results",
          "examples": [
            "column ASC",
            "score DESC",
            "name ASC, id DESC"
          ],
          "optional": true,
          "type": "string"
        },
        "page": {
          "default": 0,
          "description": "Page number for paginated results (100 rows per page)",
          "minimum": 0,
          "type": "integer"
        },
        "split": {
          "description": "Dataset split name. Splits partition the data for training/evaluation",
          "examples": [
            "train",
            "validation",
            "test"
          ],
          "type": "string"
        },
        "where": {
          "description": "SQL-like WHERE clause to filter rows",
          "examples": [
            "column = "value"",
            "score > 0.5",
            "text LIKE "%query%""
          ],
          "type": "string"
        }
      },
      "required": [
        "dataset",
        "config",
        "split",
        "where"
      ],
      "type": "object"
    },
    "name": "filter"
  },
  {
    "description": "Get statistics about a Hugging Face dataset",
    "inputSchema": {
      "properties": {
        "auth_token": {
          "description": "Hugging Face auth token for private/gated datasets",
          "optional": true,
          "type": "string"
        },
        "config": {
          "description": "Dataset configuration/subset name. Use get_info to list available configs",
          "examples": [
            "default",
            "en",
            "es"
          ],
          "type": "string"
        },
        "dataset": {
          "description": "Hugging Face dataset identifier in the format owner/dataset",
          "examples": [
            "ylecun/mnist",
            "stanfordnlp/imdb"
          ],
          "pattern": "^[^/]+/[^/]+$",
          "type": "string"
        },
        "split": {
          "description": "Dataset split name. Splits partition the data for training/evaluation",
          "examples": [
            "train",
            "validation",
            "test"
          ],
          "type": "string"
        }
      },
      "required": [
        "dataset",
        "config",
        "split"
      ],
      "type": "object"
    },
    "name": "get_statistics"
  },
  {
    "description": "Export Hugging Face dataset split as Parquet file",
    "inputSchema": {
      "properties": {
        "auth_token": {
          "description": "Hugging Face auth token for private/gated datasets",
          "optional": true,
          "type": "string"
        },
        "dataset": {
          "description": "Hugging Face dataset identifier in the format owner/dataset",
          "examples": [
            "ylecun/mnist",
            "stanfordnlp/imdb"
          ],
          "pattern": "^[^/]+/[^/]+$",
          "type": "string"
        }
      },
      "required": [
        "dataset"
      ],
      "type": "object"
    },
    "name": "get_parquet"
  },
  {
    "description": "Check if a Hugging Face dataset exists and is accessible",
    "inputSchema": {
      "properties": {
        "auth_token": {
          "description": "Hugging Face auth token for private/gated datasets",
          "optional": true,
          "type": "string"
        },
        "dataset": {
          "description": "Hugging Face dataset identifier in the format owner/dataset",
          "examples": [
            "ylecun/mnist",
            "stanfordnlp/imdb"
          ],
          "pattern": "^[^/]+/[^/]+$",
          "type": "string"
        }
      },
      "required": [
        "dataset"
      ],
      "type": "object"
    },
    "name": "validate"
  }
]