loading…
Search for a command to run...
loading…
A governed MCP server for Saudi open data sources (SAMA, stats.gov.sa, Ministry of Finance, data.gov.sa), providing typed contracts, registry-backed metadata, a
A governed MCP server for Saudi open data sources (SAMA, stats.gov.sa, Ministry of Finance, data.gov.sa), providing typed contracts, registry-backed metadata, and CLI/API access for dataset search, preview, and controlled export.
saudi-open-data-mcp is a production-minded MCP server for Saudi open data sources.
git clone https://github.com/raheb77/saudi-open-data-mcp.git
cd saudi-open-data-mcp
uv sync --no-editable
export HTTP_AUTH_TOKEN="$(openssl rand -hex 32)"
uv run --no-editable saudi-open-data-mcp --version
uv run --no-editable saudi-open-data-mcp check-startup
uv run --no-editable saudi-open-data-mcp list
Live refresh and query paths depend on current upstream source availability and
local snapshots. For evaluation, start with the startup and catalog commands
above; then run uv run --no-editable saudi-open-data-mcp refresh only when
testing live source access.
The project is not just an MCP wrapper around upstream websites. Its value is in the layers underneath MCP:
Current implementation now includes curated official-source coverage across SAMA, stats.gov.sa, Ministry of Finance, and one narrow data.gov.sa pilot dataset. The current baseline is an internal, container-first MCP service with stdio still available for local development and command-based host integration.
The repository also includes an Arabic RTL dashboard package under dashboard/.
That package remains optional, but on main it is now a thin live consumer of
the governed backend over /mcp and /startupz with /readyz kept as a
startup-only compatibility alias, not a separate backend or required runtime
dependency for the core.
See ARCHITECTURE.md for the architecture, ADR-001 for the initial source decision, GOVERNANCE.md for the current core auth/audit/data-access model, OPERATIONS.md for runtime and durability guidance, DEPLOYMENT.md for the current local/container/runtime topology, RUNBOOKS.md for concise failure/recovery handling, PERSISTENCE.md for current persistence and backup/restore boundaries, and CHANGELOG.md for current baseline change visibility and migration notes.
query_dataset a live remote query surfacequery_dataset and download_dataset stay local-only, while preview_dataset is the only hybrid path and exposes freshness/origin/degradation context explicitly| Surface | Purpose | Current state |
|---|---|---|
| Backend/core | Governed MCP service over /mcp with /startupz |
primary runtime |
| CLI | Thin operator/engineer façade over the same core | supported |
| Dashboard | Arabic RTL UI package under dashboard/ |
optional live consumer of /mcp and /startupz |
| Exports | Institutional artifacts over governed query results | CLI-governed path today |
The current deployment fit is intentionally narrow and practical:
/mcp, plus CLI/stdio for local host integrationIn current-state terms, "self-hostable" or "sovereign-hosting-friendly" means:
What this repository does not claim today:
stats.gov.sadata.gov.sa pilot pathstats.gov.sa headline CPI monthlystats.gov.sa total unemployment rate quarterlystats.gov.sa real GDP growth quarterlySee DATASETS.md for the current canonical dataset direction and current narrow-contract limits.
Stable enough to evaluate and operate now:
Still intentionally evolving:
The current codebase is organized around three layers.
connectors/ defines the typed connector contract and the current source-specific connectors for SAMA, stats.gov.sa, Ministry of Finance, and one narrow data.gov.sa pilot dataset. Connector resolution dispatches by descriptor.source.storage/ provides raw snapshot persistence and local freshness helpers for connector payloads.httpx is the only HTTP client in the core path.normalization/ contains source-aware field mapping, source-aware validation, the normalization pipeline, and the current minimal canonical record layer.registry/ owns dataset descriptors, health metadata, SQLite persistence, and deterministic bootstrap data.rows list of objectsresources/ exposes read-only registry-backed resource views.tools/ exposes deterministic MCP tool layers over the registry, local snapshots, and preview/query paths.server.py wires the current MCP surface into FastMCP.The current exposed MCP surface is intentionally small:
resource://catalogresource://observabilityresource://policiesdataset_metadatadataset_healthdownload_datasetmaterialize_hot_setquery_datasetsearch_datasetspreview_datasetWhat each one does now:
resource://catalog: read-only summary of the bootstrapped registry catalogresource://observability: read-only grouped summary of current process-local counters, plus the raw counter snapshot for internal operatorsresource://policies: read-only summary of current data-facing semantics, including why query_dataset remains the primary analytical surface and preview_dataset remains hybriddataset_metadata: exact lookup of registry-backed dataset metadata by dataset_iddataset_health: exact lookup of registry-backed health metadata by dataset_id, with local snapshot freshness evidence when availabledownload_dataset: local-only raw snapshot availability lookup by dataset_idmaterialize_hot_set: explicit Wave 1 hot-set fetch and local snapshot persistence for the safe SAMA subsetquery_dataset: local-only exact-match query over canonical records derived from local snapshotssearch_datasets: deterministic registry-backed substring search over dataset metadatapreview_dataset: exact preview by canonical dataset_id, using explicit local/live hybrid resolution metadata and the registry-owned source_locator internally for source accessConcise example of the current surface:
resource://catalog
resource://observability
resource://policies
dataset_metadata({"dataset_id": "sama-money-supply-weekly"})
dataset_health({"dataset_id": "sama-money-supply-weekly"})
download_dataset({"dataset_id": "sama-money-supply-weekly"})
materialize_hot_set({"include_optional": false})
query_dataset({"dataset_id": "sama-money-supply-weekly", "filters": {"week_end_date": "2024-01-13"}, "limit": 5})
search_datasets({"query": "money"})
preview_dataset({"dataset_id": "sama-money-supply-weekly"})
stats.gov.sa, Ministry of Finance, and one narrow data.gov.sa pilot dataset.dataset_id from source-specific source_locator.record_derivablelimitedfailed/startupz startup probe, and a /readyz compatibility alias with the same startup-only semantics.upstream-canary command and scheduled workflow now exercise live approved dataset paths for source families with a registered queryable canary dataset.download_dataset or query_datasetdataset_id plus single source_locatorOne important limitation to keep explicit: preview_dataset uses the real connector and normalization path, but the normalization layer may still return limited results for HTML/text payloads and does not yet claim final normalized domain records.
Another important limitation: query_dataset only works on local snapshots that can be normalized into the current narrow canonical record shapes. Unsupported JSON shapes and HTML/text payloads remain explicit rather than queryable.
This repo uses a src/ layout. uv sync --no-editable installs the local
package and exposes the saudi-open-data-mcp console script through
uv run --no-editable; local commands do not require manually setting
PYTHONPATH.
Install and sync with uv:
uv sync --no-editable
Then either use uv run --no-editable as shown below or activate the local environment:
source .venv/bin/activate
or call the installed tools from .venv/bin/... explicitly.
Lint:
uv run --no-editable ruff check .
Tests:
uv run --no-editable pytest
The supported local development path is the local console script through
uv run --no-editable:
uv run --no-editable saudi-open-data-mcp check-startup
uv run --no-editable saudi-open-data-mcp run-stdio
HTTP_AUTH_TOKEN="$(openssl rand -hex 32)" uv run --no-editable saudi-open-data-mcp run-http --host 127.0.0.1 --port 8000
After activating .venv, the same console script is available without uv run:
saudi-open-data-mcp check-startup
saudi-open-data-mcp run-stdio
HTTP_AUTH_TOKEN="$(openssl rand -hex 32)" saudi-open-data-mcp run-http --host 127.0.0.1 --port 8000
The same CLI also provides a thin non-interactive local façade over
the current core operations. These commands emit structured JSON by default and
support --output for file writes. --quiet only applies when --output is
set. --format remains json for the read/health/config commands, while
export now also supports excel and pdf artifacts over the governed
query_dataset result:
uv run --no-editable saudi-open-data-mcp list
uv run --no-editable saudi-open-data-mcp query sama-pos-weekly --filter week_end_date=2024-01-13 --limit 5
uv run --no-editable saudi-open-data-mcp preview stats-gov-sa-cpi-headline-monthly
uv run --no-editable saudi-open-data-mcp download sama-money-supply-weekly
uv run --no-editable saudi-open-data-mcp export sama-money-supply-weekly --output money_supply.json
uv run --no-editable saudi-open-data-mcp export sama-money-supply-weekly --format excel --output money_supply.xml
uv run --no-editable saudi-open-data-mcp export sama-money-supply-weekly --format pdf --output money_supply.pdf
uv run --no-editable saudi-open-data-mcp health mof-budget-balance-quarterly
uv run --no-editable saudi-open-data-mcp refresh --dataset sama-money-supply-weekly
uv run --no-editable saudi-open-data-mcp refresh --include-optional
uv run --no-editable saudi-open-data-mcp config
The Excel artifact is an Excel-compatible XML workbook with visible metadata and records worksheets. The PDF artifact is a metadata-first text PDF that keeps status, origin, freshness, and limitations explicit instead of adding decorative reporting layers.
Use the local console script or helper scripts for development and local host integration.
run-stdio remains the primary local host/operator path for Claude Desktop and
other command-based MCP hosts.
run-http starts the same app over streamable HTTP. Treat that path as
MCP-aware and session-aware only. It is suitable for MCP inspectors and MCP
clients, not generic browser probing. It now requires
Authorization: Bearer <token> using HTTP_AUTH_TOKEN, plus an explicit HTTP
role from HTTP_AUTH_ROLE. The configured role resolves to the allowed
capability bundle, and HTTP_AUTH_CAPABILITIES may be left implicit or set to
the same role bundle explicitly.
By default, local registry and snapshot state resolve under the repo's
.local/ directory; set REGISTRY_PATH or SNAPSHOT_DIR to override them
explicitly. For reproducible host runs, prefer explicit REGISTRY_PATH,
SNAPSHOT_DIR, SAMA_BASE_URL, and DATA_GOV_SA_BASE_URL values.
Local state expectations:
download_dataset reports only what exists in the local snapshot store. It does not fetch remotely.query_dataset only works when a local snapshot exists and the normalization layer can derive canonical records from that snapshot.download_dataset returns artifact_missing and query_dataset returns snapshot_missing.data_origin, freshness_status, failure_stage, and degradation_reason to make degraded and failed paths easier to interpret.The helper script remains available for local HTTP development:
./scripts/run_local_http.sh
The official internal serving path for this phase is containerized streamable HTTP.
Chosen serving mode:
run-http over FastMCP streamable HTTPWhy this mode:
The canonical container entrypoint is:
saudi-open-data-mcp run-http
The image sets container-specific runtime defaults:
HTTP_HOST=0.0.0.0HTTP_PORT=8000HTTP_AUTH_TOKEN must be provided by the operatorHTTP_AUTH_ROLE=operatorHTTP_AUTH_CAPABILITIES=read,refresh,materializeTIER_A_REFRESH_ENABLED=falseTIER_A_REFRESH_INTERVAL_SECONDS=3600REGISTRY_PATH=/var/lib/saudi-open-data-mcp/registry.sqliteSNAPSHOT_DIR=/var/lib/saudi-open-data-mcp/snapshotsCACHE_DIR=/var/lib/saudi-open-data-mcp/cachePersistence expectations for that runtime:
REGISTRY_PATH and SNAPSHOT_DIR should live on durable storage if you need state to survive replacementCACHE_DIR is recreatable scratch spaceresource://observability counters, in-memory rate limits, and refresh loop state are process-localBuild and serve with Docker Compose:
docker compose up --build
The provided compose file publishes the service on 127.0.0.1:8000 on the
host, persists runtime state in a Docker-managed volume mounted at
/var/lib/saudi-open-data-mcp, enables init: true, and applies the same
/startupz startup-probe contract as the image. It also requires
HTTP_AUTH_TOKEN to be set in the operator environment before startup.
Internal observability remains intentionally simple:
resource://observability to inspect the current grouped in-process counters in one placeserver.startup.*, preview.request.*, connector.request.*, materialize.*, and tier_a_refresh.*For operator startup, shutdown, refresh, backup, and restore guidance, see OPERATIONS.md.
Direct container run example:
docker build -t saudi-open-data-mcp .
docker run --rm \
-p 127.0.0.1:8000:8000 \
-e HTTP_AUTH_TOKEN="$(openssl rand -hex 32)" \
-v saudi-open-data-mcp-data:/var/lib/saudi-open-data-mcp \
saudi-open-data-mcp
Container/runtime expectations:
viewer for read/query/metadata/health/policies/observabilityoperator for viewer access plus preview_dataset and materialize_hot_setadmin as the highest current role with the same operational bundle as operatorread for resources and local read/query/search toolsrefresh for preview_datasetmaterialize for materialize_hot_setHTTP_AUTH_TOKEN, HTTP_AUTH_ROLE, HTTP_AUTH_CAPABILITIES, TIER_A_REFRESH_ENABLED, TIER_A_REFRESH_INTERVAL_SECONDS,
SAMA_BASE_URL, DATA_GOV_SA_BASE_URL, and LOG_LEVEL are the main
operator-facing overrides
These base-URL overrides remain explicitly source-specific in the current
config because the runtime still carries SAMA-specific and data.gov.sa-pilot
assumptions.Startup/readiness contract:
GET /startupz is the canonical machine-friendly startup probe for this phaseGET /readyz remains a compatibility alias for the same startup-only payload/startupz and /readyz mean only:Authorization: Bearer <token> header are
rejected with 401 Unauthorized403 Forbidden/startupz and /readyz do not claim:/mcp must be checked with an MCP-aware client if you want real session
readiness validationGET / or GET /mcp probing can still return 404 or 406 and that
is not, by itself, a serving failureCurated live canary contract:
uv run --no-editable saudi-open-data-mcp upstream-canary performs a live connector fetch plus normalization on:sama-exchange-rates-currentstats-gov-sa-cpi-headline-monthlymof-budget-balance-quarterlydata.gov.sa is skipped with an explicit log line until a queryable data.gov.sa dataset is registered; the catalog-only pilot dataset is not used as a canary target.For local desktop MCP host registration, use stdio. That remains the supported development/operator path for command-based hosts and is separate from the official internal container serving path.
For stdio-based MCP hosts, use the source-tree CLI directly with absolute paths.
The repo's server.json is repository-level MCP metadata for this
internal/evaluator alpha. It intentionally declares no PyPI, npm, or MCP
registry package until a package is actually published.
Claude Desktop example:
{
"mcpServers": {
"saudi-open-data-mcp": {
"command": "/absolute/path/to/saudi-open-data-mcp/.venv/bin/python",
"args": [
"/absolute/path/to/saudi-open-data-mcp/src/saudi_open_data_mcp/cli.py",
"run-stdio"
],
"cwd": "/absolute/path/to/saudi-open-data-mcp",
"env": {
"LOG_LEVEL": "ERROR",
"SAMA_BASE_URL": "https://www.sama.gov.sa",
"DATA_GOV_SA_BASE_URL": "https://open.data.gov.sa"
}
}
}
}
cwd is optional here. The default registry and snapshot paths are now anchored to the repo rather than the process working directory, but keeping cwd set to the repo root can still make local config and path reasoning easier.
After installing the package, command-based hosts can also launch the console script directly:
{
"mcpServers": {
"saudi-open-data-mcp": {
"command": "saudi-open-data-mcp",
"args": ["run-stdio"],
"env": {
"LOG_LEVEL": "ERROR",
"SAMA_BASE_URL": "https://www.sama.gov.sa",
"DATA_GOV_SA_BASE_URL": "https://open.data.gov.sa"
}
}
}
}
If you prefer a single command path, the repo also includes a stdio helper script:
{
"mcpServers": {
"saudi-open-data-mcp": {
"command": "/absolute/path/to/saudi-open-data-mcp/scripts/run_local_stdio.sh"
}
}
}
Current limitation to keep explicit: local host registration remains stdio through the source-tree CLI. The official container serving path is HTTP, not a desktop stdio-host replacement.
HTTP is the official internal container serving mode, but it is not a plain
REST surface. /mcp is an MCP endpoint, not a normal browser page.
export HTTP_AUTH_TOKEN="$(openssl rand -hex 32)"
uv run --no-editable saudi-open-data-mcp run-http --host 127.0.0.1 --port 8000
curl -s http://127.0.0.1:8000/startupz
http://127.0.0.1:8000/mcp.Authorization: Bearer <the value of HTTP_AUTH_TOKEN>.resource://catalog, search_datasets({"query": "money"}), dataset_metadata({"dataset_id": "sama-money-supply-weekly"}),
and dataset_health({"dataset_id": "sama-money-supply-weekly"}).Naive probing can look broken even when the server is healthy:
GET / can return 404GET /mcp without the expected MCP headers may return 406, which is expectedGET /startupz is the health/startup probe and should return 200 with a narrow startup-only payloadThat behavior is expected for the current streamable HTTP setup. Browser or curl checks are useful only as a negative smoke test here, not as a real MCP session test.
Programmatic MCP-aware test example:
import asyncio
import os
from fastmcp import Client
from fastmcp.client.transports import StreamableHttpTransport
async def main() -> None:
async with Client(
transport=StreamableHttpTransport(
"http://127.0.0.1:8000/mcp",
headers={"Authorization": f"Bearer {os.environ['HTTP_AUTH_TOKEN']}"},
)
) as client:
result = await client.call_tool(
"dataset_metadata",
{"dataset_id": "sama-money-supply-weekly"},
)
print(result.structured_content)
asyncio.run(main())
The repo includes four test layers:
tests/unit/: typed contracts, tool behavior, repository behavior, connector behavior, and normalization behaviortests/integration/: small cross-module composition checkstests/contracts/: architectural boundary checks, including the rule that tool modules must not import connectors directlytests/smoke/: basic CLI/importability verificationThe main code lives under src/saudi_open_data_mcp/.
connectors/: source access contracts and per-source connectors (SAMA, stats.gov.sa, MoF, data.gov.sa)normalization/: field mapping, validators, pipeline, and minimal canonical recordsregistry/: typed metadata models, SQLite repository, and bootstrapstorage/: snapshots and local freshness helpersresources/: registry-backed MCP resourcestools/: registry-backed and local-only MCP tools, plus preview over the connector pathobservability/: structured logging, process-local counters, upstream canarysecurity/: HTTP auth middleware, readiness probes, rate limiting, input sanitizationconfig.py: runtime configuration and environment variable resolutioncli.py: thin non-interactive CLI over the MCP coreserver.py: FastMCP wiringdocs/: architecture, ADR, roadmap, and dataset notesNear-term work should stay aligned with the current architecture:
Выполни в терминале:
claude mcp add io-github-raheb77-saudi-open-data-mcp -- npx Безопасность
Низкий рискАвтоматическая эвристика по публичным данным — не гарантия безопасности.