Vector Indexing & Metadata¶

This guide covers the schema and best practices for populating the Cloudflare Vectorize index.

1. Index Schema¶

Property	Value	Description
Dimensions	`1024`	Required by the `bge-m3` embedding model.
Metric	`cosine`	Cosine similarity is recommended for text embeddings.

Metadata Fields¶

To enable efficient filtering, the following metadata fields are indexed:

Field Name	Type	Description
`lang`	`string`	The language of the article chunk (e.g., `ar`, `en`).
`source`	`string`	The key of the source (e.g., `SANA`, `verify-sy`).
`published_at_bucket`	`number`	The publication month, formatted as `YYYYMM` (e.g., `202508`).

2. Vector & Metadata Structure¶

Each document stored in Vectorize represents a single chunk of an article.

Vector ID Naming Convention¶

Vector IDs follow a strict naming convention to ensure uniqueness and traceability: article:<uuid>:<chunk_no>

article: A static prefix.
<uuid>: The unique identifier of the parent article.
<chunk_no>: The 0-indexed number of the chunk within the article.

Recommended Metadata Payload¶

This is the metadata object that should be stored alongside each vector.

{
  "article_id": "<uuid>",
  "chunk_no": 0,
  "lang": "ar",
  "source": "SANA",
  "published_at_bucket": 202508,
  "url": "https://example.com/article/123",
  "title": "Article Title",
  "text": "A short snippet of the article text (2-4 KiB max)."
}

Chunk Size

Aim for a chunk size of approximately 600-800 tokens per vector for optimal retrieval performance.

3. Backfill & Data Management¶

Pilot Backfill¶

For initial data loading or small backfills, use the /rag/dev/upsert-batch development endpoint or a one-off script that calls VECTORIZE.upsert directly.

Data Updates & Deletes¶

Database Mapping: It is crucial to maintain a mapping in the primary PostgreSQL database between (article_id, chunk_no) and the corresponding vector_id.
Updates: To update a vector, re-embed the content and call VECTORIZE.upsert() with the same vector ID. This will overwrite the existing vector.
Deletes: To remove vectors, use VECTORIZE.deleteByIds(['id1', 'id2', ...]). For large-scale deletions, consider maintaining a "tombstone" list and purging periodically.