انتقل إلى المحتوى

Vector Indexing & Metadata

This guide covers the schema and best practices for populating the Cloudflare Vectorize index.


1. Index Schema

Property Value Description
Dimensions 1024 Required by the bge-m3 embedding model.
Metric cosine Cosine similarity is recommended for text embeddings.

Metadata Fields

To enable efficient filtering, the following metadata fields are indexed:

Field Name Type Description
lang string The language of the article chunk (e.g., ar, en).
source string The key of the source (e.g., SANA, verify-sy).
published_at_bucket number The publication month, formatted as YYYYMM (e.g., 202508).

2. Vector & Metadata Structure

Each document stored in Vectorize represents a single chunk of an article.

Vector ID Naming Convention

Vector IDs follow a strict naming convention to ensure uniqueness and traceability: article:<uuid>:<chunk_no>

  • article: A static prefix.
  • <uuid>: The unique identifier of the parent article.
  • <chunk_no>: The 0-indexed number of the chunk within the article.

This is the metadata object that should be stored alongside each vector.

{
  "article_id": "<uuid>",
  "chunk_no": 0,
  "lang": "ar",
  "source": "SANA",
  "published_at_bucket": 202508,
  "url": "https://example.com/article/123",
  "title": "Article Title",
  "text": "A short snippet of the article text (2-4 KiB max)."
}

Chunk Size

Aim for a chunk size of approximately 600-800 tokens per vector for optimal retrieval performance.


3. Backfill & Data Management

Pilot Backfill

For initial data loading or small backfills, use the /rag/dev/upsert-batch development endpoint or a one-off script that calls VECTORIZE.upsert directly.

Data Updates & Deletes

  • Database Mapping: It is crucial to maintain a mapping in the primary PostgreSQL database between (article_id, chunk_no) and the corresponding vector_id.
  • Updates: To update a vector, re-embed the content and call VECTORIZE.upsert() with the same vector ID. This will overwrite the existing vector.
  • Deletes: To remove vectors, use VECTORIZE.deleteByIds(['id1', 'id2', ...]). For large-scale deletions, consider maintaining a "tombstone" list and purging periodically.