TopK provides a data frame-like syntax for querying documents. It features built-in semantic search, text search, vector search, metadata filtering as well as reranking capabilities.

With TopK’s declarative query builder, you can easily select fields, chain filters, and apply vector/text search in a composable manner.

Query structure

In TopK, a query consists of multiple stages:

  • Select stage - Select static or computed fields that will be returned in the query results
    • these fields can be used in stages such as Filter, TopK or Rerank
  • Filter stage - Filter the documents that will be returned in the query results
    • filters can be applied to static fields, computed fields such as vector_distance() or semantic_similarity() or custom properties computed inside select()
  • TopK stage - Return the top k results based on the provided logical expression
  • Count stage - Return the total number of documents matching the query
  • Rerank stage - Rerank the results

All queries must have either TopK or Count collection stage.

You can stack multiple select, filter and rerank stages in a single query.

A typical query in TopK looks as follows:

Select

The select() function is used to initialize the select stage of a query. It accepts a key-value pair of field names and field expressions:

from topk_sdk.query import select, field

client.collection("books").query(
  select(
    "published_year" # elect the static fields directly
    title=field("title"),
  )
  ...
)

Select expressions

Use a field() function to select fields from a document. In the select stage, you can also rename existing fields or define computed fields using function expressions.

from topk_sdk.query import select, field

docs = client.collection("books").query(
    select(
        "title", # the actual "title" field from the document
        year=field("published_year") # renamed field
        year_plus_ten=field("published_year") + 10 # computed field
    )
)

Function expressions

Function expressions are used to define computed fields that will be included in your query results. TopK currently supports three main function expressions:

  • bm25_score(): Calculates relevance scores using the BM25 algorithm for keyword search
  • vector_distance(field, vector): Computes distance between vectors for vector search
  • semantic_similarity(field, query): Measures semantic similarity between the provided text query and the field’s embedding

BM25 Score

The BM25 score is a relevance score that can be used to score documents based on their text content.

To use the fn.bm25_score() in your query, you must include a match predicate in your filter stage.

To use the fn.bm25_score() function, you must have a keyword index defined in your collection schema.

from topk_sdk.query import select, fn, match

docs = client.collection("books").query(
    select(
        "title",
        text_score=fn.bm25_score()
    )
    .filter(match("Good")) # must include a match predicate
    .topk(field("text_score"), 10)
)

# Example result:
[
  {
    "_id": "1",
    "title": "Good Night, Bat! Good Morning, Squirrel!",
    "text_score": 0.2447269707918167
  },
  {
    "_id": "2",
    "title": "Good Girl, Bad Blood",
    "text_score": 0.20035339891910553
  }
]

Vector distance

The vector_distance() function is used to compute the distance between a query vector and a vector field in a collection.

To use the vector_distance() function, you must have a vector index defined on the field you’re computing the vector distance against:

from topk_sdk.query import select, fn

docs = client.collection("books").query(
    select(
        "title",
        title_similarity=fn.vector_distance("title_embedding", [0.1, 0.2, 0.3, ...])
         # embedding for "animal"
    )
    .topk(field("title_similarity"), 10)
)

# Example result:
[
  {
    "_id": "2",
    "title": "To Kill a Mockingbird",
    "title_similarity": 0.7484796643257141
  },
  {
    "_id": "1",
    "title": "The Catcher in the Rye",
    "title_similarity": 0.5471329569816589
  }
]

Semantic similarity

The semantic_similarity() function is used to compute the similarity between a text query and a text field in a collection.

To use the semantic_similarity() function, you must have a semantic index defined on the field you’re computing the similarity on.

from topk_sdk.query import select, fn

docs = client.collection("books").query(
    select(
        "title",
        title_similarity=fn.semantic_similarity("title", "animal")
    )
    .topk(field("title_similarity"), 10)
)

# Example result:
[
  {
    "_id": "2",
    "title": "To Kill a Mockingbird",
    "title_similarity": 0.7484796643257141
  },
  {
    "_id": "1",
    "title": "The Catcher in the Rye",
    "title_similarity": 0.5471329569816589
  }
]

Advanced select expressions

TopK doesn’t only let you select static fields from your documents or computed fields using function expressions. You can also use TopK powerful expression language to select fields by chaining arbitrary logical expressions:

select(
  weight_in_grams=field("weight").mul(1000),
  is_adult=field("age").gt(18),
  published_in_nineteenth_century=field("published_year") >= 1800 && field("published_year") < 1900,
)

Filtering

You can filter documents by metadata, keywords, custom properties computed inside select() (e.g. vector similarity or BM25 score) and more. Filter expressions support all comparison operators: ==, !=, >, >=, <, <=, arithmetic operations: +, -, *, /, and boolean operators: | and &.

Metadata filtering

.filter(
    field("published_year") > 1980
)

The match() function is the backbone of keyword search in TopK. It allows you to search for documents that contain specific keywords or phrases.

You can configure the match() function to:

  • Match on multiple terms
  • Match only on specific fields
  • Use weights to prioritize certain terms

The match() function accepts the following parameters:

token
string
required

String token to match. Can also contain multiple terms separated by a delimiter which is any non-alphanumeric character.

options.field
string

Field to match on. If not provided, the function will match on all fields.

options.weight
number

Weight to use for matching. If not provided, the function will use the default weight(1.0).

options.all
boolean

Use all parameter when a text must contain all terms(separated by a delimeter)

  • when all is false (default) it’s an equivalent of OR operator
  • when all is true it’s an equivalent of AND operator

Searching for a term like "catcher" in your documents is as simple as using the match() function in the filter stage of your query:

from topk_sdk.query import match

.filter(
    match("catcher")
)

Match multiple terms

The match() function can be configured to match all terms when using a delimiter.

A term delimiter is any non-alphanumeric character.

To ensure that all terms are matched, use the all parameter:

from topk_sdk.query import match

.filter(match("catcher|rye", field="title", all=True))

Give weight to specific terms

You can give weight to specific terms by using the weight parameter:

from topk_sdk.query import match

.filter(match("catcher", weight=2.0).or(match("rye", weight=1.0)))

Combine keyword search and metadata filtering

You can combine metadata filtering and keyword search in a single query by stacking multiple filter stages.

In the example below, we’re searching for documents that contain the keyword "catcher" and were published in 1997, or documents that were published between 1920 and 1980.

.filter(
    match("catcher")
)
.filter(
    field("published_year") == 1997 || (field("published_year") >= 1920 && field("published_year") <= 1980)
)

Operators

When writing queries, you can use the following operators for field selection or filtering:

Logical operators

and

The and operator can be used to combine multiple logical expressions.

.filter(
    field("published_year") == 1997 && field("title") == "The Catcher in the Rye"
)

# or

.filter(
    field("published_year").eq(1997).and_(field("title").eq("The Catcher in the Rye"))
)

or

The or operator can be used to combine multiple logical expressions.

.filter(
    field("published_year") == 1997 || field("title") == "The Catcher in the Rye"
)

# or

.filter(
    field("published_year").eq(1997).or(field("title").eq("The Catcher in the Rye"))
)

Comparison operators

eq

The eq operator can be used to match documents that have a field with a specific value.

.filter(
    field("published_year") == 1997
)

# or

.filter(
    field("published_year").eq(1997)
)

ne

The ne operator can be used to match documents that have a field with a value that is not equal to a specific value.

.filter(
    field("published_year") != 1997
)

# or

.filter(
    field("published_year").ne(1997)
)

gt

The gt operator can be used to match documents that have a field with a value greater than a specific value.

.filter(
    field("published_year") > 1997
)

# or

.filter(
    field("published_year").gt(1997)
)

gte

The gte operator can be used to match documents that have a field with a value greater than or equal to a specific value.

.filter(
    field("published_year") >= 1997
)

# or

.filter(
    field("published_year").gte(1997)
)

lt

The lt operator can be used to match documents that have a field with a value less than a specific value.

.filter(
    field("published_year") < 1997
)

# or

.filter(
    field("published_year").lt(1997)
)

lte

The lte operator can be used to match documents that have a field with a value less than or equal to a specific value.

.filter(
    field("published_year") <= 1997
)

# or

.filter(
    field("published_year").lte(1997)
)

starts_with

The starts_with operator can be used on string fields to match documents that start with a given prefix. This is especially useful in multi-tenant applications where document IDs can be structured as {tenant_id}/{document_id} and starts_with can then be used to scope the query to a specific tenant.

.filter(
    field("_id").starts_with("tenant_123/")
)

contains

The contains operator can be used on string fields to match documents that include a specific substring. It is case-sensitive and is particularly useful in scenarios where you need to filter results based on a portion of a string.

.filter(
    field("title").contains("Catcher")
)

Arithmetic operators

add

The add operator can be used to add two numbers.

.filter(
    field("published_year") + 1997
)

# or

.filter(
    field("published_year").add(1997)
)

sub

The sub operator can be used to subtract two numbers.

.filter(
    field("published_year") - 1997
)

# or

.filter(
    field("published_year").sub(1997)
)

mul

The mul operator can be used to multiply two numbers.

.filter(
    field("published_year") * 1997
)

# or

.filter(
    field("published_year").mul(1997)
)

div

The div operator can be used to divide two numbers.

.filter(
    field("published_year") / 1997
)

# or

.filter(
    field("published_year").div(1997)
)

Unary operators

not

The not helper can be used to negate a logical expression. It takes an expression as an argument and inverts its logic.

from topk_sdk.query import field, not_

.filter(
    not_(field("title").contains("Catcher"))
)

is_null

The is_null operator can be used to match documents that have a field with a value that is null.

.filter(
    field("title").is_null()
)

is_not_null

The is_not_null operator can be used to match documents that have a field with a value that is not null.

.filter(
    field("title").is_not_null()
)

Collection

All queries must have a collection stage. Currently, we only support topk() and count() collectors.

topk

Use the topk() function to return the top k results. The topk() function accepts the following parameters:

field
LogicalExpression
required

The logical expression to sort the results by.

k
number
required

The number of results to return.

asc
boolean
required

Whether to sort the results in ascending order.

To get the top 10 results ordered by the title_similarity field, you can use the following query:

# Return top 10 results order by `published_year` ascending
.topk(field("title_similarity"), 10, asc=True)

count

Use the count() function to get the total number of documents matching the query. If there are no filters then count() will return the total number of documents in the collection.

# Count the total number of documents in the collection
.count()

When writing queries, remember that they all require the topk or count function at the end.

Rerank

The rerank() function is used to rerank the results of a query. Read more about it in our reranking guide.

.rerank()