Summarize
The summarize endpoint is used to run calculations on your data. This guide runs through the key concepts, parameters, and how you should use the summarize endpoint.
Summarizing or performing calculations on your data can reveal information to you or the users of your product. You could request the total number of sales in January, the profit on a basketball, or to find the store in your city with the best sales numbers.
After reading this document, you'll have a far better understanding of what the summarize endpoint can do, and where it fits in to the Xata ecosystem.
The endpoint has two concepts that are important to understand: columns and summaries. If you're already familiar with aggregations and GROUP BY
in SQL, you can skip this section.
Consider the following table. Whenever a sale occurs in an e-commerce store, a new row is added to this table, denoting the item sold and its sale price.
product | sale_price |
---|---|
basketball | 10.0 |
basketball | 15.0 |
basketball | 25.0 |
shirt | 40.0 |
shirt | 80.0 |
In the context of the given table, a column represents an attribute that you want to perform a calculation on. Assigning product
as a column yields one result row for each unique value within the product
column. This implies you retrieve two rows: one representing basketball
and the other, shirt
.
A summary has two parts. These consist of a function (the summary) and a parameter (the column name). The function can be called sum
, or average
. The parameter is the column name on which you would like to execute the function. A question I can ask is to {"sum": "sale_price"}
to retrieve the total sale price.
Referring to the table above, if you would like to understand how the store has been going, you might ask some tough questions that can be solved with the summarize endpoint.
Here, the column is product
. The summary
- in it's two parts - will be average
as the function, and sale_price
as the column. In plain language, you might ask "give me the average sale price for each product". The request would look like:
const records = await xata.db.sales.summarize({
filter: { product: 'basketball' },
columns: ['product'],
summaries: {
average_sales_price: { average: 'sale_price' }
}
});
records = xata.data().summarize("sales", {
"filter": { "product": "basketball" },
"columns": ["product"],
"summaries": {
"average_sale_price": { "average": "sale_price" }
}
})
// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/sql
{
"statement": "SELECT product,AVG(sale_price) FROM \"sales\" WHERE product = $1 GROUP BY product;",
"params": ["basketball"]
}
// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/tables/{table}/summarize
{
"filter": { "product": "basketball" },
"columns": ["product"],
"summaries": {
"average_sale_price": { "average": "sale_price" }
}
}
Since we specified product
in columns
, we get one row for each unique product in our table. In the request above, we have a key average_sale_price
. This is the name of your summary; you will see this appear in the result.
{
"summaries": [{ "product": "basketball", "average_sale_price": 16.6 }]
}
Let's try one more example to help clarify the concepts.
Here, as above, the column is product
. We will also want to run a calculation on sale_price
, however, we want to add up each sale price rather than average them, so our summary
equation will be sum
. Our request looks like:
const records = await xata.db.sales.summarize({
columns: ['product'],
summaries: {
total_sales: { sum: 'sale_price' }
}
});
records = xata.data().summarize("sales", {
"columns": ["product"],
"summaries": {
"total_sales": { "sum": "sale_price" }
}
})
searchClient, _ := xata.NewSearchAndFilterClient()
records, _ := searchClient.Summarize(context.TODO(), xata.SummarizeTableRequest{
TableName: "sales",
Payload: xata.SummarizeTableRequestPayload{
Columns: []string{"product"},
Summaries: map[string]map[string]any{
"total_sales": {"sum": "sale_price"},
},
},
})
// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/sql
{
"statement": "SELECT product,SUM(sale_price) FROM \"sales\" GROUP BY product;"
}
// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/tables/{table}/summarize
{
"columns": ["product"],
"summaries": {
"total_sales": { "sum": "sale_price" }
}
}
And response:
{
"summaries": [
{ "product": "basketball", "total_sales": 50.0 },
{ "product": "shirt", "total_sales": 120.0 }
]
}
Here, we've chosen to call the summarized column total_sales
and added up all of the sale_price
of each product
.
Before we get started, keep in mind that we keep reference material that can be found here.
In its most basic form summarize
can be used as a "distinct" method to get the unique values or value combinations of the summarized columns.
const records = await xata.db.sales.summarize({
columns: ['username'],
summaries: {}
});
records = xata.data().summarize("sales", {
"columns": ["username"],
"summaries": {}
})
searchClient, _ := xata.NewSearchAndFilterClient()
records, _ := searchClient.Summarize(context.TODO(), xata.SummarizeTableRequest{
TableName: "sales",
Payload: xata.SummarizeTableRequestPayload{
Columns: []string{"username"},
},
})
// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/sql
{
"statement": "SELECT username,COUNT(username) FROM \"sales\" GROUP BY username;"
}
// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/tables/{table}/summarize
{
"columns": ["username"],
"summaries": {}
}
In the above example, we get the unique values of the column "username".
The summarize
API is limited to one page of results. Refer to the page section below for more details.
Let's take a look at the different parameters.
In the summaries key, you set which columns you'd like summarized, and how you'd like them summarized. Xata supports count, sum, min, max, and average.
All summary functions take a column as an argument. The functions sum, min, max, and average can operate on integer
and float
columns. They will return an integer or float back; this is determined by which function you use, and the column type it's operating on.
The count function is slightly different. Asking for a summary like: {"count_names": {"count": "name"}}
will return the total number of names that are not null. If you'd like to know the total - including values that are null - you can use the wildcard operator like so: {"count_all_names": :{"count": "*"}}
.
All summaries support links. If you have a link set up, you specify it using the standard dotted notation; {"average_pet_size": {"average": {"pet.size"}}
One can specify several summaries in the same request. As long as the names are unique, do not contain special characters, and do not conflict with columns on your table, you are free to set the names as you wish.
const records = await xata.db.sales.summarize({
summaries: {
// count all rows
"all_sales": {"count": "*"},
// count all rows where `store_id` is not null
"all_sales_from_store": {"count": "store_id"},
// sum the `profit` column
"total_profit": {"sum": "profit"},
// average the `shipping.time_days` column
"average_shipping_time": {"average": "shipping.time_days"},
// count the total rows where `has_arrived` is not null
"total_has_arrived": {"count": "has_arrived"},
// retrieve the smallest value in the `package_volume` column
"smallest_package": {"min": "package_volume"},
// retrieve the largest value in the `sale_price` column
"largest_sale_price": {"max": "sale_price"}
}
});
records = xata.data().summarize("sales", {
"summaries": {
# count all rows
"all_sales": { "count": "*" },
# count all rows where `store_id` is not null
"all_sales_from_store": { "count": "store_id" },
# sum the `profit` column
"total_profit": { "sum": "profit" },
# average the `shipping.time_days` column
"average_shipping_time": { "average": "shipping.time_days" },
# count the total rows where `has_arrived` is not null
"total_has_arrived": { "count": "has_arrived" },
# retrieve the smallest value in the `package_volume` column
"smallest_package": { "min": "package_volume" },
# retrieve the largest value in the `sale_price` column
"largest_sale_price": { "max": "sale_price" }
}
})
searchClient, _ := xata.NewSearchAndFilterClient()
records, _ := searchClient.Summarize(context.TODO(), xata.SummarizeTableRequest{
Payload: xata.SummarizeTableRequestPayload{
Columns: []string{"product"},
Summaries: map[string]map[string]any{
// count all rows
"all_sales": {"count": "*"},
// count all rows where `store_id` is not null
"all_sales_from_store": {"count": "store_id"},
// sum the `profit` column
"total_profit": {"sum": "profit"},
// average the `shipping.time_days` column
"average_shipping_time": {"average": "shipping.time_days"},
// count the total rows where `has_arrived` is not null
"total_has_arrived": {"count": "has_arrived"},
// retrieve the smallest value in the `package_volume` column
"smallest_package": {"min": "package_volume"},
// retrieve the largest value in the `sale_price` column
"largest_sale_price": {"max": "sale_price"},
},
},
})
// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/tables/{table}/summarize
{
"summaries": {
// count all rows
"all_sales": { "count": "*" },
// count all rows where `store_id` is not null
"all_sales_from_store": { "count": "store_id" },
// sum the `profit` column
"total_profit": { "sum": "profit" },
// average the `shipping.time_days` column
"average_shipping_time": { "average": "shipping.time_days" },
// count the total rows where `has_arrived` is not null
"total_has_arrived": { "count": "has_arrived" },
// retrieve the smallest value in the `package_volume` column
"smallest_package": { "min": "package_volume" },
// retrieve the largest value in the `sale_price` column
"largest_sale_price": { "max": "sale_price" }
}
}
In the previous example, it was indicated that using "all_sales": {"count": "*"}
yields the number of records from the chosen table. To maintain simplicity, the subsequent example demonstrates how to obtain the total count of records in the sales
table.
const records = await xata.db.sales.summarize({
summaries: {
"total": {"count": "*"}
}
});
console.log(records.summaries[0].total)
records = xata.data().summarize("sales", {
"summaries": {
"total": { "count": "*" }
}
})
print(records["summaries"][0]["count"])
searchClient, _ := xata.NewSearchAndFilterClient()
records, _ := searchClient.Summarize(context.TODO(), xata.SummarizeTableRequest{
TableName: "sales",
Payload: xata.SummarizeTableRequestPayload{
Columns: []string{"product"},
Summaries: map[string]map[string]any{
"total": {"count": "*"},
},
},
})
// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/sql
{
"statement": "SELECT COUNT(*) as total FROM \"sales\""
}
// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/tables/{table}/summarize
{
"summaries": {
"total": { "count": "*" }
}
}
The columns
key lets you specify how you'd like to group your data. Columns work exactly the same as with the query endpoint.
You may submit explicit column names, use wildcards, and reference linked columns as needed.
const records = await xata.db.sales.summarize({
columns: [
'settings.*', // group by all columns in the `settings` object
'username', // group by the username field
'user.hobbies.name' // group by a linked column
]
});
records = xata.data().summarize("sales", {
"columns": [
"settings.*", # group by all columns in the `settings` object
"username", # group by the username field
"user.hobbies.name" # group by a linked column
]
})
searchClient, _ := xata.NewSearchAndFilterClient()
records, _ := searchClient.Summarize(context.TODO(), xata.SummarizeTableRequest{
TableName: "sales",
Payload: xata.SummarizeTableRequestPayload{
Columns: []string{
"settings.*", // group by all columns in the `settings` object
"username", // group by the username field
"user.hobbies.name", // group by a linked column
},
},
})
// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/tables/{table}/summarize
{
"columns": [
"settings.*", // group by all columns in the `settings` object
"username", // group by the username field
"user.hobbies.name" // group by a linked column
]
}
If you would like to see results with no groups, you may submit columns
as an empty array.
Filter lets you filter data before grouping and summarizing. Check out our guide here.
Keep in mind: filter
does not support filtering on the result of a summary. In order to filter on summaries, one must use summariesFilter
.
Sort lets you decide how you'd like your results sorted. You may sort on columns
you've explicitly referenced, as well as on summaries
you've requested. When wanting to sort by a summary, you use the name you've chosen.
const records = await xata.db.sales.summarize({
columns: ["store_name"],
summaries: {
"total_sales": {"count": "*"}
},
sort: [
// put the highest total_sales at the top
{"total_sales": "desc"},
// if total_sales is equal for two rows, order them by the store_name
{"store_name": "asc"}
]
});
records = xata.data().summarize("sales", {
"columns": ["store_name"],
"summaries": {
"total_sales": {"count": "*"}
},
"sort": [
# put the highest total_sales at the top
{"total_sales": "desc"},
# if total_sales is equal for two rows, order them by the store_name
{"store_name": "asc"}
]
})
searchClient, _ := xata.NewSearchAndFilterClient()
records, _ := searchClient.Summarize(context.TODO(), xata.SummarizeTableRequest{
TableName: "sales",
Payload: xata.SummarizeTableRequestPayload{
Columns: []string{"store_name"},
Summaries: map[string]map[string]any{
"total": {"count": "*"},
},
Sort: xata.NewSortExpressionFromStringSortOrderMapList(
[]map[string]xata.SortOrder{
// put the highest total_sales at the top
{"total_sales": xata.SortOrderDesc},
// if total_sales is equal for two rows, order them by the store_name
{"store_name": xata.SortOrderAsc},
},
),
},
})
// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/sql
{
"statement": "SELECT store_name,COUNT(*) AS total_sales FROM \"sales\" GROUP BY store_name ORDER BY total_sales desc, store_name asc;"
}
// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/tables/{table}/summarize
{
"columns": ["store_name"],
"summaries": {
"total_sales": { "count": "*" }
},
"sort": [
// put the highest total_sales at the top
{ "total_sales": "desc" },
// if total_sales is equal for two rows, order them by the store_name
{ "store_name": "asc" }
]
}
Page lets you determine how many results are returned. We currently only support one parameter: size
. This allows you to set the amount of rows returned back to you.
The default setting for size
, if not specified in the request, is 20 results. We allow you to set this between 1 and 1000.
If this limitation is too small, check if a similar request can be achieved using the aggregate endpoint as an alternative approach.
An example of a page request looks like:
const records = await xata.db.sales.summarize({
pagination: {
size: 50 // return up to 50 results
}
});
records = xata.data().summarize("sales", {
"page": {
"size": 50 # return up to 50 results
}
})
// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/tables/{table}/summarize
{
"page": {
"size": 50 // return up to 50 results
}
}
In order to filter on the results of a summary operation, Xata supports summariesFilter
. This has the same syntax and options as the filter
key defined above.
The key feature summariesFilter
adds is the ability to take a summary and filter by it.
Using summariesFilter
is much slower than using filter
. With this in mind, we recommend that you do as much
work as possible using filter
and use summariesFilter
only when you really need it.
An example looks like:
const records = await xata.db.sales.summarize({
columns: ['product_name'],
summaries: {
revenue: { sum: 'price' },
items_sold: { count: '*' }
},
summariesFilter: {
// only return product names where the total revenue is >= 100 AND
// where sold >= 1000 of them
revenue: { $ge: 100.0 },
items_sold: { $ge: 1000 }
}
});
records = xata.data().summarize("sales", {
"columns": ["product_name"],
"summaries": {
"revenue": { "sum": "price" },
"items_sold": { "count": "*" }
},
"summariesFilter": {
# only return product names where the total revenue is >= 100 AND
# where sold >= 1000 of them
"revenue": { "$ge": 100.0 },
"items_sold": { "$ge": 1000 }
}
})
// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/sql
{
"statement": "SELECT product_name, SUM(price) AS revenue, COUNT(*) AS items_sold FROM \"sales\" GROUP BY product_name HAVING SUM(price) >= 100 AND COUNT(*) >= 1000;"
}
// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/tables/{table}/summarize
{
"columns": ["product_name"],
"summaries": {
"revenue": { "sum": "price" },
"items_sold": { "count": "*" }
},
"summariesFilter": {
// only return product names where the total revenue is >= 100 AND
// where sold >= 1000 of them
"revenue": { "$ge": 100.0 },
"items_sold": { "$ge": 1000 }
}
}
By specifying the option consistency: eventual
the summarize request will be serviced by the Replica Store which has a small, typically insignificant, propagation delay compared to the Primary Store as outlined in the Data Model guide.
The default value for the consistency option is strong
, which retrieves data from the Primary Store and guarantees that the response is up to date with the latest state of the record content.
It is recommended to use the Replica Store for summarize requests wherever possible, in order to get the best possible performance out of your current plan.
const records = await xata.db.sales.summarize({
filter: { product: 'basketball' },
columns: ['product'],
summaries: {
average_sales_price: { average: 'sale_price' }
},
consistency: 'eventual'
});
records = xata.data().summarize("sales", {
"filter": {
"product": "basketball"
},
"columns": ["product"],
"summaries": {
"average_sale_price": {
"average": "sale_price"
}
},
"consistency": "eventual"
})
// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/sql
{
"statement": "SELECT product, AVG(sale_price) AS average_sale_price FROM \"sales\" WHERE product = $1 GROUP BY product;",
"params": ["basketball"],
"consistency": "eventual"
}
// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/tables/{table}/summarize
{
"filter": { "product": "basketball" },
"columns": ["product"],
"summaries": {
"average_sale_price": { "average": "sale_price" }
},
"consistency": "eventual"
}
If you want to improve the performance of your calls to summarize, here's a few things to look for, and a few to avoid.
Filter as much as possible
The more data you filter out using filter
, the faster your query will be. Use it to filter the exact values you wish to summarize.
Reduce cardinality from the columns
call
If you request a summary for a column with many unique values, the request will take longer. If you're finding your request slow, check the columns you've specified. Can you remove any of them? Filter out any values you're not interested in?
Generally, having as few unique values as possible will speed up your request.
Avoid links
Links always add time to the request. If you do not need to reference a link in the columns
field nor in the summaries
field, you query will be quicker.
Use filters
instead of summariesFilter
if possible
Our advice is always to use filters
where you can. This will give you a much quicker response. Use summariesFilter
only to filter on summary results, or in complex cases where you need to combine a filter between a column and a summary result.
Use consistency: "eventual"
if possible
Eventual consistency utilizes the dedicated concurrency connections of the Read Replica store, leaving more concurrency slots available for operations that can only operate on the Primary store, such as data inserts and updates. Which makes it less likely to reach the concurrency limit of either Store type.
Use the aggregate endpoint
The aggregate endpoint will almost always perform better. It's quicker to handle larger tables, with more cardinality, and does not have a hard limit on the amount of results that one can request in a single request. See the below section "Summarize or Aggregate" to understand why.
Summarize and aggregate both return similar results. They differ in terms of the underlying store the data is served from. When using summarize, one retrieves data from PostgreSQL. This means that you will retrieve consistent results.
However, PostgreSQL storage is not optimized to work on these types of queries. Running these workloads will tend to be slower and consume for of your account's concurrency limits. For larger tables and for cases where eventual consistency is acceptable, we recommend using the aggregate endpoint. The aggregate endpoint services the same kinds of requests, but from our column-store rather than from Postgres. This results in far faster responses to your requests.
Note: there are some minor differences to how the two endpoints function. Please see the documentation for aggregate for more information.