Almost complete feature but still under development. There may be changes as we address issues and feedback. Use with caution.
Aggregations
The /aggregate
API allows you to use the search/analytics engine to perform aggregations on your data. Similar to the search API aggregations run in the optional Search store, which means: it is eventually consistent with the main transactional store,
and it cannot access the linked fields from a table. The Search store is enabled by default and can be disabled in the Database Settings from the Web UI. The aggregate API and features described in this page require the Search store to be enabled. If these limitations are not acceptable for your use case, you should use the Summarize API. The advantages of using the Aggregation API, over the Summarize API, are:
- it generally offers better performance, because the underlying store is column oriented.
- it offers composable aggregations that can be combined into complex aggregations / visualizations.
Operations which are available both in the Aggregation and the Summarize API, such as sum
, may present small deviations due to differences in the order of reading data from storage and subsequent rounding, as well as in case of in-flight data until consistency is achieved across the different stores.
An example of a relatively complex visualization that can be created with the Aggregation API could be: a multi-line chart, where each line represents a movie genre, and the Y axis represents the average rating of the movies in that genre, per year. This chart can be obtained with a single aggregation request, looking something like this:
const results = await xata.db.titles.aggregate({
movieGenres: {
topValues: {
column: 'genre',
size: 50,
aggs: {
byReleaseDate: {
dateHistogram: {
column: 'releaseDate',
calendarInterval: 'year',
aggs: {
avgRating: {
average: {
column: 'rating'
}
}
}
}
}
}
}
}
});
results = xata.data().aggregate("titles", {
"aggs": {
"movieGenres": {
"topValues": {
"column": "genre",
"size": 50,
"aggs": {
"byReleaseDate": {
"dateHistogram": {
"column": "releaseDate",
"calendarInterval": "year",
"aggs": {
"avgRating": {
"average": {
"column": "rating"
}
}
}
}
}
}
}
}
}
})
// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/tables/{table}/aggregate
{
"aggs": {
"movieGenres": {
"topValues": {
"column": "genre",
"size": 50,
"aggs": {
"byReleaseDate": {
"dateHistogram": {
"column": "releaseDate",
"calendarInterval": "year",
"aggs": {
"avgRating": {
"average": {
"column": "rating"
}
}
}
}
}
}
}
}
}
}
The code above combines three aggregations: topValues
, dateHistogram
and average
. The topValues
aggregations splits the data into buckets based on the genre
column, and takes the top 50 genres by the number of movies in each. The dateHistogram
aggregation splits the data into buckets based on the releaseDate
column, and groups the data by calendaristic year. The average
aggregation is a metric aggregation that is executed on the resulting buckets.
The /aggregate
API can be combined with top-level filters using the syntax of the filtering API.
const results = await xata.db.titles.aggregate({
movieGenres: {
topValues: {
column: 'genre',
size: 50,
aggs: {
byReleaseDate: {
dateHistogram: {
column: 'releaseDate',
calendarInterval: 'year',
aggs: {
avgRating: {
average: {
column: 'rating'
}
}
}
}
}
}
}
}
},
{
"director":"Peter Jackson"
})
results = xata.data().aggregate("titles", {
"aggs": {
"movieGenres": {
"topValues": {
"column": "genre",
"size": 50,
"aggs": {
"byReleaseDate": {
"dateHistogram": {
"column": "releaseDate",
"calendarInterval": "year",
"aggs": {
"avgRating": {
"average": {
"column": "rating"
}
}
}
}
}
}
}
}
},
"filter": {
"director":"Peter Jackson"
}
})
// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/tables/{table}/aggregate
{
"aggs": {
"movieGenres": {
"topValues": {
"column": "genre",
"size": 50,
"aggs": {
"byReleaseDate": {
"dateHistogram": {
"column": "releaseDate",
"calendarInterval": "year",
"aggs": {
"avgRating": {
"average": {
"column": "rating"
}
}
}
}
}
}
}
}
},
"filter": {
"director": "Peter Jackson"
}
}
Additionally, certain aggregation types support internal filters to apply on records before calculating aggregate values. Refer to the aggregation type's documentation sections below for more details.
There are two types of aggregations:
- bucket aggregations, which split the data into buckets based on a column value. Examples of bucket aggregations are:
topValues
,dateHistogram
andnumericHistogram
. - metric aggregations, which compute a value based on the data in the bucket. Examples of metric aggregations are:
average
,sum
,min
,max
,count
anduniqueCount
.
Aggregations cannot be used with link columns.
The response for the example aggregation above has the following form:
{
"aggs": {
"movieGenres": {
"values": [
{
"$key": "Drama",
"$count": 1123,
"byReleaseDate": {
"values": [{
"$key": "1970-01-01T00:00:00.000Z",
"$count": 78,
"avgRating": 7.5
}, {
"$key": "1971-01-01T00:00:00.000Z",
"$count": 53,
"avgRating": 7.6
}
...
]
}
},
...
]
}
}
}
The metric aggregation (avgRating
of type average
) returns a single value, in this case a floating point value.
The bucket aggregations (movieGenres
of type topValues
and byReleaseDate
of type dateHistogram
) return a list of objects in the values
array, with the following fields:
$key
is the key of the bucket, typically representing the start of the interval. In case of adateHistogram
it is a date, in case ofnumericHistogram
it is a number, and in case oftopValues
it is a string.$count
is the number of records in that bucket.- the rest of the keys are the sub-aggregations requested, which can be both metric or bucket aggregations.
topValues
is a bucket-aggregation that splits the data into buckets by the unique values in a column. It is configured with the following parameters:
column
: the name of the column to split the data by. Accepted types arestring
,email
,int
,float
, orbool
.size
: the maximum number of buckets to return. Default is 10, maximum is 1000.aggs
: sub-aggregations to execute on the resulting buckets.
Example:
const records = await xata.db.Teams.aggregate({
topTeams: {
topValues: {
column: 'name'
}
}
});
results = xata.data().aggregate("titles", {
"aggs": {
"topTeams": {
"topValues": {
"column": "name"
}
}
}
})
// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/tables/{table}/aggregate
{
"aggs": {
"topTeams": {
"topValues": {
"column": "name"
}
}
}
}
dateHistogram
is a bucket-aggregation that splits the data into buckets by a timestamptz (datetime) column. It is configured with the following parameters:
column
: The column to use for bucketing. Must be of type timestamptz (datetime).interval
: The fixed interval to use when bucketing. It is formatted as number + unit of time, for example:5d
,20m
,10s
.calendarInterval
: The calendar-aware interval to use when bucketing. Possible values are:minute
,hour
,day
,week
,month
,quarter
,year
.timezone
: The timezone to use for bucketing. By default, UTC is assumed. The accepted format is as an ISO 8601 UTC offset. For example:+01:00
or-08:00
.aggs
: sub-aggregations to execute on the resulting buckets.
Either interval
or calendarInterval
must be specified. The calendarInterval
always starts at the boundary of the calendar unit, for example, if the
calendarInterval
is month
, the buckets will start at the beginning of the month.
Example:
const records = await xata.db.titles.aggregate({
byReleaseDate: {
dateHistogram: {
column: "releaseDate",
calendarInterval: "year",
},
},
});
results = xata.data().aggregate("titles", {
"aggs": {
"byReleaseDate": {
"dateHistogram": {
"column": "releaseDate",
"calendarInterval": "year"
}
}
}
})
// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/tables/{table}/aggregate
{
"aggs": {
"byReleaseDate": {
"dateHistogram": {
"column": "releaseDate",
"calendarInterval": "year"
}
}
}
}
numericHistogram
is a bucket-aggregation that splits the data into buckets by dynamic numeric ranges.. It is configured with the following parameters:
column
: The column to use for bucketing. Must be of numeric type.interval
: The numeric interval to use for bucketing. The resulting buckets will be ranges with this value as size.offset
: By default the bucket keys start with 0 and then continue ininterval
steps. The bucket boundaries can be shifted by using theoffset
option. For example, if theinterval
is 100, but you prefer the bucket boundaries to be[50, 150), [150, 250), etc.
, you can setoffset
to 50.
Example:
const records = await xata.db.titles.aggregate({
viewsHistogram: {
numericHistogram: {
column: 'views',
interval: 100
}
}
});
results = xata.data().aggregate("titles", {
"aggs": {
"viewsHistogram": {
"numericHistogram": {
"column": "views",
"interval": 100
}
}
}
})
// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/tables/{table}/aggregate
{
"aggs": {
"viewsHistogram": {
"numericHistogram": {
"column": "views",
"interval": 100
}
}
}
}
The Date histogram aggregation cannot be used on xata metadata columns xata.createdAt
and xata.updatedAt
.
count
is a metric-aggregation that counts the number of records.
It accepts a filter
setting to filter the records before counting. The filter object accepts the same syntax as the query filter.
Example:
const records = await xata.db.titles.aggregate({
ratingsAboveEight: {
count: {
filter: {
rating: { $gt: 8 }
}
}
},
ratingsBelowEight: {
count: {
filter: {
rating: { $le: 8 }
}
}
}
});
results = xata.data().aggregate("titles", {
"aggs": {
"ratingsAboveEight": {
"count": {
"filter": {
"rating": { "$gt": 8 }
}
}
},
"ratingsBelowEight": {
"count": {
"filter": {
"rating": { "$le": 8 }
}
}
}
}
})
// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/tables/{table}/aggregate
{
"aggs": {
"ratingsAboveEight": {
"count": {
"filter": {
"rating": { "$gt": 8 }
}
}
},
"ratingsBelowEight": {
"count": {
"filter": {
"rating": { "$le": 8 }
}
}
}
}
}
If you don't need to specify any filter you can use the special count: "*"
syntax:
const records = await xata.db.titles.aggregate({
totalCount: {
count: '*'
}
});
results = xata.data().aggregate("titles", {
"aggs": {
"totalCount": {
"count": "*"
}
}
})
// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/tables/{table}/aggregate
{
"aggs": {
"totalCount": {
"count": "*"
}
}
}
uniqueCount
is a metric-aggregation that counts the number of distinct values in a column. It uses
an approximate algorithm (HyperLogLog++)
in order to reduce the amount of memory required, but it is exact for low values (up to the preicisionThreshold
).
It accepts the following parameters:
column
: The column from where to count the unique values.precisionThreshold
: The threshold under which the unique count is exact. If the number of unique values in the column is higher than this threshold, the results are approximate. Maximum value is 40,000, default value is 3000.
Example:
const records = await xata.db.titles.aggregate({
uniqueGenres: {
uniqueCount: {
column: "genre",
},
},
});
results = xata.data().aggregate("titles", {
"aggs": {
"uniqueGenres": {
"uniqueCount": {
"column": "genre"
}
}
}
})
// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/tables/{table}/aggregate
{
"aggs": {
"uniqueGenres": {
"uniqueCount": {
"column": "genre"
}
}
}
}
average
is a metric-aggregation that computes the average value of a numeric column.
Example:
const records = await xata.db.titles.aggregate({
avgRating: {
average: {
column: 'rating'
}
}
});
results = xata.data().aggregate("titles", {
"aggs": {
"avgRating": {
"average": {
"column": "rating"
}
}
}
})
// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/tables/{table}/aggregate
{
"aggs": {
"avgRating": {
"average": {
"column": "rating"
}
}
}
}
max
is a metric-aggregation that computes the maximum value of a numeric column.
Example:
const records = await xata.db.titles.aggregate({
maxRating: {
max: {
column: 'rating'
}
}
});
results = xata.data().aggregate("titles", {
"aggs": {
"maxRating": {
"max": {
"column": "rating"
}
}
}
})
// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/tables/{table}/aggregate
{
"aggs": {
"maxRating": {
"max": {
"column": "rating"
}
}
}
}
min
is a metric-aggregation that computes the minimum value of a numeric column.
Example:
const records = await xata.db.titles.aggregate({
minRating: {
min: {
column: 'rating'
}
}
});
results = xata.data().aggregate("titles", {
"aggs": {
"minRating": {
"min": {
"column": "rating"
}
}
}
})
// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/tables/{table}/aggregate
{
"aggs": {
"minRating": {
"min": {
"column": "rating"
}
}
}
}
sum
is a metric-aggregation that computes the sum of a numeric column.
Example:
const records = await xata.db.titles.aggregate({
sumRating: {
sum: {
column: 'rating'
}
}
});
results = xata.data().aggregate("titles", {
"aggs": {
"sumRating": {
"sum": {
"column": "rating"
}
}
}
})
// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/tables/{table}/aggregate
{
"aggs": {
"sumRating": {
"sum": {
"column": "rating"
}
}
}
}
percentiles
is a metric-aggregation that computes the percentiles of a numeric column. You can specify an array with the percentiles you want to compute. For example, you can request the median by asking for the 50th percentile ([50]
), or you can request the 50th, 90th, 99th, and 99.9th percentiles by asking for [50, 90, 99, 99.9]
.
Example:
const records = await xata.db.titles.aggregate({
latencyPercentiles: {
percentiles: {
column: 'latency',
percentiles: [50, 90, 99, 99.9]
}
}
});
results = xata.data().aggregate("titles", {
"aggs": {
"latencyPercentiles": {
"sum": {
"column": "latency",
"percentiles": [50, 90, 99, 99.9]
}
}
}
})
// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/tables/{table}/aggregate
{
"aggs": {
"latencyPercentiles": {
"percentiles": {
"column": "latency",
"percentiles": [50, 90, 99, 99.9]
}
}
}
}
The result will be an object containing a 'values' sub-object. This sub-object will have percentiles as keys and their corresponding values as the percentile values.
For example:
{
"aggs": {
"latencyPercentiles": {
"values": {
"50.0": 30,
"90:0": 78,
"99.0": 120,
"99.9": 124
}
}
}
}
Default values added to existing records by creating new columns with default values, are not available for Aggregations. These default values are available for new records and become accessible in existing records only in case of a subsequent record update.