PostgreSQL

TimescaleDB with Tessell for PostgreSQL

Rajnikant Rakesh

December 27, 2023

Introduction

TimescaleDB is a time-series database built on top of PostgreSQL. It is designed to handle large volumes of time-stamped data and provide efficient and scalable query performance for time-series data.

The main difference between TimescaleDB and regular PostgreSQL is that TimescaleDB is optimized for storing and querying time-series data. At the same time, PostgreSQL is a general-purpose database that can handle various types of data. TimescaleDB extends PostgreSQL to provide additional features and capabilities specifically for time-series data management, such as automatic time partitioning, optimized indexing, and compression.

TimescaleDB uses a distributed hypertable architecture that partitions data based on time intervals, enabling efficient querying of large volumes of data over time. It also provides advanced analytics and visualization capabilities for time-series data, including continuous aggregates and window functions.

In this post, we’ll walk through a general overview of time series data, how TimescaleDB turns PostgreSQL into a time series database, and how to set up and perform various tasks.

Tessell for PostgreSQL supports TimescaleDB in all configurations, whether using a Single instance or High Availability setup.

Timescale Use Cases

As human beings, we produce data as we move and operate through time. Therefore, everything we do in our business and personal lives creates time series data.

1. Internet of Things (IoT): IoT is an obvious example of time series data in practice. Everything from mobile devices, home appliances, thermostats, and security cameras to automobiles can continuously collect, store, use, and transmit data. With the proliferation of IoT devices, there is a growing need to capture and analyze time-series data generated by these devices. TimescaleDB is an ideal solution for storing and analyzing IoT data, such as sensor data, device logs, and telemetry data.

2. Financial Services: Financial services firms often deal with large volumes of time-series data, such as stock prices, trading data, and transaction data. TimescaleDB can help these firms store and analyze this data more efficiently and accurately, enabling better decision-making.

3. DevOps and Monitoring: DevOps teams and system administrators often need to monitor and analyze time-series data related to system performance, network traffic, and application metrics. TimescaleDB can help to store and analyze this data in real-time, enabling faster and more accurate troubleshooting.

4. Digital Marketing: Digital marketing firms often need to analyze large volumes of time-series data related to user behavior, website traffic, and ad campaigns. TimescaleDB can help to store and analyze this data more efficiently and accurately, enabling better campaign optimization.

5. Energy and Utilities: Energy and utility companies often need to analyze time-series data related to power consumption, network outages, and equipment performance. TimescaleDB can help to store and analyze this data more efficiently, enabling better predictive maintenance and outage prevention.

TimescaleDB is useful in any scenario where large volumes of time-series data need to be stored and analyzed efficiently and accurately.

Feature & Benefits of TimescaleDB

1. Time partitioning: TimescaleDB automatically partitions data by time, which means that data is stored in separate partitions based on time intervals. This enables more efficient querying and faster analysis of time-series data.

2. Advanced indexing: TimescaleDB provides advanced indexing features that enable fast and efficient querying of time-series data. TimescaleDB uses multi-dimensional indexing to optimize queries that involve both time and other dimensions.

3. Compression: TimescaleDB provides built-in compression for time-series data, which reduces storage requirements and improves query performance. TimescaleDB uses a combination of run-length encoding, delta encoding, and bit-packing to achieve high levels of compression.

4. Continuous aggregates: TimescaleDB provides continuous aggregate functions that enable real-time analytics of time-series data. This enables users to see trends and patterns in real-time, which can be useful for proactive decision-making.

5. Window functions: TimescaleDB provides window functions that enable more advanced analytics of time-series data. Window functions allow users to calculate aggregates over a sliding window of time, which can be useful for identifying trends and patterns in time-series data.

So how does TimescaleDB’s architecture update PostgreSQL’s capabilities?

Best of PostgreSQL with TimescaleDB.

The best thing about TimescaleDB being a PostgreSQL extension is that backup and streaming replication automatically apply to it: pg_basebackup will be able to backup data stored in TimescaleDB. Similarly, restore and point-in-time recovery also automatically apply.

The same goes for synchronous and asynchronous streaming replication. Therefore, Tessell PostgreSQL with standby instances provides read scalability and high availability and would do so for TimescaleDB as well.

The above is key, especially in an environment where popular time series databases have chosen the open-core model, making features that are necessary for production environments only available under proprietary licenses. So, you continue to use the best of PostgreSQL without any additional overheads.

Now that there is an alternative, let’s check it out.

TimescaleDB: Hypertables

Essentially, you create a table in TimescaleDB just as you would in vanilla PostgreSQL. The only requirement is that the table has a timestamp (or date) column. (You can get a more in-depth overview of TimescaleDB’s Architecture here.)

Once this table is created, the built-in helper function, create_hypertable(), performs all the necessary tasks to create the hypertable and its child objects, e.g. partitions and chunks. The hypertable holds the information about its child objects, e.g. partitions and key spaces, within partitions called chunks.

Data ingested into a time series table ultimately ends up in chunks, its final resting place.

Chunks can be managed (resized, dropped, etc) from the command line. Dropping chunks to purge old data is much more efficient than the usual alternatives. But, the true elegance of TimescaleDB lies in the fact that it is a PostgreSQL extension.

How to use TimescaleDB with Tessell for PostgreSQL

Tessell for PostgreSQL supports Timescale as a first-class integration, to create a PostgreSQL database login to the Tessell console or You can create a Tessell for PostgreSQL database in minutes for free on AWS or Azure.

To provision a PostgreSQL database in Tessell. Go to Provisioning.

Select the required version of PostgreSQL

Select the choice of cloud and region.

Select the required shape/configuration needed for the database instance.

Follow the subsequent steps on the console to provision the database. Once the database is created use the choice of client to connect to the database.

Connect to the database tdb (sample database)

Copied to clipboard!


\c tdb

To use timescaleDB in an existing database run the following command.

Copied to clipboard!


CREATE EXTENSION IF NOT EXISTS timescaledb;

Now that we are all set, let’s explore the capabilities of timescaleDB with a use case.

A TimescaleDB use case with Tessell PostgreSQL

Let’s walk through a use case of TimescaleDB within Tessell for PostgreSQL by examining solar power plant generation data. This dataset is a time-series dataset having a sampling frequency of 15 minutes i.e. the time-series data from the power plant is generated every 15 mins.

This data has been gathered at two solar power plants in India over a 34 day period. It has two pairs of files - each pair has one power generation dataset and one sensor readings dataset. The power generation datasets are gathered at the inverter level - each inverter has multiple lines of solar panels attached to it. The sensor data is gathered at a plant level - a single array of sensors optimally placed at the plant.

To showcase the capabilities of TimescaleDB, we’ll perform the following tasks for this use case:

Create the schema in TimescaleDB
Ingest data
Query data
Purge old data
Compression

1. Creating Schema

First, create a new database.

Copied to clipboard!


create database solar_gen;

Now, create a TimescaleDB extension in the created database.

Copied to clipboard!


\c solar_gen
CREATE EXTENSION IF NOT EXISTS timescaledb;

Create a table for storing power plant generation data.

Copied to clipboard!


CREATE TABLE power_plant_generation_data (
  DATE_TIME             TIMESTAMP not null,
  PLANT_ID              INT not null,
  SOURCE_KEY            TEXT not null,
  DC_POWER              FLOAT not null,
  AC_POWER              FLOAT not null,
  DAILY_YIELD           FLOAT not null,
  TOTAL_YIELD           FLOAT not nulL
);

Create the corresponding hypertable in TimescaleDB. Also create chunks to store 1 day’s worth of data (this is specified in the chunk_time_intervalparameter). The interval maps well to the generation data we will be ingesting and analyzing.

Copied to clipboard!


solar_gen=> SELECT create_hypertable('power_plant_generation_data', 'DATE_TIME',
    chunk_time_interval => interval '1 day');

2. Ingesting data

Let’s Ingest data into the table.

Since the data is split into multiple files for multiple plants, used a bash script to load each CSV file using psql.

Copied to clipboard!


#! /bin/bash
for i in *.csv; do
    echo "item: $i"
    time psql \
        "postgres://user:password@****.tessell.com:port/defaultdb?sslmode=require" \
        -c "\COPY power_plant_generation_data FROM $i CSV HEADER"
done

After data is ingested into the table, the partition and chunk information becomes evident.

Copied to clipboard!


solar_gen=> \d+ power_plant_generation_data
Indexes:
    "power_plant_generation_data_time_idx" btree ("DATE_TIME" DESC)
Child tables: _timescaledb_internal._hyper_67_100_chunk,
              _timescaledb_internal._hyper_67_101_chunk,
              _timescaledb_internal._hyper_67_102_chunk,
...
              _timescaledb_internal._hyper_67_215_chunk,
              _timescaledb_internal._hyper_67_216_chunk

The query plan for a simple count(*) of the records in the table after all the data is ingested looks like this:

Copied to clipboard!


solar_gen=> explain select count(*) from power_plant_generation_data; Finalize Aggregate  (cost=227717.58..227717.59 rows=1 width=8)
   ->  Gather  (cost=227717.47..227717.58 rows=1 width=8)
         Workers Planned: 1
         ->  Partial Aggregate  (cost=226717.47..226717.48 rows=1 width=8)
               ->  Append  (cost=0.00..206762.98 rows=7981797 width=0)
                     ->  Parallel Seq Scan on power_plant_generation_data  (cost=0.00..0.00 rows=1 width=0)
                     ->  Parallel Seq Scan on _hyper_67_100_chunk  (cost=0.00..948.12 rows=36612 width=0)
…
                     ->  Parallel Seq Scan on _hyper_67_215_chunk  (cost=0.00..1480.92 rows=57192 width=0)
                     ->  Parallel Seq Scan on _hyper_67_216_chunk  (cost=0.00..253.96 rows=9796 width=0)

*As you can see, the query needs to scan each and every chunk to resolve the query.

3. Querying data

The optimal query in TimescaleDB hits a single chunk within a hypertable.

With our data and corresponding chunk configuration (remember, we configured a chunk to contain a day’s worth of data), the most efficient query would be one that includes a specific date in the WHERE clause. Here’s an example of such a query and its corresponding query plan.

Copied to clipboard!


solar_gen=> SELECT  date_trunc('day', DATE_TIME) as day,
                    avg(DAILY_YIELD), avg(DC_POWER)
                FROM power_plant_generation_data
                WHERE time = '2020-05-15'
                GROUP BY day ORDER BY day;          day           |         avg         | avg_1
------------------------+---------------------+-----
 2014-02-28 00:00:00+00 | 2586.279926 |  2979.8562766
(1 row)solar_gen=> EXPLAIN SELECT  date_trunc('day', DATE_TIME) as day,
                    avg(DAILY_YIELD), avg(DC_POWER)
                FROM power_plant_generation_data
                WHERE time = '2020-05-15'
                GROUP BY day ORDER BY day; Sort  (cost=370.26..370.76 rows=200 width=48)
   Sort Key: (date_trunc('day'::text, power_plant_generation_data."DATE_TIME"))
   ->  HashAggregate  (cost=359.62..362.62 rows=200 width=48)
         Group Key: date_trunc('day'::text, power_plant_generation_data."DATE_TIME")
         ->  Result  (cost=0.00..356.57 rows=407 width=16)
               ->  Append  (cost=0.00..351.48 rows=407 width=16)
                     ->  Seq Scan on power_plant_generation_data  (cost=0.00..0.00 rows=1 width=16)
                           Filter: ("time" = '2020-05-15 00:00:00+00'::timestamp with time zone)
                     ->  Index Scan using _hyper_67_115_chunk_power_plant_generation_data_time_idx on _hyper_67_115_chunk  (cost=0.42..351.48 rows=406 width=16)
                           Index Cond: ("time" = '2020-05-15 00:00:00+00'::timestamp with time zone)

*As can be seen above, one (and only one) chunk is ever considered for the query.

For range scans on time, multiple chunks will be considered for the query. But, here’s where the PostgreSQL parallel query scan comes into play and provides big benefits.

TimescaleDB also has other built-in time functions that are more flexible and convenient to use, such as time_bucket() which provides for arbitrary time intervals as compared to PostgreSQL’s built-in date_trunc() function.

Copied to clipboard!


solar_gen=> SELECT  time_bucket('1 hour', DATE_TIME) as one_hr,
                    avg(DAILY_YIELD), avg(DC_POWER)
                FROM power_plant_generation_data
                WHERE DAILY_YIELD > 1000
                GROUP BY one_hr ORDER BY one_hr;
         one_hr         |         DAILY_YIELD         |  DC_POWER
------------------------+---------------------+-------
 2015-05-15 11:00:00+00 | 1065.1486310299869622 |        1578
 2015-05-15 12:00:00+00 | 64.8786089238845144 |          1281

4. Purging old data

When dealing with time series data, it is convenient to have a mechanism to purge old data without impacting incoming data. This means options like DELETE FROM Table WHERE time > ‘data’ are problematic since they may involve intrusive locking and therefore impact ingest performance.

In TimescaleDB, you can drop a chunk that contains old data, and the best part is, that it does not impact ongoing ingest. Here’s an example, where we drop chunks containing data older than a specific date.

Copied to clipboard!


solar_gen=> SELECT drop_chunks('2020-05-15 12:00:00-5'::timestamp,
                               'power_plant_generation_data');

5. Compression

Data can be compressed to reduce the amount of storage required and increase the speed of some queries. This is a cornerstone feature of Timescale. When new data is added to your database, it is in the form of uncompressed rows. Timescale uses a built-in job scheduler to convert this data to the form of compressed columns. This occurs across chunks of Timescale hypertables.

To enable compression on an existing hypertable use the following command.

Copied to clipboard!


ALTER TABLE power_plant_generation_data SET (
  timescaledb.compress,
  timescaledb.compress_segmentby = 'plant_id'
);

Add a compression policy to compress chunks that are older than seven days:

Copied to clipboard!


SELECT add_compression_policy('power_plant_generation_data', INTERVAL '7 days');

These storage savings directly translate into saving costs.

Check out the real-world use cases where compression had greatly reduced the storage requirements and in turn the cost.

Conclusion

In summary, TimescaleDB provides a specialized solution for managing and analyzing time-series data, while regular PostgreSQL is a general-purpose database that can handle various types of data.

When starting with TimescaleDB on Tessell for PostgreSQL, it is best to understand your time series data, the retention period, and most importantly what kind of queries you will run against it. These will help you determine the chunk interval for the data which closely correlates to ingest performance, query performance, and the ability to easily purge old data.

TimescaleDB scales very effectively to store time series data compared to vanilla PostgreSQL. This, along with PostgreSQL features such as backup and high availability, renders it a powerful open-source alternative to other time series databases.

The example shown in this post will help you get started quickly with TimescaleDB in Tessell PostgreSQL. To take advantage of its ability to be deployed as a high availability time series database, enable it within a Tessell for PostgreSQL.

And if you’re looking for an enterprise-grade production PostgreSQL database for your Time series workloads, try Tessell for PostgreSQL.