I always thought that working for a company in Greenwich would solve all my time problems. I store my company data in GMT+0 (Greenwich Mean Time), and my time would always be correct.  But man, what was I wrong.. The time in Greenwich during summer time is GMT+1! In other words, while you are in Greenwich, you are actually one hour ahead of Greenwich time. 
 
When dealing with time, there are many things to take into account when setting up a data warehouse. Many BI-teams like to start counting from 1900-01-01 00:00:00. Are you sure that this timestamp is defined in your current timezone?
 

The companies that I work for are operating in the Europe/Amsterdam timezone. During summer time, the Dutch clocks use +0100 GMT during summer and +0200 during winter. This was not always the case. 

The Europe/Amsterdam timezone was defined in the year 1937 and changed its definition 3 times (1940: added day time saving; 1946: removed day time saving, 1977: added day time saving). Therefore, the timestamp 1900-01-01 00:00:00 simply is not defined. However, database vendors like Oracle and Snowflake support working with these timestamps. Be aware that the timezone-offset might be different than you might expect; 1900-01-01 00:00:00 +0019 in case of the Europe/Amsterdam timezone.

Data experts often assume that timezones are constant. Because of this false assumption, mistakes are easily made.

My name is Paul, I’m cofounder of Acheron. For the few years, I’m working on Snowflake to store all kinds of data. In this blog I will take you on my adventure how to store timely data on Snowflake.

In this blog, I will cover the following questions:
  • When are timezones important to store?
  • How do end-users like to visualize time?
  • How like data engineers to deal with time?
  • How are the timestamp-formats different from each other? (TS_TZ vs. TS_NTZ)
  • When should TS_LTZ be favoured over TS_TZ and visa versa
  • In what timezone is Snowflake running?
  • Making up the balance between TS_LTZ vs. TS_TZ

When are timezones important to store?

Always.
 
It might be tempting to ignore timezones. Just store the wall-clock time, what can go wrong? You soon find yourself in the situation that you didn’t take the daylight saving time into account. As a consequence, it is very difficult to correct for this one day a year there the clock has 23 in a day and the other day where the clock has 25. This example is just the tip of the iceberg.
 
In general, when storing data on a data platform, the question of the end-user is often not clear. Therefore, loosing information will only limit the possible questions that the end-user can store.
 
My clients are all Dutch companies that operate solely in the Netherlands. They use more and more cloud-services. These services are not always located in the Netherlands, but their data is valuable to report on. It follows that timezone information becomes relevant in order to join data sources on the time dimension.
My advise: store timezone information. Hence, don’t store time in Snowflake’s TS_NTZ format, because not timezone information would be stored, only the wallclock time.
 
NTZ is not favoured.
 ALTER SESSION SET TIMEZONE = ‘Europe/Amsterdam’;
SELECT TO_TS_NTZ(’04/05/2013 01:02:03′, ‘mm/dd/yyyy hh24:mi:ss’);
–> 2013-04-05 01:02:03.000
ALTER SESSION SET TIMEZONE = ‘UTC’;
SELECT TO_TS_NTZ(’04/05/2013 01:02:03′, ‘mm/dd/yyyy hh24:mi:ss’);
–> 2013-04-05 01:02:03.000
 
Instead, choose for a timestamp-format that takes timezones into account:
ALTER SESSION SET TIMEZONE = ‘Europe/Amsterdam’;
SELECT TO_TS_TZ(’04/05/2013 01:02:03′, ‘mm/dd/yyyy hh24:mi:ss’);
–> 2013-04-05 01:02:03.000 +0200
SELECT TO_TS_LTZ(’04/05/2013 01:02:03′, ‘mm/dd/yyyy hh24:mi:ss’);
–> 2013-04-05 01:02:03.000 +0200
 
 
 

Preferences of end-users

Most companies operate in one timezones. When I ask business users how they like to have their time presented to them, they always require time to be presented in their local timezone (must have)
 
A business representative added the following to this:
When time is relevant, the business likes to see time in their local timezone. Hence, when a source systems uses a different timezone and their data is relevant for reporting, this timezone is required in order to standardize time towards the reports timezone.

Preferences of data engineers

Within my current data engineering team, we have consensus that time should be stored and processed in a uniform way for all data sources we extract (must have)
 
My team has a strong preference to have time presented in the local timezone (nice to have)
 
Last but not least, some engineers like to add the requirement that no information should be lost when time is stored in Snowflake. As we will see, this last requirement is a bit harder to fulfil (should have)

How are the timestamp-formats different from each other?

I like to start to quote Snowflake’s documentation (to keep this blog concise, abbreviations are used):
TIMESTAMP_TZ (TS_TZ) internally stores UTC time together with an associated time zone offset. When a time zone is not provided, the session time zone offset is used. All operations are performed with the time zone offset specific to each record. TS_TZ currently only stores the offset of a given time zone, not the actual time zone
TIMESTAMP_LTZ (TS_LTZ) internally stores UTC time with a specified precision. However, all operations are performed in the current session’s time zone, controlled by the TIMEZONE session parameter.
 
From these definitions, it was not directly clear to me how both timestamp-types differ from each other. Both types internally store a TIMESTAMP as UTC. The former with an offset, the latter with a specified precision.
 

In order to illustrate the difference, I reuse the response from David Gardner this thread. David takes the timestamp ‘2010-11-18 08:21:54.260’ without timezone and shows what happens when timezones switch between the time they are written to disk (INSERT_TIMEZONE) and the time they are queried (QUERY_TIMEZONE). Notice the following:

  • All values in the column TS_NTZ are the same. This illustrates that TS_NTZ is ignorant for any timezone set. 
  • The rows annotated with (==) show that TS_LTZ and TS_TZ are the same if QUERY_TIMEZONE and INSERT_TIMEZONE are the same.
  • Notice that the order of colors match in columns QUERY_TIMEZONE and TS_LTZ. This illustrates that the UTC-timestamp internally stored for TS_LTZ is computed on query time to the current timezone.
  • otice that the order of colors match in columns INSERT_TIMEZONE and TS_TZ. When a timestamp is written to disk, the offset at writing-time is written to disk as well. Hence, this offset is returned on query-time.
     

When should TS_LTZ be favoured over TS_TZ, and visa versa

Shifted offsets
Daylight saving time introduces days with 23 hours and days with 25 hours. This introduces different behaviour between TS_LTZ and TS_TZ as the example on the right illustrates;
  • Because TM_TZ only takes the offset into account on insert-time. This makes that TM_TZ is unaware of the daylight saving time. Hence, when a day is added to a timestamp of TM_TZ-type, 24 hours are added.
  • TM_LTZ takes a different approach. I assumed it would add 24 hours on top of the UTC time it internally stores. However, as we can see in the picture, Snowflake is apparently aware of the daylight saving time, and adds one hour less. Since TS_LTZ doesn’t store the timestamp at INSERT_TIME, I’m quite curious how this mechanism works.
 
 

Query example illustrating shifted offsets

alter session set timezone = ‘Europe/Amsterdam’;
select column1 date_1, column2 date_2,
datediff(year, column1, column2) diff_years,
datediff(month, column1, column2) diff_months,
datediff(day, column1, column2) diff_days,
column2::date – column1::date as diff_days_via_minus,
datediff(hour, column1, column2) diff_hours
from values
( TO_TIMESTAMP_[L]TZ(‘2000-03-26 00:00:00’),
TO_TIMESTAMP_[L]TZ(‘2000-03-27 00:00:00’)),
( DATEADD(day,1,TO_TIMESTAMP_[L]TZ(‘2000-03-26 00:00:00’)),
TO_TIMESTAMP_[L]TZ(‘2000-03-27 00:00:00’));

Risk of corrupted LTZ Timestamps

When writing timestamps to disk in LTZ-format, the timestamp is converted to UTC using the current timezone. In Snowflake, a timezone can be set on account level and be overwritten on session level. Because of this behaviour, the account level timezone acts as a default. This introduces a risk: when a timezone is accidentally set at session level, a stored timestamp doesn’t need to represent the actual timestamp of the source system. I call these LTZ timestamps to be corrupted.
 
Randy Pitcher warns for this behaviour in his blogpost “The Hitchhiker’s Guide to Timestamps in Snowflake: Part 1
“Any SQL code interacting with TIMESTAMP_LTZ datatypes will be non-deterministic, so buckle up tight when using this bad boy
 
I reached out to Randy, since Part 2 has not been written at the time I write this post. I asked him if he could explain the non-deterministic behaviour introduced by TM_LTZ. Our conversation is on the right.
 
 

Randy: “The issue with using TM_LTZ in your tables is that it can result in non deterministic code. So if my local timezone is different from the service account I use in prod or different from my teammate’s, using TM_LTZ can quietly result in SQL that works but is quietly shifting timestamps around.

Paul: “If I translate it in my own words, you say that it is an easy mistake to use another timezone somewhere in your snowflake landscape (hence the non-deterministic nature). Your first example is of multiple accounts (yes, you need to make sure that the timezones are set the same). I can also imagine that a timezone is overwritten accidentally on session level. If this would happen on read, I think it wouldn’t matter (you would see the correct time, but in another timezone). However, on write, you would store internally an incorrect UTC time.

Randy: “Yeah, exactly

I agree with Randy that this risk exists, but at the same time I would assess its probability to be low:
  • Corrupted LTZ Timestamps only impact the production account
  • The only way that LTZ Timestamps can get corrupted is when the session timezone accidentally overwrites the account timezone. There are three scenario’s that I can think of that can cause this to happen.
    • A Snowflake-developer can change the sessions timezone. However, they should not be privileged to load data into snowflake tables. Therefore, there is no risk that corrupted LTZ timestamps are loaded. 
    • The Snowflake ACCOUNTADMIN has the rights to change the timezone on the production Snowflake account. In case the timezone changes due to new insights, the ACCOUNTADMIN needs to be aware of that changing the timezone can cause new TS_LTZ timestamps to get corrupted. The workaround I recommend is to set the timezone explicitly in the session where data is loaded.
    • An ELT-tool should be used for production loads. For each load, a new session is used. There is a risk that a timezone is set on session level. In order to make sure that this setting doesn’t have destructive effects, I strongly recommend to store the timezone that was used in the session.
  • A JDBC/ODBC bug could set a timezone. This risk can be mitigated with proper testing.

In what timezone is Snowflake running?

Snowflake allows every session to use another timezone. The timezone set on account-level is just a default that can be overwritten.
 
Given the insights in the previous session, an accidental overwrite of the timezone can cause timestamps to be incorrectly stored in Snowflake as TM_LTZ values.
Don’t be fooled by the following query
 
SHOW PARAMETERS LIKE ‘TIMEZONE’ IN ACCOUNT;
— value ‘America/Los_Angeles’
 
You might think that Snowflake runs in the above timezone, but this is only a default that can be overwritten per session.

Summary - TM_TZ vs. TM_LTZ

TIMEZONE_TZ
This timestamp type is only concerned with the UTC offset on the time of writing. We loose the original timezone from the source that wrote these timestamps to Snowflake.
 
Because TM_TZ is unaware of any timezone, it is also unaware of daytime saving time. We have seen that this introduces unexpected behaviour when we add or subtract time from these timestamps.
TIMEZONE_LTZ
This timestamp type stores one UTC value internally. We loose the original timezone from the source that wrote these timestamps to Snowflake.
 
We have seen that if the LTZ timestamp is calculated with an incurrent timezone on writing time, timestamps will corrupted.

Final thoughts

My personal preference after writing this blog: on the one hand, I recognize that the LTZ-approach satisfies the need from business users and data engineers to visualize timestamps in their current timezone. On the other hand, I recognize the risk that LTZ Timestamps might get corrupted when writing is done with different timestamps. To overcome this risk, I would be open to the idea to introduce an extra column to store the timezone on writing time for every record being written. Since Snowflake has columnar-compression, this extra field should come at low storage and compute costs.
 
Timezones are not as old as time: how would Snowflake represent a timestamp before the timezone was defined?
According to Wikipedia, the official start of the Europe/Amsterdam timezone was on 1937-07-01 00:00:28.000 +0120. In the context of building data marts, what values could be used to denote the start of an active record in the context of Kimball Slowly Changing Dimensions Type 2
 
Below, I convert UTC time to Europe/Amsterdam timezone in LTZ and TZ.  Interestingly, they differ before the timezone was defined!
 
‘0001-01-01 00:00:00’ (UTC) -> 0000-12-31 23:40:28.000 +0019 (LTZ) 0000-12-31 23:39:56.000 +0019 (TZ) FALSE
‘1800-01-01 00:00:00’ (UTC) -> 1799-12-31 23:40:28.000 +0019 (LTZ) 1799-12-31 23:39:56.000 +0019 (TZ) FALSE
‘1800-07-01 00:00:00’ (UTC) -> 1800-06-30 23:40:28.000 +0019 (LTZ) 1800-06-30 23:39:56.000 +0019 (TZ) FALSE
‘1900-01-01 00:00:00’ (UTC) -> 1899-12-31 23:40:28.000 +0019 (LTZ) 1899-12-31 23:39:56.000 +0019 (TZ) FALSE
‘1900-01-01 00:59:59’ (UTC) -> 1900-01-01 00:40:27.000 +0019 (LTZ) 1900-01-01 00:39:55.000 +0019 (TZ) FALSE
‘1900-01-01 01:00:00’ (UTC) -> 1900-01-01 00:40:28.000 +0019 (LTZ) 1900-01-01 00:39:56.000 +0019 (TZ) FALSE
‘1900-01-01 01:59:59’ (UTC) -> 1900-01-01 01:40:27.000 +0019 (LTZ) 1900-01-01 01:39:55.000 +0019 (TZ) FALSE
‘1900-01-01 02:00:00’ (UTC) -> 1900-01-01 01:40:28.000 +0019 (LTZ) 1900-01-01 01:39:56.000 +0019 (TZ) FALSE
‘1915-01-01 00:00:00’ (UTC) -> 1914-12-31 23:40:28.000 +0019 (LTZ) 1914-12-31 23:39:56.000 +0019 (TZ) FALSE
‘1920-01-01 00:00:00’ (UTC) -> 1919-12-31 23:40:28.000 +0019 (LTZ) 1919-12-31 23:39:56.000 +0019 (TZ) FALSE
‘1925-01-01 00:00:00’ (UTC) -> 1924-12-31 23:40:28.000 +0019 (LTZ) 1924-12-31 23:39:56.000 +0019 (TZ) FALSE
‘1930-01-01 00:00:00’ (UTC) -> 1929-12-31 23:40:28.000 +0019 (LTZ) 1929-12-31 23:39:56.000 +0019 (TZ) FALSE
‘1935-01-01 00:00:00’ (UTC) -> 1934-12-31 23:40:28.000 +0019 (LTZ) 1934-12-31 23:39:56.000 +0019 (TZ) FALSE
‘1937-07-01 01:20:28’ (UTC) -> 1937-07-01 00:00:28.000 +0120 (LTZ) 1937-07-01 00:00:28.000 +0120 (TZ) TRUE
‘1940-07-01 00:00:00’ (UTC) -> 1940-06-30 22:00:00.000 +0200 (LTZ) 1940-06-30 22:00:00.000 +0200 (TZ) TRUE
‘1946-07-01 00:00:00’ (UTC) -> 1946-06-30 23:00:00.000 +0100 (LTZ) 1946-06-30 23:00:00.000 +0100 (TZ) TRUE
‘1975-07-01 00:00:00’ (UTC) -> 1975-06-30 23:00:00.000 +0100 (LTZ) 1975-06-30 23:00:00.000 +0100 (TZ) TRUE
‘1977-07-01 00:00:00’ (UTC) -> 1977-06-30 22:00:00.000 +0200 (LTZ) 1977-06-30 22:00:00.000 +0200 (TZ) TRUE
‘1990-01-01 00:00:00’ (UTC) -> 1989-12-31 23:00:00.000 +0100 (LTZ) 1989-12-31 23:00:00.000 +0100 (TZ) TRUE
 
Snowflake support responded as follows:
As per my investigation and internal research, the behaviour mentioned by you is a known one. The difference between TZ and LTZ comes from the offset set in the database, meaning that even if the displayed offset is +0019 (19 minutes), the difference is <60 seconds.

Leave a Reply

Your email address will not be published. Required fields are marked *