Which one the techniques are fact tables improves data access performance and saves data storage space?

Understanding the difference between Fact and Dimension tables in the context of the Star Schema and Data Warehousing

Photo by Reign Abarintos on Unsplash

Introduction

One of the most challenging aspects of data management is related to how data is stored, due to the increasing volume of data being generated (and even consumed) by organisations throughout the years.

Therefore, it is very important to be able to manage the data effectively — and in my opinion the first step towards this direction is to figure out a proper structure for the data that will eventually reside in Databases and Data Warehouses.

In today’s article we will discuss about a commonly used principle in Data Engineering, called Star Schema. Additionally, we will discuss about its two main components, namely Fact and Dimension Tables and how they differ to each other. We will also go through an end-to-end example in order to demonstrate how these concepts are utilised in real-world projects.

The Star Schema in a nutshell

The star schema is the most commonly used model when it comes to modelling data warehouses and dimensional data marts. It consists of one fact table, which in turn reference(s) any number of dimension tables.

A star schema is a special case of the snowflake schema and is commonly applied to facilitate a more simple set of queries — and obviously this comes with its pros and cons (more on these in the following sections of the article). This model was given the name of a star schema since the fact table exists at the centre of the model diagram and is surrounded by dimension tables.

Fact and Dimension Tables

A Fact Table is one that holds the primary keys of the referenced dimension tables along with some quantitative metrics (i.e. measurements) over which some sort of calculation can be performed. Some common examples of facts tables include orders, logs and time-series financial data.

On the other hand, Dimension Tables hold the descriptive information for all related fields that are included in the fact table’s records. A few common examples of Dimension Tables are physical entities such as a Customer and Product tables or even Time Tables. In general, the Dimension Tables are expected to be much smaller in size compared to Fact Tables.

An easy way to distinguish fact from dimension tables is to evaluate whether a table refers to a noun (say a physical person or an object). For instance, a product or a customer can exist without ever being involved in a specific event in the context of the business. Nouns are therefore represented by dimension tables since they represent something that either does something or it has something done over it (e.g. a physical store, a customer or a product). On the other hand, a verb usually corresponds to a Fact Table. In other words, every record listed in a fact table corresponds to an event in which entries from dimension tables are involved. For example, an order is an event in that involves a customer and a product (or maybe more). An order is being made on a specific product and by a specific customer.

As an example, let’s consider a use-case where customers purchase products in physical stores. The star schema is illustrated below.

Example Star Schema with a Fact table in orange and Dimension tables in blue — Source: Author

Dimension tables — in blue colour — correspond to the tables containing information about the Customers, Stores, Products and Dates. These are the nouns of the business case.

The Fact table shown in orange, contains all the Primary Keys (PK) of the dimension tables — which are the Foreign Keys (FK) in the fact table — along with two quantitative fields, namely quantity and amount.

Note that a fact table could exist even without a Primary Key but usually they are assigned a Surrogate Key.

Benefits of Star Schema

Due to the denormalised nature of the model, Star Schema tends to be faster in terms of performance. At the same time the Star Schema tends to be fairly simple and thus the overall structure is easier to be designed. Additionally, it is much more readable even though not as maintainable — more on this in the next section.

Due to its simple design, the data aggregation is much easier — a fact table is typically joined to only a single level of dimension tables. Therefore, Data Engineers and Scientists don’t need to worry too much about the complexity of the queries that they’ll need to write. And perhaps testing also becomes fairly easy!

Finally, since the query performance of the Star Schema is good enough, the chance of affecting other OLAP products is also minimised.

When (not) to use the Star Schema

As mentioned already, the dimensions in Star Schema are denormalised. This means that there may be repeating values within a table. Therefore, the storage required to implement star schema is relatively bigger when compared with other schemas — say a snowflake schema whose dimension tables are normalised. If for any reason the storage size is a problem then this data redundancy should make you think twice before applying the star schema.

Due to the data redundancy in the Starch Schema, the Data Integrity is more at risk. Since the data are repeated in multiple records, new updates, deletions and insertions may affect the overall integrity of the data.

Star schemas are easy to be designed and implemented since the relationships between the tables are fairly simple. However, due to the data integrity concern we discussed about, it could be quite challenging to maintain them. As new data is ingested and potentially new tables need to be created, it may be a nightmare to validate and maintain data integrity throughout the Data Warehouse.

Final Thoughts

In today’s article we discussed about the importance of having a proper structure when storing our data so that we can then manage them in an effective way and one of the concepts that can eventually help organisations build an efficient structure is known as the Star Schema.

We also discussed about the two main table types involved in this concept, called Fact and Dimension tables and what type of data each is supposed to store. Additionally, we went through a real-world example to demonstrate how these concepts work in practise.

Finally, we discussed about the pros and cons of the Star Schema, when it should be used and under what circumstances other alternative approaches should be considered (e.g. the Snowflake Schema).

Become a member and read every story on Medium. Your membership fee directly supports me and other writers you read. You’ll also get full access to every story on Medium.

Related articles you may also like

What is fact table and what are different types of fact table?

A fact table stores quantitative information for analysis and is often denormalized. A fact table works with dimension tables and it holds the data to be analyzed and a dimension table stores data about the ways in which the data can be analyzed. Thus, a fact table consists of two types of columns.

What is the most important points to take care while designing a fact table?

It is essential to declare the Fact grain at the design stage. Every fact table should have one grain and you should be having multiple grain levels in the same fact table. For example, students' attendance will be daily while their marks are on a term basis.

How many types of fact tables are there?

Fact table and entity types. There are three types of fact tables and entities: Transaction. A transaction fact table or transaction fact entity records one row per transaction.

Which type of data describes numerical facts or measures that can be counted ordered and aggregated?

Quantitative data is anything that can be counted or measured; it refers to numerical data.