How to design a Data Warehouse(Part 1)

Image for post
Image for post

Last blog I wrote why we need a Data Warehouse.

First, what is the data warehouse?

It stores current and historical data are used for reporting and analysis.

However, here is the problem:

How we can design a Data Warehouse?

Because it touches all areas of a company, all departments need to be onboard with the design. Each department needs to understand what the benefits of data warehouse and what results they can expect from it.

What objectives we can focus on:

  • Determine the scope of the whole project
  • Find out what data is useful for analysis and where our dat is current siloed
  • Create a backup plan in case of failure
  • Security: monitoring, etc.

2 Choose a data warehouse platform

There are four types of data warehouse platforms:

  • Traditional database management systems: Row-based relational platforms, e.g., Microsoft SQL Server
  • Specialized Analytics DBMS: Columnar data stores designed specifically for managing and running analytics, e.g., Teradata
  • Out-of-box data warehouse appliances: the combination with a software and hardware with a DBMS pre-installed, e.g., Oracle Exadata
  • Cloud-hosted data warehouse tools

We can choose the suitable one for the company according to budget, employees and infrastructure.

We can choose between cloud or on-premise?

Cloud solution pros:

  • Scalability: easy, cost-effective, simple and flexible to scale with cloud services
  • Low entry cost: no servers, hardware and operational cost
  • Connectivity: easy to connect to other cloud services
  • Security: cloud providers supply security patches and protocols to keep customers safe

Choices:

  • Amazon Redshift
  • Microsoft Azure SQL Data Warehouse
  • Google Bigquery
  • Snowflake Computing

On-premise solution pros:

  • Reliability: With good staff and exceptional hardware, on-premise solutions can be highly available and reliable
  • Security: Organizations have full control of the security and access

Choices:

What we can choose between on-premise and cloud solution, in the big picture, it depends on our budget and existing system.

If we look for control, then we can choose on-premise solution. Conversely, if we look for scalability, we can choose a cloud service.

3 Set up the physical environments

There are three physical environments in Data Warehouse: development, testing and production.

  • We need to test changes before they move into the production environment
  • Running tests against data typically uses extreme data sets or random sets of data from the production environment.
  • Data integrity is much easier to track and issues are easier to contain if we have three environments running.

4 Data Modelling

It is the most complex phase of Data Warehouse design. It is the process of visualizing data distribution in the warehouse.

  • Visualize the relationships between data
  • Set standardized naming conventions
  • Create relationships between data sets
  • Establish compliance and security processes

There are bunches of data modeling techniques that businesses use for data warehouse design. Here are top 3 popular ones:

4 Choosing the ETL solution

ETL, stands for Extract, Transform and Load is the process we pull data from the storage solutions to warehouse.

We need to build an easy, replicable and consistent data pipeline because a poor ETL process can break the entire data warehouse.

Wrapping up

If you are interested in or have any problems with Data Warehouse, feel free to contact me.

Or you can connect with me through my LinkedIn.

Originally published at http://jacquiwu.com on May 24, 2020.

Written by

A current Data Analyst in a subsidiary under Webjet, with experience in applying data science techniques to business.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store