Seed Scientific [2021] noted that every day, we create roughly 2.5 quintillion bytes of data. With the growing popularity of IoT (Internet of Things), this data creation rate will become even greater.
With this vast amount of data being produced every day, it has necessitated data-driven companies to have robust solutions for managing and analyzing the large quantities of data generated. These systems must be scalable, reliable, and secure enough, as well as flexible to support a wide variety of data types and use cases.
These requirements go way beyond the capabilities of any traditional database solutions hence Data warehouses.
A Data warehouse is a scalable data processing system that supports analytical processes and reporting of insights from data.
In data science, when we are discussing data processing systems, there are two main types:
Online Transaction Processing (OLTP) - This kind of system captures, stores and maintains and processes data from transactions. They are classic databases that control and run essential business operations in real-time.
Online Analytical Processing (OLAP) - This system is catered to for advanced data analytics purposes. They use complex queries to analyse aggregated historical data.
A data warehouse is an OLAP solution, and unlike Data lakes, which follow the ELT (Extract, Load, Transform) model, Data warehouses commonly use the ETL (Extract, Transform and Load)
Requirements of Data Warehouse System
The Data warehouse system must make information easily accessible and its contents must be understandable.
The system must present information consistently, and the data must be carefully assembled from a variety of sources, cleansed, quality assured and released when it is fit for consumption.
The data warehouse system must adapt to change. User needs, business conditions, data, and technology are all subject to change and the data warehouse must be designed to handle this inevitable change so that it doesn’t invalidate existing data or applications.
The system must present information in a timely way because it is used more intensively for operational decisions, raw data may need to be converted into actionable information within hours, minutes or even seconds.
The data warehouse system must be a secure bastion that protects the information assets.
The data warehouse system must serve as the authoritative and trustworthy foundation for improved decision-making, it must have the right data to support decision making
A data warehouse receives data from different data sources and then it is processed in the staging area before being ingested into the actual warehouse (a database). The Data warehouses may then feed data to separate Data marts; smaller database systems which end users may use for different purposes.