This page provides an overview view about key terms and phrases relating to data warehousing and big data. Each topic has a link that provides more information
Every organization has information that it must store and manage to meet its requirements. For example, a corporation must collect and maintain human resources records for its employees. This information must be available to those who need it.
An information system is a formal system for storing and processing information. An information system could be a set of cardboard boxes containing manila folders along with rules for how to store and retrieve the folders. However, most companies today use a database to automate their information systems. A database is an organized collection of information treated as a unit. The purpose of a database is to collect, store, and retrieve related information for use by database applications.
Left to their own devices, business users will fend for themselves. More times than not, we see a chasm between data and information; a chasm filled by books and books full of spreadsheets. On their own, spreadsheets are not the issue. There is simply to too much reliance on spreadsheets as a form of Swiss army knife.
Though it may work in the short-term, calling this approach a “process” seems to be a stretch, at best. Spreadsheets are fantastic personal productivity tools; unfortunately, everyone tends to overuse them.
More to the point, the spreadsheets are not really being used properly. Time and time again, analysts and business users create massive workbooks, filled with dozens - if not hundreds - of sheets turning them into “reporting applications”. So a spread-mart is really a data mart built using a series of spreadsheet workbooks.
A data mart serves the same role as a data warehouse, but it is intentionally limited in scope. It may serve one particular department or line of business. The advantage of a data mart versus a data warehouse is that it can be created much faster due to its limited coverage. However, data marts also create problems with inconsistency.
It takes tight discipline to keep data and calculation definitions consistent across data marts. This problem has been widely recognized, so data marts exist in two styles. Independent data marts are those which are fed directly from source data. They can turn into islands of inconsistent information. Dependent data marts are fed from an existing data warehouse. Dependent data marts can avoid the problems of inconsistency, but they require that an enterprise-level data warehouse already exist.
Data marts can be physically instantiated or implemented purely logically though views. Furthermore, data marts can be co-located with the enterprise data warehouse or built as separate systems.
Operational data stores exist to support daily operations. The ODS data is cleaned and validated, but it is not historically deep: it may be just the data for the current day. Rather than support the historically rich queries that a data warehouse can handle, the ODS gives data warehouses a place to get access to the most current data, which has not yet been loaded into the data warehouse.
The ODS may also be used as a source to load the data warehouse. As data warehousing loading techniques have become more advanced, data warehouses may have less need for ODS as a source for loading data. Instead, constant trickle-feed systems can load the data warehouse in near real time.
A data warehouse is a database designed to enable business intelligence activities: it exists to help users understand and enhance their organization's performance. It is designed for query and analysis rather than for transaction processing, and usually contains historical data derived from transaction data, but can include data from other sources. Data warehouses separate analysis workload from transaction workload and enable an organization to consolidate data from several sources.
This helps in:
The consolidated storage of the raw data as the center of your data warehousing architecture is often referred to as an Enterprise Data Warehouse (EDW). An EDW provides a 360-degree view into the business of an organization by holding all relevant business information in the most detailed format.
A staging area simplifies data cleansing and consolidation for operational data coming from multiple source systems, especially for enterprise data warehouses where all relevant information of an enterprise is consolidated.
An Oracle Autonomous Data Warehouse brings together decades of database automation, decades of automating database infrastructure, and new technology in the cloud to deliver a fully autonomous database. The data warehouse is self-driving, self-securing, and self-repairing. This means:
An autonomous database has four overarching goals:
Data warehouses are distinct from online transaction processing (OLTP) systems. With a data warehouse you separate analysis workload from transaction workload. Thus data warehouses are very much read-oriented systems. They have a far higher amount of data reading versus writing and updating.
This enables far better analytical performance and avoids impacting your transaction systems. A data warehouse system can be optimized to consolidate data from many sources to achieve a key goal: it becomes your organization's "single source of truth".
There is great value in having a consistent source of data that all users can look to; it prevents many disputes and enhances decision-making efficiency.
The data discovery lab is a separate environment built to allow your analysts and data scientists to figure out the value hidden in your data. The data lab helps you find the right questions to ask and, of course, put those answers to work for your business. It also referred to as a “sandbox”.
The lab is not the end result. Rather, it’s a way to generate new insights that can be put to productive use. It’s important to figure out upfront how you’re going to turn insight into value. And if you’re starting a data lab project for the first time, you want that value to be visible quickly to maintain or gain organizational support for the work
See data discovery lab.
Put simply, big data is larger, more complex data sets, especially from new data sources. These data sets are so voluminous that traditional data processing software just can’t manage them. But these massive volumes of data can be used to address business problems you wouldn’t have been able to tackle before.
A data lake is a place to store your structured and unstructured data, as well as a method for organizing large volumes of highly diverse data from diverse sources. Watch this video to go a bit deeper.
Data lakes are becoming increasingly important as people, especially in business and technology, want to perform broad data exploration and discovery. Bringing data together into a single place or most of it in a single place can be useful for that.
The key difference between a data lake and a data warehouse is that the data lake tends to ingest data very quickly and prepare it later on the fly as people access it. With a data warehouse, on the other hand, you prepare the data very carefully upfront before you ever let it in the data warehouse.
Artificial intelligence as an academic discipline was founded in 1956. The goal then, as now, was to get computers to perform tasks regarded as uniquely human: things that required intelligence. Initially, researchers worked on problems like playing checkers and solving logic problems.
Artificial intelligence, then, refers to the output of a computer. The computer is doing something intelligent, so it’s exhibiting intelligence that is artificial.
Machine learning is the subset of artificial intelligence (AI) that focuses on building systems that learn—or improve performance—based on the data they consume.
Artificial intelligence is a broad term that refers to systems or machines that mimic human intelligence. Machine learning and AI are often discussed together, and the terms are sometimes used interchangeably, but they don’t mean the same thing. An important distinction is that although all machine learning is AI, not all AI is machine learning.
Today, machine learning is at work all around us. When we interact with banks, shop online, or use social media, machine learning algorithms come into play to make our experience efficient, smooth, and secure. Machine learning and the technology around it are developing rapidly, and we're just beginning to scratch the surface of its capabilities.
Put simply, deep learning is all about using neural networks with more neurons, layers, and interconnectivity. We’re still a long way off from mimicking the human brain in all its complexity, but we’re moving in that direction. And when you read about advances in computing from autonomous cars to Go-playing supercomputers to speech recognition, that’s deep learning under the covers.
You experience some form of artificial intelligence. Behind the scenes, that AI is powered by some form of deep learning.
A subject area is a single-topic-centric slice through an entire data warehouse data model. A data mart or departmental mart is typically used to analyze a single subject area such as finance, or sales, or HR. Within a database a subject area groups all tables together that cover a specific (logical) concept, business process or question. A data warehouse and enterprise data warehouse will typically contain multiple subject areas, creating what is sometimes referred to as a 360-degree view of the business.
A schema is a collection of database objects, including tables, views, indexes, and synonyms. You can arrange schema objects in the schema models designed for data warehousing in a variety of ways.
The model of your source data and the requirements of your users help you design the data warehouse schema. You can sometimes get the source model from your company's enterprise data model and reverse-engineer the logical data model for the data warehouse from this. The physical implementation of the logical data warehouse model may require some changes to adapt it to your system parameters—size of computer, number of users, storage capacity, type of network, and software.
Star schemas are often found in data warehousing systems with embedded logical or physical data marts. The term star schema is another way of referring to a "dimensional modeling" approach to defining your data model. Most descriptions of dimensional modeling use terminology drawn from the work of Ralph Kimball, the pioneering consultant and writer in this field. Dimensional modeling creates multiple star schemas, each based on a business process such as sales tracking or shipments.
Each star schema can be considered a data mart, and perhaps as few as 20 data marts can cover the business intelligence needs of an enterprise.
The snowflake schema is a more complex data warehouse model than a star schema, and is a type of star schema. It is called a snowflake schema because the diagram of the schema resembles a snowflake. Snowflake schemas normalize dimensions to eliminate redundancy. That is, the dimension data has been grouped into multiple tables instead of one large table.
Dimension tables provide category data to give context to the fact data. For instance, a star schema for sales data will have dimension tables for product, date, sales location, promotion and more. Dimension tables act as lookup or reference tables because their information lets you choose the values used to constrain your queries.
The values in many dimension tables may change infrequently. As an example, a dimension of geographies showing cities may be fairly static. But when dimension values do change, it is vital to update them fast and reliably. Of course, there are situations where data warehouse dimension values change frequently. The customer dimension for an enterprise will certainly be subject to a frequent stream of updates and deletions.
Fact tables have measurement data. They have many rows but typically not many columns. Fact tables for a large enterprise can easily hold billions of rows. For many star schemas, the fact table will represent well over 90 percent of the total storage space. A fact table has a composite key made up of the primary keys of the dimension tables of the schema.
A fact table contains either detail-level facts or facts that have been aggregated. Fact tables that contain aggregated facts are often called summary tables. A fact table usually contains facts with the same level of aggregation. Though most facts are additive, they can also be semi-additive or non-additive. Additive facts can be aggregated by simple arithmetical addition. A common example of this is sales. Non-additive facts cannot be added at all