Data integration
John Bloor, iStockphotoFor many, information technology (IT) can seem like an enigmatic world full of complex programming languages and costly equipment. Yet, despite the technical jargon that often feels like an unknown language, it is essential for business and organizational decision-makers to grasp the fundamentals of IT. One key concept to understand is data integration.
At first glance, data integration might seem like a straightforward concept. Many organizations manage information across various databases, which requires an effective method to retrieve and combine data from these different sources into a cohesive whole.
In reality, data integration solutions can be quite intricate. There is no one-size-fits-all method for data integration, and many of the techniques used by IT professionals are still advancing. A specific data integration tool may be more suitable for one organization over another, based on their unique requirements.
So, what are the fundamental aspects of data integration? Let's dive into the specifics!
How Does Data Integration Function?
This simple chart displays customer transactions.
Mytour.comData integration primarily focuses on databases. A database is a structured collection of data, much like a file system, which organizes files for easy access, retrieval, and management.
There are various ways to classify databases. One approach is to categorize them based on the type of data they store. For instance, a database could be categorized as a media database if it contains video or audio files.
Another way to classify databases is by how they organize their data. The structure of a database is known as its schema. A common method for organization is through tables, which illustrate the relationships between different data elements. Tables are similar to spreadsheets, with columns representing data categories and rows representing individual records. A database using this method is called a relational database.
To illustrate data integration in practice, let’s consider an electronics company launching a new mobile device. The marketing team might need to access customer data from the sales department and combine it with information from the product department to create a specific sales list. An efficient data integration system would enable the marketing team to view data from both departments in one unified view, excluding irrelevant information and simplifying business operations.
Object-oriented programming (OOP) databases use a unique method for organizing data. In contrast to traditional programming models, which focus on entering data into a set of instructions to produce output, OOP languages define data as objects and explore the relationships and interactions between these objects.
To build an OOP database, you start by defining the objects you want to store. Next, you determine how each object is related to others within the system. Once you've identified an object, you categorize it into a class, which is a collection of related objects. To define a class, you need to specify the data each object in the class will hold and which methods (logical sequences) will impact those objects. Objects within the system can communicate with one another or with users through interfaces called messages.
A concrete example makes it easier to grasp. Imagine you’re designing a database to track American sports. You begin by defining baseball teams. Once you’ve established the structure of a baseball team, you can generalize it into a class. The Atlanta Braves would be a specific object or instance of that class. The baseball team class would be part of a broader superclass, American sports teams, which also includes other classes such as football and soccer teams.
To retrieve information from a database (regardless of its organizational structure), you use a query. A query is a request for specific information. Both humans and applications can submit queries to databases, which respond by providing data that meets the query's criteria. Queries are written in specialized languages like Structured Query Language (SQL). If you've ever used a search engine on the Internet, you've submitted a query by entering search terms.
Databases respond to queries by generating a data view. A view is simply a specific way of presenting data. In a data integration system, the view returned only includes the data directly relevant to the original query. For instance, if you query for customers who made purchases exceeding $100, the database would return this specific information:
This view displays only the data relevant to the query "customers who made purchases over $100 in products."
Mytour.comThis view presents only the data associated with the query "customers who spent more than $100 on products." It does not include information on what specific products were purchased, nor does it display customers who made purchases under $100.
What are the various methods of data integration? We will explore that next.
Data refers to any type of information. It could be the value in a cell of a spreadsheet, an audio or video file, a series of words in a document, raw output from a computer program, or even metadata describing a file. Data integration is concerned with information itself, not the files that contain it.
Data Integration Tools
From the previous section, you may assume that databases are quite intricate. This is a reasonable conclusion and helps explain why data integration continues to evolve, despite being around for decades. The goal of data integration is to collect data from various sources, merge it, and present it as if it were a unified whole. However, the success of this process is heavily dependent on the quality of the data, as poor data can result in misleading conclusions or insights.
Imagine you're about to head out on a trip and want to check the traffic conditions before deciding which route to take. Here's how the different data integration methods would handle your request.
With manual integration, you would be responsible for all the work. First, you'd need to know where to find your data. You would have to locate the traffic report and the map for your town. Then, you would retrieve the traffic report and map data directly from their respective databases, comparing them to figure out the best route out of town.
Using a common user interface approach would ease your task somewhat. You could query an interface, such as the internet. The query results would appear as a view on the interface. While you'd still need to compare the traffic report with the map to determine the optimal route, the interface would handle the retrieval and location of the data.
Some integration methods use applications to do all the work for you. These applications, known as data integration tools, are specialized programs designed to locate, retrieve, and merge the data for you. Data scientists often create these applications to ensure smooth and accurate data integration processes.
During the integration process, applications must adjust the data to ensure that the information from one source aligns with the data from another. In this case, you would send a query to an application, which would then provide a view that combines your town's map with traffic report data. The challenge with this method is that as the number of data sources and formats grows, the complexity of programming applications increases.
Next, there's the data warehousing approach, or common data storage method. With this technique, data from all the databases you're integrating is extracted, transformed, and loaded. First, the data warehouse pulls data from the different sources. Then, it standardizes the data into a consistent format, ensuring compatibility across all data. Finally, the warehouse stores the new data in its own database. When you run a query, the warehouse locates, retrieves, and presents the integrated data.
In our scenario, the data warehouse would pull the most recent traffic and map data it has on your town. It would then combine this information and send you an integrated view. This system has its benefits and drawbacks, which we will explore in the next section.
Most designers of data integration systems aim to minimize the amount of work required from the end user, which is why they often prioritize applications and data warehousing strategies.
What exactly do data warehouses do? Find out in the next section!
Portal websites like Google or Yahoo serve as prime examples of common user interfaces. While these portals pull data from various sources, they do not combine this information into a cohesive view.
The Data Warehouse
As we mentioned earlier, a data warehouse is a database that stores data from other databases, converting it into a standardized format. This is the simplest description of a data warehouse. There isn't a single definition for what constitutes a data warehouse or how it should be constructed, leading to a variety of different implementations. Each data warehouse may have a unique structure and operation.
Typically, queries sent to a data warehouse are resolved quickly. This is because the heavy lifting of extracting, transforming, and merging the data has already been completed. The user interface of a data warehouse is referred to as the front end, and from a user perspective, it offers a fast way to access integrated data.
From the back-end perspective, however, things are more complex. Database administrators need to carefully design the data warehouse system to ensure it is both effective and efficient. One of the most challenging tasks is converting data from various sources into a common format, which requires a consistent method for describing and encoding the data.
A data warehouse must be large enough to accommodate data collected from multiple sources. Some data warehouses also include a component called a data mart. In this setup, the data warehouse handles the aggregation of the data, while the data mart retrieves and combines the relevant data to respond to user queries.
One issue with data warehouses is that the data they contain is not always up to date. This happens because data warehouses pull information from other databases at scheduled intervals. If the data in those databases changes between extractions, the information provided by the data warehouse may not reflect the most current or accurate view. This may not be an issue if the data doesn't change often, but it can be problematic for other applications.
Referring to the earlier example of the traffic report and map, you can see how this issue arises. While a town's map may not need frequent updates, traffic conditions can shift drastically in a short period. Since a data warehouse may not extract data frequently, it could lead to outdated or unreliable time-sensitive information. In such cases, a different data integration approach might be more effective.
What other options exist aside from data warehousing? Let's explore the alternatives!
Metadata refers to descriptions of data. It serves as a tool for naming and defining data, as well as illustrating the relationships between different sets of data. Data integration systems rely on metadata to help locate the data relevant to specific queries.
Networked Databases
For data integration systems that handle information which is frequently updated, a data warehouse isn't the best solution. In these situations, data virtualization can be a more adaptable alternative by enabling access to data from multiple sources without the need for physical integration. Other methods, like streaming data integration or real-time data processing, can also help organizations that need to manage rapidly changing information.
To tackle the challenge of continuously changing information, IT specialists design systems that fetch data directly from the individual data sources. Without a centralized database for analyzing, categorizing, and integrating the data in preparation for user queries, these tasks are handled by other parts of the system.
IT professionals define data integration systems through schemata. The global schema is the unified view that results from a processed query. The source schema represents the structure of the individual data sources and their relationships. Mapping describes how the global schema and source schema connect. Consider the source schema as a blueprint for all the data within the system, and the global schema as a blueprint for the response generated by a query.
There are two primary strategies for resolving queries within a data integration system: global-as-view and local-as-view. Each method emphasizes different parts of the overall system, offering distinct advantages and disadvantages.
The global-as-view approach emphasizes the global schema. As long as the data sources remain unchanged, this method works effectively. Modifying the global schema setup is relatively straightforward, making it easy to analyze the same set of data in various ways. However, adding or removing data sources is a challenge, as it impacts the system as a whole.
The local-as-view approach, in contrast, focuses on the individual data sources. As long as the global schema stays the same, adding or removing data sources is easy. The schema is designed to seek the same types of data and relationships within the new data sources. However, modifying the global schema is difficult in this case. Any new way of analyzing the data would require redefining the entire system.
So, that's the overview of data integration. Next time you check the weather or filter through a set of data, you'll have a better understanding of the complex processes happening behind the scenes to make it all work.
Federated database systems (FDBMS) are networks of independent databases. These systems take on several challenging tasks:
- Handling user queries
- Breaking them into sub-queries
- Using wrappers to define the sub-queries for each database's understanding
- Sending those wrapped sub-queries to the correct databases
- Receiving the data returned by the databases
- Integrating the data into a single view
- Displaying the unified view to the user
These systems are difficult to build and maintain due to the inherent complexity of managing databases.
