This document discusses the general concepts that apply to any data warehousing technology, and it describes a framework that you can use to organize and structure your migration to BigQuery. Show
TerminologyWe use the following terminology when discussing data warehouse migration: Use case A use case consists of all the datasets, data processing, and system and user interactions required in order to achieve business value, such as tracking sales volumes for a product over time. In data warehousing, the use case often consists of:
Migration offloading can increase the complexity and work involved in migrating data pipelines. Full migrationA migration approach similar to an offload migration, but instead of copying and then synchronizing the schema and data, you configure the migration to ingest data directly into the new cloud data warehouse from the upstream source systems. In other words, the data pipelines required for the use case are also migrated. Enterprise data warehouse (EDW) A data warehouse that consists not only of an analytical database, but multiple critical analytical components and procedures. These include data pipelines, queries, and business applications that are required in order to fulfill the organization's workloads. Cloud data warehouse (CDW) A data warehouse that has the same characteristics as an EDW, but runs on a fully managed service in the cloud—in this case, BigQuery. Data pipeline A process that connects data systems through a series of functions and tasks that perform various types of data transformation. For details, see What is a data pipeline? in this series.Why migrate to BigQuery?Over the past few decades, organizations have mastered the science of data warehousing. They have increasingly applied descriptive analytics to large quantities of stored data, gaining insight into their core business operations. Conventional Business Intelligence (BI), which focuses on querying, reporting, and Online Analytical Processing, might have been a differentiating factor in the past, either making or breaking a company, but it's no longer sufficient. Today, not only do organizations need to understand past events using descriptive analytics, they need predictive analytics, which often uses machine learning (ML) to extract data patterns and make probabilistic claims about the future. The ultimate goal is to develop prescriptive analytics that combine lessons from the past with predictions about the future to automatically guide real-time actions. Traditional data warehouse practices capture raw data from various sources, which are often Online Transactional Processing (OLTP) systems. Then, a subset of data is extracted in batches, transformed based on a defined schema, and loaded into the data warehouse. Because traditional data warehouses capture a subset of data in batches and store data based on rigid schemas, they are unsuitable for handling real-time analysis or responding to spontaneous queries. Google designed BigQuery in part in response to these inherent limitations. Innovative ideas are often slowed by the size and complexity of the IT organization that implements and maintains these traditional data warehouses. It can take years and substantial investment to build a scalable, highly available, and secure data warehouse architecture. BigQuery offers sophisticated software as a service (SaaS) technology that can be used for serverless data warehouse operations. This lets you focus on advancing your core business while delegating infrastructure maintenance and platform development to Google Cloud. BigQuery offers access to structured data storage, processing, and analytics that's scalable, flexible, and cost effective. These characteristics are essential when your data volumes are growing exponentially—to make storage and processing resources available as needed, as well as to get value from that data. Furthermore, for organizations that are just starting with big data analytics and machine learning, and that want to avoid the potential complexities of on-premises big data systems, BigQuery offers a pay-as-you-go way to experiment with managed services. With BigQuery, you can find answers to previously intractable problems, apply machine learning to discover emerging data patterns, and test new hypotheses. As a result, you have timely insight into how your business is performing, which enables you to modify processes for better results. In addition, the end user's experience is often enriched with relevant insights gleaned from big data analysis, as we explain later in this series. What and how to migrate: The migration frameworkUndertaking a migration can be a complex and lengthy endeavor. Therefore, we recommend adhering to a framework to organize and structure the migration work in phases:
Prepare and discoverIn the initial phase, the focus is on preparation and discovery. It's about affording yourself and your stakeholders an early opportunity to discover the existing use cases and raise initial concerns. Importantly, you also conduct an initial analysis around the expected benefits. These include performance gains (for example, improved concurrency) and reductions in total cost of ownership (TCO). This phase is crucial in helping you establish the value of the migration. A data warehouse typically supports a wide range of use cases and has a large number of stakeholders, from data analysts to business decision-makers. We recommend that you involve representatives from these groups to get a good understanding of which use cases exist, whether these use cases perform well, and whether stakeholders are planning new use cases. The discovery phase process consists of the following tasks:
To get insights into the use cases, you can develop a questionnaire to gather information from your subject matter experts (SMEs), end users, and stakeholders. The questionnaire should gather the following information:
The following diagram shows a high-level legacy architecture before the migration. It illustrates the catalog of available data sources, legacy data pipelines, legacy operational pipelines and feedback loops, and legacy BI reports and dashboards that are accessed by your end users. PlanThe planning phase is about taking the input from the preparation and discovery phase, assessing that input, and then using it to plan for the migration. This phase can be broken down into the following tasks:
Migrate using an iterative approachWhen migrating a large data warehousing operation to the cloud, it's a good idea to take an iterative approach. Therefore we recommend that you make the transition to BigQuery in iterations. Dividing the migration effort into iterations makes the overall process easier, reduces risk, and provides opportunities for learning and for improving after each iteration. An iteration consists of all the work required to offload or fully migrate one or more related use cases within a bound period of time. You can think of an iteration as a sprint cycle in the agile methodology, consisting of one or more user stories. For convenience and ease of tracking, you might consider associating an individual use case with one or more user stories. For example, consider the following user story: "As a pricing analyst, I want to analyze product price changes over the last year so that I can calculate future prices." The corresponding use case might be:
The business value of this use case is to support pricing analysis. As with most use cases, this use case probably will support multiple user stories. An offloaded use case will likely be followed by a subsequent iteration to fully migrate the use case. Otherwise, you might still have a dependency on the existing, legacy data warehouse, because the data is copied from there. The subsequent full migration is the delta between the offload and a full migration that hasn't been preceded by an offload—in other words, the migration of the data pipeline(s) to extract, transform, and load the data into the data warehouse. Prioritize use casesWhere you start and end your migration depends on your specific business needs. Deciding the order in which you migrate use cases is important because early success during a migration is crucial to continuing on your cloud adoption path. Experiencing failure at an early stage can become a serious setback to the overall migration effort. You might be on board with the benefits of Google Cloud and BigQuery, but processing all of the datasets and data pipelines that have been created or managed in your legacy data warehouse for different use cases can be complicated and time-consuming. Although there's no one-size-fits-all answer, there are best practices you can use as you're evaluating your on-premises use cases and business applications. This kind of up-front planning can make the migration process easier and the entire transition to BigQuery smoother. The following sections explore possible approaches for prioritizing use cases. Approach: Exploit current opportunitiesLook at current opportunities that could help you maximize the return on investment of a specific use case. This approach is especially useful if you're under pressure to justify the business value of migrating to the cloud. It also provides an opportunity to gather additional datapoints to help assess the total migration cost. Here are some example questions to ask to help you identify which use cases to prioritize:
Identifying opportunities to migrate can create some quick wins that yield tangible, immediate benefits for users and the business. Approach: Migrate analytical workloads firstMigrate Online Analytical Processing (OLAP) workloads before Online Transaction Processing (OLTP) workloads. A data warehouse is often the only place in the organization where you have all the data to create a single, global view of the organization's operations. Thererfore, it's common for organizations to have some data pipelines that feed back into the transactional systems to update status or trigger processes—for example, to buy more stock when a product's inventory is low. OLTP workloads tend to be more complex and have more stringent operational requirements and service-level agreements (SLAs) than OLAP workloads, so it also tends to be easier to migrate OLAP workloads first. Approach: Focus on the user experienceIdentify opportunities for enhancing the user experience by migrating specific datasets and enabling new types of advanced analytics. For example, one way to enhance the user experience is with real-time analytics. You can build sophisticated user experiences around a real-time stream of data when it's fused with historical data. For example:
You can also enhance the user experience with predictive and prescriptive analytics. For that, you can use BigQuery ML, Vertex AI AutoML tabular, or Google's pre-trained models for image analysis, video analysis, speech recognition, natural language, and translation. Or you can serve your custom-trained model using Vertex AI for use cases tailored to your business needs. This might involve the following:
Approach: Prioritize least-risky use casesThere are a number of questions IT can ask to help evaluate which use cases are the least risky to migrate, which makes them the most attractive to migrate in the early phases of the migration. For example:
Going through this list of questions can help you rank datasets and data pipelines from lowest to highest risk. Low-risk assets should be migrated first, and higher-risk ones should come later. ExecuteAfter you've gathered information about your legacy systems, and created a prioritized backlog of use cases, you can group the use cases into workloads and proceed with the migration in iterations. An iteration can consist of a single use case, a few separate use cases, or a number of use cases pertaining to a single workload. Which of these options you choose for the iteration depends on the interconnectivity of the use cases, any shared dependencies, and the resources you have available to undertake the work. A migration typically contains the following steps: These steps are described in more detail in the following sections. You might not need to go through all of these steps in each iteration. For example, in one iteration you might decide to focus on copying some data from your legacy data warehouse to BigQuery. In contrast, in a subsequent iteration you might focus on modifying the ingestion pipeline from an original data source directly to BigQuery. 1. Setup and data governanceSetup is the foundational work that's required in order to enable the use cases to run on Google Cloud. Setup can include configuration of your Google Cloud projects, network, virtual private cloud (VPC), and data governance. It also includes developing a good understanding of where you are today—what works and what doesn't. This helps you understand the requirements for your migration effort. You can use the BigQuery migration assessment feature to assist you with this step. Data governance is a principled approach to manage data during its lifecycle, from acquisition to use to disposal. Your data governance program clearly outlines policies, procedures, responsibilities, and controls surrounding data activities. This program helps to ensure that information is collected, maintained, used, and disseminated in way that both meets your organization's data integrity and its security needs. It also helps empower your employees to discover and use the data to its fullest potential. The data governance documention helps you understand data governance and the controls that you need when migrating your on-premises data warehouse to BigQuery. 2. Migrate schema and dataThe data warehouse schema defines how your data is structured and defines the relationships between your data entities. The schema is at the core of your data design, and it influences many processes, both upstream and downstream. The schema and data transfer documentation provides extensive information on how you can move your data to BigQuery and recommendations for updating your schema to take full advantage of BigQuery's features. 3. Translate queriesUse batch SQL translation to migrate your SQL code in bulk, or interactive SQL translation to translate ad hoc queries. Some legacy data warehouses include extensions to the SQL standard to enable functionality for their product. BigQuery does not support these proprietary extensions; instead, it conforms to the ANSI/ISO SQL:2011 standard. This means that some of your queries might still need manual refactoring if the SQL translators can't interpret them. 4. Migrate business applicationsBusiness applications can take many forms—from dashboards to custom applications to operational data pipelines that provide feedback loops to transactional systems. To learn more about the analytics options when working with BigQuery, see Overview of BigQuery analytics. This topic provides an overview of the reporting and analysis tools that you can use to get compelling insights from your data. The section on feedback loops in the data pipeline documention describes how you can use a data pipeline to create a feedback loop to provision upstream systems. 5. Migrate data pipelinesThe data pipelines documention presents procedures, patterns, and technologies to migrate your legacy data pipelines to Google Cloud. It helps you understand what a data pipeline is, what procedures and patterns it can employ, and which migration options and technologies are available in relation to the larger data warehouse migration. 6. Optimize performanceBigQuery processes data efficiently for both small and petabyte-scale datasets. With the help of BigQuery, your data analytics jobs should perform well without modification in your newly migrated data warehouse. If you find that under certain circumstances query performance doesn't match your expectations, see Introduction to optimizing query performance for guidance. 7. Verify and validateAt the end of each iteration, validate that the use-case migration was successful by verifying that:
You can get started with the Data validation tool, an open source Python CLI tool that compares data from source and target environments to ensure they match. It supports multiple connection types along with multi-level validation functionality. It's also a good idea to measure the impact of the use-case migration—for example, in terms of improving performance, reducing cost, or enabling new technical or business opportunities. Then you can more accurately quantify the value of the return on investment and compare the value against your success criteria for the iteration. After the iteration is validated, you can release the migrated use case to production and give your users access to migrated datasets and business applications. Finally, take notes and document lessons learned from this iteration, so you can apply these lessons in the next iteration and accelerate the migration. Summarizing the migration effortDuring migration, you run both your legacy data warehouse and BigQuery, as detailed in this document. The reference architecture in the following diagram highlights that both data warehouses offer similar functionality and paths—both can ingest from the source systems, integrate with the business applications, and provide the required user access. Importantly, the diagram also highlights that data is synchronized from your data warehouse to BigQuery. This allows use cases to be offloaded during the entire duration of the migration effort. Assuming that your intent is to fully migrate from your data warehouse to BigQuery, the end state of the migration looks like the following: What's nextLearn more about the following steps in data warehouse migration:
You can also learn about moving from specific data warehouse technologies to BigQuery:
Which of the following principles includes the concept of writing all or nothing to the database?The ACID principle of database transaction refers to atomicity (all or nothing), consistency, isolation (transactions operate independently), and durability (data is maintained).
Which of the following is the best definition of a data mart?A data mart is a subset of a data warehouse focused on a particular line of business, department, or subject area. Data marts make specific data available to a defined group of users, which allows those users to quickly access critical insights without wasting time searching through an entire data warehouse.
What is the primary objective of post implementation review quizlet?The objective of a postimplementation review is to reveal whether the implementation of a system has achieved planned objectives (i.e., meets business objectives and risk acceptance criteria). An organization is considering making a major investment in upgrading technology.
|