翻译于 2018/08/15 18:43
0 人 顶 此译文
Extract, Transform, Load (ETL) is a mature process that enables organizations to make sense out of disparate data, regardless of where the data resides or the format in which it has been stored. Over time, ETL has evolved with business requirements to support streaming data and unstructured data in NoSQL databases as well as structured data in relational databases.
Why is ETL so important? Data quality impacts the reliability of business decisions whether those decisions are based on spreadsheets, business intelligence systems, analytics dashboards or machine learning.
ETL involves three discrete processes: Extract, Transfer, and Load. The workflow tends to be implemented as an ongoing process.
The process begins by extracting data from the target sources, which may include an ERP, CRM, and other enterprise systems, and data from third-party sources. The different systems tend to use different data formats which are normalized into a common format for further processing.
The extract, transform and load process is a key part of data management in today's enterprise.
The transformation stage ensures data usability for the purpose it will be used. Transformation processes may include:
· Aggregation – summarizing the data
· Applying business rules – for example, ensuring that the data can be used or used for its intended purpose
· Cleansing – identifying and correcting errors
· Deduping – removing duplicate records
· Filtering – selecting and/or excluding data based on some criteria, such as residents of Palo Alto
· Mapping – determining the relationship between two pieces of data from different systems, such as J. Smith and John Smith
· Joining – combining data from two or more records
· Sorting – numerically, alphabetically, chronologically, reverse chronologically, by region, etc.
· Transposing – rotating rows into columns or vice versa
· Splitting – dividing one column into two columns or more
· Validating – ensuring the data has been cleansed and meets pre-defined data quality standards
During the Load phase, the data is loaded into its target destination which could be a data warehouse, a cloud environment, or a delimited flat file.
映射——确定来自不同系统(如J. Smith和John Smith)的两段数据之间的关系
聚合 —— 合并数据
运用业务规则 —— 比如，保证数据可用或者对特殊目的可用
净化 —— 识别和纠正错误
删除重复 —— 删除重复的记录
过滤 —— 根据一些规则选择/排除数据，比如只选择或排除 Palo Alto 居民的数据
映射 —— 判断来自不同系统的不同数据之间的关系，比如 J.Smith 和 John Smith
关联 —— 合并两条或多条记录
排序 —— 按字母顺序，按字母顺序排列，按时间顺序排列，按时间顺序，按地区，等
转置 —— 把行变成列或者反过来
划分 —— 把一列分成两份或多份
验证 —— 保证数据被清洗的质量达到了预期的标准
ETL systems can use temporary data or permanent data, depending on the use case. For example, temporary data (a subset of data) may be used for reporting or analytics. Permanent data might be migrated from one database to another or from one data warehouse to another.
The ETL cycle is processed in thefollowmanner:
· Initiate cycle
· Build reference data
· Extract data from sources
· Audit reports
· Publish to target tables
· Clean up
Data migration is the movement of data from one computer, database, application or storage device to another. It is commonly done when organizations modernize their IT capabilities, replace one vendor's offerings with another vendor's offerings, or consolidate technology ecosystems, such as during mergers and acquisitions. In a consolidation scenario, data mapping is necessary since the existing and new systems tend to use different data formats and organize data differently.
Although data migration tends to be done automatically to accelerate the process and reduce the likelihood of manual errors, it still has to be handled carefully to avoid unnecessary business disruption. Before executing a migration, any technical constraints should be understood and dealt with to avoid unnecessary delays or failures. Meanwhile, end users should be provided with ample advance notice, and the migration should be planned at a time when it will have the least impact on the business (e.g., in the middle of the night, over the weekend, etc.)
Data flows are the movement ofdataone place to another. In the ETL process, dataflowsfrom the target sources through transformations to its intended destination.
Control flowsensurethat processing tasks have been completed and in the right order. The tasks are prioritized in a linear fashion which requires one process to complete beforeanother initiates.
Although control flows execute dataflowsas a task, a task within data flow (a data flow task) can trigger another task before it has completed, which means that data flow tasks can execute in parallel.
Data flow mapping tools have become important as a result of the EU's General Data Protection Regulation (GDPR) because organizations need to explain what personally identifiable information (PII) they are processing and how they are processing it.
The scalability of data stores has become important given the rapid growth of data, so it's not surprising that ETL software is also now taking advantage of parallel processing. The advantage, of course, is increased speed and scalability, which is necessaryinlight of growing data volumes. Parallel processing enables:
· Parallel data access
· Parallel task execution
· Parallel process execution
Data warehousing procedures take failures into account so ETL processes can be rerun and recover from run errors. Rather than running one huge ETL process, the process is decomposed into smaller parts that run in sequence or in parallel. Rows and the elements of a process are tagged for each data flow, so if a failure occurs, the failure point is known. If a failure occurs, then the ETL process is rolled back to the point of failure so the process can be rerun from that point.
Relational databases use keys to establish and identify relationships between tables. Keys are also used to identify a record or row of data inside a table:
· A unique key is a column that identifies a particular entity
· Composite keys comprise several columns
· A foreign key is a key in another table that refers to a primary key
· The primary key is a single attribute, or multiple attributes, that uniquely identify a row in a table
In a data warehouse, keys are important because data is loaded into the data warehouse from different data sources. The different data sources may use different attributes as the primary key. If they do, the data warehouse may require them to be consolidated into a single dimension that includes all of the attributes represented by the various primary keys.
Virtual ETL takes advantage of virtual data to overcome some of the challenges with data migration and application integration with dealing with multiple data sources. The objects or entities collected from various data sources are abstracted for the purpose of creating an in-memory or persistent metadata repository. A persistent metadata repository enables ETL tools to persist as middleware so data harmonization and profiling can be performed continuously, in near-real time.
ETL and ELT involve the same processes, albeit in a different order. ETL is the traditional order of processes thathasbeen used since before data warehouses existed. Big data analysis has fueled the popularity of ELT, as has data diversity.
ELT loads the extracted data into a data warehouse where the transformation takes place. Unlike ETL, ELT does not use a separate transformation engine, it uses the processing power of the data store for the purposes of simplifying the architecture and accelerating the transformation process.