加载中

Extract, Transform, Load (ETL) is a mature process that enables organizations to make sense out of disparate data, regardless of where the data resides or the format in which it has been stored.  Over time, ETL has evolved with business requirements to support streaming data and unstructured data in NoSQL databases as well as structured data in relational databases.

Why is ETL so important? Data quality impacts the reliability of business decisions whether those decisions are based on spreadsheets, business intelligence systems, analytics dashboards or machine learning. 

提取、转换、加载(ETL)是一个成熟的过程,它使组织能够利用不同的数据,而不管数据驻留在何处或存储数据的格式如何。随着时间的推移,ETL随着业务需求的发展,支持NoSQL数据库中的流数据和非结构化数据以及关系数据库中的结构化数据

ETL为什么如此重要?无论这些决策是基于电子表格、商业智能系统、分析仪表板还是机器学习,数据质量都会影响业务决策的可靠性。

How ETL Works

ETL involves three discrete processes:  Extract, Transfer, and Load. The workflow tends to be implemented as an ongoing process.

The process begins by extracting data from the target sources, which may include an ERP, CRM, and other enterprise systems, and data from third-party sources. The different systems tend to use different data formats which are normalized into a common format for further processing.


The extract, transform and load process is a key part of data management in today's enterprise.

ETL是如何工作的

ETL涉及三个独立的过程:提取、传输和加载。工作流程往往作为一个正在进行的过程来实现。

该过程首先从目标源提取数据,目标源可能包括ERP、CRM和其他企业系统,以及来自第三方源的数据。不同的系统倾向于使用不同的数据格式,这些格式被规范化为一种通用格式,以便进一步处理。

提取、转换和加载过程是当今企业数据管理的重要组成部分。

The transformation stage ensures data usability for the purpose it will be used.  Transformation processes may include:

·  Aggregation – summarizing the data

·  Applying business rules – for example, ensuring that the data can be used or used for its intended purpose

·  Cleansing – identifying and correcting errors

·  Deduping – removing duplicate records

·  Filtering – selecting and/or excluding data based on some criteria, such as residents of Palo Alto

·  Mapping – determining the relationship between two pieces of data from different systems, such as J. Smith and John Smith

·  Joining – combining data from two or more records

·  Sorting – numerically, alphabetically, chronologically, reverse chronologically, by region, etc.

·  Transposing – rotating rows into columns or vice versa

·  Splitting – dividing one column into two columns or more

·  Validating – ensuring the data has been cleansed and meets pre-defined data quality standards

During the Load phase, the data is loaded into its target destination which could be a data warehouse, a cloud environment, or a delimited flat file.

转换阶段确保了数据的可用性。转换过程可能包括:

  • 汇总——汇总数据

  • 应用业务规则——例如,确保数据可以用于或用于其预期目的

  • 清洁-识别和纠正错误

  • 清除重复记录

  • 过滤——根据一些标准选择和/或排除数据,比如Palo Alto的居民

  • 映射——确定来自不同系统(如J. Smith和John Smith)的两段数据之间的关系

  • 连接-从两个或多个记录中组合数据

  • 排序——数字排序、字母排序、时间排序、反向排序、区域排序等。

  • 转置—将行旋转为列,反之亦然

  • 分割——将一列分成两列或两列以上

  • 验证—确保数据已被清理,并符合预定义的数据质量标准

在加载阶段,数据被加载到目标目的地,目标目的地可以是数据仓库、云环境或分隔的平面文件。

Temporary vs. Permanent Data

ETL systems can use temporary data or permanent data, depending on the use case.  For example, temporary data (a subset of data) may be used for reporting or analytics.  Permanent data might be migrated from one database to another or from one data warehouse to another.

Real-life ETL 

The ETL cycle is processed in thefollowmanner:

·  Initiate cycle

·  Build reference data

·  Extract data from sources

·  Validate

·  Transform

·  Stage

·  Audit reports

·  Publish to target tables

·  Archive

· Clean up

临时和永久数据

ETL系统可以使用临时数据或永久数据,这取决于用例。例如,临时数据(数据的子集)可以用于报告或分析。永久数据可以从一个数据库迁移到另一个数据库,也可以从一个数据仓库迁移到另一个数据仓库。

现实生活中的ETL

ETL循环的处理方式如下:

  • 启动周期

  • 建立参考数据

  • 从资源中提取数据

  • 验证

  • 变换

  • 阶段

  • 审计报告

  • 发布到目标表

  • 存档

  • 清理

Data Migration

Data migration is the movement of data from one computer, database, application or storage device to another. It is commonly done when organizations modernize their IT capabilities, replace one vendor's offerings with another vendor's offerings, or consolidate technology ecosystems, such as during mergers and acquisitions. In a consolidation scenario, data mapping is necessary since the existing and new systems tend to use different data formats and organize data differently.

Although data migration tends to be done automatically to accelerate the process and reduce the likelihood of manual errors, it still has to be handled carefully to avoid unnecessary business disruption. Before executing a migration, any technical constraints should be understood and dealt with to avoid unnecessary delays or failures.  Meanwhile, end users should be provided with ample advance notice, and the migration should be planned at a time when it will have the least impact on the business (e.g., in the middle of the night, over the weekend, etc.)

数据迁移

数据迁移是数据从一台计算机、数据库、应用程序或存储设备到另一台计算机的移动。当组织实现It能力的现代化,用另一个供应商的产品替换一个供应商的产品,或者整合技术生态系统(例如在合并和收购期间)时,通常会这样做。在整合场景中,数据映射是必要的,因为现有系统和新系统倾向于使用不同的数据格式并以不同的方式组织数据。

尽管数据迁移倾向于自动进行,以加速流程并减少手工错误的可能性,但仍然需要小心处理,以避免不必要的业务中断。在执行迁移之前,应该了解并处理任何技术约束,以避免不必要的延迟或失败。同时,应该为终端用户提供充足的提前通知,并且迁移应该在对业务影响最小的时候(例如,在午夜,周末等等)进行计划。

Data Flow and Control Flow

Data flows are the movement ofdataone place to another. In the ETL process, dataflowsfrom the target sources through transformations to its intended destination.

Control flowsensurethat processing tasks have been completed and in the right order.  The tasks are prioritized in a linear fashion which requires one process to complete beforeanother initiates.

Although control flows execute dataflowsas a task, a task within data flow (a data flow task) can trigger another task before it has completed, which means that data flow tasks can execute in parallel.

Data flow mapping tools have become important as a result of the EU's General Data Protection Regulation (GDPR) because organizations need to explain what personally identifiable information (PII) they are processing and how they are processing it.

数据流和控制流

数据流是数据从一个地方移动到另一个地方。在ETL过程中,目标源通过转换到目标目标。

控制流程确保处理任务已经按照正确的顺序完成。任务的优先级是线性的,这需要一个进程在另一个进程开始之前完成。

虽然控制流执行数据流任务,但是数据流(数据流任务)中的任务可以在完成之前触发另一个任务,这意味着数据流任务可以并行执行。

由于欧盟的通用数据保护法规(GDPR),数据流映射工具变得非常重要,因为组织需要解释他们正在处理什么个人身份信息(PII)以及他们如何处理这些信息。

Parallel Processing

The scalability of data stores has become important given the rapid growth of data, so it's not surprising that ETL software is also now taking advantage of parallel processing.  The advantage, of course, is increased speed and scalability, which is necessaryinlight of growing data volumes. Parallel processing enables:

·  Parallel data access

·  Parallel task execution

·  Parallel process execution

并行处理

随着数据的快速增长,数据存储的可伸缩性变得非常重要,因此ETL软件现在也在利用并行处理的优势也就不足为奇了。当然,它的优点是提高了速度和可伸缩性,这在数据量不断增长的情况下是必要的。

并行处理能够:

  • 并行数据访问

  • 并行任务执行

  • 并行流程执行

Rerunnability, recoverability

Data warehousing procedures take failures into account so ETL processes can be rerun and recover from run errors. Rather than running one huge ETL process, the process is decomposed into smaller parts that run in sequence or in parallel.  Rows and the elements of a process are tagged for each data flow, so if a failure occurs, the failure point is known.  If a failure occurs, then the ETL process is rolled back to the point of failure so the process can be rerun from that point.

重启,可恢复性

数据仓库过程考虑到故障,因此ETL过程可以重新运行并从运行错误中恢复。与其运行一个庞大的ETL进程,不如将该进程分解成按顺序或并行运行的更小的部分。每个数据流都标记了流程的行和元素,因此如果发生故障,就知道故障点。如果发生故障,则将ETL进程回滚到故障点,以便从该点重新运行该进程。

The Importance of Keys

Relational databases use keys to establish and identify relationships between tables.  Keys are also used to identify a record or row of data inside a table:

·  A unique key is a column that identifies a particular entity

·  Composite keys comprise several columns

·  A foreign key is a key in another table that refers to a primary key

·  The primary key is a single attribute, or multiple attributes, that uniquely identify a row in a table

In a data warehouse, keys are important because data is loaded into the data warehouse from different data sources.  The different data sources may use different attributes as the primary key.  If they do, the data warehouse may require them to be consolidated into a single dimension that includes all of the attributes represented by the various primary keys.

键的重要性

关系数据库使用键来建立和识别表之间的关系。键还用于标识表内的记录或数据行:

  • 唯一键是标识特定实体的列

  • 组合键包含几个列
  • 外键是另一个表中引用主键的键
  • 主键是单个属性或多个属性,它们惟一地标识表中的一行

在数据仓库中,键非常重要,因为数据是从不同数据源加载到数据仓库中的。不同的数据源可能使用不同的属性作为主键。如果这样做,数据仓库可能要求将它们合并到一个维度中,该维度包含由各种主键表示的所有属性。

Virtual ETL

Virtual ETL takes advantage of virtual data to overcome some of the challenges with data migration and application integration with dealing with multiple data sources.  The objects or entities collected from various data sources are abstracted for the purpose of creating an in-memory or persistent metadata repository.  A persistent metadata repository enables ETL tools to persist as middleware so data harmonization and profiling can be performed continuously, in near-real time.

ETL Versus ELT

ETL and ELT involve the same processes, albeit in a different order.  ETL is the traditional order of processes thathasbeen used since before data warehouses existed.  Big data analysis has fueled the popularity of ELT, as has data diversity.

ELT loads the extracted data into a data warehouse where the transformation takes place.  Unlike ETL, ELT does not use a separate transformation engine, it uses the processing power of the data store for the purposes of simplifying the architecture and accelerating the transformation process.

虚拟ETL

虚拟ETL利用虚拟数据来克服数据迁移和处理多个数据源的应用程序集成带来的一些挑战。从各种数据源收集的对象或实体被抽象出来,目的是创建内存或持久元数据存储库。持久元数据存储库允许ETL工具作为中间件进行持久化,这样数据协调和分析就可以在接近实时的情况下持续执行。

ETL与ELT

ETL和ELT涉及相同的过程,尽管顺序不同。ETL是自数据仓库出现之前就开始使用的传统流程顺序。大数据分析推动了 ELT,的普及,数据多样性也是如此。

ELT将提取的数据加载到进行转换的数据仓库中。与ETL不同,ELT不使用单独的转换引擎,它使用数据存储的处理能力来简化体系结构并加速转换过程。


返回顶部
顶部