加载中

Can Elasticsearch be used as a "NoSQL"-database? NoSQL means different things in different contexts, and interestingly it's not really about SQL. We will start out with a "Maybe!", and look into the various properties of Elasticsearch as well as those it has sacrificed, in order to become one of the most flexible, scalable and performant search and analytics engines yet.

What is a NoSQL Database Anyway?

NoSQL-database defines NoSQL as “Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable.”. In other words, it’s not a very precise definition.

Elasticsearch可以作为一个"NoSQL"数据库来使用吗? NoSQL在不同上下文中代表不同的意思,而且有趣的是它跟SQL无关。我们先来假设这个问题的答案是"也许可以!",让我们先来探究一下,Elasticsearch都具有那些特性以及它牺牲了哪些特性,使得它成为迄今最具灵活性、扩展性和高性能的查询和分析引擎。

到底什么是NoSQL数据库?

NoSQL-数据库官网把NoSQL定义为“下一代数据库,主要具有下列一些特点: 具有非关系性,分布式,开源并且横向可扩展”。也就是说,它并不是一个非常精确的概念。

It’s not about SQL in particular. For example, Hive’s query language is clearly inspired by SQL. The same is true for Esper’s query language, which operates on streams instead of relations. Also, did you know PostgreSQL was named “Postgres” and had “Quel” as its query language back in the days? While first and foremost an ORDBMS, it now also has many features to make it viable as a schemaless document-store.

It’s not about ACID-ity either. Hyperdex is one example of a NoSQL-database that aims to provide ACID-transactions. MySQL, certainly an SQL-database, has a history of dubious interpretations of what ACID really means.

它尤其与SQL无关. 比方说, Hive 查询语言的灵感显然来自 SQL.  Esper查询语言同样如此, 只是它操作的是流而不是关系. 还有你知道 PostgreSQL 过去被命名为 “Postgres” 并使用 “Quel” 作为它的查询语言么? 而首先作为一个关系型数据库管理系统( ORDBMS), 它现在同样有许多的特性使其具备无模式文档存储的能力.

它同样也和ACID-特性无关. Hyperdex 就是一个 NoSQL-数据库的例子,它的目标就是提供 ACID-事务能力. MySQL, 确实是一个 SQL-数据库, 历史上它有一段解释 ACID 的真正意义的暧昧时期.

Relations? While most of the NoSQL-databases do not support joining in the same sense as traditional relational databases and leave that as an exercise for the user, there are those that do. RethinkDB, Hive and Pig, to name a few. Neo4j, a graph-oriented database, certainly deals with relations - it’s excellent at traversing relations (i.e. edges) in graphs. Elasticsearch has a concept of “query time” joining with parent/child-relations and “index time” joining with nested types.

Distributed? While there are some distributed SQL-databases around, and some projects aiming to be something like a NoSQLite, newer generation databases tend to be distributed in some way or another.

关系型的? 虽然大多数的 NoSQL-数据库并不支持加入传统关系型数据意义相同的功能,但还是有一些那样做了,并将其留给用户当做练习使用. RethinkDBHive 还有 Pig, 等等. Neo4j, 面向图形的数据库, 确实是处理关系用的 - 它擅于遍历图中的关系 (比如,图中的边) . Elasticsearch 有一个概念叫做加入父子关系的“查询时间”和加入嵌套类型的“索引时间“.

分布式的? 已经有一些分布式的 SQL-数据库 了, 并以 一些项目 旨在做一些像一个NoSQLite那样的事情, 更新一代的数据库趋向于在某些方式上具备分布式能力.

To summarize the summary, it neither makes sense to precisely define NoSQL, nor to simply say that Elasticsearch is a “document store”-type NoSQL-database. At the time of writing, nosql-database.org lists >20 of those.

In the next sections, we’ll have a look at some important properties and see how Elasticsearch does or does not implement them.

总而言之, 既没有道理给 NoSQL 做出精确的定义, 也不能简单的说 Elasticsearch 是一个“文档存储”-类型的NoSQL-数据库. 在我写这篇文章的时候, nosql-database.org 列出了超过20 个那样的东西.

在下一节,我们将关注一些重要的属性并且看看 Elasticsearch 为什么要实现或者不去实现它们.

No Transactions

Lucene, which Elasticsearch is built on, has a notion of transactions. Elasticsearch on the other hand, does not have transactions in the typical sense. There is no way to rollback a submitted document, and you cannot submit a group of documents and have either all or none of them indexed. What it does have, however, is a write-ahead-log to ensure the durability of operations without having to do an expensive Lucene-commit. You can also specify the consistency level of index-operations, in terms of how many replicas must acknowledge the operation before returning. This defaults to a quorum, i.e. n2+1 .

无事务

Lucene, 是 Elasticsearch 的构建的基础, 它是由一个事务的概念的. 而Elasticsearch在另外的方面, 并没有典型意义的事务. 对于已经提交的文档并没有办法回滚, 而你也不能提交一组文档并且为它们所有或者其中一些建立索引. 然而它所具备的, 是一个用来确保业务过程持久性而不用做昂贵的Lucene提交的预写日志. 你也可以指定索引操作的一致性级别, 以确保在返回之前有多少副本可以拿来确认操作条件. 默认的是法定人数, 例如 ⌊n2⌋

Visibility of changes is controlled when an index is refreshed, which by default is once per second, and happens on a shard-by-shard-basis.

Optimistic concurrency control is done by specifying the version of the submitted documents.

Elasticsearch is built for speed. Doing distributed transactions is a lot of work. Not providing them makes a lot of things easier. By accepting that what we read can be somewhat stale, and that everyone sees the same timeline, Elasticsearch can serve a lot of things from caches - which is paramount for the mind-boggling performance we love it for.

在逐个切片进行处理的方式中,当一个索引被刷新时,默认是一秒钟一次,就需要对变更的可见性进行控制。

通过制定提交文档的版本,可以进行乐观并发控制。

Elasticsearch追求的是速度。支持分布式事务是一大块工作。不支持分布式事务会使得很多事情变得容易起来。只要我们能接受读取到的数据有些陈旧,而且所有人看到的是同一时间点的数据,那么Elasticsearch就可以利用缓存提供很多服务 - 这对于我们钟爱的极速性能来说是至关重要的。

Schema Flexible

Elasticsearch does not require you to specify a schema upfront. Throw a JSON-document at it, and it will do some educated guessing to infer its type. It does a good job at things like numerics, booleans and timestamps. For strings, it will use the “standard”-analyzer, which is usually good to get started.

While it’s arguably “schema free”, in the sense that you don’t have to specify a schema, we like to think of it as “schema flexible” instead. To develop great search and/or analytics, you really need to tweak your schemas. Elasticsearch has an extensive set of powerful tools to help you, like dynamic templates, multi-field objects, etc. This is covered in more detail in our article on mapping.

模式灵活

Elasticsearch 不要求你先指定模式。扔给它一个 JSON 文档,它就会进行一些训练有素的猜测来推断其类型。对于数值、布尔、时间戳它可以做的很好。对于字符串,它会使用“标准化”的分析,这通常是良好的开始。

它是有商榷的“无模式”,在这个意义上你不必指定一个模式,我们更愿意把它认做是“模式灵活”。为了开发大规模的搜索、分析,你确实需要对模式进行微调。Elasticsearch 有大量的强大工具可以帮助你,例如动态模板、多字段对象等。这在我们关于映射的文章里会谈及更多。

Relations and Constraints

Elasticsearch is a document oriented database. The entire object graph you want to search needs to be indexed, so before indexing your documents, they must be denormalized. Denormalization increases retrieval performance (since no query joining is necessary), uses more space (because things must be stored several times), but makes keeping things consistent and up-to-date more difficult (as any change must be applied to all instances). They’re excellent for write-once-read-many-workloads, however.

关系和约束

Elasticsearch是一种面向文档的数据库。你想要对之进行搜索的整个对象关系图,都需要进行索引,在对文档进行索引之前,它们必须先被反规范化。反规范化提升了查询性能(因为不再需要进行关联查询),使用了更多存储空间(因为数据必须被存储多次),但是,要保持数据一致性和实时性则更加困难(因为任何数据改变都必须被写入到所有实例中去)。不过,对于一次写入频繁读取的工作场景,它的表现相当优异。

For example, say you have set up database containing customers, orders and products, and you want to search for orders given the name of a product and user. This could be solved by indexing orders with all the necessary information about the user and the products. Searching is then easy, but what happens when you want to change the name of the product? In a relational design with proper normalization, you would simply update the product and be done. That’s what they are really good at. With a denormalized document database, every order with the product would have to be updated.

举例来说,假设你在数据库中存储了客户、订单和产品等数据,现在你想要通过产品名字和客户姓名来查找订单。可以这样来解决这个问题:在对订单进行索引时,把客户和产品的所有必要信息都加进来。这样的话,查询就非常简单,但是当你想要改变某个产品的名字时会出现什么情况呢? 在进行了良好规范化的关系型模型中,你只需要修改该产品对应的单条记录就搞定了。这是关系型数据库所擅长的。而在反规范化的文档数据库中,将不得不更新与该产品有关的所有订单。

In other words, with document oriented databases like Elasticsearch, we design our mappings and store our documents such that it’s optimized for search and retrieval.

As mentioned in the introduction, Elasticsearch has a concept of “query time” joining with parent/child-relations, and “index time” joining with nested types. We’ll probably cover this in more depth in a future article. In the meantime, we can recommend Martijn van Groningen’s presentation “Document relations with Elasticsearch”.

Most relational databases also let you specify constraints to define what is and isn’t consistent. For example, referential integrity and uniqueness can be enforced. You can require that the sum of account movements must be positive and so on. Document oriented databases tend not to do this, and Elasticsearch is no different.

换句话说,在面向文档类型的数据库中,比如Elasticsearch, 我们对文档进行映射和存储设计只是为了优化查询和信息获取的性能。

在介绍中已经提到,Elasticsearch中可以使用父/子-关系进行“查询时”连结,也可以使用内嵌类型进行“索引时”连结。我们会在以后的文章中对该主题进行深入介绍。我们推荐Martijn van Groningen的一篇文章“Document relations with Elasticsearch”.

大多数关系型数据库也会允许你指定约束关系,来定义什么需要保持一致性,什么不需要保持一致性。比如,参照完整性和唯一性都是强制性的。你可以要求账户变更金额必须是正数,等等。而面向文档的数据库不倾向于这么做,Elasticsearch就是如此。

返回顶部
顶部