翻译于 2014/02/13 09:32
2 人 顶 此译文
Can Elasticsearch be used as a "NoSQL"-database? NoSQL means different things in different contexts, and interestingly it's not really about SQL. We will start out with a "Maybe!", and look into the various properties of Elasticsearch as well as those it has sacrificed, in order to become one of the most flexible, scalable and performant search and analytics engines yet.
NoSQL-database defines NoSQL as “Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable.”. In other words, it’s not a very precise definition.
Elasticsearch 可以被当成一个 "NoSQL"-数据库来使用么? NoSQL 意味着在不同的环境下存在不同的东西, 而erestingly 它并不是真的跟 SQL 有啥关系. 我们开始只会觉得 "可能"而已, 所以细细研究了 Elasticsearch 的各种属性，包括它已经为了成就最具灵活性，可伸缩性和性能优异的分析查询引擎的那些属性.
NoSQL-数据库 将 NoSQL 定义为“下一代主要解决如下问题的数据库: 非关系型的，分布式的，开元的并且可以扁平扩展.”. 换言之，它并不是一个精确的定义.
It’s not about SQL in particular. For example, Hive’s query language is clearly inspired by SQL. The same is true for Esper’s query language, which operates on streams instead of relations. Also, did you know PostgreSQL was named “Postgres” and had “Quel” as its query language back in the days? While first and foremost an ORDBMS, it now also has many features to make it viable as a schemaless document-store.
It’s not about ACID-ity either. Hyperdex is one example of a NoSQL-database that aims to provide ACID-transactions. MySQL, certainly an SQL-database, has a history of dubious interpretations of what ACID really means.
Relations? While most of the NoSQL-databases do not support joining in the same sense as traditional relational databases and leave that as an exercise for the user, there are those that do. RethinkDB, Hive and Pig, to name a few. Neo4j, a graph-oriented database, certainly deals with relations - it’s excellent at traversing relations (i.e. edges) in graphs. Elasticsearch has a concept of “query time” joining with parent/child-relations and “index time” joining with nested types.
To summarize the summary, it neither makes sense to precisely define NoSQL, nor to simply say that Elasticsearch is a “document store”-type NoSQL-database. At the time of writing, nosql-database.org lists >20 of those.
In the next sections, we’ll have a look at some important properties and see how Elasticsearch does or does not implement them.
总而言之, 既没有道理给 NoSQL 做出精确的定义, 也不能简单的说 Elasticsearch 是一个“文档存储”-类型的NoSQL-数据库. 在我写这篇文章的时候, nosql-database.org 列出了超过20 个那样的东西.
在下一节，我们将关注一些重要的属性并且看看 Elasticsearch 为什么要实现或者不去实现它们.
Lucene, which Elasticsearch is built on, has a notion of transactions. Elasticsearch on the other hand, does not have transactions in the typical sense. There is no way to rollback a submitted document, and you cannot submit a group of documents and have either all or none of them indexed. What it does have, however, is a write-ahead-log to ensure the durability of operations without having to do an expensive Lucene-commit. You can also specify the consistency level of index-operations, in terms of how many replicas must acknowledge the operation before returning. This defaults to a quorum, i.e. ⌊n2⌋+1 .
Visibility of changes is controlled when an index is refreshed, which by default is once per second, and happens on a shard-by-shard-basis.
Optimistic concurrency control is done by specifying the version of the submitted documents.
Elasticsearch is built for speed. Doing distributed transactions is a lot of work. Not providing them makes a lot of things easier. By accepting that what we read can be somewhat stale, and that everyone sees the same timeline, Elasticsearch can serve a lot of things from caches - which is paramount for the mind-boggling performance we love it for.
Elasticsearch does not require you to specify a schema upfront. Throw a JSON-document at it, and it will do some educated guessing to infer its type. It does a good job at things like numerics, booleans and timestamps. For strings, it will use the “standard”-analyzer, which is usually good to get started.
While it’s arguably “schema free”, in the sense that you don’t have to specify a schema, we like to think of it as “schema flexible” instead. To develop great search and/or analytics, you really need to tweak your schemas. Elasticsearch has an extensive set of powerful tools to help you, like dynamic templates, multi-field objects, etc. This is covered in more detail in our article on mapping.
Elasticsearch is a document oriented database. The entire object graph you want to search needs to be indexed, so before indexing your documents, they must be denormalized. Denormalization increases retrieval performance (since no query joining is necessary), uses more space (because things must be stored several times), but makes keeping things consistent and up-to-date more difficult (as any change must be applied to all instances). They’re excellent for write-once-read-many-workloads, however.
For example, say you have set up database containing customers, orders and products, and you want to search for orders given the name of a product and user. This could be solved by indexing orders with all the necessary information about the user and the products. Searching is then easy, but what happens when you want to change the name of the product? In a relational design with proper normalization, you would simply update the product and be done. That’s what they are really good at. With a denormalized document database, every order with the product would have to be updated.
In other words, with document oriented databases like Elasticsearch, we design our mappings and store our documents such that it’s optimized for search and retrieval.
As mentioned in the introduction, Elasticsearch has a concept of “query time” joining with parent/child-relations, and “index time” joining with nested types. We’ll probably cover this in more depth in a future article. In the meantime, we can recommend Martijn van Groningen’s presentation “Document relations with Elasticsearch”.
Most relational databases also let you specify constraints to define what is and isn’t consistent. For example, referential integrity and uniqueness can be enforced. You can require that the sum of account movements must be positive and so on. Document oriented databases tend not to do this, and Elasticsearch is no different.