把 MongoDB 当成是纯内存数据库来使用(Redis 风格) 已翻译 100%

oschina 投递于 2013/05/03 13:55 (共 5 段, 翻译完成于 05-04)
阅读 45326
收藏 139
14
加载中

The idea

There has been a growing interest in using MongoDB as an in-memory database, meaning that the data is not stored on disk at all. This can be super useful for applications like:

  • a write-heavy cache in front of a slower RDBMS system
  • embedded systems
  • PCI compliant systems where no data should be persisted
  • unit testing where the database should be light and easily cleaned

That would be really neat indeed if it was possible: one could leverage the advanced querying / indexing capabilities of MongoDB without hitting the disk. As you probably know the disk IO (especially random) is the system bottleneck in 99% of cases, and if you are writing data you cannot avoid hitting the disk.

One sweet design choice of MongoDB is that it uses memory-mapped files to handle access to data files on disk. This means that MongoDB does not know the difference between RAM and disk, it just accesses bytes at offsets in giant arrays representing files and the OS takes care of the rest! It is this design decision that allows MongoDB to run in RAM with no modification.

已有 1 人翻译此段
我来翻译

How it is done

This is all achieved by using a special type of filesystem called tmpfs. Linux will make it appear as a regular FS but it is entirely located in RAM (unless it is larger than RAM in which case it can swap, which can be useful!). I have 32GB RAM on this server, let’s create a 16GB tmpfs:

# mkdir /ramdata
# mount -t tmpfs -o size=16000M tmpfs /ramdata/
# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/xvde1             5905712   4973924    871792  86% /
none                  15344936         0  15344936   0% /dev/shm
tmpfs                 16384000         0  16384000   0% /ramdata

Now let’s start MongoDB with the appropriate settings. smallfiles and noprealloc should be used to reduce the amount of RAM wasted, and will not affect performance since it’s all RAM based. nojournal should be used since it does not make sense to have a journal in this context!

dbpath=/ramdata
nojournal = true
smallFiles = true
noprealloc = true

After starting MongoDB, you will find that it works just fine and the files are as expected in the FS:

# mongo
MongoDB shell version: 2.3.2
connecting to: test
> db.test.insert({a:1})
> db.test.find()
{ "_id" : ObjectId("51802115eafa5d80b5d2c145"), "a" : 1 }

# ls -l /ramdata/
total 65684
-rw-------. 1 root root 16777216 Apr 30 15:52 local.0
-rw-------. 1 root root 16777216 Apr 30 15:52 local.ns
-rwxr-xr-x. 1 root root        5 Apr 30 15:52 mongod.lock
-rw-------. 1 root root 16777216 Apr 30 15:52 test.0
-rw-------. 1 root root 16777216 Apr 30 15:52 test.ns
drwxr-xr-x. 2 root root       40 Apr 30 15:52 _tmp

Now let’s add some data and make sure it behaves properly. We will create a 1KB document and add 4 million of them:

> str = ""

> aaa = "aaaaaaaaaa"
aaaaaaaaaa
> for (var i = 0; i < 100; ++i) { str += aaa; }

> for (var i = 0; i < 4000000; ++i) { db.foo.insert({a: Math.random(), s: str});}
> db.foo.stats()
{
        "ns" : "test.foo",
        "count" : 4000000,
        "size" : 4544000160,
        "avgObjSize" : 1136.00004,
        "storageSize" : 5030768544,
        "numExtents" : 26,
        "nindexes" : 1,
        "lastExtentSize" : 536600560,
        "paddingFactor" : 1,
        "systemFlags" : 1,
        "userFlags" : 0,
        "totalIndexSize" : 129794000,
        "indexSizes" : {
                "_id_" : 129794000
        },
        "ok" : 1
}

已有 1 人翻译此段
我来翻译
The document average size is 1136 bytes and it takes up about 5GB of storage. The index on _id takes about 130MB. Now we need to verify something very important: is the data duplicated in RAM, existing both within MongoDB and the filesystem? Remember that MongoDB does not buffer any data within its own process, instead data is cached in the FS cache. Let’s drop the FS cache and see what is in RAM:

# echo 3 > /proc/sys/vm/drop_caches 
# free
             total       used       free     shared    buffers     cached
Mem:      30689876    6292780   24397096          0       1044    5817368
-/+ buffers/cache:     474368   30215508
Swap:            0          0          0

As you can see there is 6.3GB of used RAM of which 5.8GB is in FS cache (buffers). Why is there still 5.8GB of FS cache even after all caches were dropped?? The reason is that Linux is smart and it does not duplicate the pages between tmpfs and its cache… Bingo! That means your data exists with a single copy in RAM. Let’s access all documents and verify RAM usage is unchanged:

> db.foo.find().itcount()
4000000

# free
             total       used       free     shared    buffers     cached
Mem:      30689876    6327988   24361888          0       1324    5818012
-/+ buffers/cache:     508652   30181224
Swap:            0          0          0
# ls -l /ramdata/
total 5808780
-rw-------. 1 root root  16777216 Apr 30 15:52 local.0
-rw-------. 1 root root  16777216 Apr 30 15:52 local.ns
-rwxr-xr-x. 1 root root         5 Apr 30 15:52 mongod.lock
-rw-------. 1 root root  16777216 Apr 30 16:00 test.0
-rw-------. 1 root root  33554432 Apr 30 16:00 test.1
-rw-------. 1 root root 536608768 Apr 30 16:02 test.10
-rw-------. 1 root root 536608768 Apr 30 16:03 test.11
-rw-------. 1 root root 536608768 Apr 30 16:03 test.12
-rw-------. 1 root root 536608768 Apr 30 16:04 test.13
-rw-------. 1 root root 536608768 Apr 30 16:04 test.14
-rw-------. 1 root root  67108864 Apr 30 16:00 test.2
-rw-------. 1 root root 134217728 Apr 30 16:00 test.3
-rw-------. 1 root root 268435456 Apr 30 16:00 test.4
-rw-------. 1 root root 536608768 Apr 30 16:01 test.5
-rw-------. 1 root root 536608768 Apr 30 16:01 test.6
-rw-------. 1 root root 536608768 Apr 30 16:04 test.7
-rw-------. 1 root root 536608768 Apr 30 16:03 test.8
-rw-------. 1 root root 536608768 Apr 30 16:02 test.9
-rw-------. 1 root root  16777216 Apr 30 15:52 test.ns
drwxr-xr-x. 2 root root        40 Apr 30 16:04 _tmp
# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/xvde1             5905712   4973960    871756  86% /
none                  15344936         0  15344936   0% /dev/shm
tmpfs                 16384000   5808780  10575220  36% /ramdata

And that verifies it! :)

已有 1 人翻译此段
我来翻译

What about replication?

You probably want to use replication since a server loses its RAM data upon reboot! Using a standard replica set you will get automatic failover and more read capacity. If a server is rebooted MongoDB will automatically rebuild its data by pulling it from another server in the same replica set (resync). This should be fast enough even in cases with a lot of data and indices since all operations are RAM only :)

It is important to remember that write operations get written to a special collection called oplog which resides in the local database and takes 5% of the volume by default. In my case the oplog would take 5% of 16GB which is 800MB. In doubt, it is safer to choose a fixed oplog size using the oplogSize option. If a secondary server is down for a longer time than the oplog contains, it will have to be resynced. To set it to 1GB, use:

oplogSize = 1000

已有 1 人翻译此段
我来翻译

What about sharding?

Now that you have all the querying capabilities of MongoDB, what if you want to implement a large service with it? Well you can use sharding freely to implement a large scalable in-memory store. Still the config servers (that contain the chunk distribution) should be disk based since their activity is small and rebuilding a cluster from scratch is not fun.

What to watch for

RAM is a scarce resource, and in this case you definitely want the entire data set to fit in RAM. Even though tmpfs can resort to swapping the performance would drop dramatically. To make best use of the RAM you should consider:

  • usePowerOf2Sizes option to normalize the storage buckets
  • run a compact command or resync the node periodically.
  • use a schema design that is fairly normalized (avoid large document growth)

Conclusion

Sweet, you can now use MongoDB and all its features as an in-memory RAM-only store! Its performance should be pretty impressive: during the test with a single thread / core I was achieving 20k writes per second, and it should scale linearly over the number of cores.

已有 1 人翻译此段
我来翻译
本文中的所有译文仅用于学习和交流目的,转载请务必注明文章译者、出处、和本文链接。
我们的翻译工作遵照 CC 协议,如果我们的工作有侵犯到您的权益,请及时联系我们。
加载中

评论(32)

hwclegend
hwclegend

引用来自“王瑞平”的评论

可以这么说,内存可靠并且够用前提下,不成问题
但一旦内存崩溃,来不及写入磁盘的数据集体完蛋,无法恢复
曾经遇到过一次内存坏掉,磁盘RAID全部写错,无法启动。最后只能从别的硬备份中恢复。当你真的不在乎一段时间的数据丢失前提下,可以选择内存数据库,这种丢失数据的风险不管是1%,甚至1%%也是不能忽略的。
选择技术架构不是赶时髦,适不适合要事先想清楚。
这种用法对于以下应用场合来讲,超实用: 置于慢速RDBMS系统之前的写操作密集型高速缓存 嵌入式系统 无需持久化数据的PCI兼容系统 需要轻量级数据库而且库中数据可以很容易清除掉的单元测试(unit testing) 这种用法只是特定场合,如果仅仅为了提高性使用这种方式就有点过了,举个例子:手游中存放用户SESSION就挺合适的
k
ktprime
2.6 的 redis 有了lua脚本支持后,在redis server可以做很多数据分析
工作,带宽性能大幅提升
智能矩阵
智能矩阵

引用来自“0小小鱼0”的评论

弱弱的问一句,收费问题才是关健...这个决定了与之比较的对象

嗯。就像去羽毛球馆和路边玩,也许消耗的体力是一样的。
0
0小小鱼0
弱弱的问一句,收费问题才是关健...这个决定了与之比较的对象
IUnKnown
IUnKnown
话说这么牛逼的国货我都没听过,out.....
智能矩阵
智能矩阵

引用来自“M_Ittrue”的评论

并发能力如何?

靠响应提升并发的,数据库带qos的,我们有,别家不知
M
M_Ittrue
并发能力如何?
一头面向对象的牛
MemSQL其实一般,并不是很快
智能矩阵
智能矩阵

引用来自“IUnKnown”的评论

那个啥矩阵貌似很牛X的样子,秒杀redis?

单核秒他8核吧
智能矩阵
智能矩阵

引用来自“刀尖红叶”的评论

有意思吗?mysql的data目录是tmpfs速度不也刚刚的?再说linux本身就尽量用内存做数据缓存,我们实际测试128G内存,插入10亿条,mongodb达到7w/次插入量~其实mongodb已经本身是半个内存数据库了~

我们2006年在赛扬2.4上是80万次。。。。
返回顶部
顶部