加载中

The idea

There has been a growing interest in using MongoDB as an in-memory database, meaning that the data is not stored on disk at all. This can be super useful for applications like:

  • a write-heavy cache in front of a slower RDBMS system
  • embedded systems
  • PCI compliant systems where no data should be persisted
  • unit testing where the database should be light and easily cleaned

That would be really neat indeed if it was possible: one could leverage the advanced querying / indexing capabilities of MongoDB without hitting the disk. As you probably know the disk IO (especially random) is the system bottleneck in 99% of cases, and if you are writing data you cannot avoid hitting the disk.

One sweet design choice of MongoDB is that it uses memory-mapped files to handle access to data files on disk. This means that MongoDB does not know the difference between RAM and disk, it just accesses bytes at offsets in giant arrays representing files and the OS takes care of the rest! It is this design decision that allows MongoDB to run in RAM with no modification.

基本思想

将MongoDB用作内存数据库(in-memory database),也即,根本就不让MongoDB把数据保存到磁盘中的这种用法,引起了越来越多的人的兴趣。这种用法对于以下应用场合来讲,超实用:

  • 置于慢速RDBMS系统之前的写操作密集型高速缓存
  • 嵌入式系统
  • 无需持久化数据的PCI兼容系统
  • 需要轻量级数据库而且库中数据可以很容易清除掉的单元测试(unit testing)

如果这一切可以实现就真是太优雅了:我们就能够巧妙地在不涉及磁盘操作的情况下利用MongoDB的查询/检索功能。可能你也知道,在99%的情况下,磁盘IO(特别是随机IO)是系统的瓶颈,而且,如果你要写入数据的话,磁盘操作是无法避免的。

MongoDB有一个非常酷的设计决策,就是她可以使用内存影射文件(memory-mapped file)来处理对磁盘文件中数据的读写请求。这也就是说,MongoDB并不对RAM和磁盘这两者进行区别对待,只是将文件看作一个巨大的数组,然后按照字节为单位访问其中的数据,剩下的都交由操作系统(OS)去处理!就是这个设计决策,才使得MongoDB可以无需任何修改就能够运行于RAM之中。

How it is done

This is all achieved by using a special type of filesystem called tmpfs. Linux will make it appear as a regular FS but it is entirely located in RAM (unless it is larger than RAM in which case it can swap, which can be useful!). I have 32GB RAM on this server, let’s create a 16GB tmpfs:

# mkdir /ramdata
# mount -t tmpfs -o size=16000M tmpfs /ramdata/
# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/xvde1             5905712   4973924    871792  86% /
none                  15344936         0  15344936   0% /dev/shm
tmpfs                 16384000         0  16384000   0% /ramdata

Now let’s start MongoDB with the appropriate settings. smallfiles and noprealloc should be used to reduce the amount of RAM wasted, and will not affect performance since it’s all RAM based. nojournal should be used since it does not make sense to have a journal in this context!

dbpath=/ramdata
nojournal = true
smallFiles = true
noprealloc = true

After starting MongoDB, you will find that it works just fine and the files are as expected in the FS:

# mongo
MongoDB shell version: 2.3.2
connecting to: test
> db.test.insert({a:1})
> db.test.find()
{ "_id" : ObjectId("51802115eafa5d80b5d2c145"), "a" : 1 }

# ls -l /ramdata/
total 65684
-rw-------. 1 root root 16777216 Apr 30 15:52 local.0
-rw-------. 1 root root 16777216 Apr 30 15:52 local.ns
-rwxr-xr-x. 1 root root        5 Apr 30 15:52 mongod.lock
-rw-------. 1 root root 16777216 Apr 30 15:52 test.0
-rw-------. 1 root root 16777216 Apr 30 15:52 test.ns
drwxr-xr-x. 2 root root       40 Apr 30 15:52 _tmp

Now let’s add some data and make sure it behaves properly. We will create a 1KB document and add 4 million of them:

> str = ""

> aaa = "aaaaaaaaaa"
aaaaaaaaaa
> for (var i = 0; i < 100; ++i) { str += aaa; }

> for (var i = 0; i < 4000000; ++i) { db.foo.insert({a: Math.random(), s: str});}
> db.foo.stats()
{
        "ns" : "test.foo",
        "count" : 4000000,
        "size" : 4544000160,
        "avgObjSize" : 1136.00004,
        "storageSize" : 5030768544,
        "numExtents" : 26,
        "nindexes" : 1,
        "lastExtentSize" : 536600560,
        "paddingFactor" : 1,
        "systemFlags" : 1,
        "userFlags" : 0,
        "totalIndexSize" : 129794000,
        "indexSizes" : {
                "_id_" : 129794000
        },
        "ok" : 1
}

实现方法

这一切都是通过使用一种叫做tmpfs的特殊类型文件系统实现的。在Linux中它看上去同常规的文件系统(FS)一样,只是它完全位于RAM中(除非其大小超过了RAM的大小,此时它还可以进行swap,这个非常有用!)。我的服务器中有32GB的RAM,下面让我们创建一个16GB的 tmpfs:

# mkdir /ramdata
# mount -t tmpfs -o size=16000M tmpfs /ramdata/
# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/xvde1             5905712   4973924    871792  86% /
none                  15344936         0  15344936   0% /dev/shm
tmpfs                 16384000         0  16384000   0% /ramdata

接下来要用适当的设置启动MongoDB。为了减小浪费的RAM数量,应该把smallfilesnoprealloc设置为true。既然现在是基于RAM的,这么做完全不会降低性能。此时再使用journal就毫无意义了,所以应该把nojournal设置为true。

dbpath=/ramdata
nojournal = true
smallFiles = true
noprealloc = true

MongoDB启动之后,你会发现她运行得非常好,文件系统中的文件也正如期待的那样出现了:

# mongo
MongoDB shell version: 2.3.2
connecting to: test
> db.test.insert({a:1})
> db.test.find()
{ "_id" : ObjectId("51802115eafa5d80b5d2c145"), "a" : 1 }

# ls -l /ramdata/
total 65684
-rw-------. 1 root root 16777216 Apr 30 15:52 local.0
-rw-------. 1 root root 16777216 Apr 30 15:52 local.ns
-rwxr-xr-x. 1 root root        5 Apr 30 15:52 mongod.lock
-rw-------. 1 root root 16777216 Apr 30 15:52 test.0
-rw-------. 1 root root 16777216 Apr 30 15:52 test.ns
drwxr-xr-x. 2 root root       40 Apr 30 15:52 _tmp

现在让我们添加一些数据,证实一下其运行完全正常。我们先创建一个1KB的document,然后将它添加到MongoDB中4百万次:

> str = ""

> aaa = "aaaaaaaaaa"
aaaaaaaaaa
> for (var i = 0; i < 100; ++i) { str += aaa; }

> for (var i = 0; i < 4000000; ++i) { db.foo.insert({a: Math.random(), s: str});}
> db.foo.stats()
{
        "ns" : "test.foo",
        "count" : 4000000,
        "size" : 4544000160,
        "avgObjSize" : 1136.00004,
        "storageSize" : 5030768544,
        "numExtents" : 26,
        "nindexes" : 1,
        "lastExtentSize" : 536600560,
        "paddingFactor" : 1,
        "systemFlags" : 1,
        "userFlags" : 0,
        "totalIndexSize" : 129794000,
        "indexSizes" : {
                "_id_" : 129794000
        },
        "ok" : 1
}
The document average size is 1136 bytes and it takes up about 5GB of storage. The index on _id takes about 130MB. Now we need to verify something very important: is the data duplicated in RAM, existing both within MongoDB and the filesystem? Remember that MongoDB does not buffer any data within its own process, instead data is cached in the FS cache. Let’s drop the FS cache and see what is in RAM:

# echo 3 > /proc/sys/vm/drop_caches 
# free
             total       used       free     shared    buffers     cached
Mem:      30689876    6292780   24397096          0       1044    5817368
-/+ buffers/cache:     474368   30215508
Swap:            0          0          0

As you can see there is 6.3GB of used RAM of which 5.8GB is in FS cache (buffers). Why is there still 5.8GB of FS cache even after all caches were dropped?? The reason is that Linux is smart and it does not duplicate the pages between tmpfs and its cache… Bingo! That means your data exists with a single copy in RAM. Let’s access all documents and verify RAM usage is unchanged:

> db.foo.find().itcount()
4000000

# free
             total       used       free     shared    buffers     cached
Mem:      30689876    6327988   24361888          0       1324    5818012
-/+ buffers/cache:     508652   30181224
Swap:            0          0          0
# ls -l /ramdata/
total 5808780
-rw-------. 1 root root  16777216 Apr 30 15:52 local.0
-rw-------. 1 root root  16777216 Apr 30 15:52 local.ns
-rwxr-xr-x. 1 root root         5 Apr 30 15:52 mongod.lock
-rw-------. 1 root root  16777216 Apr 30 16:00 test.0
-rw-------. 1 root root  33554432 Apr 30 16:00 test.1
-rw-------. 1 root root 536608768 Apr 30 16:02 test.10
-rw-------. 1 root root 536608768 Apr 30 16:03 test.11
-rw-------. 1 root root 536608768 Apr 30 16:03 test.12
-rw-------. 1 root root 536608768 Apr 30 16:04 test.13
-rw-------. 1 root root 536608768 Apr 30 16:04 test.14
-rw-------. 1 root root  67108864 Apr 30 16:00 test.2
-rw-------. 1 root root 134217728 Apr 30 16:00 test.3
-rw-------. 1 root root 268435456 Apr 30 16:00 test.4
-rw-------. 1 root root 536608768 Apr 30 16:01 test.5
-rw-------. 1 root root 536608768 Apr 30 16:01 test.6
-rw-------. 1 root root 536608768 Apr 30 16:04 test.7
-rw-------. 1 root root 536608768 Apr 30 16:03 test.8
-rw-------. 1 root root 536608768 Apr 30 16:02 test.9
-rw-------. 1 root root  16777216 Apr 30 15:52 test.ns
drwxr-xr-x. 2 root root        40 Apr 30 16:04 _tmp
# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/xvde1             5905712   4973960    871756  86% /
none                  15344936         0  15344936   0% /dev/shm
tmpfs                 16384000   5808780  10575220  36% /ramdata

And that verifies it! :)

可以看出,其中的document平均大小为1136字节,数据总共占用了5GB的空间。_id之上的索引大小为130MB。现在我们需要验证一件 非常重要的事情:RAM中的数据有没有重复,是不是在MongoDB和文件系统中各保存了一份?还记得MongoDB并不会在她自己的进程内缓存任何数据,她的数据只会缓存到文件系统的缓存之中。那我们来清除一下文件系统的缓存,然后看看RAM中还有有什么数据:
# echo 3 > /proc/sys/vm/drop_caches 
# free
             total       used       free     shared    buffers     cached
Mem:      30689876    6292780   24397096          0       1044    5817368
-/+ buffers/cache:     474368   30215508
Swap:            0          0          0

可以看到,在已使用的6.3GB的RAM中,有5.8GB用于了文件系统的缓存(缓冲区,buffer)。为什么即使在清除所有缓存之后,系统中仍然还有5.8GB的文件系统缓存??其原因是,Linux非常聪明,她不会在tmpfs和缓存中保存重复的数据。太棒了!这就意味着,你在RAM只有一份数据。下面我们访问一下所有的document,并验证一下,RAM的使用情况不会发生变化:

> db.foo.find().itcount()
4000000

# free
             total       used       free     shared    buffers     cached
Mem:      30689876    6327988   24361888          0       1324    5818012
-/+ buffers/cache:     508652   30181224
Swap:            0          0          0
# ls -l /ramdata/
total 5808780
-rw-------. 1 root root  16777216 Apr 30 15:52 local.0
-rw-------. 1 root root  16777216 Apr 30 15:52 local.ns
-rwxr-xr-x. 1 root root         5 Apr 30 15:52 mongod.lock
-rw-------. 1 root root  16777216 Apr 30 16:00 test.0
-rw-------. 1 root root  33554432 Apr 30 16:00 test.1
-rw-------. 1 root root 536608768 Apr 30 16:02 test.10
-rw-------. 1 root root 536608768 Apr 30 16:03 test.11
-rw-------. 1 root root 536608768 Apr 30 16:03 test.12
-rw-------. 1 root root 536608768 Apr 30 16:04 test.13
-rw-------. 1 root root 536608768 Apr 30 16:04 test.14
-rw-------. 1 root root  67108864 Apr 30 16:00 test.2
-rw-------. 1 root root 134217728 Apr 30 16:00 test.3
-rw-------. 1 root root 268435456 Apr 30 16:00 test.4
-rw-------. 1 root root 536608768 Apr 30 16:01 test.5
-rw-------. 1 root root 536608768 Apr 30 16:01 test.6
-rw-------. 1 root root 536608768 Apr 30 16:04 test.7
-rw-------. 1 root root 536608768 Apr 30 16:03 test.8
-rw-------. 1 root root 536608768 Apr 30 16:02 test.9
-rw-------. 1 root root  16777216 Apr 30 15:52 test.ns
drwxr-xr-x. 2 root root        40 Apr 30 16:04 _tmp
# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/xvde1             5905712   4973960    871756  86% /
none                  15344936         0  15344936   0% /dev/shm
tmpfs                 16384000   5808780  10575220  36% /ramdata

果不其然! :)

What about replication?

You probably want to use replication since a server loses its RAM data upon reboot! Using a standard replica set you will get automatic failover and more read capacity. If a server is rebooted MongoDB will automatically rebuild its data by pulling it from another server in the same replica set (resync). This should be fast enough even in cases with a lot of data and indices since all operations are RAM only :)

It is important to remember that write operations get written to a special collection called oplog which resides in the local database and takes 5% of the volume by default. In my case the oplog would take 5% of 16GB which is 800MB. In doubt, it is safer to choose a fixed oplog size using the oplogSize option. If a secondary server is down for a longer time than the oplog contains, it will have to be resynced. To set it to 1GB, use:

oplogSize = 1000

复制(replication)呢?

既然服务器在重启时RAM中的数据都会丢失,所以你可能会想使用复制。采用标准的副本集(replica set)就能够获得自动故障转移(failover),还能够提高数据读取能力(read capacity)。如果有服务器重启了,它就可以从同一个副本集中另外一个服务器中读取数据从而重建自己的数据(重新同步,resync)。即使在大量数据和索引的情况下,这个过程也会足够快,因为索引操作都是在RAM中进行的 :)

有一点很重要,就是写操作会写入一个特殊的叫做oplog的collection,它位于local数据库之中。缺省情况下,它的大小是总数据量的5%。在我这种情况下,oplog会占有16GB的5%,也就是800MB的空间。在拿不准的情况下,比较安全的做法是,可以使用oplogSize这个选项为oplog选择一个固定的大小。如果备选服务器宕机时间超过了oplog的容量,它就必须要进行重新同步了。要把它的大小设置为1GB,可以这样:

oplogSize = 1000

What about sharding?

Now that you have all the querying capabilities of MongoDB, what if you want to implement a large service with it? Well you can use sharding freely to implement a large scalable in-memory store. Still the config servers (that contain the chunk distribution) should be disk based since their activity is small and rebuilding a cluster from scratch is not fun.

What to watch for

RAM is a scarce resource, and in this case you definitely want the entire data set to fit in RAM. Even though tmpfs can resort to swapping the performance would drop dramatically. To make best use of the RAM you should consider:

  • usePowerOf2Sizes option to normalize the storage buckets
  • run a compact command or resync the node periodically.
  • use a schema design that is fairly normalized (avoid large document growth)

Conclusion

Sweet, you can now use MongoDB and all its features as an in-memory RAM-only store! Its performance should be pretty impressive: during the test with a single thread / core I was achieving 20k writes per second, and it should scale linearly over the number of cores.

分片(sharding)呢?

既然拥有了MongoDB所有的查询功能,那么用它来实现一个大型的服务要怎么弄?你可以随心所欲地使用分片来实现一个大型可扩展的内存数据库。配置服务器(保存着数据块分配情况)还还是用过采用基于磁盘的方案,因为这些服务器的活动数量不大,老从头重建集群可不好玩。

注意事项

RAM属稀缺资源,而且在这种情况下你一定想让整个数据集都能放到RAM中。尽管tmpfs具有借助于磁盘交换(swapping)的能力,但其性能下降将非常显著。为了充分利用RAM,你应该考虑:

  • 使用usePowerOf2Sizes选项对存储bucket进行规范化
  • 定期运行compact命令或者对节点进行重新同步(resync)
  • schema的设计要相当规范化(以避免出现大量比较大的document)

结论

宝贝,你现在就能够将MongoDB用作内存数据库了,而且还能使用她的所有功能!性能嘛,应该会相当惊人:我在单线程/核的情况下进行测试,可以达到每秒20K个写入的速度,而且增加多少个核就会再增加多少倍的写入速度。

返回顶部
顶部