MPI+pthread的问题,程序调试两周了,代码只有90行,望大侠们给点线索啊

yu71941628 发布于 2014/03/07 20:05
阅读 1K+
收藏 1

下面这段程序调试了2周了,就是不知道哪里出了问题,代码只有90多行,还请大家赐教啊。

这段程序在我本机linux运行没有问题,但是在我们实验室的集群上,上海超算中心,济南超算中心都运行的有问题。

程序的逻辑很简单,启动两个mpi进程,master 进程(id = 0)在tag:0上监听,slave(id = 1)每隔1秒向master tag:0发送消息然后slave在tag:100上监听ack消息,master收到消息后给slave的tag:100返回一个ack的消息。

程序的问题描述:程序运行一段时间后会卡死,master卡在监听tag=0的MPI_Recv上,而slave会卡在向master的tag=0发送消息上。
理论上卡的这套mpi通信室匹配的,但是不知道为什么程序运行一会儿就卡了。


一些线索:
在以下情况程序不会卡:
1.把void *master_server(void *null_arg)函数中的pthread_create(&tid,&attr,master_server_handler,NULL); 后面加上sleep函数,程序不会卡。

2.把pthread_create(&tid,&attr,master_server_handler,NULL);换成joinable式的属性然后用pthread_join来回收线程,则程序不会卡。
3.不用pthread_create(&tid,&attr,master_server_handler,NULL);创建线程而是直接用函数master_server_handler,则程序不会卡。
4.把void *master_server_handler(void *arg)中的MPI_Ssend换成MPI_Send,则程序不会卡。
(以上的改动方法都在程序的注释中可以找到)


但是为什么我现在的程序会在运行的时候卡呢。无解啊。我的程序在openmpi和mpich2上运行都有这种现象。
希望大家给点提示啊,任何线索都行。我是真的黔驴技穷了。

如果需要代码在大家的机器上运行没有问题,而需要我这边实际运行环境的话,我可以把实验室的vpn告诉你。登陆我实验室的机器上实际看看。(qq:674322022)

ps:我在编译openmpi核mpich2的时候开启了mpi对线程的支持。openmpi 的config参数 --with-threads=poxis --enable-mpi-thread-multiple mpich2的参数:--enable-threads


--------------------------------------------------------------------------------------------------------------------
实验室的机器是centos的。 uname -a 输出如下。
Linux node5 2.6.18-238.12.1.el5xen #1 SMP Tue May 31 14:02:29 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux

运行的时候用的命令:mpiexec -n 2 ./a.out

下面是源代码、程序的输出、以及卡的时候用gdb输出的堆栈信息。



源代码:
--------------------------------------------------------------------------------------------------------------------
#include "stdio.h"
#include "pthread.h"
#include "stdlib.h"
#include "string.h"
#include "mpi.h"

void send_heart_beat();
void *heart_beat_daemon(void *null_arg);
void *master_server(void *null_arg);
void *master_server_handler(void *arg);

int main(int argc,char *argv[])
{
        int p,id;
        pthread_t tid;

        MPI_Init(&argc,&argv);
        MPI_Comm_size(MPI_COMM_WORLD,&p);
        MPI_Comm_rank(MPI_COMM_WORLD,&id);

        if(id==0)
        {
                //master
                pthread_create(&tid,NULL,master_server,NULL);
                pthread_join(tid,NULL);
        }
        else
        {
                //slave
                pthread_create(&tid,NULL,heart_beat_daemon,NULL);
                pthread_join(tid,NULL);
        }

        MPI_Finalize();

        return 0;
}

void *heart_beat_daemon(void *null_arg)
{
        while(1)
        {
                sleep(1);
                send_heart_beat();
        }
}

void send_heart_beat()
{
        char send_msg[5];
        char ack_msg[5];

        strcpy(send_msg,"AAAA");

        MPI_Ssend(send_msg,5,MPI_CHAR,0,0,MPI_COMM_WORLD);

        MPI_Recv(ack_msg,5,MPI_CHAR,0,100,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
}

void *master_server(void *null_arg)
{
        char msg[5];

        pthread_t tid;
        pthread_attr_t attr;

        pthread_attr_init(&attr);
        pthread_attr_setdetachstate(&attr,PTHREAD_CREATE_DETACHED);

        while(1)
        {

                MPI_Recv(msg,5,MPI_CHAR,1,0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);

                pthread_create(&tid,&attr,master_server_handler,NULL);
//                sleep(2);
//                master_server_handler(NULL);
//                pthread_create(&tid,NULL,master_server_handler,fun_arg);
//                pthread_join(tid,NULL);
        }
}

void *master_server_handler(void *arg)
{
        static int count;
        char ack[5];

        count ++;
        printf("recved a msg %d\n",count);

        strcpy(ack,"ACK:");
        MPI_Ssend(ack,5,MPI_CHAR,1,100,MPI_COMM_WORLD);
//        MPI_Send(ack,5,MPI_CHAR,1,100,MPI_COMM_WORLD);
}





程序输出:
--------------------------------------------------------------------------------------------------------------------

recved a msg 1
recved a msg 2
recved a msg 3
recved a msg 4
recved a msg 5
recved a msg 6
recved a msg 7
recved a msg 8
recved a msg 9
recved a msg 10
recved a msg 11
recved a msg 12
recved a msg 13
recved a msg 14
recved a msg 15

(程序在这里卡了,没有后续输出了)


master(pid = 0)卡的时候的堆栈:
--------------------------------------------------------------------------------------------------------------------

(gdb) bt
#0  opal_progress () at runtime/opal_progress.c:175
#1  0x00002b17ed288f75 in opal_condition_wait (addr=<value optimized out>,
    count=<value optimized out>, datatype=<value optimized out>, src=1, tag=0,
    comm=0x601520, status=0x0) at ../../../../opal/threads/condition.h:99
#2  ompi_request_wait_completion (addr=<value optimized out>,
    count=<value optimized out>, datatype=<value optimized out>, src=1, tag=0,
    comm=0x601520, status=0x0) at ../../../../ompi/request/request.h:377
#3  mca_pml_ob1_recv (addr=<value optimized out>, count=<value optimized out>,
    datatype=<value optimized out>, src=1, tag=0, comm=0x601520, status=0x0)
    at pml_ob1_irecv.c:105
#4  0x00002b17ed1ef049 in PMPI_Recv (buf=0x2b17f2495120, count=5,
    type=0x601320, source=1, tag=0, comm=0x601520, status=0x0) at precv.c:78
#5  0x0000000000400d75 in master_server (null_arg=0x0) at main.c:73
#6  0x0000003b5a00683d in start_thread () from /lib64/libpthread.so.0
#7  0x0000003b594d526d in clone () from /lib64/libc.so.6



slave(pid = 1)卡的时候的堆栈:
--------------------------------------------------------------------------------------------------------------------
(gdb) bt
#0  0x00002adff87ef975 in opal_atomic_cmpset_32 (btl=<value optimized out>, endpoint=<value optimized out>,
    registration=0x0, convertor=0x124e46a8, order=0 '\000', reserve=32, size=0x2adffda74fe8, flags=3)
    at ../../../../opal/include/opal/sys/amd64/atomic.h:85
#1  opal_atomic_lifo_pop (btl=<value optimized out>, endpoint=<value optimized out>, registration=0x0,
    convertor=0x124e46a8, order=0 '\000', reserve=32, size=0x2adffda74fe8, flags=3)
    at ../../../../opal/class/opal_atomic_lifo.h:100
#2  mca_btl_sm_prepare_src (btl=<value optimized out>, endpoint=<value optimized out>, registration=0x0,
    convertor=0x124e46a8, order=0 '\000', reserve=32, size=0x2adffda74fe8, flags=3) at btl_sm.c:697
#3  0x00002adff8877678 in mca_bml_base_prepare_src (sendreq=0x124e4600, bml_btl=0x124ea860, size=5, flags=0)
    at ../../../../ompi/mca/bml/bml.h:339
#4  mca_pml_ob1_send_request_start_rndv (sendreq=0x124e4600, bml_btl=0x124ea860, size=5, flags=0)
    at pml_ob1_sendreq.c:815
#5  0x00002adff8869e82 in mca_pml_ob1_send_request_start (buf=0x2adffda75100, count=5,
    datatype=<value optimized out>, dst=0, tag=0, sendmode=MCA_PML_BASE_SEND_SYNCHRONOUS, comm=0x601520)
    at pml_ob1_sendreq.h:363
#6  mca_pml_ob1_send (buf=0x2adffda75100, count=5, datatype=<value optimized out>, dst=0, tag=0,
    sendmode=MCA_PML_BASE_SEND_SYNCHRONOUS, comm=0x601520) at pml_ob1_isend.c:119
#7  0x00002adff87d2be6 in PMPI_Ssend (buf=0x2adffda75100, count=5, type=0x601320, dest=0, tag=0,
    comm=0x601520) at pssend.c:76
#8  0x0000000000400cf4 in send_heart_beat () at main.c:55
#9  0x0000000000400cb6 in heart_beat_daemon (null_arg=0x0) at main.c:44
#10 0x0000003b5a00683d in start_thread () from /lib64/libpthread.so.0
#11 0x0000003b594d526d in clone () from /lib64/libc.so.6

加载中
0
y
yu71941628

搞定了。国外的大神回答了。以下是回复。


MPI provides four different levels of thread support: MPI_THREAD_SINGLE, MPI_THREAD_SERIALIZED, MPI_THREAD_FUNNELED, and MPI_THREAD_MULTIPLE. In order to be able to make MPI calls from different threads concurrently, you have to initialise MPI with the MPI_THREAD_MULTIPLE level of thread support and make sure that the library actually provides that level:

int provided; MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided); if (provided < MPI_THREAD_MULTIPLE) { printf("Error: the MPI library doesn't provide the required thread level\n"); MPI_Abort(MPI_COMM_WORLD, 0); }

If you call MPI_Init instead of MPI_Init_thread, the library is free to chose whatever default thread support level its creators deemed best. For Open MPI that is MPI_THREAD_SINGLE, i.e. no support for threads. You can control the default level by setting the environment variable OMPI_MPI_THREAD_LEVEL but that is not recommended - MPI_Init_thread should be used instead.

0
王咚咚
您好,请问您是在哪个网站提问的?另外可否拨冗看下我的求助,谢谢了。http://www.oschina.net/question/814135_161191
返回顶部
顶部