Python 不是 C 已翻译 100%

oschina 投递于 2015/07/03 07:40 (共 7 段, 翻译完成于 07-03)
阅读 16561
收藏 86
3
加载中

I have been using Python a lot lately, for various data  science projects.  Python is known for its ease of use.  Someone with  coding experience can use it effectively in few days.

This sounds good but there may be an issue if you program in Python as you would in another language, say C.

Let me give an example based on my own experience.  I have a  strong background with imperative languages like C and C++.  I also  have substantial experience with oldies but goodies like Lisp and  Prolog.  I also used Java, Javascript, and PHP quite a bit.  Python  should be a no brainer for me, right?  Actually it looked too easy, and I  fell in a trap: I tried to use Python as I would use C.

已有 1 人翻译此段
我来翻译

Here are some details.

As part of a recent project I  had to deal with geospatial data.  I was given a gps track with about  25,000 points, and I repeatedly needed to find the closest point on that  track given a latitude and a longitude.  My first reaction has been to  look for a code snippet that would compute the distance between two  points given their latitudes and longitudes.  It so happens that John D.  Cook made such code available in the public domain

I was all set!  All I needed to do was to write a little  Python function that returns the index of the point closest to the input  coordinates:

def closest_distance(lat,lon,trkpts):
    d = 100000.0
    best = -1
    r = trkpts.index
    for i in r:
        lati = trkpts.ix[i,'Lat']
        loni = trkpts.ix[i,'Lon']
        md = distance_on_unit_sphere(lat, lon, lati, loni)
        if d > md
            best = i
            d = md
    return best

where distance_on_unit_sphere is John D. Cook's function, and trkpts is an array containing the coordinates of the gps track (actually it is a pandas data frame). 

The above function is similar to what I would have written in C.  It iterates over the trkpts array, and it keeps the index of the closest point so far in the local variable best

已有 1 人翻译此段
我来翻译

So far so good, Python syntax is very different from C, but at least I could write the code I needed very quickly.

The code was quite to write, but it is rather slow to  execute.  For instance I was also given 428 waypoints with names.  I  wanted to find for each of the waypoints the closest point on the  track.  Running the search for the closest point for all the 428  waypoints takes 3 minutes 6 seconds on my laptop.

I then looked for an approximation called the Manhattan  distance.  Rather than computing the exact distance between points, I  compute the distance along the East-West axis plus the distance along  the South-North axis.  Here is the function that computes Manhattan  distance:

def manhattan_distance(lat1, lon1, lat2, lon2):
    lat = (lat1+lat2)/2.0
    return abs(lat1-lat2)+abs(math.cos(math.radians(lat))*(lon1-lon2))

已有 1 人翻译此段
我来翻译

Actually, I used an even simpler function, ignoring the  fact that 1 degree difference in latitude yields a larger distance than 1  degree difference in longitude.  Here is my simple function:

def manhattan_distance1(lat1, lon1, lat2, lon2):
    return abs(lat1-lat2)+abs(lon1-lon2)

Our closest function becomes: 

def closest_manhattan_distance1(lat,lon,trkpts):
    d = 100000.0
    best = -1
    r = trkpts.index
    for i in r:
        lati = trkpts.ix[i,'Lat']
        loni = trkpts.ix[i,'Lon']
        md = manhattan_distance1(lat, lon, lati, loni)
        if d > md
            best = i
            d = md
    return best

We can speed it a bit via inlining the Manhattan distance function:

def closest_manhattan_distance2(lat,lon,trkpts):
    d = 100000.0
    best = -1
    r = trkpts.index
    for i in r:
        lati = trkpts.ix[i,'Lat']
        loni = trkpts.ix[i,'Lon']
        md = abs(lat-lati)+abs(lon-loni)
        if d > md
            best = i
            d = md
    return best

Using that function yields the same closest points than  using John's function.  I was hoping for that, and was pleased to see my  intuition was right.  This simpler function is also faster.  Finding  the closest points for all my waypoints now takes 2minutes 37 seconds on  my laptop.  It is a nice 18% speedup.  Nice, but nothing dramatic.

已有 1 人翻译此段
我来翻译

I then decided to use Python correctly.  In this case it  means leveraging the array operations that pandas support.  These arrays  operations come from the numpy package.  Using these array operations we can get the closest point in a very concise code:

def closest(lat,lon,trkpts):
    cl = numpy.abs(trkpts.Lat - lat) + numpy.abs(trkpts.Lon - lon)
    return cl.idxmin()

This function returns the same closests points as before  for all my waypoints.  Running it for all the waypoints takes 0.5 second  on my laptop.  This is a 300x speedup over the above code!  300x, i.e.  30,000 %. This is dramatic.  The reason for that speedup is that numpy  array operations are written in C.  Hence, we get the best of both  world: we get speed comparable to C, with the conciseness of Python.

已有 1 人翻译此段
我来翻译

The lesson is clear: do not write Python code as you would  do in C.  Use numpy array operations rather than iterate on arrays.  For  me it meant a mental shift.

Update on July 2, 2015.  That post is discussed on Hacker News.   Some of the comments missed the fact that I am using pandas data  frames.  i am using them because they are useful for data analysis.  If  my problem was just to quickly find the closest point, and if I had  enough time, then I would code some quad trees in C or C++.

Second update on July 2, 2015.  Several readers made a comment that numba would provide substantial speedup as well.  I did give it a try.

Here is my, non conclusive, experience.  First, let me say  that experiment with another Python install may yield different result.   I am using Anaconda on a Windows machine, and I have installed quite a  few packages.  Maybe some of these packages interfere with numba. 

First thing I did was to follow the instructions for installing numba:

$ conda install numba

Here is what I got:

image

Then I discovered that numba was already installed in my  anaconda distribution.  Maybe someone working on numba should update the  install instructions.

I then used numba as recommended:

@jit
def closest_func(lat,lon,trkpts,func):
    d = 100000.0
    best = -1
    r = trkpts.index
    for i in r:
        lati = trkpts.ix[i,'Lat']
        loni = trkpts.ix[i,'Lon']
        md = abs(lat - lati) + abs(lon - loni)
        if d > md:
            #print d, dlat, dlon, lati, loni
            best = i
            d = md
    return best

I did not see running time improvement.  I also tried the more aggressive compilation setting:

@jit(nopython=True)
def closest_func(lat,lon,trkpts,func):
    d = 100000.0
    best = -1
    r = trkpts.index
    for i in r:
        lati = trkpts.ix[i,'Lat']
        loni = trkpts.ix[i,'Lon']
        md = abs(lat - lati) + abs(lon - loni)
        if d > md:
            #print d, dlat, dlon, lati, loni
            best = i
            d = md
    return best

This time I got an error when running the code:

image

Seems that pandas code is a bit too smart for numba to compile.

已有 1 人翻译此段
我来翻译

Sure, I could spend time changing my data structure until  numba can compile it.   But why would I want to do that?  I got a code  that runs fast enough using numpy.  I was already using numpy via pandas  anyway.  Why not leverage it?

I also got suggestions to use pypy.  This certainly makes sense but... I had to use Jupyter notebooks on a hosted server.  I have to use the kernels they provide, i.e. a regular Python 2.7.x kernel. Pypy was not an option.

I also got suggestions about Cython.   Well, if I am back to compiling code, then I'd rather use C or C++  directly.  I used Python for its interactive nature in notebooks, and  the fast prototyping it enables. This is not what Cython is designed  for.

已有 2 人翻译此段
我来翻译
本文中的所有译文仅用于学习和交流目的,转载请务必注明文章译者、出处、和本文链接。
我们的翻译工作遵照 CC 协议,如果我们的工作有侵犯到您的权益,请及时联系我们。
加载中

评论(39)

布客飞龙
布客飞龙
自己写个共享库,然后import ctypes。完事。
老牛拉货车
老牛拉货车

引用来自“fireflyc”的评论

这个兄弟是瞎搞,命名自己的时间复杂度是一个线性级的非要说语言问题,即便换了语言,你同样的算法也是同样的“慢”。只有一句话是靠谱的,自己实现一个四叉树索引,但是——亲,你难道只会用C实现四叉树?话说来,你如果用C去做这个事情,万一碰到注入正则表达式,多线程,xml解析,json解析,http通讯,我保证你会想念python。所以你这是脑抽,是病,得吃药。。。。。建议没事多出去走走,勤动脑。
同意
crossmix
crossmix
good,对Python不很精
slver888
slver888
牛人之所以牛是因为精通数据结构和算法,和语言木有关系。我们不是牛人,python就是为你准备优秀的数据结构和算法的,这样我也可以装逼了,这样就会激怒那些不会装逼的。。。
码工许师傅
码工许师傅
了解算法,汇编,编译器、解释器的原理很重要
shady
shady
一直在玩泥巴的路过。。。
orangleliu
orangleliu

引用来自“Xtay”的评论

会玩个C还玩出优越感来了
有些人就这样。井底之蛙而已
haitaosoft
haitaosoft
说实话,这怎么叫c的思维?明明是最直接自然的思维。
只是遇到python的循环执行特别弱(所以搞出一些c实现的包)而已。。。
梁选
梁选

引用来自“IdleMan”的评论

感觉不会C的pythoner都是在玩泥巴
不会汇编的Cer都是在玩泥巴
中山野鬼
中山野鬼

引用来自“gzwxn”的评论

总之,不要用 Python 的循环来做高计算量的工作,所以作者说要改变思维模式。P.S. 这里确实是语言的问题,因为算法都是一样的。
这个是基本尝试。每种语言都是工具,围绕工具自然有一套应用思路,对应也是编程思维。把螺丝刀当作锤子用。效果可想而知。哈。
返回顶部
顶部