加载中

I have been using Python a lot lately, for various data  science projects.  Python is known for its ease of use.  Someone with  coding experience can use it effectively in few days.

This sounds good but there may be an issue if you program in Python as you would in another language, say C.

Let me give an example based on my own experience.  I have a  strong background with imperative languages like C and C++.  I also  have substantial experience with oldies but goodies like Lisp and  Prolog.  I also used Java, Javascript, and PHP quite a bit.  Python  should be a no brainer for me, right?  Actually it looked too easy, and I  fell in a trap: I tried to use Python as I would use C.

我一直使用 Python,用它处理各种数据科学项目。 Python 以易用闻名。有编码经验者学习数天就能上手(或有效使用它)。

听起来很不错,不过,如果你既用 Python,同时也是用其他语言,比如说 C 的话,或许会存在一些问题。

给你举个我自己经历的例子吧。 我精通命令式语言,如 C 和 C++。对古老经典的语言如 Lisp 和  Prolog 能熟练使用。另外,我也用过 Java,Javascript 和 PHP 一段时间。(那么,学习) Python 对我来讲不是很简单吗?事实上,只是看起来容易,我给自己挖了个坑:我像用 C 一样去用 Python。

Here are some details.

As part of a recent project I  had to deal with geospatial data.  I was given a gps track with about  25,000 points, and I repeatedly needed to find the closest point on that  track given a latitude and a longitude.  My first reaction has been to  look for a code snippet that would compute the distance between two  points given their latitudes and longitudes.  It so happens that John D.  Cook made such code available in the public domain

I was all set!  All I needed to do was to write a little  Python function that returns the index of the point closest to the input  coordinates:

def closest_distance(lat,lon,trkpts):
    d = 100000.0
    best = -1
    r = trkpts.index
    for i in r:
        lati = trkpts.ix[i,'Lat']
        loni = trkpts.ix[i,'Lon']
        md = distance_on_unit_sphere(lat, lon, lati, loni)
        if d > md
            best = i
            d = md
    return best

where distance_on_unit_sphere is John D. Cook's function, and trkpts is an array containing the coordinates of the gps track (actually it is a pandas data frame). 

The above function is similar to what I would have written in C.  It iterates over the trkpts array, and it keeps the index of the closest point so far in the local variable best

具体情况,请向下看。

一个最近的项目中,需要处理地理空间数据。给出(任务)是 gps 追踪 25,000 个左右位置点,需要根据给定的经纬度,重复定位距离最短的点。我第一反应是,翻查(已经实现的)计算已知经纬度两点间距离的代码片段。代码可以在 John D.  Cook 写的这篇 code available in the public domain 中找得到。 

万事俱备! 只要写一段 Python 函数,返回与输入坐标距离最短的点索引(25,000 点数组中的索引),就万事大吉了:

def closest_distance(lat,lon,trkpts):
    d = 100000.0
    best = -1
    r = trkpts.index
    for i in r:
        lati = trkpts.ix[i,'Lat']
        loni = trkpts.ix[i,'Lon']
        md = distance_on_unit_sphere(lat, lon, lati, loni)
        if d > md
            best = i
            d = md
    return best

其中, distance_on_unit_sphere 是 John D. Cook's 书中的函数,trkpts 是数组,包含 gps 追踪的点坐标(实际上,是 pandas 中的数据帧,注,pandas 是 python 第三方数据分析扩展包)。 

上述函数与我以前用 C 实现的函数基本相同。 它遍历(迭代)trkpts 数组,将迄今为止(距离给定坐标位置)的距离最短的点索引值,保存到本地变量 best 中。 

So far so good, Python syntax is very different from C, but at least I could write the code I needed very quickly.

The code was quite to write, but it is rather slow to  execute.  For instance I was also given 428 waypoints with names.  I  wanted to find for each of the waypoints the closest point on the  track.  Running the search for the closest point for all the 428  waypoints takes 3 minutes 6 seconds on my laptop.

I then looked for an approximation called the Manhattan  distance.  Rather than computing the exact distance between points, I  compute the distance along the East-West axis plus the distance along  the South-North axis.  Here is the function that computes Manhattan  distance:

def manhattan_distance(lat1, lon1, lat2, lon2):
    lat = (lat1+lat2)/2.0
    return abs(lat1-lat2)+abs(math.cos(math.radians(lat))*(lon1-lon2))

目前为止,情况还不错,虽然 Python 语法与 C 有很多差别,但写这段代码,并没有花去我太多时间。

代码写起来快,但执行起来却很慢。例如,我指定428 个点,命名为waypoints(导航点,路点,导航路线中的关键点)。导航时,我要为每个导航点 waypoint 找出距离最短的点。为 428 个导航点 waypoint 查找距离最短点的程序,在我的笔记本上运行了 3 分 6 秒。

之后,我改为查询计算曼哈坦距离,这是近似值。我不再计算两点间的精确距离,而是计算东西轴距离和南北轴距离。计算曼哈坦距离的函数如下:

def manhattan_distance(lat1, lon1, lat2, lon2):
    lat = (lat1+lat2)/2.0
    return abs(lat1-lat2)+abs(math.cos(math.radians(lat))*(lon1-lon2))

Actually, I used an even simpler function, ignoring the  fact that 1 degree difference in latitude yields a larger distance than 1  degree difference in longitude.  Here is my simple function:

def manhattan_distance1(lat1, lon1, lat2, lon2):
    return abs(lat1-lat2)+abs(lon1-lon2)

Our closest function becomes: 

def closest_manhattan_distance1(lat,lon,trkpts):
    d = 100000.0
    best = -1
    r = trkpts.index
    for i in r:
        lati = trkpts.ix[i,'Lat']
        loni = trkpts.ix[i,'Lon']
        md = manhattan_distance1(lat, lon, lati, loni)
        if d > md
            best = i
            d = md
    return best

We can speed it a bit via inlining the Manhattan distance function:

def closest_manhattan_distance2(lat,lon,trkpts):
    d = 100000.0
    best = -1
    r = trkpts.index
    for i in r:
        lati = trkpts.ix[i,'Lat']
        loni = trkpts.ix[i,'Lon']
        md = abs(lat-lati)+abs(lon-loni)
        if d > md
            best = i
            d = md
    return best

Using that function yields the same closest points than  using John's function.  I was hoping for that, and was pleased to see my  intuition was right.  This simpler function is also faster.  Finding  the closest points for all my waypoints now takes 2minutes 37 seconds on  my laptop.  It is a nice 18% speedup.  Nice, but nothing dramatic.

实际上,我用了一个更简化的函数,忽略一个因素,即维度曲线上 1 度差距比经度曲线上的 1 度差距要大得多。简化函数如下:

def manhattan_distance1(lat1, lon1, lat2, lon2):
    return abs(lat1-lat2)+abs(lon1-lon2)

    closest 函数修改为: 

def closest_manhattan_distance1(lat,lon,trkpts):
    d = 100000.0
    best = -1
    r = trkpts.index
    for i in r:
        lati = trkpts.ix[i,'Lat']
        loni = trkpts.ix[i,'Lon']
        md = manhattan_distance1(lat, lon, lati, loni)
        if d > md
            best = i
            d = md
    return best

如果将 Manhattan_distance 函数体换进来,速度还可以快些:

def closest_manhattan_distance2(lat,lon,trkpts):
    d = 100000.0
    best = -1
    r = trkpts.index
    for i in r:
        lati = trkpts.ix[i,'Lat']
        loni = trkpts.ix[i,'Lon']
        md = abs(lat-lati)+abs(lon-loni)
        if d > md
            best = i
            d = md
    return best

在计算的最短距离点上,用这个函数与用 John's 的函数效果相同。我希望我的直觉是对的。越简单就越快。现在这个程序用了 2 分 37 秒。提速了 18%。 很好,但还不够激动人心。

I then decided to use Python correctly.  In this case it  means leveraging the array operations that pandas support.  These arrays  operations come from the numpy package.  Using these array operations we can get the closest point in a very concise code:

def closest(lat,lon,trkpts):
    cl = numpy.abs(trkpts.Lat - lat) + numpy.abs(trkpts.Lon - lon)
    return cl.idxmin()

This function returns the same closests points as before  for all my waypoints.  Running it for all the waypoints takes 0.5 second  on my laptop.  This is a 300x speedup over the above code!  300x, i.e.  30,000 %. This is dramatic.  The reason for that speedup is that numpy  array operations are written in C.  Hence, we get the best of both  world: we get speed comparable to C, with the conciseness of Python.

我决定正确使用 Python。这意味着要利用 pandas 支持的数组运算。这些数组运算操作源于 numpy 包。通过调用这些数组操作,代码实现更简练:

def closest(lat,lon,trkpts):
    cl = numpy.abs(trkpts.Lat - lat) + numpy.abs(trkpts.Lon - lon)
    return cl.idxmin()

该函数与之前函数的返回结果相同。在我的笔记本上运行时间花费了 0.5 秒。整整快了 300 倍!  300 倍,,也即30,000 %。不可思议。 提速的原因是 numpy 数组操作运算用 C 实现。因此, 我们将最好的两面结合起来了: 我们得到 C 的速度和 Python 的简洁性。

The lesson is clear: do not write Python code as you would  do in C.  Use numpy array operations rather than iterate on arrays.  For  me it meant a mental shift.

Update on July 2, 2015.  That post is discussed on Hacker News.   Some of the comments missed the fact that I am using pandas data  frames.  i am using them because they are useful for data analysis.  If  my problem was just to quickly find the closest point, and if I had  enough time, then I would code some quad trees in C or C++.

Second update on July 2, 2015.  Several readers made a comment that numba would provide substantial speedup as well.  I did give it a try.

Here is my, non conclusive, experience.  First, let me say  that experiment with another Python install may yield different result.   I am using Anaconda on a Windows machine, and I have installed quite a  few packages.  Maybe some of these packages interfere with numba. 

First thing I did was to follow the instructions for installing numba:

$ conda install numba

Here is what I got:

image

Then I discovered that numba was already installed in my  anaconda distribution.  Maybe someone working on numba should update the  install instructions.

I then used numba as recommended:

@jit
def closest_func(lat,lon,trkpts,func):
    d = 100000.0
    best = -1
    r = trkpts.index
    for i in r:
        lati = trkpts.ix[i,'Lat']
        loni = trkpts.ix[i,'Lon']
        md = abs(lat - lati) + abs(lon - loni)
        if d > md:
            #print d, dlat, dlon, lati, loni
            best = i
            d = md
    return best

I did not see running time improvement.  I also tried the more aggressive compilation setting:

@jit(nopython=True)
def closest_func(lat,lon,trkpts,func):
    d = 100000.0
    best = -1
    r = trkpts.index
    for i in r:
        lati = trkpts.ix[i,'Lat']
        loni = trkpts.ix[i,'Lon']
        md = abs(lat - lati) + abs(lon - loni)
        if d > md:
            #print d, dlat, dlon, lati, loni
            best = i
            d = md
    return best

This time I got an error when running the code:

image

Seems that pandas code is a bit too smart for numba to compile.

教训很明确:别用 C 的方式写 Python 代码。用 numpy 数组运算,不要用数组遍历。对我来说,这是思维上的转变。

Update on July 2, 2015。文章讨论在Hacker News。一些评论没有注意到(missed )我用到了 pandas 数据帧的情况。主要是它在数据分析中很常用。如果我只是要快速的查询最短距离点,且我时间充分,我可以使用 C 或 C++ 编写四叉树(实现)。

Second update on July 2, 2015。有个评论提到 numba 也能对代码提速。我就试了一下。

这是我的做法,与你的情况不一定相同。 首先,要说明的是,不同的 python 安装版,实验的结果不一定相同。我的实验环境是 windows 系统上安装 Anaconda,同时也安装了一些扩展包。可能这些包和 numba 存在干扰。. 

首先,输入下面的安装命令,安装 numba:

$ conda install numba

这是我命令行界面上的反馈:

image

之后我发现,numba 在 anaconda 安装套件中已存在。 也可能安装指令有变更也说不定。

推荐的 numba 用法:

@jit
def closest_func(lat,lon,trkpts,func):
    d = 100000.0
    best = -1
    r = trkpts.index
    for i in r:
        lati = trkpts.ix[i,'Lat']
        loni = trkpts.ix[i,'Lon']
        md = abs(lat - lati) + abs(lon - loni)
        if d > md:
            #print d, dlat, dlon, lati, loni
            best = i
            d = md
    return best

我没有发现运行时间提高。我也尝试了更积极的编译参数设置:

@jit(nopython=True)
def closest_func(lat,lon,trkpts,func):
    d = 100000.0
    best = -1
    r = trkpts.index
    for i in r:
        lati = trkpts.ix[i,'Lat']
        loni = trkpts.ix[i,'Lon']
        md = abs(lat - lati) + abs(lon - loni)
        if d > md:
            #print d, dlat, dlon, lati, loni
            best = i
            d = md
    return best

这次运行代码时,出现一个错误:image

看来,pandas 比 numba 处理代码更智能。

Sure, I could spend time changing my data structure until  numba can compile it.   But why would I want to do that?  I got a code  that runs fast enough using numpy.  I was already using numpy via pandas  anyway.  Why not leverage it?

I also got suggestions to use pypy.  This certainly makes sense but... I had to use Jupyter notebooks on a hosted server.  I have to use the kernels they provide, i.e. a regular Python 2.7.x kernel. Pypy was not an option.

I also got suggestions about Cython.   Well, if I am back to compiling code, then I'd rather use C or C++  directly.  I used Python for its interactive nature in notebooks, and  the fast prototyping it enables. This is not what Cython is designed  for.

当然,我也能花时间修改数据结构,使 numba 能正确编译(compile)。可是,我为什么要这么干呢? 用 numpy 写的代码运行的足够快了。反正,我一直在用 numpy 和 pandas 。为什么不继续用呢?

也有建议我用pypy。这当然有意义,不过...我用的是托管服务器上的 Jupyter notebooks(注,在线浏览器的 python 交互式开发环境)。我用的是它提供的 python 内核,也即,官方的(regular)Python 2.7.x 内核。并没有提供 Pypy 选择。

也有建议用 Cython。好吧,如果我回头要编译代码 ,那我干脆直接用 C 和 C++ 就好了。我用 python,是因为,它提供了基于 notebooks(注:网页版在线开发环境)的交互式特性,可以快速原型实现。这却不是 Cython 的设计目标。

返回顶部
顶部