# Python 不是 C已翻译 100%

oschina 投递于 2015/07/03 07:40 (共 7 段, 翻译完成于 07-03)

#### 参与翻译 (2人) : gx老苗, 纵横九洲

I have been using Python a lot lately, for various data  science projects.  Python is known for its ease of use.  Someone with  coding experience can use it effectively in few days.

This sounds good but there may be an issue if you program in Python as you would in another language, say C.

Let me give an example based on my own experience.  I have a  strong background with imperative languages like C and C++.  I also  have substantial experience with oldies but goodies like Lisp and  Prolog.  I also used Java, Javascript, and PHP quite a bit.  Python  should be a no brainer for me, right?  Actually it looked too easy, and I  fell in a trap: I tried to use Python as I would use C.

Here are some details.

As part of a recent project I  had to deal with geospatial data.  I was given a gps track with about  25,000 points, and I repeatedly needed to find the closest point on that  track given a latitude and a longitude.  My first reaction has been to  look for a code snippet that would compute the distance between two  points given their latitudes and longitudes.  It so happens that John D.  Cook made such code available in the public domain

I was all set!  All I needed to do was to write a little  Python function that returns the index of the point closest to the input  coordinates:

def closest_distance(lat,lon,trkpts):
d = 100000.0
best = -1
r = trkpts.index
for i in r:
lati = trkpts.ix[i,'Lat']
loni = trkpts.ix[i,'Lon']
md = distance_on_unit_sphere(lat, lon, lati, loni)
if d > md
best = i
d = md
return best

where distance_on_unit_sphere is John D. Cook's function, and trkpts is an array containing the coordinates of the gps track (actually it is a pandas data frame).

The above function is similar to what I would have written in C.  It iterates over the trkpts array, and it keeps the index of the closest point so far in the local variable best

So far so good, Python syntax is very different from C, but at least I could write the code I needed very quickly.

The code was quite to write, but it is rather slow to  execute.  For instance I was also given 428 waypoints with names.  I  wanted to find for each of the waypoints the closest point on the  track.  Running the search for the closest point for all the 428  waypoints takes 3 minutes 6 seconds on my laptop.

I then looked for an approximation called the Manhattan  distance.  Rather than computing the exact distance between points, I  compute the distance along the East-West axis plus the distance along  the South-North axis.  Here is the function that computes Manhattan  distance:

def manhattan_distance(lat1, lon1, lat2, lon2):
lat = (lat1+lat2)/2.0

Actually, I used an even simpler function, ignoring the  fact that 1 degree difference in latitude yields a larger distance than 1  degree difference in longitude.  Here is my simple function:

def manhattan_distance1(lat1, lon1, lat2, lon2):
return abs(lat1-lat2)+abs(lon1-lon2)

Our closest function becomes:

def closest_manhattan_distance1(lat,lon,trkpts):
d = 100000.0
best = -1
r = trkpts.index
for i in r:
lati = trkpts.ix[i,'Lat']
loni = trkpts.ix[i,'Lon']
md = manhattan_distance1(lat, lon, lati, loni)
if d > md
best = i
d = md
return best

We can speed it a bit via inlining the Manhattan distance function:

def closest_manhattan_distance2(lat,lon,trkpts):
d = 100000.0
best = -1
r = trkpts.index
for i in r:
lati = trkpts.ix[i,'Lat']
loni = trkpts.ix[i,'Lon']
md = abs(lat-lati)+abs(lon-loni)
if d > md
best = i
d = md
return best

Using that function yields the same closest points than  using John's function.  I was hoping for that, and was pleased to see my  intuition was right.  This simpler function is also faster.  Finding  the closest points for all my waypoints now takes 2minutes 37 seconds on  my laptop.  It is a nice 18% speedup.  Nice, but nothing dramatic.

I then decided to use Python correctly.  In this case it  means leveraging the array operations that pandas support.  These arrays  operations come from the numpy package.  Using these array operations we can get the closest point in a very concise code:

def closest(lat,lon,trkpts):
cl = numpy.abs(trkpts.Lat - lat) + numpy.abs(trkpts.Lon - lon)
return cl.idxmin()

This function returns the same closests points as before  for all my waypoints.  Running it for all the waypoints takes 0.5 second  on my laptop.  This is a 300x speedup over the above code!  300x, i.e.  30,000 %. This is dramatic.  The reason for that speedup is that numpy  array operations are written in C.  Hence, we get the best of both  world: we get speed comparable to C, with the conciseness of Python.

The lesson is clear: do not write Python code as you would  do in C.  Use numpy array operations rather than iterate on arrays.  For  me it meant a mental shift.

Update on July 2, 2015.  That post is discussed on Hacker News.   Some of the comments missed the fact that I am using pandas data  frames.  i am using them because they are useful for data analysis.  If  my problem was just to quickly find the closest point, and if I had  enough time, then I would code some quad trees in C or C++.

Second update on July 2, 2015.  Several readers made a comment that numba would provide substantial speedup as well.  I did give it a try.

Here is my, non conclusive, experience.  First, let me say  that experiment with another Python install may yield different result.   I am using Anaconda on a Windows machine, and I have installed quite a  few packages.  Maybe some of these packages interfere with numba.

First thing I did was to follow the instructions for installing numba:

`\$ conda install numba`

Here is what I got:

Then I discovered that numba was already installed in my  anaconda distribution.  Maybe someone working on numba should update the  install instructions.

I then used numba as recommended:

@jit
def closest_func(lat,lon,trkpts,func):
d = 100000.0
best = -1
r = trkpts.index
for i in r:
lati = trkpts.ix[i,'Lat']
loni = trkpts.ix[i,'Lon']
md = abs(lat - lati) + abs(lon - loni)
if d > md:
#print d, dlat, dlon, lati, loni
best = i
d = md
return best

I did not see running time improvement.  I also tried the more aggressive compilation setting:

@jit(nopython=True)
def closest_func(lat,lon,trkpts,func):
d = 100000.0
best = -1
r = trkpts.index
for i in r:
lati = trkpts.ix[i,'Lat']
loni = trkpts.ix[i,'Lon']
md = abs(lat - lati) + abs(lon - loni)
if d > md:
#print d, dlat, dlon, lati, loni
best = i
d = md
return best

This time I got an error when running the code:

Seems that pandas code is a bit too smart for numba to compile.

Sure, I could spend time changing my data structure until  numba can compile it.   But why would I want to do that?  I got a code  that runs fast enough using numpy.  I was already using numpy via pandas  anyway.  Why not leverage it?

I also got suggestions to use pypy.  This certainly makes sense but... I had to use Jupyter notebooks on a hosted server.  I have to use the kernels they provide, i.e. a regular Python 2.7.x kernel. Pypy was not an option.

I also got suggestions about Cython.   Well, if I am back to compiling code, then I'd rather use C or C++  directly.  I used Python for its interactive nature in notebooks, and  the fast prototyping it enables. This is not what Cython is designed  for.

good，对Python不很精