pysal / gwr Goto Github PK

This is a repository for the geographically-weighted regression submodule of the Python Spatial Analysis Library

License: BSD 2-Clause "Simplified" License

Python 7.75% Jupyter Notebook 92.25%

gwr's Issues

Excessive computation in golden_section search can be avoided

In golden_section() search, it is like either b will become d(d <- b when score_b <= score_d) or d will become b (b <- d when score_b > score_d) for the next iteration, thus either gwr_func(b) or gwr_func(d) has been computed in previous iteration and do not need to be re-computed.

We could add a dictionary with key-value as {bw : gwr_func(bw)} to store previously computed gwr_func(bw) and in every iteration we check if for a given bw, gwr_func(bw) has been computed or not. If not, just compute it. If bw exists in dictionary, just get the value. This should approximately save half of the time when searching for optimal bw.

def golden_section(a, c, delta, function, tol, max_iter, int_score=True):
    ...
    dict = {} # {bw:aicc}
    for iters in range(max_iter):
        ...
        #score_b = function(b)
        #score_d = function(d)
        if b in dict:
            score_b = dict[b]
        else:
            score_b = function(b)
            dict[b] = score_b
        if d in dict:
            score_d = dict[d]
        else:
            score_d = function(d)
            dict[d] = score_d
        ...

Thoughts about adaptive gaussian

The adaptive gaussian and adaptive bisquare show totally different curves for the model PctBach~FB+Black+Pov. The optimal bandwidth found for adaptive gaussian would be local and for adaptive bisquare would be global. Test against GWModel is showing exactly same curves. Test against ArcGIS adaptive gaussian is showing similar curve as PySAL's adaptive bisqaure. BTW, ArcGIS claims their gaussian kernel is w = exp(-(d/b)**2), while we are using exp(-0.5*(d/b)**2) which is from the book. Will look into to see what it implies.

#Imports
import numpy as np
import matplotlib.pyplot as plt
import pysal
from pysal.contrib.gwr.gwr import GWR
from pysal.contrib.gwr.sel_bw import Sel_BW
from pysal.contrib.glm.family import Gaussian, Poisson, Binomial


#Load the GA data
data = pysal.open(pysal.examples.get_path('GData_utm.csv'))
y = np.array(data.by_col('PctBach')).reshape((-1,1))
fb  = np.array(data.by_col('PctFB')).reshape((-1,1))
pov = np.array(data.by_col('PctPov')).reshape((-1,1)) 
blk = np.array(data.by_col('PctBlack')).reshape((-1,1))
X = np.hstack([fb,pov,blk])
coords = list(zip(data.by_col('X'), data.by_col('Y')))

#Fit using each bandwidth
pysal_aicc_gau = []
pysal_aicc_bisq = []
for bw in range(20,160):
    results = GWR(coords, y, X, bw,fixed=False,kernel='gaussian',family=Gaussian()).fit()
    pysal_aicc_gau += [results.aicc]
    results = GWR(coords, y, X, bw,fixed=False,kernel='bisquare',family=Gaussian()).fit()
    pysal_aicc_bisq += [results.aicc]

#Plot
bw = range(20,160)
plt.figure(figsize=(12,8))
#plt.plot(bw,arc_aiccs,'r-',label="ArcGIS Adaptive 'Gaussian'") #Calculated from ArcGIS
plt.plot(bw,pysal_aicc_bisq,'b-',label='PySAL Adaptive Bisquare')
plt.plot(bw,pysal_aicc_gau,'y-',label='PySAL Adaptive Gaussian')
plt.xlabel('Bandwidth',fontsize=14)
plt.ylabel('AICc',fontsize=14)
plt.title('Georgia: Bach~ForeignBorn+Poverty+Black',fontsize=20)
plt.legend(fontsize=14)
plt.show()

Resolve divide-by-zero issue for Gaussian GWR

See here for details. Need to confirm proposed solution.

calling `search` modifies the method `search`, so it can only be called once.

When you call Sel_BW(...).search(criterion='something', search='something'), the search method actually overwrites itself on line 172. So you can't run it again without instantiating it again, and the error is really cryptic (since search is now a string, you get "str object is not callable" when you've not done anything with strings?)

I'd like to change this to assign to self.search_method or something.

#Offset does not yet do anyhting and needs to be implemented

Offset functionality for Poisson distibution is still dead?
It seems, that it is taken in account, but returning results are strange:

result of fit() funtion is in log scale
residuals are calculated as natural values of exog and log scale of fitted.

thanks for answer

For fit parameters, use sklearn trailing `_` notation?

Sklearn has standardized around using trailing underscores to denote quantities estimated from data.

I think this convention could be helpful? Sometimes, I'm not sure what quantities on a Sel_BW object versus a GWR object are estimated, especially since a gwr object receives a bw and does not, itself, estimate that bw.

Predition jpynb is broken

https://github.com/pysal/gwr/blob/master/gwr/notebooks/GWR_Georgia_Prediction.ipynb

Support for projected coordinates

Code currently only support projected coordinates (Euclidian distances) and users may only have lat/long data or may be using a global-scale dataset that requires spherical coordinates and requires distances computed by the haversine formula.

Add "Global" R-squared

Currently, results include a local R-squared that is based on weighted diagnostics and not directly comparable to the R-squared of a an OLS. Would be useful to have global R-squared implemented as:

R-squared = 1 - (RSS / TSS)

where

TSS = np.sum((results.y.reshape((815,1)) - np.mean(y))**2)
RSS = np.sum((results.y.reshape((815,1)) - results.predy.reshape((815,1)))**2)

Possibility for adding a summary text for GWR diagnostics

Like what statsmodel.OLS.fit.summary() does? Can follow GWR4's summary .txt.

Profiling for speed-up

Did some profiling before and after implementing my optimization. Here is a notebook. Note that the argument in Sel_BW(..., Ziqi=True/False) indicates whether using my speed-up or not.

Take a fixed gaussian example
Before:
From line profiler: 1) The iwls() takes 70.3% of total runtime with 1494 ns per hit. 2) Returning GWRResults class takes 19.1% of total runtime. So basically those two are the most heaviest. Total runtime is 185.313 s.


Total time: 185.313 s
File: /Users/Ziqi/Desktop/developer/gwr/gwr/gwr.py
Function: fit at line 229

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   229                                               def fit(self, ini_params=None, tol=1.0e-5, max_iter=20, solve='iwls'):
   230                                                   """
   231                                                   Method that fits a model with a particular estimation routine.
   232                                           
   233                                                   Parameters
   234                                                   ----------
   235                                           
   236                                                   ini_betas     : array
   237                                                                   k*1, initial coefficient values, including constant.
   238                                                                   Default is None, which calculates initial values during
   239                                                                   estimation
   240                                                   tol:            float
   241                                                                   Tolerence for estimation convergence
   242                                                   max_iter      : integer
   243                                                                   Maximum number of iterations if convergence not
   244                                                                   achieved
   245                                                   solve         : string
   246                                                                   Technique to solve MLE equations.
   247                                                                   'iwls' = iteratively (re)weighted least squares (default)
   248                                                   """
   249        31          131      4.2      0.0          self.fit_params['ini_params'] = ini_params
   250        31           39      1.3      0.0          self.fit_params['tol'] = tol
   251        31           32      1.0      0.0          self.fit_params['max_iter'] = max_iter
   252        31           29      0.9      0.0          self.fit_params['solve']= solve
   253        31          104      3.4      0.0          if solve.lower() == 'iwls':
   254        31          100      3.2      0.0              m = self.W.shape[0]
   255        31          853     27.5      0.0              params = np.zeros((m, self.k))
   256        31          162      5.2      0.0              predy = np.zeros((m, 1))
   257        31          158      5.1      0.0              v = np.zeros((m, 1))
   258        31          147      4.7      0.0              w = np.zeros((m, 1))
   259        31       420474  13563.7      0.2              z = np.zeros((m, self.n))
   260        31       427699  13796.7      0.2              S = np.zeros((m, self.n))
   261        31       408756  13185.7      0.2              R = np.zeros((m, self.n))
   262        31         1291     41.6      0.0              CCT = np.zeros((m, self.k))
   263                                                       #f = np.zeros((n, n))
   264        31          293      9.5      0.0              p = np.zeros((m, 1))
   265     87234       125547      1.4      0.1              for i in range(m):
   266     87203       328111      3.8      0.2                  wi = self.W[i].reshape((-1,1))
   267     87203       144996      1.7      0.1                  rslt = iwls(self.y, self.X, self.family, self.offset, None,
   268     87203    130337171   1494.6     70.3                  ini_params, tol, max_iter, wi=wi)
   269     87203       754817      8.7      0.4                  params[i,:] = rslt[0].T
   270     87203       250628      2.9      0.1                  predy[i] = rslt[1][i]
   271     87203       175738      2.0      0.1                  v[i] = rslt[2][i]
   272     87203       175021      2.0      0.1                  w[i] = rslt[3][i]
   273     87203      1065945     12.2      0.6                  z[i] = rslt[4].flatten()
   274     87203      2158717     24.8      1.2                  R[i] = np.dot(self.X[i], rslt[5])
   275     87203      1562018     17.9      0.8                  ri = np.dot(self.X[i], rslt[5])
   276     87203      2709734     31.1      1.5                  S[i] = ri*np.reshape(rslt[4].flatten(), (1,-1))
   277                                                           #dont need unless f is explicitly passed for
   278                                                           #prediction of non-sampled points
   279                                                           #cf = rslt[5] - np.dot(rslt[5], f)
   280                                                           #CCT[i] = np.diag(np.dot(cf, cf.T/rslt[3]))
   281     87203      7677708     88.0      4.1                  CCT[i] = np.diag(np.dot(rslt[5], rslt[5].T))
   282        31      1244227  40136.4      0.7              S = S * (1.0/z)
   283        31     35342666 1140086.0     19.1          return GWRResults(self, params, predy, S, CCT, w)

After implementing my optimization:
Calling _compute_betas_gwr directly takes 280.6 per hit, which is one fifth of iwls(). Total runtime is 26.9759 s, which is a 6x speed-up in this case.

Timer unit: 1e-06 s

Total time: 26.9759 s
File: /Users/Ziqi/Desktop/developer/gwr/gwr/gwr.py
Function: _fast_search at line 285

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   285                                               def _fast_search(self, ini_params=None, tol=1.0e-5, max_iter=20, solve='iwls'):
   286        31          105      3.4      0.0          trS = 0 #trace of S
   287        31           45      1.5      0.0          RSS = 0
   288        31           29      0.9      0.0          dev = 0
   289        31           31      1.0      0.0          CV_score = 0
   290        31           40      1.3      0.0          n = self.n
   291     87234        90185      1.0      0.3          for i in range(n):
   292     87203       148537      1.7      0.6              if self.kernel == 'bisquare': #Truncated kernel, taking out none-zero weights observations
   293                                                           nonzero_i = np.nonzero(self.W[i])
   294                                                           wi = self.W[i,nonzero_i].reshape((-1,1))
   295                                                           X_new = self.X[nonzero_i]
   296                                                           Y_new = self.y[nonzero_i]
   297                                                           offset_new = self.offset[nonzero_i]
   298                                                           current_i = np.where(wi==1)[0][0] #index of current regression point
   299                                                       
   300                                                       else: #non-truncated kernel
   301     87203       288208      3.3      1.1                  wi = self.W[i].reshape((-1,1))
   302     87203        85832      1.0      0.3                  X_new = self.X
   303     87203        83646      1.0      0.3                  Y_new = self.y
   304     87203        82309      0.9      0.3                  offset_new = self.offset
   305     87203        75867      0.9      0.3                  current_i = i
   306                                                               
   307     87203       141917      1.6      0.5              if isinstance(self.family, Gaussian):
   308     87203     24466433    280.6     90.7                  betas, inv_xtx_xt = _compute_betas_gwr(Y_new,X_new,wi)
   309     87203       543991      6.2      2.0                  hat = np.dot(X_new[current_i],inv_xtx_xt[:,current_i]) #influ
   310     87203       235841      2.7      0.9                  yhat = np.dot(X_new[current_i],betas)[0] #yhat
   311     87203       182077      2.1      0.7                  err = Y_new[current_i][0]-yhat #residual
   312     87203       130966      1.5      0.5                  RSS += err*err
   313     87203        94273      1.1      0.3                  trS += hat
   314     87203       324563      3.7      1.2                  CV_score += (err/(1-hat))**2
   315                                                           
   316                                                       elif isinstance(self.family, (Poisson, Binomial)):
   317                                                           rslt = iwls(Y_new, X_new, self.family, offset_new, None, ini_params, tol, max_iter, wi=wi)
   318                                                           inv_xtx_xt = rslt[5]
   319                                                           hat = np.dot(X_new[current_i],inv_xtx_xt[:,current_i])*rslt[3][current_i][0]
   320                                                           yhat = rslt[1][current_i][0]
   321                                                           err = Y_new[current_i][0]-yhat
   322                                                           trS += hat
   323                                                           dev += self.family.resid_dev(Y_new[current_i][0], yhat)**2
   324                                           
   325        31           56      1.8      0.0          if isinstance(self.family, Gaussian):
   326        31          520     16.8      0.0              ll = -np.log(RSS)*n/2 - (1+np.log(np.pi/n*2))*n/2 #log likelihood
   327        31           76      2.5      0.0              aic = -2*ll + 2.0 * (trS + 1)
   328        31           75      2.4      0.0              aicc = -2.0*ll + 2.0*n*(trS + 1.0)/(n - trS - 2.0)
   329        31          201      6.5      0.0              bic = -2*ll + (trS+1) * np.log(n)
   330        31           40      1.3      0.0              cv = CV_score/n
   331                                                   elif isinstance(self.family, (Poisson, Binomial)):
   332                                                       aic = dev + 2.0 * trS
   333                                                       aicc = aic + 2.0 * trS * (trS + 1.0)/(n - trS - 1.0)
   334                                                       bic = dev + trS * np.log(n)
   335                                                       cv = None
   336                                           
   337        31           58      1.9      0.0          return {'AICc': aicc,'AIC':aic, 'BIC': bic,'CV': cv}

gwr is using example files that are not in libpysal (but in old pysal)

For example https://github.com/pysal/gwr/blob/master/gwr/tests/test_gwr.py#L18

In meta pysal the tests fail as those datasets are not part of libpysal. It would be great if gwr could test on its own datasets that were distributed with its source.

GWR speed-up

Two speed-up ideas

Calculate only AIC/AICc/BIC/CV (search method) in bandwidth searching and calculate full diagnostics in the final GWR.fit(). I am planning to submit a PR about this.
When calculating kernel for each bandwidth, distance matrix is calculated and sorted (if fixed = False) every time, which can be calculated for only once and passed to kernel calculation function.

TypeError: ('Unsupported kernel function ', 'bisquare')

import pysal as ps
from gwr.gwr import GWR
from spglm.family import Gaussian, Poisson


y = CASES_POOL_A.Call2017.reshape((-1,1))
X=shp[['cx3','cx5','cx6']].values
labels = ['Intercept', 'cx3', 'cx5', 'cx6']
u = shp.LAT
v = shp.LON
coords = zip(u,v)

model = GWR(coords, y, X, bw=12, family=Poisson(),fixed=False)
results = model.fit()

Traceback (most recent call last):

  File "<ipython-input-2-f5f6fbc75e94>", line 1, in <module>
    runfile('/home/www_adm/ownCloud/NNIIEM_SCIENCE/00_MAIN_THEME/01_GEN_FOR_R.py', wdir='/home/www_adm/ownCloud/NNIIEM_SCIENCE/00_MAIN_THEME')

  File "/usr/local/lib/python3.5/dist-packages/spyder/utils/site/sitecustomize.py", line 705, in runfile
    execfile(filename, namespace)

  File "/usr/local/lib/python3.5/dist-packages/spyder/utils/site/sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "/home/www_adm/ownCloud/NNIIEM_SCIENCE/00_MAIN_THEME/01_GEN_FOR_R.py", line 141, in <module>
    model = GWR(coords, y, X, bw=12, family=Poisson(),fixed=False)

  File "/usr/local/lib/python3.5/dist-packages/gwr/gwr.py", line 205, in __init__
    self.W = self._build_W(fixed, kernel, coords, bw)

  File "/usr/local/lib/python3.5/dist-packages/gwr/gwr.py", line 221, in _build_W
    raise TypeError('Unsupported kernel function  ', kernel)

TypeError: ('Unsupported kernel function  ', 'bisquare')

Move MGWR

We should probably also think about moving mgwr into the private version of this package.

If we need to, I can create the private pysal/mgwr or something, to store work until we're ready to make it public. That'd be a copy of this package + the relevant improvements for the multiscale implementation.

pysal / gwr Goto Github PK

gwr's Issues

Excessive computation in golden_section search can be avoided

Thoughts about adaptive gaussian

Resolve divide-by-zero issue for Gaussian GWR

calling `search` modifies the method `search`, so it can only be called once.

#Offset does not yet do anyhting and needs to be implemented

For fit parameters, use sklearn trailing `_` notation?

Predition jpynb is broken

Support for projected coordinates

Add "Global" R-squared

Possibility for adding a summary text for GWR diagnostics

Profiling for speed-up

gwr is using example files that are not in libpysal (but in old pysal)

GWR speed-up

TypeError: ('Unsupported kernel function ', 'bisquare')

Move MGWR

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent