pysal / gwr Goto Github PK
View Code? Open in Web Editor NEWThis is a repository for the geographically-weighted regression submodule of the Python Spatial Analysis Library
License: BSD 2-Clause "Simplified" License
This is a repository for the geographically-weighted regression submodule of the Python Spatial Analysis Library
License: BSD 2-Clause "Simplified" License
In golden_section()
search, it is like either b will become d(d <- b when score_b <= score_d)
or d will become b (b <- d when score_b > score_d)
for the next iteration, thus either gwr_func(b)
or gwr_func(d)
has been computed in previous iteration and do not need to be re-computed.
We could add a dictionary with key-value as {bw : gwr_func(bw)}
to store previously computed gwr_func(bw)
and in every iteration we check if for a given bw, gwr_func(bw)
has been computed or not. If not, just compute it. If bw exists in dictionary, just get the value. This should approximately save half of the time when searching for optimal bw.
def golden_section(a, c, delta, function, tol, max_iter, int_score=True):
...
dict = {} # {bw:aicc}
for iters in range(max_iter):
...
#score_b = function(b)
#score_d = function(d)
if b in dict:
score_b = dict[b]
else:
score_b = function(b)
dict[b] = score_b
if d in dict:
score_d = dict[d]
else:
score_d = function(d)
dict[d] = score_d
...
The adaptive gaussian
and adaptive bisquare
show totally different curves for the model PctBach~FB+Black+Pov
. The optimal bandwidth found for adaptive gaussian
would be local and for adaptive bisquare
would be global. Test against GWModel
is showing exactly same curves. Test against ArcGIS adaptive gaussian
is showing similar curve as PySAL's adaptive bisqaure
. BTW, ArcGIS claims their gaussian kernel is w = exp(-(d/b)**2)
, while we are using exp(-0.5*(d/b)**2)
which is from the book. Will look into to see what it implies.
#Imports
import numpy as np
import matplotlib.pyplot as plt
import pysal
from pysal.contrib.gwr.gwr import GWR
from pysal.contrib.gwr.sel_bw import Sel_BW
from pysal.contrib.glm.family import Gaussian, Poisson, Binomial
#Load the GA data
data = pysal.open(pysal.examples.get_path('GData_utm.csv'))
y = np.array(data.by_col('PctBach')).reshape((-1,1))
fb = np.array(data.by_col('PctFB')).reshape((-1,1))
pov = np.array(data.by_col('PctPov')).reshape((-1,1))
blk = np.array(data.by_col('PctBlack')).reshape((-1,1))
X = np.hstack([fb,pov,blk])
coords = list(zip(data.by_col('X'), data.by_col('Y')))
#Fit using each bandwidth
pysal_aicc_gau = []
pysal_aicc_bisq = []
for bw in range(20,160):
results = GWR(coords, y, X, bw,fixed=False,kernel='gaussian',family=Gaussian()).fit()
pysal_aicc_gau += [results.aicc]
results = GWR(coords, y, X, bw,fixed=False,kernel='bisquare',family=Gaussian()).fit()
pysal_aicc_bisq += [results.aicc]
#Plot
bw = range(20,160)
plt.figure(figsize=(12,8))
#plt.plot(bw,arc_aiccs,'r-',label="ArcGIS Adaptive 'Gaussian'") #Calculated from ArcGIS
plt.plot(bw,pysal_aicc_bisq,'b-',label='PySAL Adaptive Bisquare')
plt.plot(bw,pysal_aicc_gau,'y-',label='PySAL Adaptive Gaussian')
plt.xlabel('Bandwidth',fontsize=14)
plt.ylabel('AICc',fontsize=14)
plt.title('Georgia: Bach~ForeignBorn+Poverty+Black',fontsize=20)
plt.legend(fontsize=14)
plt.show()
See here for details. Need to confirm proposed solution.
When you call Sel_BW(...).search(criterion='something', search='something')
, the search method actually overwrites itself on line 172. So you can't run it again without instantiating it again, and the error is really cryptic (since search is now a string, you get "str object is not callable" when you've not done anything with strings?)
I'd like to change this to assign to self.search_method
or something.
Offset functionality for Poisson distibution is still dead?
It seems, that it is taken in account, but returning results are strange:
thanks for answer
Sklearn has standardized around using trailing underscores to denote quantities estimated from data.
I think this convention could be helpful? Sometimes, I'm not sure what quantities on a Sel_BW
object versus a GWR
object are estimated, especially since a gwr object receives a bw and does not, itself, estimate that bw.
Code currently only support projected coordinates (Euclidian distances) and users may only have lat/long data or may be using a global-scale dataset that requires spherical coordinates and requires distances computed by the haversine formula.
Currently, results include a local R-squared that is based on weighted diagnostics and not directly comparable to the R-squared of a an OLS. Would be useful to have global R-squared implemented as:
R-squared = 1 - (RSS / TSS)
where
TSS = np.sum((results.y.reshape((815,1)) - np.mean(y))**2)
RSS = np.sum((results.y.reshape((815,1)) - results.predy.reshape((815,1)))**2)
Like what statsmodel.OLS.fit.summary()
does? Can follow GWR4's summary .txt
.
Did some profiling before and after implementing my optimization. Here is a notebook. Note that the argument in Sel_BW(..., Ziqi=True/False)
indicates whether using my speed-up or not.
Take a fixed gaussian example
Before:
From line profiler: 1) The iwls()
takes 70.3% of total runtime with 1494 ns per hit. 2) Returning GWRResults
class takes 19.1% of total runtime. So basically those two are the most heaviest. Total runtime is 185.313 s.
Total time: 185.313 s
File: /Users/Ziqi/Desktop/developer/gwr/gwr/gwr.py
Function: fit at line 229
Line # Hits Time Per Hit % Time Line Contents
==============================================================
229 def fit(self, ini_params=None, tol=1.0e-5, max_iter=20, solve='iwls'):
230 """
231 Method that fits a model with a particular estimation routine.
232
233 Parameters
234 ----------
235
236 ini_betas : array
237 k*1, initial coefficient values, including constant.
238 Default is None, which calculates initial values during
239 estimation
240 tol: float
241 Tolerence for estimation convergence
242 max_iter : integer
243 Maximum number of iterations if convergence not
244 achieved
245 solve : string
246 Technique to solve MLE equations.
247 'iwls' = iteratively (re)weighted least squares (default)
248 """
249 31 131 4.2 0.0 self.fit_params['ini_params'] = ini_params
250 31 39 1.3 0.0 self.fit_params['tol'] = tol
251 31 32 1.0 0.0 self.fit_params['max_iter'] = max_iter
252 31 29 0.9 0.0 self.fit_params['solve']= solve
253 31 104 3.4 0.0 if solve.lower() == 'iwls':
254 31 100 3.2 0.0 m = self.W.shape[0]
255 31 853 27.5 0.0 params = np.zeros((m, self.k))
256 31 162 5.2 0.0 predy = np.zeros((m, 1))
257 31 158 5.1 0.0 v = np.zeros((m, 1))
258 31 147 4.7 0.0 w = np.zeros((m, 1))
259 31 420474 13563.7 0.2 z = np.zeros((m, self.n))
260 31 427699 13796.7 0.2 S = np.zeros((m, self.n))
261 31 408756 13185.7 0.2 R = np.zeros((m, self.n))
262 31 1291 41.6 0.0 CCT = np.zeros((m, self.k))
263 #f = np.zeros((n, n))
264 31 293 9.5 0.0 p = np.zeros((m, 1))
265 87234 125547 1.4 0.1 for i in range(m):
266 87203 328111 3.8 0.2 wi = self.W[i].reshape((-1,1))
267 87203 144996 1.7 0.1 rslt = iwls(self.y, self.X, self.family, self.offset, None,
268 87203 130337171 1494.6 70.3 ini_params, tol, max_iter, wi=wi)
269 87203 754817 8.7 0.4 params[i,:] = rslt[0].T
270 87203 250628 2.9 0.1 predy[i] = rslt[1][i]
271 87203 175738 2.0 0.1 v[i] = rslt[2][i]
272 87203 175021 2.0 0.1 w[i] = rslt[3][i]
273 87203 1065945 12.2 0.6 z[i] = rslt[4].flatten()
274 87203 2158717 24.8 1.2 R[i] = np.dot(self.X[i], rslt[5])
275 87203 1562018 17.9 0.8 ri = np.dot(self.X[i], rslt[5])
276 87203 2709734 31.1 1.5 S[i] = ri*np.reshape(rslt[4].flatten(), (1,-1))
277 #dont need unless f is explicitly passed for
278 #prediction of non-sampled points
279 #cf = rslt[5] - np.dot(rslt[5], f)
280 #CCT[i] = np.diag(np.dot(cf, cf.T/rslt[3]))
281 87203 7677708 88.0 4.1 CCT[i] = np.diag(np.dot(rslt[5], rslt[5].T))
282 31 1244227 40136.4 0.7 S = S * (1.0/z)
283 31 35342666 1140086.0 19.1 return GWRResults(self, params, predy, S, CCT, w)
After implementing my optimization:
Calling _compute_betas_gwr
directly takes 280.6 per hit, which is one fifth of iwls()
. Total runtime is 26.9759 s, which is a 6x speed-up in this case.
Timer unit: 1e-06 s
Total time: 26.9759 s
File: /Users/Ziqi/Desktop/developer/gwr/gwr/gwr.py
Function: _fast_search at line 285
Line # Hits Time Per Hit % Time Line Contents
==============================================================
285 def _fast_search(self, ini_params=None, tol=1.0e-5, max_iter=20, solve='iwls'):
286 31 105 3.4 0.0 trS = 0 #trace of S
287 31 45 1.5 0.0 RSS = 0
288 31 29 0.9 0.0 dev = 0
289 31 31 1.0 0.0 CV_score = 0
290 31 40 1.3 0.0 n = self.n
291 87234 90185 1.0 0.3 for i in range(n):
292 87203 148537 1.7 0.6 if self.kernel == 'bisquare': #Truncated kernel, taking out none-zero weights observations
293 nonzero_i = np.nonzero(self.W[i])
294 wi = self.W[i,nonzero_i].reshape((-1,1))
295 X_new = self.X[nonzero_i]
296 Y_new = self.y[nonzero_i]
297 offset_new = self.offset[nonzero_i]
298 current_i = np.where(wi==1)[0][0] #index of current regression point
299
300 else: #non-truncated kernel
301 87203 288208 3.3 1.1 wi = self.W[i].reshape((-1,1))
302 87203 85832 1.0 0.3 X_new = self.X
303 87203 83646 1.0 0.3 Y_new = self.y
304 87203 82309 0.9 0.3 offset_new = self.offset
305 87203 75867 0.9 0.3 current_i = i
306
307 87203 141917 1.6 0.5 if isinstance(self.family, Gaussian):
308 87203 24466433 280.6 90.7 betas, inv_xtx_xt = _compute_betas_gwr(Y_new,X_new,wi)
309 87203 543991 6.2 2.0 hat = np.dot(X_new[current_i],inv_xtx_xt[:,current_i]) #influ
310 87203 235841 2.7 0.9 yhat = np.dot(X_new[current_i],betas)[0] #yhat
311 87203 182077 2.1 0.7 err = Y_new[current_i][0]-yhat #residual
312 87203 130966 1.5 0.5 RSS += err*err
313 87203 94273 1.1 0.3 trS += hat
314 87203 324563 3.7 1.2 CV_score += (err/(1-hat))**2
315
316 elif isinstance(self.family, (Poisson, Binomial)):
317 rslt = iwls(Y_new, X_new, self.family, offset_new, None, ini_params, tol, max_iter, wi=wi)
318 inv_xtx_xt = rslt[5]
319 hat = np.dot(X_new[current_i],inv_xtx_xt[:,current_i])*rslt[3][current_i][0]
320 yhat = rslt[1][current_i][0]
321 err = Y_new[current_i][0]-yhat
322 trS += hat
323 dev += self.family.resid_dev(Y_new[current_i][0], yhat)**2
324
325 31 56 1.8 0.0 if isinstance(self.family, Gaussian):
326 31 520 16.8 0.0 ll = -np.log(RSS)*n/2 - (1+np.log(np.pi/n*2))*n/2 #log likelihood
327 31 76 2.5 0.0 aic = -2*ll + 2.0 * (trS + 1)
328 31 75 2.4 0.0 aicc = -2.0*ll + 2.0*n*(trS + 1.0)/(n - trS - 2.0)
329 31 201 6.5 0.0 bic = -2*ll + (trS+1) * np.log(n)
330 31 40 1.3 0.0 cv = CV_score/n
331 elif isinstance(self.family, (Poisson, Binomial)):
332 aic = dev + 2.0 * trS
333 aicc = aic + 2.0 * trS * (trS + 1.0)/(n - trS - 1.0)
334 bic = dev + trS * np.log(n)
335 cv = None
336
337 31 58 1.9 0.0 return {'AICc': aicc,'AIC':aic, 'BIC': bic,'CV': cv}
For example https://github.com/pysal/gwr/blob/master/gwr/tests/test_gwr.py#L18
In meta pysal the tests fail as those datasets are not part of libpysal. It would be great if gwr could test on its own datasets that were distributed with its source.
Two speed-up ideas
Calculate only AIC/AICc/BIC/CV (search method) in bandwidth searching and calculate full diagnostics in the final GWR.fit()
. I am planning to submit a PR about this.
When calculating kernel for each bandwidth, distance matrix is calculated and sorted (if fixed = False
) every time, which can be calculated for only once and passed to kernel calculation function.
import pysal as ps
from gwr.gwr import GWR
from spglm.family import Gaussian, Poisson
y = CASES_POOL_A.Call2017.reshape((-1,1))
X=shp[['cx3','cx5','cx6']].values
labels = ['Intercept', 'cx3', 'cx5', 'cx6']
u = shp.LAT
v = shp.LON
coords = zip(u,v)
model = GWR(coords, y, X, bw=12, family=Poisson(),fixed=False)
results = model.fit()
Traceback (most recent call last):
File "<ipython-input-2-f5f6fbc75e94>", line 1, in <module>
runfile('/home/www_adm/ownCloud/NNIIEM_SCIENCE/00_MAIN_THEME/01_GEN_FOR_R.py', wdir='/home/www_adm/ownCloud/NNIIEM_SCIENCE/00_MAIN_THEME')
File "/usr/local/lib/python3.5/dist-packages/spyder/utils/site/sitecustomize.py", line 705, in runfile
execfile(filename, namespace)
File "/usr/local/lib/python3.5/dist-packages/spyder/utils/site/sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "/home/www_adm/ownCloud/NNIIEM_SCIENCE/00_MAIN_THEME/01_GEN_FOR_R.py", line 141, in <module>
model = GWR(coords, y, X, bw=12, family=Poisson(),fixed=False)
File "/usr/local/lib/python3.5/dist-packages/gwr/gwr.py", line 205, in __init__
self.W = self._build_W(fixed, kernel, coords, bw)
File "/usr/local/lib/python3.5/dist-packages/gwr/gwr.py", line 221, in _build_W
raise TypeError('Unsupported kernel function ', kernel)
TypeError: ('Unsupported kernel function ', 'bisquare')
We should probably also think about moving mgwr into the private version of this package.
If we need to, I can create the private pysal/mgwr or something, to store work until we're ready to make it public. That'd be a copy of this package + the relevant improvements for the multiscale implementation.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.