clusterpy / clusterpy Goto Github PK

Library of spatially constrained clustering algorithms

License: BSD 3-Clause "New" or "Revised" License

Python 99.91% Shell 0.09%

clusterpy's Introduction

ClusterPy

Analytical regionalization is a scientific way to decide how to group of a large number of geographic areas or points into a smaller number of regions based on similiarities in one or more variables (e.g. income, ethnicity, or environmental condition) that the researcher believes are important for the topic at hand. Conventional conceptions of how areas should be grouped into regions may either not be relevant to the information one is trying to illustrate (e.g. using political regions to map air pollution) or may actually be designed in ways to bias aggregated results.

Current algorithms

AZP: Openshaw and Rao (1995)
AZP-Simulated Annealing: Openshaw and Rao (1995)
AZP-Tabu: Openshaw and Rao (1995)
AZP-R-Tabu: Openshaw and Rao (1995)
Max-p-regions (Greedy): Duque, Anselin and Rey (2010)
Max-p-regions (Tabu): Duque, Anselin and Rey (2010)
Max-p-regions (Simulated Annealing): Duque, Anselin and Rey (2010)
AMOEBA: Aldstadt and Getis (2006)
SOM: Kohonen (1990)
geoSOM: Bacao (2004)
Random

Special Features

Customized 'Analytical' Regionalizations based on following user specifications/inputs:
Key areal attribute to regionalize on: User regionalizes (or clusters) data based on different variables she considers important for her problem at hand. (i.e. use your own 'analytical' regions versus normative or administrative regions)
Maximum or minimum number of regions.
Threshold conditions of the maximum or minimum value that all regional clusters must meet for a given variable (e.g. a minimum threshold for a social or business project might be for all regions to have at least 100.000 people, or for an ecological project regions should have an area of at least 100 square miles).
Spatial contiguity constraints (W matrix , GAL, GWT formats), or they will be created for you based the shared geographic borders of your areal units.
Time-series signature clustering: not only can areas by clustered by a cross-sectional variable, but also by the correlation of their time-series signatures of the variable.
Create New ESRI shapefiles:

Related information

Citing Clusterpy

Please cite ClusterPy when using the software in your work

Duque, J.C.; Dev, B.; Betancourt, A.; Franco, J.L. (2011). ClusterPy: Librar of spatially constrained clustering algorithms, Version 0.9.9. RiSE-group (Research in Spatial Economics). EAFIT University. http://www.rise-group.org.

A BibTeX entry for LaTeX users is:

@Manual{ClusterPy,
title = {ClusterPy: {Library} of spatially constrained clustering algorithms,
{Version} 0.9.9.},
author = {Juan C. Duque and Boris Dev and Alejandro Betancourt and Jose L. Franco},
organization = {RiSE-group (Research in Spatial Economics). EAFIT University.},
address = {Colombia},
year = {2011},
url = {http://www.rise-group.org}
}

License information

See the file "LICENSE.txt" for information on the history of this software, terms & conditions for usage, and a DISCLAIMER OF ALL WARRANTIES.

clusterpy's People

Contributors

Stargazers

Watchers

clusterpy's Issues

Gurobipy not present

Handle the case that the user doesn't have gurobipy.
Also handle the case when they try to use it without gurobipy.

This is important because gurobipy is a very specific import and a lot of people may not have it.

Move arisel and maxp

The implementation of the algorithms does not allow for proper testing.
This relates to #30

random seed

when running an algorithm multiple times from the same script, the random will get the seed from the process id. this bug comes from the multi core implementation.

Warning notice for unavailable required libraries

What steps will reproduce the problem?

Install Clusterpy without having any of the required libraries.
import Clusterpy

What is the expected output? What do you see instead?
Clusterpy should not install if the required libraries are not installed.
At the moment the library lets install without any warning or notice.

[This issue was raised by a user trying to install the library, and
contacted the group directly]

Using disjoint polygons or centroids will fail?

When my initial geometry contains non-touching polygons, or simply centroid points instead of polygons, it seems the algorithm do not work anymore?

See example below, modifying the cluster example of California and:

making each county of California slightly smaller, hence not touching
using countie's centroid

Note that code will not break, but just run infinitely? Is there a problem with how the distance metric is measured? I thought that conceptually, the algorithms would run independent of the geometry of the initial dataset?

Thanks!

import os
os.chdir("path_to_clusterpy/clusterpy/")
os.getcwd()
#> '/home/matifou/gitReps/clusterpy/clusterpy'
if not os.path.exists("tempDir"):
    os.mkdir("tempDir")
import geopandas
import clusterpy
#> ClusterPy: Library of spatially constrained clustering algorithms

calif_gpd = geopandas.read_file("data_examples/CA_Polygons.shp")

## Smaller buffers
calif_gpd_buf = calif_gpd.copy()
calif_gpd_buf['geometry'] = calif_gpd_buf["geometry"].simplify(2000, preserve_topology=False).buffer(-6000)
calif_gpd_buf = calif_gpd_buf.set_geometry('geometry')
calif_gpd_buf.plot()
#> <matplotlib.axes._subplots.AxesSubplot at 0x7f54e9ec28d0>

## Centroid
calif_centroid = calif_gpd.copy()
calif_centroid['geometry'] = calif_gpd.centroid
calif_centroid = calif_centroid.set_geometry('geometry')

calif_centroid.plot();

## Write to disk
calif_gpd_buf.to_file("tempDir/CA_Polygons_buffer.shp")#
calif_centroid.to_file("tempDir/CA_Polygons_centroid.shp")#

## laod with clusterPy
calif = clusterpy.importArcData("data_examples/CA_Polygons")
#> Loading data_examples/CA_Polygons.dbf
#> Loading data_examples/CA_Polygons.shp
#> Done
calif_buffer = clusterpy.importArcData("tempDir/CA_Polygons_buffer")
#> Loading tempDir/CA_Polygons_buffer.dbf
#> Loading tempDir/CA_Polygons_buffer.shp
#> Done
calif_centroid = clusterpy.importArcData("tempDir/CA_Polygons_centroid")
#> Loading tempDir/CA_Polygons_centroid.dbf
#> Loading tempDir/CA_Polygons_centroid.shp
#> Done

## Run

### Classic: works
calif.cluster('arisel', ['PCR2002'], 9, wType='rook', inits=10, dissolve=1)
#> Getting variables
#> Variables successfully extracted
#> Running original Arisel algorithm
#> Number of areas:  58
#> Number of regions:  9
#> initial Solution:  [8, 4, 4, 1, 4, 1, 8, 5, 7, 4, 1, 1, 4, 4, 4, 4, 1, 1, 0, 4, 3, 4, 1, 4, 1, 4, 0, 7, 7, 0, 7, 1, 4, 7, 4, 4, 0, 2, 4, 0, 2, 0, 8, 6, 1, 1, 1, 7, 7, 4, 1, 1, 1, 4, 4, 0, 7, 1]
#> initial O.F:  0.5022200552944863
#> FINAL SOLUTION:  [8, 4, 4, 1, 4, 1, 8, 5, 7, 4, 5, 1, 4, 4, 4, 4, 1, 5, 0, 4, 3, 4, 1, 4, 5, 4, 0, 7, 7, 0, 7, 1, 4, 7, 0, 4, 0, 2, 4, 0, 2, 0, 8, 6, 1, 5, 5, 7, 7, 4, 1, 5, 5, 4, 4, 0, 1, 5]
#> FINAL OF:  0.4011089126984128
#> Done
#> Adding variables
#> Done
#> Dissolving lines
#> Done
calif.results[0]
#> <clusterpy.core.layer.Layer instance at 0x7f54e9e43dc0>

^{Created on 2020-03-02 by the reprexpy package}

Try now:

calif_buffer.cluster('arisel', ['PCR2002'], 9, wType='rook', inits=10, dissolve=1)
calif_buffer.results[0]

or:

calif_centroid.cluster('arisel', ['PCR2002'], 9, wType='rook', inits=10, dissolve=1)
calif_centroid.results[0]

Move repository to clusterpy/clusterpy

Move to its own clustepry/clusterpy

Matplotlib

Present the map with matplotlib on the interactive python console.
This will be useful when used inside things like the IPython notebook to present the map inline.

Change License headers to reflect the project licensing

File headers state a GPL license, but the project license is in fact BSD.

init method for RegionMaker too long

Refactor init method

[Idea] Each NumRegionType could be a separate function that grows the regions depending on the parameter.

Try a code formatter

https://github.com/google/yapf

Python 3.x support

I see clusterpy is not getting updated anymore, but I find its functionality is still useful. Are there any plans of supporting Python 3.x? or maybe development has moved somewhere else and I've missed it?

Anyway, I'm willing to give it a try and port this to Python 3.6, any pointers on what the roadblocks might be?

Thanks!

Tests for maxp alg.

Need to add tests for the maxp clustering algorithm.

Add Travis CI

Great way to show people the status of the project

A question about the realization function in the SAR data module.

I'm a beginner in spatial regression model. And I've got a problem in the simulation fomulation in realization function in the SAR data module.
Why does the response variable Y equal to the product between self.cvcv (the Cholesky factor of self.vcv) and the normal distributed random serial number e?

In the definition of SAR class, the (I-rho*W) and its inversed matrix AI are already calculated. So, can I get the response variable Y by simplely mutiple AI and the normal distributed random serial number e? What's the meaning of vcv matrix, and why its Cholesky factorisation is needed?

In the end, I found that the parameter meanY seems useless in the DGP and SAR initiation step. How can we give a basic mean value to the response variable Y during the simulation process?

Thanks for your time, and looking forward to your responds.

maxpTabu not giving expected number of areas per region

The minimum size for a region under the maxptabu algorithm should be the same as the threshold. This is a bug

Versioning

Fix versioning.

Add a Simple way to check Clusterpy' version

Multiple cores and only one core

When running arisel with the multicore version, but the system only finds one core, bad things happen.

Region maker as best possible solution

A common strategy with the algorithms is to create multiple instances to test for the best possible configuration and then work on that one. It would be better if the process of getting the best possible region was in the creation of the region maker itself.
That way avoiding the need to create multiple instances and deciding which to use.

Tests for gurobipy, minp and pregions

Some files in the tests directory should be in the correct form for nosetests.

examples of usage

an entry on the wiki with a list of example (I say notebooks) showing different ways or different usage of the library.
A big contribution to this is getting #14 done.

Curate data examples

Many sample files are in the data_examples directory.
For the pypi version we need to slim down this, since this would mean downloading many files.

Add debug flag

A debug flag would be very useful in case of setting a random seed and trying to get consistent results between executions.

Objective functions with 'f' added in the end?

When adding a new objective function, in some cases the method to compute
the objective function will try to fetch the method by appending an 'f' to the end.

def getObjectiveFast(self, region2AreaDict, modifiedRegions=[]):
 [code]
                _fun = objectiveFunctionTypeDispatcher[_objFunType+'f']
                distance=_fun(self, region2AreaDict, modifiedRegions, indexData)
 [code]

This has to be either documented and/or fixed somehow.

Layer variables for colors in Matplotlib

Use a specific variable in the layer as the value for the colormap.
This will allow to present the data with intensity levels with Matplotlib.

Very useful to present indexes data. Related #14

How to change versions

Entry in the wiki listing the different places where the version should be updated.

clusterpy/init file
setup.py file
pypi version

Remove unnecessary output

Output like:

Done
Adding variables
Done
Dissolving lines
Done

Only makes it difficult to use Clusterpy in a unix fashion, using pipes and redirections.

maxpTabu using queen

When running the maxptabu algorithm, the execution assumes, and works, with a rook.
If only the queen is available, the execution will never get to a result.

Python random or numpy random

Decide between the two.
Would make it easier to set a seed and test.

Specifics of papers

Some functionality of Clusterpy, specially parameters, are specific for a publication or project. This kind of developments, when are not general for the usage of the library, should be on a branch on its own or separated somehow.

Fix broken links

The links on the README are broken

Unknown function stdobs

A strange function used to standardize is referenced but it
does not exist.

In file clusterpy/core/toolboxes/cluster/componentsAlg/distanceFunctions.py
in the distanceA2AEuclideanSquared function.

...
    if std:
        x = nparray(x)
        x = stdobs(x)  #  standardize
        x = x.tolist()
...

NameError: name 'stdobs' is not defined

dissolveLayer list.remove(x): x not in list

What steps will reproduce the problem?

Setting dissolve to 1 on some map instances.

The cause or type of map/configuration is not clear.

ERROR MESSAGE:
Dissolving lines
clusterPy is not able to dissolve your map based on this solution.Please execute the command Layer.exportArcData(dissolveProblem) and send us the resulting files to [email protected] to analyse the problem and give you a solution as soon asposible. Your feedback is important for us.

TRACEBACK:
Traceback (most recent call last):
  File "performance_script.py", line 46, in <module>
    instance.cluster('azp', ['SAR1'], pReg, dissolve=1)
  File "clusterpy/source/clusterpy/core/layer.py", line 1240, in cluster
    self.dissolveMap(dataOperations=dataOperations)
  File "clusterpy/source/clusterpy/core/layer.py", line 222, in dissolveMap
    dissolveLayer(self, sh, self.region2areas)
  File "clusterpy/source/clusterpy/core/geometry/dissolve.py", line 70, in dissolveLayer
    raise ve
ValueError: list.remove(x): x not in list

Tests for AZP algs.

Need to add tests for the AZP* clustering algorithms.

AZP
AZP Tabu
AZP R-Tabu
AZP SA

Problem dissolving solution

I can't dissolve the shapefile for a solution, am I missing anything? I'm posting here an example using a toy example data (from pysal), but I have not been able to get it to work with other datasets too.

import pysal as ps
import clusterpy as clp
col = clp.importArcData(ps.examples.get_path('columbus'))
col.cluster('azpRTabu', ['HOVAL', 'CRIME'], 2, dissolve=1)

Which prints out the following output:

Getting variables
Variables successfully extracted
Running original AZP-R-Tabu algorithm (Openshaw and Rao, 1995)
Number of areas:  49
Number of regions:  2
Constructing regions
initial Solution:  [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0]
initial O.F:  24864.1208571
Performing local search
FINAL SOLUTION:  [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0]
FINAL OF:  24572.9644776
Done
Adding variables
Done
Dissolving lines
Problem: Amount of assigned regions does not match number of areas
Regions: [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0]
Done

generateData : local variable 'y' referenced before assignment

What steps will reproduce the problem?

l = clusterpy.createGrid(4,4)
l.generateData("SAR", "rook", 1, 0.7)
l.generateData("SAR1", "rook", 1, 0.7)

GIVES:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "clusterpy/core/layer.py", line 539, in generateData
    self.Y[i] = self.Y[i] + y[i]
UnboundLocalError: local variable 'y' referenced before assignment

What is the expected output? What do you see instead?
A layer with two data vars, SAR and SAR1

Local search as functions

The local search procedures should be outside the region maker.
The region maker is handling more things than it should and having the local search outside will help towards #30

[Workflow] Cluster templates

All the algorithms in the cluster module should be presented as 'Templates', but the implementation of the algorithm.
A user should be able to recreate any algorithm with functions solely from the componentsAlg module.
E.g.

Creating a layer
Use any clustering algorithm
Run one or multiple times any local search algorithm on the layer.

This workflow is not possible with the current implementation.

clusterpy / clusterpy Goto Github PK

clusterpy's Introduction

ClusterPy

Current algorithms

Special Features

Related information

Citing Clusterpy

License information

clusterpy's People

Contributors

Stargazers

Watchers

Forkers

clusterpy's Issues

Recommend Projects

Recommend Topics

Recommend Org