Code Monkey home page Code Monkey logo

my-data-toolkit's Introduction

πŸ‘‹ Hi there

I'm 40%.

It's a nickname and the real name also in a way. My real name pronunciation is similar to forty percent ('ε››ζˆ') in Chinese.

πŸ’¬ About me

An informal developer. In Chinese, I'm called it as 'ι‡Žη”Ÿηš„η¨‹εΊε‘˜'.

In fact, development is not my main work. It's my hobby.

Actually, I'm a data scientist, especially in geographic data mining.

πŸ™ˆ Creed which I believe

  • Do one thing and do the best. That also means 'Less is more'.
  • Do the thing until to the end. That also means 'Never forget why you started'.

πŸ“ˆ Last 7 days my coding stats

Other    35 hrs 14 mins  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–“β–‘   95.31 %
Python   1 hr 4 mins     β–“β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘   02.89 %
SQL      39 mins         β–’β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘   01.80 %

my-data-toolkit's People

Contributors

dependabot[bot] avatar github-actions[bot] avatar pre-commit-ci[bot] avatar zeroto521 avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

my-data-toolkit's Issues

ENH: New accessor for Series `duplicated_groups`

Example:

>>> import pandas as pd
>>> s = pd.Series([1, 1, 2, 1])
>>> s
0    1
1    1
2    2
3    1
dtype: int64
>>> s.duplicated_groups()
0    0
1    0
2    1
3    0
dtype: int64
>>> s.value_counts()  # related method
1    3
2    1
dtype: int64

Implementation:

def duplicated_groups(s: pd.Series) -> pd.Series:
    base = set(s)
    return s.replace(dict(zip(base, range(len(base)))))

The result of `values_to_dict` should contain columns or not?

A DataFrame likes the following.

   x  y  z
0  A  a  1
1  A  b  2
2  B  c  3
3  B  d  3
4  B  d  4

values_to_dict could return like the following.

{
    "A": {
        "a": ["1"],
        "b": ["2"],
    },
    "B": {
        "c": ["3"],
        "d": ["3", "4"],
    },
}

The above result is missing columns.
So whether the result should contain columns like the following?

{
    [
        {
            "column": "x",
            "value": "A",
            "next": [
                {
                    "column": "y",
                    "value": "a",
                    "next": {"column": "1", "value": "z"},
                },
                {
                    "column": "y",
                    "value": "b",
                    "next": {"column": "2", "value": "z"},
                },
            ],
        },
        {
            "column": "x",
            "value": "B",
            "next": [
                {
                    "column": "y",
                    "value": "c",
                    "next": {"column": "3", "value": "z"},
                },
                {
                    "column": "y",
                    "value": "d",
                    "next": [
                        {"column": "3", "value": "z"},
                        {"column": "4", "value": "z"},
                    ],
                },
            ],
        },
    ]
}

BUG: `GeoDataFrame.repeat` return `DataFrame`

>>> import dtoolkit.geoaccessor
>>> import geopandas as gpd
>>> from shapely import Point
>>> df = (
...     gpd.GeoDataFrame(
...         {
...             "label": ["a", "b"],
...             "geometry": [Point(100, 1), Point(122, 55)],
...         },
...         crs=4326,
...     )
...     .repeat(2)
... )
>>> df
  label        geometry
0     a   POINT (100 1)
0     a   POINT (100 1)
1     b  POINT (122 55)
1     b  POINT (122 55)
>>> type(df)
<class 'pandas.core.frame.DataFrame'>

EHN: A function to read dict from `values_to_dict` returns

>>> import json
>>> import dtoolkit.accessor
>>> import pandas as pd
>>> df = pd.DataFrame(
...     {
...         "x" : ["A", "A", "B", "B", "B"],
...         "y" : ["a", "b", "c", "d", "d"],
...         "z" : ["1", "2", "3", "3", "4"],
...     }
... )
>>> df
    x  y  z
0  A  a  1
1  A  b  2
2  B  c  3
3  B  d  3
4  B  d  4

Turn the above case into the following case.

>>> import json
>>> import dtoolkit.accessor
>>> import pandas as pd
>>> df = func(
...     {
...         "A": {
...             "a": ["1"],
...             "b": ["2"],
...         },
...         "B": {
...             "c": ["3"],
...             "d": ["3", "4"],
...         },
...     }
... )
>>> df
    x  y  z
0  A  a  1
1  A  b  2
2  B  c  3
3  B  d  3
4  B  d  4

BUG: dataframe.filterin couldn't work well on one column

from dtoolkit.accessor import FilterInAccessor

import pandas as pd

df = pd.DataFrame(
    {
        "num_legs": [2, 4, 2],
        "num_wings": [2, 0, 0],
    },
    index=["falcon", "dog", "cat"],
)

df.filterin({'num_legs': [2]})

# result:
# Empty DataFrame
# Columns: [num_legs, num_wings]
# Index: []

# excepted:
#         num_legs  num_wings
# falcon         2          2
# cat            2          0

BUG: feature_union can't concat different dataframe well

if all(istype(i, PandasTypeList) for i in Xs):
return pd.concat(Xs, axis=1)

When we use transformers to handle data frames, some rows would be deleted.
So use the feature union transformer would cause the following problem.

0   1.0  0.0  0.0  1.0  0.0  ...    0.070607     0.0   1.0    1.0     1.0
1   0.0  1.0  0.0  1.0  0.0  ...    0.000000     0.0   1.0    1.0     1.0
2   0.0  0.0  1.0  0.0  1.0  ...    0.853865     1.0   1.0    1.0     1.0
3   0.0  0.0  1.0  0.0  1.0  ...    0.279593     0.0   0.0    1.0     0.0
4   0.0  0.0  1.0  1.0  0.0  ...    1.000000     0.0   1.0    1.0     0.0
5   1.0  0.0  0.0  0.0  1.0  ...    0.566105     0.0   0.0    1.0     0.0
6   0.0  1.0  0.0  1.0  0.0  ...    0.007911     0.0   1.0    0.0     1.0
7   0.0  1.0  0.0  1.0  0.0  ...    0.220168     0.0   1.0    0.0     1.0
8   0.0  1.0  0.0  1.0  0.0  ...    0.242736     0.0   1.0    0.0     1.0
9   1.0  0.0  0.0  1.0  0.0  ...    0.491557     0.0   1.0    0.0     1.0
10  1.0  0.0  0.0  0.0  1.0  ...         NaN     NaN   NaN    NaN     NaN
11  NaN  NaN  NaN  NaN  NaN  ...    0.184352     0.0   1.0    0.0     1.0

We could see, row index 10 and 11 data have NaN.

To fix this, there should add a parameter to ignore the index then concat data.

EHN: use `squash merge` to keep a cleaning git commit history

After v0.0.8 was released, dtoolkit has 1812 commits of numbers.

The git commit history is a little bit lot and mess.
Code development is not linear. There would be several PRs to merge into the master branch at the same time.
Even some PRs have merged commits from some other branches.
When these PRs are merged into master, the master branch would be a mess.

So to keep the git commit history to track, there would use the squash merge method instead of the merge method.

image
image
image

ENH: New geoaccessor `radius`

Get radius from polygon, such as getting the radius of a city.

import pandas as pd

from dtoolkit.geoaccessor.register import register_geoseries_method


@register_geoseries_method
def radius(s: gpd.GeoSeries, /) -> pd.DataFrame:
    if s.crs != 4326:
        raise ValueError(f"Only support 'EPSG:4326' CRS, but got {s.crs!r}.")

    # `convex_hull` is used to merge multi-polygon to one polygon.
    return s.convex_hull.exterior.apply(_radius)


def _radius(s: gpd.GeoSeries, /) -> pd.Series:
    return (
        gpd.GeoSeries(points(s.coords), crs=4326)
        .geodistance(s.centroid)
        .describe()
    )

The result will be larger than the real.

For shapely.box(0, 0, 4, 2), a rectangle.
This polygon totally has four points.
The distance between the centroid and each point is $\sqrt {5}$.
That is larger than the real radius.

IMG_0111

BUG: `DataFrame.to_geoframe` will ignore GeoDataFrame type

>>> import dtoolkit.geoaccessor
>>> import geopandas as gpd
>>> from shapely import Point
>>> df = gpd.GeoDataFrame(
...     {
...         "label": ["a", "b"],
...         "geometry": [Point(100, 1), Point(122, 50)],
...     },
...     crs=4326,
... )
>>> df
  label                    geometry
0     a   POINT (100.00000 1.00000)
1     b  POINT (122.00000 50.00000)
>>> df.to_geoframe(gpd.GeoSeries([Point(0, 0), Point(1, 1)], crs=3857))
# geometry doesn't change at all
  label                    geometry
0     a   POINT (100.00000 1.00000)
1     b  POINT (122.00000 50.00000)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.