Code Monkey home page Code Monkey logo

anaconda-package-data's Introduction

Conda Package Download Data

Creative Commons License

This repository describes the conda package download data provided by Anaconda, Inc. It includes package download counts starting from Jan. 2017 for the following download sources:

  • Anaconda Distribution: The default channels hosted on repo.anaconda.com (and historically on repo.continuum.io)
  • Select Anaconda.org channels: Currently this includes conda-forge and bioconda.

Check out an example notebook using this data on Binder: badge app

Data Format

The download data is provided as record for every unique combination of:

  • data_source: anaconda for Anaconda distribution, conda-forge for the conda-forge channel on Anaconda.org, and bioconda for the bioconda channel on Anaconda.org.
  • time: UTC time, binned by hour
  • pkg_name: Package name (Ex: pandas)
  • pkg_version: Package version (Ex: 0.23.0)
  • pkg_platform: One of linux-32, linux-64, osx-64, win-32, win-64, linux-armv7, linux-ppcle64, linux-aarch64, or noarch
  • pkg_python: Python version required by the package, if any (Ex: 3.7)
  • counts: Number of downloads for this combination of attributs

The storage format is Parquet, one file per day, with SNAPPY compression. Files are hosted on S3, with the naming convention:

  • s3://anaconda-package-data/conda/hourly/[year]/[month]/[year]-[month]-[day].parquet

Data Catalog

To simplify using the dataset, we have also created an Intake catalog file, which you can load either directly from the repository if you have the intake, intake-parquet, and python-snappy packages installed:

import intake

cat = intake.Catalog('https://raw.githubusercontent.com/ContinuumIO/anaconda-package-data/master/catalog/anaconda_package_data.yaml')
monthly = cat.anaconda_package_data_by_month(year=2019, month=12).to_dask()

Or you can install the data package directly with conda, which will also fetch the required dependencies:

conda install -c intake anaconda-package-data

And then the data source will appear in the global catalog of your conda environment:

import intake

monthly = intake.cat.anaconda_package_data_by_month(year=2019, month=12).to_dask()

To minimize bandwidth usage, these catalogs are configured so that Intake will cache data locally to your system on first use.

Known Issues

There are some known gaps in the dataset, and Anaconda.org data doesn't appear in the data set until April 2017. See KNOWN_ISSUES.md for more details.

Updates

This data will be updated approximately monthly. Note that we may revise historical data if processing issues are discovered, or to add additional data (like new Anaconda.org channels). We will update the change log when new or revised data is posted.

License

This dataset is licensed under a Creative Commons Attribution 4.0 International License. We are offering this data to help the community understand the usage of conda packages, but with no warranty. If you use this data, please acknowledge Anaconda as the source and link back to this Github repository.

Feedback

If you have questions or find problems in the data, please open an issue on this repository. Thanks!

anaconda-package-data's People

Contributors

bkreider avatar datapythonista avatar jsignell avatar mariusvniekerk avatar seibert avatar sophiamyang avatar tswast avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

anaconda-package-data's Issues

Access denied when trying to load 2019 data

when running this (both from binder and in my own conda environment, python 3.7, both on windows and linux):

cat = intake.open_catalog('https://raw.githubusercontent.com/ContinuumIO/anaconda-package-data/master/catalog/anaconda_package_data.yaml')
df = cat.anaconda_package_data_by_year(year=2019).to_dask()

I get the following error:

ClientError: An error occurred (AccessDenied) when calling the GetObject operation: Access Denied

This used to work one month ago or so. ANy ideas of what's wrong?
It seems to work fine if I say year=2018.

Thanks!

Anaconda for Linux Mint

Installing anaconda on a Linux Mint OS (a distro based on Ubuntu) runs into problems due to the missing keyword "linuxmint" in vscode.py when looking for the OS type. Presently, only "debian" and "ubuntu" are listed in this file for that branch of Linux distros using deb package managers. As a result, running anaconda-navigator fails absent this keyword.

The file supplied has the additions needed for Linux Mint. It runs perfectly. Location:
~/anaconda3/lib/python3.7/site-packages/anaconda_navigator/api/external_apps/vscode.py
vscode.py.zip

Anything wrong with the data from March to today ?

  • condastats version: 1.2.1
  • Python version: 3.12.3
  • Operating System: manjaro

Description

Using condastats, the data show an exponential increase in downloads over the last few months. While we're confident in the quality of our package ;-), this seems unrealistic and, in any case, unexpected (*100 between 2023/12 and 2024/05 !).

Do you have any idea why these variations are occurring ?

What I Did

condastats overall pyagrum --monthly

[...]
          2023-08      2484
          2023-09      2433
          2023-10      4560
          2023-11      3154
          2023-12      1114
          2024-01      2829
          2024-02      2812
          2024-03     12573
          2024-04     66098
          2024-05    110944

Thank you for any hints, explanation or information on this subject

(Copy of conda-incubator/condastats#22)

Add plotly channel

pyviz.org fetches data from here to display the download stats of various viz and dashboards packages and related. Among those plotly is one that has its own conda channel and it gets downloaded from there a non negligible amount of times. Could the channel plotly be added to the dataset?

Pandas/pyArrow/read_parquet error

As requested by @sophiamyang , I pass on an issue I opened for condastats since this package depends on the data pipeline in this very repo :

  • condastats version: 0.2.1
  • Python version: Python 3.11.3
  • Operating System: linux (Manjaro/Plasma)

Description

Unable to use condastats.cli.overall (internal error on pandas->pyArrow)

    dataconda = condastats.cli.overall([conda_module], monthly=True)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[...]/lib/python3.11/site-packages/condastats/cli.py", line 62, in overall
    df = dd.read_parquet(
         ^^^^^^^^^^^^^^^^
  File "[...]/python3.11/site-packages/dask/backends.py", line 138, in wrapper
    raise type(e)(
ValueError: An error occurred while calling the read_parquet method registered to the pandas backend.
Original Message: ArrowStringArray requires a PyArrow (chunked) array of string type

Recent versions of Pandas experience PyArrow errors through `intake` and `condastats` use of anaconda-package-data

Thank you for making this data and the documented methods available - fantastic stuff!

I noticed when attempting to use the intake methods from the README.md there are Pandas PyArrow errors when using recent versions of Pandas (>=v2.0.0). This appears to also effect condastats though maybe through different means. I imagine but don't know whether this could be a Pandas or Dask DataFrame issue at the core, but also wondered about data type management within the Parquet files related to this repo (for ex. are there incompatible types which users should be made aware of?). While it might be an external issue in terms of a fix, maybe this issue could help with increased or updated documentation here.

Specifically, the errors I most often saw were:

ValueError: An error occurred while calling the read_parquet method registered to the pandas backend.
Original Message: ArrowStringArray requires a PyArrow (chunked) array of string type

There also may have been errors regarding "Pandas categorical types".

I worked around the issue by looking at the last modified date of the README.md (around January 2020) and installing a version of Pandas from around that time (v1.3.5 worked for me).

PermissionError: Access Denied

Hi,

I am using the condastats package, which relies on anaconda-package-data. When running

import condastats.cli
condastats.cli.overall('numpy')

I get the error message

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/s3fs/core.py", line 110, in _error_wrapper
    return await func(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/aiobotocore/client.py", line 265, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the GetObject operation: Access Denied

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/condastats", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/site-packages/condastats/cli.py", line 387, in main
    overall(
  File "/usr/local/lib/python3.8/site-packages/condastats/cli.py", line 87, in overall
    df = df.compute()
  File "/usr/local/lib/python3.8/site-packages/dask/base.py", line 315, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/dask/base.py", line 598, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/dask/threaded.py", line 89, in get
    results = get_async(
  File "/usr/local/lib/python3.8/site-packages/dask/local.py", line 511, in get_async
    raise_exception(exc, tb)
  File "/usr/local/lib/python3.8/site-packages/dask/local.py", line 319, in reraise
    raise exc
  File "/usr/local/lib/python3.8/site-packages/dask/local.py", line 224, in execute_task
    result = _execute_task(task, data)
  File "/usr/local/lib/python3.8/site-packages/dask/core.py", line 119, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/usr/local/lib/python3.8/site-packages/dask/optimization.py", line 990, in __call__
    return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
  File "/usr/local/lib/python3.8/site-packages/dask/core.py", line 149, in get
    result = _execute_task(task, cache)
  File "/usr/local/lib/python3.8/site-packages/dask/core.py", line 119, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/usr/local/lib/python3.8/site-packages/dask/dataframe/io/parquet/core.py", line 89, in __call__
    return read_parquet_part(
  File "/usr/local/lib/python3.8/site-packages/dask/dataframe/io/parquet/core.py", line 587, in read_parquet_part
    dfs = [
  File "/usr/local/lib/python3.8/site-packages/dask/dataframe/io/parquet/core.py", line 588, in <listcomp>
    func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))
  File "/usr/local/lib/python3.8/site-packages/dask/dataframe/io/parquet/arrow.py", line 435, in read_partition
    arrow_table = cls._read_table(
  File "/usr/local/lib/python3.8/site-packages/dask/dataframe/io/parquet/arrow.py", line 1518, in _read_table
    arrow_table = _read_table_from_path(
  File "/usr/local/lib/python3.8/site-packages/dask/dataframe/io/parquet/arrow.py", line 239, in _read_table_from_path
    return pq.ParquetFile(fil, **pre_buffer).read(
  File "/usr/local/lib/python3.8/site-packages/pyarrow/parquet/__init__.py", line 277, in __init__
    self.reader.open(
  File "pyarrow/_parquet.pyx", line 1213, in pyarrow._parquet.ParquetReader.open
  File "/usr/local/lib/python3.8/site-packages/fsspec/spec.py", line 1578, in read
    out = self.cache._fetch(self.loc, self.loc + length)
  File "/usr/local/lib/python3.8/site-packages/fsspec/caching.py", line 41, in _fetch
    return self.fetcher(start, stop)
  File "/usr/local/lib/python3.8/site-packages/s3fs/core.py", line 2030, in _fetch_range
    return _fetch_range(
  File "/usr/local/lib/python3.8/site-packages/s3fs/core.py", line 2173, in _fetch_range
    resp = fs.call_s3(
  File "/usr/local/lib/python3.8/site-packages/fsspec/asyn.py", line 86, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/fsspec/asyn.py", line 66, in sync
    raise return_result
  File "/usr/local/lib/python3.8/site-packages/fsspec/asyn.py", line 26, in _runner
    result[0] = await coro
  File "/usr/local/lib/python3.8/site-packages/s3fs/core.py", line 332, in _call_s3
    return await _error_wrapper(
  File "/usr/local/lib/python3.8/site-packages/s3fs/core.py", line 137, in _error_wrapper
    raise err
PermissionError: Access Denied

The owner of condastats asked me to open an issue here (see conda-incubator/condastats#16).

Thank you very much for your kind help,
Cheers,
Tom.

Access Issue with April 2023 files.

We are facing access issues while accessing March data files from below s3 path.

s3://anaconda-package-data/conda/hourly/2023/03/

Error: fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden

[Feature Request] Daily update of DB

it seems that the database for a specific month's daily download data is populated every month, not daily.

As of today (2022-06-16), download data for 2022-06-01~2022-06-15 is not available, which makes it not easy to collect statistics (e.g., the Download count for the last 30 days).

image

It would be great if the database is updated daily.

Include `.conda` packages

It would be helpful to include both .conda & .tar.bz2 packages. Particularly as more of the former and less of the latter are produced. May also help to track these separately to track the transition to the newer format

Possible to get the same data for other channels?

Firstly: this is a great data source. Thanks for providing it!

I'd love to be able to get the same type of data for a specific anaconda cloud channel that isn't one of the big ones (i.e. not anaconda, conda-forge, or bioconda) so that I can more easily track adoption by OS and Python version for the packages we distribute. Is there an API (or scripts) that I can use for this?

Binder examples not working

The examples in the binder notebook are failing with this error:

>>> df = dd.read_parquet('s3://anaconda-package-data/conda/hourly/2018/12/2018-12-31.parquet',
...                      storage_options={'anon': True})
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-3-37350afb994b> in <module>
      1 df = dd.read_parquet('s3://anaconda-package-data/conda/hourly/2018/12/2018-12-31.parquet',
----> 2                      storage_options={'anon': True})

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask/dataframe/io/parquet/core.py in read_parquet(path, columns, filters, categories, index, storage_options, engine, gather_statistics, **kwargs)
    135     if hasattr(path, "name"):
    136         path = stringify_path(path)
--> 137     fs, _, paths = get_fs_token_paths(path, mode="rb", storage_options=storage_options)
    138 
    139     paths = sorted(paths, key=natural_sort_key)  # numeric rather than glob ordering

/srv/conda/envs/notebook/lib/python3.7/site-packages/fsspec/core.py in get_fs_token_paths(urlpath, mode, num, name_function, storage_options, protocol)
    313         cls = get_filesystem_class(protocol)
    314 
--> 315         options = cls._get_kwargs_from_urls(urlpath)
    316         path = cls._strip_protocol(urlpath)
    317         update_storage_options(options, storage_options)

AttributeError: type object 'S3FileSystem' has no attribute '_get_kwargs_from_urls'

I guess s3fs changed the API in a recent version and should be pinned in environment.yml.

September & October data?

Appears the last data uploaded was for August. Would it be possible to include the last 2 months?

How could I get the last month data?

It looks I could only get the data from 3 months ago.
How could I get last month data?

condastats overall condastats --start_month 2021-01 --end_month 2021-05 --monthly
image

condastats overall condastats --start_month 2021-01 --end_month 2021-06 --monthly
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.