continuumio / anaconda-package-data Goto Github PK

Conda package download data

License: Creative Commons Attribution 4.0 International

Shell 0.63% Jupyter Notebook 99.37%

anaconda-package-data's Introduction

Conda Package Download Data

This repository describes the conda package download data provided by Anaconda, Inc. It includes package download counts starting from Jan. 2017 for the following download sources:

Anaconda Distribution: The default channels hosted on repo.anaconda.com (and historically on repo.continuum.io)
Select Anaconda.org channels: Currently this includes conda-forge and bioconda.

Check out an example notebook using this data on Binder:

Data Format

The download data is provided as record for every unique combination of:

data_source: anaconda for Anaconda distribution, conda-forge for the conda-forge channel on Anaconda.org, and bioconda for the bioconda channel on Anaconda.org.
time: UTC time, binned by hour
pkg_name: Package name (Ex: pandas)
pkg_version: Package version (Ex: 0.23.0)
pkg_platform: One of linux-32, linux-64, osx-64, win-32, win-64, linux-armv7, linux-ppcle64, linux-aarch64, or noarch
pkg_python: Python version required by the package, if any (Ex: 3.7)
counts: Number of downloads for this combination of attributs

The storage format is Parquet, one file per day, with SNAPPY compression. Files are hosted on S3, with the naming convention:

s3://anaconda-package-data/conda/hourly/[year]/[month]/[year]-[month]-[day].parquet

Data Catalog

To simplify using the dataset, we have also created an Intake catalog file, which you can load either directly from the repository if you have the intake, intake-parquet, and python-snappy packages installed:

import intake

cat = intake.Catalog('https://raw.githubusercontent.com/ContinuumIO/anaconda-package-data/master/catalog/anaconda_package_data.yaml')
monthly = cat.anaconda_package_data_by_month(year=2019, month=12).to_dask()

Or you can install the data package directly with conda, which will also fetch the required dependencies:

conda install -c intake anaconda-package-data

And then the data source will appear in the global catalog of your conda environment:

import intake

monthly = intake.cat.anaconda_package_data_by_month(year=2019, month=12).to_dask()

To minimize bandwidth usage, these catalogs are configured so that Intake will cache data locally to your system on first use.

Known Issues

There are some known gaps in the dataset, and Anaconda.org data doesn't appear in the data set until April 2017. See KNOWN_ISSUES.md for more details.

Updates

This data will be updated approximately monthly. Note that we may revise historical data if processing issues are discovered, or to add additional data (like new Anaconda.org channels). We will update the change log when new or revised data is posted.

License

This dataset is licensed under a Creative Commons Attribution 4.0 International License. We are offering this data to help the community understand the usage of conda packages, but with no warranty. If you use this data, please acknowledge Anaconda as the source and link back to this Github repository.

Feedback

If you have questions or find problems in the data, please open an issue on this repository. Thanks!

anaconda-package-data's People

Contributors

Stargazers

Watchers

anaconda-package-data's Issues

Access denied when trying to load 2019 data

when running this (both from binder and in my own conda environment, python 3.7, both on windows and linux):

cat = intake.open_catalog('https://raw.githubusercontent.com/ContinuumIO/anaconda-package-data/master/catalog/anaconda_package_data.yaml')
df = cat.anaconda_package_data_by_year(year=2019).to_dask()

I get the following error:

ClientError: An error occurred (AccessDenied) when calling the GetObject operation: Access Denied

This used to work one month ago or so. ANy ideas of what's wrong?
It seems to work fine if I say year=2018.

Thanks!

Anaconda for Linux Mint

Installing anaconda on a Linux Mint OS (a distro based on Ubuntu) runs into problems due to the missing keyword "linuxmint" in vscode.py when looking for the OS type. Presently, only "debian" and "ubuntu" are listed in this file for that branch of Linux distros using deb package managers. As a result, running anaconda-navigator fails absent this keyword.

The file supplied has the additions needed for Linux Mint. It runs perfectly. Location:
~/anaconda3/lib/python3.7/site-packages/anaconda_navigator/api/external_apps/vscode.py
vscode.py.zip

Anything wrong with the data from March to today ?

condastats version: 1.2.1
Python version: 3.12.3
Operating System: manjaro

Description

Using condastats, the data show an exponential increase in downloads over the last few months. While we're confident in the quality of our package ;-), this seems unrealistic and, in any case, unexpected (*100 between 2023/12 and 2024/05 !).

Do you have any idea why these variations are occurring ?

What I Did

condastats overall pyagrum --monthly

[...]
          2023-08      2484
          2023-09      2433
          2023-10      4560
          2023-11      3154
          2023-12      1114
          2024-01      2829
          2024-02      2812
          2024-03     12573
          2024-04     66098
          2024-05    110944

Thank you for any hints, explanation or information on this subject

(Copy of conda-incubator/condastats#22)

Is package named "Crypto" existed in Linux version only?

I have installed Anaconda3-2021.05-windows-x86_64.exe but no package named "Crypto" is found. Is this package only existed in Linux version only?

Add PyTorch channels to the anaconda-package-data repo

Hello Anaconda team. We would like to retrieve anaconda download statistics for PyTorch packages.
For this we would need to add following channels to the anaconda-package-data repo:
pytorch : https://anaconda.org/pytorch/
pytorch-test : https://anaconda.org/pytorch-test/

This way we can query it using condastats package.

Add plotly channel

pyviz.org fetches data from here to display the download stats of various viz and dashboards packages and related. Among those plotly is one that has its own conda channel and it gets downloaded from there a non negligible amount of times. Could the channel plotly be added to the dataset?

Missing packages?

Hi,

Is there some threshold or rule for inclusion in the stats? The package I'm looking for but can't find is conda-forge/arcticdb.
https://anaconda.org/conda-forge/arcticdb

The package page says 50k downloads but I can't find it in the monthly parquet files.

Thanks,

Download counts missing python 3.10+ versions

Originally filed in conda/infrastructure#660

Update cudatoolkit package

Hi!

cudatoolkit package at https://anaconda.org/conda-forge/cudatoolkit is very old. How is it possible to update it till the latest version = 12.2.0?

Thanks!

[renovate on-prem migration] Obsolete Dependency Dashboard

There is an error with this repository's Renovate configuration that needs to be fixed. As a precaution, Renovate will stop PRs until it is resolved.

Error type: Cannot find preset's package (github>anaconda/renovate-config)

Add missing March data

It seems as if there isn't a Parquet file for March (yet?): https://s3.amazonaws.com/anaconda-package-data/conda/monthly/2024/2024-03.parquet

Pandas/pyArrow/read_parquet error

As requested by @sophiamyang , I pass on an issue I opened for condastats since this package depends on the data pipeline in this very repo :

condastats version: 0.2.1
Python version: Python 3.11.3
Operating System: linux (Manjaro/Plasma)

Description

Unable to use condastats.cli.overall (internal error on pandas->pyArrow)

    dataconda = condastats.cli.overall([conda_module], monthly=True)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[...]/lib/python3.11/site-packages/condastats/cli.py", line 62, in overall
    df = dd.read_parquet(
         ^^^^^^^^^^^^^^^^
  File "[...]/python3.11/site-packages/dask/backends.py", line 138, in wrapper
    raise type(e)(
ValueError: An error occurred while calling the read_parquet method registered to the pandas backend.
Original Message: ArrowStringArray requires a PyArrow (chunked) array of string type

RuntimeError: Decompression 'SNAPPY' not available. Options: ['GZIP', 'UNCOMPRESSED']

I ran into RuntimeError: Decompression 'SNAPPY' not available. Options: ['GZIP', 'UNCOMPRESSED'] while using the binder notebook in this repo.

Recent versions of Pandas experience PyArrow errors through `intake` and `condastats` use of anaconda-package-data

Thank you for making this data and the documented methods available - fantastic stuff!

I noticed when attempting to use the intake methods from the README.md there are Pandas PyArrow errors when using recent versions of Pandas (>=v2.0.0). This appears to also effect condastats though maybe through different means. I imagine but don't know whether this could be a Pandas or Dask DataFrame issue at the core, but also wondered about data type management within the Parquet files related to this repo (for ex. are there incompatible types which users should be made aware of?). While it might be an external issue in terms of a fix, maybe this issue could help with increased or updated documentation here.

Specifically, the errors I most often saw were:

ValueError: An error occurred while calling the read_parquet method registered to the pandas backend.
Original Message: ArrowStringArray requires a PyArrow (chunked) array of string type

There also may have been errors regarding "Pandas categorical types".

I worked around the issue by looking at the last modified date of the README.md (around January 2020) and installing a version of Pandas from around that time (v1.3.5 worked for me).

Add nvidia channel

Can we please add nvidia channel (https://anaconda.org/nvidia) so we can get download stats for all packages within?
Currently, I don't see any download counts using condastats.

PermissionError: Access Denied

Hi,

I am using the condastats package, which relies on anaconda-package-data. When running

import condastats.cli
condastats.cli.overall('numpy')

I get the error message

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/s3fs/core.py", line 110, in _error_wrapper
    return await func(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/aiobotocore/client.py", line 265, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the GetObject operation: Access Denied

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/condastats", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/site-packages/condastats/cli.py", line 387, in main
    overall(
  File "/usr/local/lib/python3.8/site-packages/condastats/cli.py", line 87, in overall
    df = df.compute()
  File "/usr/local/lib/python3.8/site-packages/dask/base.py", line 315, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/dask/base.py", line 598, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/dask/threaded.py", line 89, in get
    results = get_async(
  File "/usr/local/lib/python3.8/site-packages/dask/local.py", line 511, in get_async
    raise_exception(exc, tb)
  File "/usr/local/lib/python3.8/site-packages/dask/local.py", line 319, in reraise
    raise exc
  File "/usr/local/lib/python3.8/site-packages/dask/local.py", line 224, in execute_task
    result = _execute_task(task, data)
  File "/usr/local/lib/python3.8/site-packages/dask/core.py", line 119, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/usr/local/lib/python3.8/site-packages/dask/optimization.py", line 990, in __call__
    return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
  File "/usr/local/lib/python3.8/site-packages/dask/core.py", line 149, in get
    result = _execute_task(task, cache)
  File "/usr/local/lib/python3.8/site-packages/dask/core.py", line 119, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/usr/local/lib/python3.8/site-packages/dask/dataframe/io/parquet/core.py", line 89, in __call__
    return read_parquet_part(
  File "/usr/local/lib/python3.8/site-packages/dask/dataframe/io/parquet/core.py", line 587, in read_parquet_part
    dfs = [
  File "/usr/local/lib/python3.8/site-packages/dask/dataframe/io/parquet/core.py", line 588, in <listcomp>
    func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))
  File "/usr/local/lib/python3.8/site-packages/dask/dataframe/io/parquet/arrow.py", line 435, in read_partition
    arrow_table = cls._read_table(
  File "/usr/local/lib/python3.8/site-packages/dask/dataframe/io/parquet/arrow.py", line 1518, in _read_table
    arrow_table = _read_table_from_path(
  File "/usr/local/lib/python3.8/site-packages/dask/dataframe/io/parquet/arrow.py", line 239, in _read_table_from_path
    return pq.ParquetFile(fil, **pre_buffer).read(
  File "/usr/local/lib/python3.8/site-packages/pyarrow/parquet/__init__.py", line 277, in __init__
    self.reader.open(
  File "pyarrow/_parquet.pyx", line 1213, in pyarrow._parquet.ParquetReader.open
  File "/usr/local/lib/python3.8/site-packages/fsspec/spec.py", line 1578, in read
    out = self.cache._fetch(self.loc, self.loc + length)
  File "/usr/local/lib/python3.8/site-packages/fsspec/caching.py", line 41, in _fetch
    return self.fetcher(start, stop)
  File "/usr/local/lib/python3.8/site-packages/s3fs/core.py", line 2030, in _fetch_range
    return _fetch_range(
  File "/usr/local/lib/python3.8/site-packages/s3fs/core.py", line 2173, in _fetch_range
    resp = fs.call_s3(
  File "/usr/local/lib/python3.8/site-packages/fsspec/asyn.py", line 86, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/fsspec/asyn.py", line 66, in sync
    raise return_result
  File "/usr/local/lib/python3.8/site-packages/fsspec/asyn.py", line 26, in _runner
    result[0] = await coro
  File "/usr/local/lib/python3.8/site-packages/s3fs/core.py", line 332, in _call_s3
    return await _error_wrapper(
  File "/usr/local/lib/python3.8/site-packages/s3fs/core.py", line 137, in _error_wrapper
    raise err
PermissionError: Access Denied

The owner of condastats asked me to open an issue here (see conda-incubator/condastats#16).

Thank you very much for your kind help,
Cheers,
Tom.

Access Issue with April 2023 files.

We are facing access issues while accessing March data files from below s3 path.

s3://anaconda-package-data/conda/hourly/2023/03/

Error: fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden

[Feature Request] Daily update of DB

it seems that the database for a specific month's daily download data is populated every month, not daily.

As of today (2022-06-16), download data for 2022-06-01~2022-06-15 is not available, which makes it not easy to collect statistics (e.g., the Download count for the last 30 days).

It would be great if the database is updated daily.

Include `.conda` packages

It would be helpful to include both .conda & .tar.bz2 packages. Particularly as more of the former and less of the latter are produced. May also help to track these separately to track the transition to the newer format

Possible to get the same data for other channels?

Firstly: this is a great data source. Thanks for providing it!

I'd love to be able to get the same type of data for a specific anaconda cloud channel that isn't one of the big ones (i.e. not anaconda, conda-forge, or bioconda) so that I can more easily track adoption by OS and Python version for the packages we distribute. Is there an API (or scripts) that I can use for this?

March data and pkg_python info

March data missing and pkg_python info missing. Issue moved from: conda-incubator/condastats#15

Getting an error when using Anaconda for packages

I get this error when trying to get download information from anaconda.

SyntaxError: invalid non-printable character U+202F

It was fine is June but this started in July

Binder examples not working

The examples in the binder notebook are failing with this error:

>>> df = dd.read_parquet('s3://anaconda-package-data/conda/hourly/2018/12/2018-12-31.parquet',
...                      storage_options={'anon': True})
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-3-37350afb994b> in <module>
      1 df = dd.read_parquet('s3://anaconda-package-data/conda/hourly/2018/12/2018-12-31.parquet',
----> 2                      storage_options={'anon': True})

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask/dataframe/io/parquet/core.py in read_parquet(path, columns, filters, categories, index, storage_options, engine, gather_statistics, **kwargs)
    135     if hasattr(path, "name"):
    136         path = stringify_path(path)
--> 137     fs, _, paths = get_fs_token_paths(path, mode="rb", storage_options=storage_options)
    138 
    139     paths = sorted(paths, key=natural_sort_key)  # numeric rather than glob ordering

/srv/conda/envs/notebook/lib/python3.7/site-packages/fsspec/core.py in get_fs_token_paths(urlpath, mode, num, name_function, storage_options, protocol)
    313         cls = get_filesystem_class(protocol)
    314 
--> 315         options = cls._get_kwargs_from_urls(urlpath)
    316         path = cls._strip_protocol(urlpath)
    317         update_storage_options(options, storage_options)

AttributeError: type object 'S3FileSystem' has no attribute '_get_kwargs_from_urls'

I guess s3fs changed the API in a recent version and should be pinned in environment.yml.

condastats overall condastats --start_month 2021-01 --end_month 2021-06 --monthly

continuumio / anaconda-package-data Goto Github PK

anaconda-package-data's Introduction

Conda Package Download Data

Data Format

Data Catalog

Known Issues

Updates

License

Feedback

anaconda-package-data's People

Contributors

Stargazers

Watchers

Forkers

anaconda-package-data's Issues

What I Did

Description

Recommend Projects

Recommend Topics

Recommend Org