Code Monkey home page Code Monkey logo

metacpan-api's Introduction

A Web Service for the CPAN

CircleCI

MetaCPAN aims to provide a free, open web service which provides metadata for CPAN modules.

REST API

MetaCPAN is based on Elasticsearch, so it provides a RESTful interface as well as the option to create complex queries. The docs/ directory provides a good starting point for REST access to MetaCPAN.

Expanding Your Author Info

MetaCPAN allows authors to add custom metadata about themselves to the index. Log in to MetaCPAN to add more information about yourself.

Installing Your Own MetaCPAN

If you want to run MetaCPAN locally, we encourage you to start with metacpan-docker. However, you may still find some info here:

Troubleshooting Elasticsearch

You can restart Elasticsearch (ES) manually if you need to troubleshoot.

sudo service elasticsearch restart

If you are unable to access [[http://localhost:9200]] (give it a few seconds) you should kill the Elasticsearch process and run it in foreground to see the debug output

sudo service elasticsearch stop
cd /opt/elasticsearch
sudo bin/elasticsearch -f

If you get a "Can't start up: not enough memory" error when trying to start Elasticsearch, you likely need to update your JRE. On Ubuntu:

# fixes "not enough memory" errors
sudo apt-get install openjdk-6-jre

(Note: If you intend to try indexing a full MiniCPAN, you may find that Elasticsearch wants to use more open filehandles than your system allows by default. This script can be used to start ES with the appropriate ulimit adjustment).

Run the test suite

The test suite accesses Elasticsearch on port 9900. The developer VM should have a dedicated test instance running in the background already, but if you want to run it manually:

cd /opt/elasticsearch
sudo bin/elasticsearch -f -Des.http.port=9900 -Des.cluster.name=testing

Then run the test suite:

cd /home/metacpan/metacpan-api
./bin/prove t

The test suite has to pass all tests.

Create the ElasticSearch Index

./bin/run bin/metacpan mapping --delete

--delete will drop all indices first to clear the index from test data.

Begin Indexing Your Modules

./bin/run bin/metacpan release /path/to/cpan/authors/id/

You should note that you can index either your CPAN mirror or a minicpan mirror. You can even index just parts of a mirror:

./bin/run bin/metacpan release /path/to/cpan/authors/id/{A,B}

Tag the Latest Releases

./bin/run bin/metacpan latest --cpan /path/to/cpan/

Index Author Data

./bin/run bin/metacpan author --cpan /path/to/cpan/

Note that minicpan doesn't provide the 00whois.xml file which is used to generate the index; you will have to download it manually (it is in the authors/ directory) in order to index authors.

wget -O /path/to/cpan/authors/00whois.xml cpan.cpantesters.org/authors/00whois.xml

It also doesn't include author.json files, so that data will also be missing unless you get it from somewhere else.

Set Up Proxy in Front of ElasticSearch

Start API server on port 5000

./bin/run plackup -p 5000 -r

This will start a single-threaded test server. If you need extra performance, use Starman instead.

Notes

For a full list of options:

./bin/run bin/metacpan release --help

Contributing

If you'd like to get involved, find us at #metacpan on irc.perl.org or open an issue on GitHub and let us know what you'd like to start working on.

IRC

You can find us at #metacpan on irc.perl.org Access it via web interface.

metacpan-api's People

Contributors

2shortplanks avatar andreeap avatar bodo-hugo-barwich avatar book avatar briandfoy avatar clintongormley avatar grantm avatar grinnz avatar gugod avatar haarg avatar jberger avatar metacpan-automation[bot] avatar mickeyn avatar mohawk2 avatar monken avatar mpeters avatar oalders avatar oiami avatar rafl avatar ranguard avatar rose avatar rwstauner avatar schwern avatar shlomif avatar ssoriche avatar szabgab avatar tsibley avatar wolfsage avatar zakame avatar zostay avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

metacpan-api's Issues

Which modules are in core?

Would be helpful to tag which modules are actually in core. Also helpful to know which modules are "dual-life".

IRC Channel

We need an IRC channel for general connectedness re: entire project.

fix ::Plack::Source to use Archive::Any

::Plack::Source is used to extract single files from a tarball and return the source. Right now it does that only for tar.gz files.
We might want to consider to extract the whole tarball and keep that around instead of extracting only one single file. We can the replace http://cpansearch.perl.org/src/ with our service too (i.e. directory browsing). We won't have to extract all CPAN tarballs but only those that are requested.

Tagging

Dists and modules should be taggable. Perhaps authors as well. There needs to be discussion on the UI and about standard and user-defined tags.

MetaCPAN Road Map

We need to map out where the project is going in the coming months so that we have something to focus on and also so that we have a clear direction for anyone who wishes to contribute. The wiki would probably be a good home for this type of document.

CPANTS

Pass/Fail/Unknown data needs to added to dist info.

Faster Index Updates

Frepan seems to be able to index dists within a few minutes of release. We need to explore how to do this as well. Frequent rsyncs would be overkill. We're probably fine with the daily rsync, but if we can regularly process a feed of released dists and then fetch those dists manually for indexing, that would likely get us there.

Documentation vs. Module vs. Package

Some ideas how to differentiate between a Module and Documentation

These are some examples that caused me some headache. Please feel free to comment and look for inconsistencies on the CPAN.

Example:

http://cpansearch.perl.org/src/RJBS/perl-5.12.3/pod/perltoot.pod

  • .pod extension but no package declaration
    • set name to perltoot (as mentioned in the NAME section) and do not mark as module

http://cpansearch.perl.org/src/DOY/Moose-2.0000/lib/Moose/Manual.pod

  • .pod extension with package declaration Moose::Manual
    • set name to Moose::Manual but do not mark as module since it has a .pod extension

http://search.cpan.org/~perler/MooseX-Attribute-Deflator-2.1.2/README.pod

  • .pod extension with no package declaration and no NAME section
    • set name to README.pod (i.e. full path inside the tarball)

http://cpansearch.perl.org/src/MLEHMANN/AnyEvent-5.31/lib/AnyEvent.pm

  • multiple package declarations in one file (AnyEvent, AE, ...)
  • NAME section which say "AnyEvent"
    • set name to AnyEvent, AE, ..., thus users who are looking for AE still find the correct module

http://cpansearch.perl.org/src/MLEHMANN/AnyEvent-5.31/lib/AE.pm

  • NAME section says "AE"
  • contains a package AE declaration
    • set name to AE, users who search for AE will receive both files (AnyEvent.pm and AE.pm)

Add Complex ElasticSearch Queries to API Docs

The REST API is documented well enough, but we don't have any examples of how people can run more complex queries on the index. This could take the form of a blog post or a page in the wiki.

Store user searches in ElasticSearch

curl -XPUT http://api.metacpan.org/search/release/latest -d '{"query":{"match_all":{}},"filter":{"and":[{"term":{"status":"latest"}},{"term":{"release.distribution.raw":"$1"}}]}}'
curl -XGET http://api.metacpan.org/search/release/latest/Net-FreshBooks-API

This API allows users to store custom searches in MetaCPAN and execute them with parameters.

Dist Comments

Commenting on dists (with version #) could be added once the Twitter auth etc is stable.

perldoc

Looks like those files can be found in something like: local/lib/perl5/5.12.2/pods They should be added to the index after the actual module POD is also available.

Favourite Modules/Dists

Favourites are not the same as bookmarks. Bookmarks indicate you want to revisit the docs. Favourites indicate some satisfaction with the product.

Github Issues

The index should keep current data on # of open issues for repositories referred to in META.yml files. Probably open issues and issues tagged as "bug".

Fix /source endpoint to extract whole archive and support bz2

See https://github.com/CPAN-API/cpan-api/blob/master/lib/MetaCPAN/Plack/Source.pm

Currently, we are opening the tarball, extract the requested file, store it in a temp directory and serve it. Subsequent requests to that file will be handled directly from the temp directory.

What we should do instead is extract the whole tarball, so we let users browse the directory and can do diffs more easily.

Also, currently we only extract .tar.gz files. Use Archive::Any instead.

Package Download and Page View Analytics

We can easily provide page view stats as an indicator of popular modules. We can do the same for dists downloaded via HTTP from our CPAN mirror.

We may want to look at letting other trusted sources report download statistics so that we can aggregate this information. One way would be to encourage people to configure their favourite command line installer to user our CPAN mirror via HTTP. It would be a very easy (and painless) way for people to contribute information back to the system.

Github watchers

Track # of watchers of Github repos for distributions. Possibly also track changes over time.

.bz2 tarballs are not being indexed

This seems to be a problem with Archive::Any:

$ bin/metacpan release ~/CPAN/authors/id/R/RJ/RJBS/perl-5.12.3.tar.bz2
2011/04/25 12:08:03 I release: Processing /Users/mo/CPAN/authors/id/R/RJ/RJBS/perl-5.12.3.tar.bz2
No handler available for type 'application/x-bzip2' at /Users/mo/perl5/perlbrew/perls/perl-5.12.3/lib/site_perl/5.12.3/Archive/Any.pm line 179.
2011/04/25 12:08:03 F release: Can't call method "is_naughty" on an undefined value at /Users/mo/Documents/workspace/cpan-api/bin/../lib/MetaCPAN/Script/Release.pm line 129.

Monitor Services (Nagios?)

We should set up Nagios or something similar to send alerts if/when services go down, like the API, the CPAN mirror (if, for example, there's an issue with the Middleware) and cpanvote.

Module Dependents

Add data on dependents to index. Much like cpan-mangler uses this information.

Proxy bug reports through metacpan

Have a centralized endpoint to submit bugs to the appropriate place.

Use cases:

  • Module on github

    • Reporter on github:

    Report directly bug report directly to the github repo

    • Reporter not on github:

Report bug using the metacpan identity on github and send reporter a link to the issue if he supplies an email address

  • Module not on github

Report to RT by sending an email (using his email address as sender) or redirect user to RT (if he wishes to).

Up for discussion :-)

Author Updating via Web App

Once the auth system is finished, we'll need to set up a site where anyone can create an account via a Twitter login. If the user has a PAUSE id, they can request one or more author roles to be added to their account. We will send an authentication email to [email protected] Once authenticated, they'll be able to use the web app to update their author info.

Going forward authors could add metadata on modules and dists as well. Dists could be marked as:

Deprecated
Unloved (In need of new maintainer)
Looking for co-maintainers

POD Translation

Now that POD is in the index, it would be helpful to have a translation layer in the API. For example, /pod/Moose should return straight POD. /pod2html/Moose would return nicely formatted HTML and /pod2textile/Moose would return textile etc.

StackOverflow

Integrate data available from the StackOverflow API. Questions/threads about modules etc.

Here are JSON results for a search against their API for "Data::Dumper":

curl --compressed -XGET "http://api.stackoverflow.com/1.0/search?intitle=Data::Dumper&key=EByZgXQ_-U-DuGKS38yYjA"

The 'key' argument is an API key that I've set up specifically for CPAN-API. Here's more info on their API keys if you're interested (likely we'll probably want to implement something similar):

http://stackapps.com/questions/67/how-api-keys-work

And here's the full API documentation:

http://stackapps.com/questions/1/api-documentation-and-help

Bookmarking

You should be able to bookmark modules/dists for viewing later. This is not the same as marking as a favourite.

search.metacpan.org

Basic Info: a sample search page as proof of concept for users to use the api.

Implementation: my idea is to create a single-page javascript engine for using the api, pulling data via ajax and managing the display of the search results/pod/source etc. through some js magic. No server required for running it.

Document general architecture

rafl:

i was wondering. do you guys have docs giving a brief architectural overview of things?

moonk: feel free to just explain things to me over beer. i'll be happy to volunteer to write it down

Don't forget to mention:

  • How are the identifiers built?
  • What role does ElasticSearchX::Model play? (MetaCPAN::Model namespace)
  • MetaCPAN::Script:: namespace
  • MetaCPAN::Plack:: namespace
  • $ bin/metacpan

SSL Cert

Free SSL certs at http://cert.startcom.org/ (work in most browsers and even iOS)

We should enable SSL for the API and encourage it's use. Especially if we start personalization (i.e. session cookies etc. should be encrypted)

MetaCPAN Status Page

It would be helpful to have a status page right on www.metacpan.org or www.metacpan.org/status We could include stats on what is currently in the index # of dists, authors etc. We could also list our module coverage, whether the API is currently online etc.

Perhaps also a few sample stats on how many PAUSE authors in the index have listed their Github accounts etc.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.