metacpan / metacpan-api Goto Github PK

View Code? Open in Web Editor NEW

284.0 38.0 194.0 5.7 MB

A free, open API for everything you want to know about CPAN

Home Page: http://www.metacpan.org/

License: Other

Perl 97.78% Shell 1.64% Dockerfile 0.11% HTML 0.09% Raku 0.38%

metacpan perl cpan hacktoberfest

metacpan-api's Introduction

A Web Service for the CPAN

MetaCPAN aims to provide a free, open web service which provides metadata for CPAN modules.

REST API

MetaCPAN is based on Elasticsearch, so it provides a RESTful interface as well as the option to create complex queries. The docs/ directory provides a good starting point for REST access to MetaCPAN.

Expanding Your Author Info

MetaCPAN allows authors to add custom metadata about themselves to the index. Log in to MetaCPAN to add more information about yourself.

Installing Your Own MetaCPAN

If you want to run MetaCPAN locally, we encourage you to start with metacpan-docker. However, you may still find some info here:

Troubleshooting Elasticsearch

You can restart Elasticsearch (ES) manually if you need to troubleshoot.

sudo service elasticsearch restart

If you are unable to access [[http://localhost:9200]] (give it a few seconds) you should kill the Elasticsearch process and run it in foreground to see the debug output

sudo service elasticsearch stop
cd /opt/elasticsearch
sudo bin/elasticsearch -f

If you get a "Can't start up: not enough memory" error when trying to start Elasticsearch, you likely need to update your JRE. On Ubuntu:

# fixes "not enough memory" errors
sudo apt-get install openjdk-6-jre

(Note: If you intend to try indexing a full MiniCPAN, you may find that Elasticsearch wants to use more open filehandles than your system allows by default. This script can be used to start ES with the appropriate ulimit adjustment).

Run the test suite

The test suite accesses Elasticsearch on port 9900. The developer VM should have a dedicated test instance running in the background already, but if you want to run it manually:

cd /opt/elasticsearch
sudo bin/elasticsearch -f -Des.http.port=9900 -Des.cluster.name=testing

Then run the test suite:

cd /home/metacpan/metacpan-api
./bin/prove t

The test suite has to pass all tests.

Create the ElasticSearch Index

./bin/run bin/metacpan mapping --delete

--delete will drop all indices first to clear the index from test data.

Begin Indexing Your Modules

./bin/run bin/metacpan release /path/to/cpan/authors/id/

You should note that you can index either your CPAN mirror or a minicpan mirror. You can even index just parts of a mirror:

./bin/run bin/metacpan release /path/to/cpan/authors/id/{A,B}

Tag the Latest Releases

./bin/run bin/metacpan latest --cpan /path/to/cpan/

Index Author Data

./bin/run bin/metacpan author --cpan /path/to/cpan/

Note that minicpan doesn't provide the 00whois.xml file which is used to generate the index; you will have to download it manually (it is in the authors/ directory) in order to index authors.

wget -O /path/to/cpan/authors/00whois.xml cpan.cpantesters.org/authors/00whois.xml

It also doesn't include author.json files, so that data will also be missing unless you get it from somewhere else.

Set Up Proxy in Front of ElasticSearch

Start API server on port 5000

./bin/run plackup -p 5000 -r

This will start a single-threaded test server. If you need extra performance, use Starman instead.

Notes

For a full list of options:

./bin/run bin/metacpan release --help

Contributing

If you'd like to get involved, find us at #metacpan on irc.perl.org or open an issue on GitHub and let us know what you'd like to start working on.

IRC

You can find us at #metacpan on irc.perl.org Access it via web interface.

metacpan-api's People

Contributors

Stargazers

Watchers

Forkers

jbobbylopez hdp reneeb sanko xsawyerx hma fayland yanick patspam melo clintongormley gwadej struan trapd00r jquelin shlomif mirod book brianphillips trammell 2shortplanks b2gills mattp- csjewell hoelzro doy toddr dolmen dearieme dpetrov ghenry wolfsage ivanwills mpeters nperez smylers tsibley jayallen pjcj zostay drebolo upasana-me amritamathew essraa talexb fanyeren buffaloe mjemmeson shoehorn oiami nadoshmado shaneutt talina06 andreeap rose mattisbusycom closeststorm schwern harunpehlivan mjgardner mishin neilb latindignity curltmobile punit6425 haarg rexzor curitiba-pm jdv benvanstaveren renzcphp acidburn0zzz mickeyn mrphishxxx reyjrar simbabque soheilyamchi ishwaki mark-5 recieverecover pebsconsulting kleopatra999 gundle zakame jjn1056 vti timofonic-perl skaji mohawk2 apusberry forkme7 marcusramberg fullstackenviormentss robrwo jmaslak pchaozhong stro joesadmercado jonasbn atomlee1

metacpan-api's Issues

Which modules are in core?

Would be helpful to tag which modules are actually in core. Also helpful to know which modules are "dual-life".

API versioning

rafl:

just go with http://api.metacpan.org/$n/, where $n isa Int

but spec out that metacpan consumers should follow redirects

Each API version has it's own index in ElasticSearch.

IRC Channel

We need an IRC channel for general connectedness re: entire project.

Index CPAN scripts

See http://www.cpan.org/scripts/ and http://www.cpan.org/scripts/submitting.html.

Scripts are single perl files, that can be uploaded to PAUSE. search.cpan.org doesn't index them and they don't seem to be widely used.

fix ::Plack::Source to use Archive::Any

::Plack::Source is used to extract single files from a tarball and return the source. Right now it does that only for tar.gz files.
We might want to consider to extract the whole tarball and keep that around instead of extracting only one single file. We can the replace http://cpansearch.perl.org/src/ with our service too (i.e. directory browsing). We won't have to extract all CPAN tarballs but only those that are requested.

Tagging

Dists and modules should be taggable. Perhaps authors as well. There needs to be discussion on the UI and about standard and user-defined tags.

Automate tagging

Use ElasticSearch's percolate feature to automate the tagging of modules:

http://www.elasticsearch.org/guide/reference/api/percolate.html

A release document is sent to the ES server and ES returns a list of matching tags.

Examples:

abstract contains "deprecated" => tag as deprecated
dependency contains XSLoader => tag as XS code
dependency contains Moose ...

Integrate CPAN Testers results

barbie:

search.cpan.org gets them from the cpanstats SQLite database available from the development site: http://devel.cpantesters.org. The DB is updated every 6 hours, and typically search.cpan.org takes a copy once a day.

CPAN-API/search-metacpan-org#27

MetaCPAN Road Map

We need to map out where the project is going in the coming months so that we have something to focus on and also so that we have a clear direction for anyone who wishes to contribute. The wiki would probably be a good home for this type of document.

CPANTS

Pass/Fail/Unknown data needs to added to dist info.

Faster Index Updates

Frepan seems to be able to index dists within a few minutes of release. We need to explore how to do this as well. Frequent rsyncs would be overkill. We're probably fine with the daily rsync, but if we can regularly process a feed of released dists and then fetch those dists manually for indexing, that would likely get us there.

Fix YAML parsing

CPAN::Meta fails to parse some META.yml files

POD parse module/release abstract

Some abstracts (e.g. http://search.cpan.org/perldoc?MooseX::App::Cmd) contain POD in the abstract.
Not sure how we should handle this. Either we add a property "abstract_html" or remove the POD altogether from the abstract. IMHO, the abstract property should not contain any markup (neither pod nor html).

Documentation vs. Module vs. Package

Some ideas how to differentiate between a Module and Documentation

These are some examples that caused me some headache. Please feel free to comment and look for inconsistencies on the CPAN.

Example:

http://cpansearch.perl.org/src/RJBS/perl-5.12.3/pod/perltoot.pod

.pod extension but no package declaration
- set name to perltoot (as mentioned in the NAME section) and do not mark as module

http://cpansearch.perl.org/src/DOY/Moose-2.0000/lib/Moose/Manual.pod

.pod extension with package declaration Moose::Manual
- set name to Moose::Manual but do not mark as module since it has a .pod extension

http://search.cpan.org/~perler/MooseX-Attribute-Deflator-2.1.2/README.pod

.pod extension with no package declaration and no NAME section
- set name to README.pod (i.e. full path inside the tarball)

http://cpansearch.perl.org/src/MLEHMANN/AnyEvent-5.31/lib/AnyEvent.pm

multiple package declarations in one file (AnyEvent, AE, ...)
NAME section which say "AnyEvent"
- set name to AnyEvent, AE, ..., thus users who are looking for AE still find the correct module

http://cpansearch.perl.org/src/MLEHMANN/AnyEvent-5.31/lib/AE.pm

NAME section says "AE"
contains a package AE declaration
- set name to AE, users who search for AE will receive both files (AnyEvent.pm and AE.pm)

Add Complex ElasticSearch Queries to API Docs

The REST API is documented well enough, but we don't have any examples of how people can run more complex queries on the index. This could take the form of a blog post or a page in the wiki.

Store user searches in ElasticSearch

curl -XPUT http://api.metacpan.org/search/release/latest -d '{"query":{"match_all":{}},"filter":{"and":[{"term":{"status":"latest"}},{"term":{"release.distribution.raw":"$1"}}]}}'
curl -XGET http://api.metacpan.org/search/release/latest/Net-FreshBooks-API

This API allows users to store custom searches in MetaCPAN and execute them with parameters.

Dist Comments

Commenting on dists (with version #) could be added once the Twitter auth etc is stable.

MetaCPAN::Consumer::...

Perl api on top of the http api - for testing and also consumer authors.

Might also help mask api incompatibilities in at least perl space

http://search.cpan.org/~xsawyerx/MetaCPAN-API-0.02/ does exist - but might be an idea to have an official integrated version?

RT Issues

Integrate RT issue counts on a per-distribution basis: http://rt.cpan.org/Public/bugs-per-dist.tsv This obviously goes hand in hand with adding Github issue counts.

perldoc

Looks like those files can be found in something like: local/lib/perl5/5.12.2/pods They should be added to the index after the actual module POD is also available.

Add PerlMongers groups to index

Would be very nice to have the PerlMongers group info in the ES index.

http://www.pm.org/groups/perl_mongers.xm

That would mean we'd only need the .pm group name in the author info rather than accompanying links etc.

Favourite Modules/Dists

Favourites are not the same as bookmarks. Bookmarks indicate you want to revisit the docs. Favourites indicate some satisfaction with the product.

Github Issues

The index should keep current data on # of open issues for repositories referred to in META.yml files. Probably open issues and issues tagged as "bug".

Fix /source endpoint to extract whole archive and support bz2

See https://github.com/CPAN-API/cpan-api/blob/master/lib/MetaCPAN/Plack/Source.pm

Currently, we are opening the tarball, extract the requested file, store it in a temp directory and serve it. Subsequent requests to that file will be handled directly from the temp directory.

What we should do instead is extract the whole tarball, so we let users browse the directory and can do diffs more easily.

Also, currently we only extract .tar.gz files. Use Archive::Any instead.

Package Download and Page View Analytics

We can easily provide page view stats as an indicator of popular modules. We can do the same for dists downloaded via HTTP from our CPAN mirror.

We may want to look at letting other trusted sources report download statistics so that we can aggregate this information. One way would be to encourage people to configure their favourite command line installer to user our CPAN mirror via HTTP. It would be a very easy (and painless) way for people to contribute information back to the system.

Github watchers

Track # of watchers of Github repos for distributions. Possibly also track changes over time.

.bz2 tarballs are not being indexed

This seems to be a problem with Archive::Any:

$ bin/metacpan release ~/CPAN/authors/id/R/RJ/RJBS/perl-5.12.3.tar.bz2
2011/04/25 12:08:03 I release: Processing /Users/mo/CPAN/authors/id/R/RJ/RJBS/perl-5.12.3.tar.bz2
No handler available for type 'application/x-bzip2' at /Users/mo/perl5/perlbrew/perls/perl-5.12.3/lib/site_perl/5.12.3/Archive/Any.pm line 179.
2011/04/25 12:08:03 F release: Can't call method "is_naughty" on an undefined value at /Users/mo/Documents/workspace/cpan-api/bin/../lib/MetaCPAN/Script/Release.pm line 129.

detection of current CGI::Application distribution is broken

compare: http://search.metacpan.org/#/dist/CGI-Application ( detects 3.31 as the latest version.

with: http://search.cpan.org/dist/CGI-Application/ (detects 4.31 as the latest version )

Monitor Services (Nagios?)

We should set up Nagios or something similar to send alerts if/when services go down, like the API, the CPAN mirror (if, for example, there's an issue with the Middleware) and cpanvote.

Module Dependents

Add data on dependents to index. Much like cpan-mangler uses this information.

search for last name of author that should match never returns

A search for stosberg gives me a progress-bar-of-death:

http://search.metacpan.org/#/author/STOSBERG

But a search for "MARKSTOS" brings up a page for me quickly, including showing that my name is "Mark Stosberg"

http://search.metacpan.org/#/author/MARKSTOS

Mark

Init Scripts for ES and Other Services

We need to set up some init scripts to ensure that all required services come back online after a reboot.

Patch Module::Metadata to use Safe.pm for version evaling

Right now, our indexer uses alarm to kill Module::Metadata when it parses modules like Acme::BadExample that do a while(1) loop in the version block. But there could be more destructive code.

The code to fix this is already there:

https://github.com/andk/pause/blob/master/lib/PAUSE/mldistwatch.pm#L2560

https://github.com/andk/pause/blob/master/lib/PAUSE/mldistwatch.pm#L2688

Update author mapping and author.json files

Write a better mapping for the author meta data. Need to update all existing .json files and index them.

Proxy bug reports through metacpan

Have a centralized endpoint to submit bugs to the appropriate place.

Use cases:

Module on github
- Reporter on github:
Report directly bug report directly to the github repo
- Reporter not on github:

Report bug using the metacpan identity on github and send reporter a link to the issue if he supplies an email address

Module not on github

Report to RT by sending an email (using his email address as sender) or redirect user to RT (if he wishes to).

Up for discussion :-)

Author Updating via Web App

Once the auth system is finished, we'll need to set up a site where anyone can create an account via a Twitter login. If the user has a PAUSE id, they can request one or more author roles to be added to their account. We will send an authentication email to [email protected] Once authenticated, they'll be able to use the web app to update their author info.

Going forward authors could add metadata on modules and dists as well. Dists could be marked as:

Deprecated
Unloved (In need of new maintainer)
Looking for co-maintainers

POD Translation

Now that POD is in the index, it would be helpful to have a translation layer in the API. For example, /pod/Moose should return straight POD. /pod2html/Moose would return nicely formatted HTML and /pod2textile/Moose would return textile etc.

StackOverflow

Integrate data available from the StackOverflow API. Questions/threads about modules etc.

Here are JSON results for a search against their API for "Data::Dumper":

curl --compressed -XGET "http://api.stackoverflow.com/1.0/search?intitle=Data::Dumper&key=EByZgXQ_-U-DuGKS38yYjA"

The 'key' argument is an API key that I've set up specifically for CPAN-API. Here's more info on their API keys if you're interested (likely we'll probably want to implement something similar):

http://stackapps.com/questions/67/how-api-keys-work

And here's the full API documentation:

http://stackapps.com/questions/1/api-documentation-and-help

Some emailadresses are marked as CENSORED

01mailrc.txt.gz contains CENSORED email addresses.
We should probably replace them with [email protected] or null them

02packages for specific date

So tools could be built to say:

"install DBIx::Simple with the knowledge of 2001-02-12"

Authentication System for Gathering Metadata from Humans

Ideally the index will eventually be expanded by allowing users to log in and tag modules, upvote, downvote etc. The architecture of this system is, as of yet, undefined.

Implement no_index of packages and namespaces

See http://search.cpan.org/~dagolden/CPAN-Meta-2.110580/lib/CPAN/Meta/Spec.pm#no_index

Bookmarking

You should be able to bookmark modules/dists for viewing later. This is not the same as marking as a favourite.

Schwartz Factor

Something like this could also be included in the index: http://babyl.dyndns.org/techblog/entry/schwartz-factor

search.metacpan.org

Basic Info: a sample search page as proof of concept for users to use the api.

Implementation: my idea is to create a single-page javascript engine for using the api, pulling data via ajax and managing the display of the search results/pod/source etc. through some js magic. No server required for running it.

Configure ES proxy to allow params in GET requests

https://github.com/CPAN-API/cpan-api/blob/master/lib/MetaCPAN/Plack/Base.pm#L64

via IRC: http://irclog.perlgeek.de/metacpan/2011-05-07#i_3689099

-> you would basically add something there
-> elsif PATH_INFO =~ _search and METHOD eq GET or someting
-> and then you need to Plack::App::Proxy to the es server

Document general architecture

rafl:

i was wondering. do you guys have docs giving a brief architectural overview of things?

moonk: feel free to just explain things to me over beer. i'll be happy to volunteer to write it down

Don't forget to mention:

How are the identifiers built?
What role does ElasticSearchX::Model play? (MetaCPAN::Model namespace)
MetaCPAN::Script:: namespace
MetaCPAN::Plack:: namespace
$ bin/metacpan

SSL Cert

Free SSL certs at http://cert.startcom.org/ (work in most browsers and even iOS)

We should enable SSL for the API and encourage it's use. Especially if we start personalization (i.e. session cookies etc. should be encrypted)

Match search.cpan.org api

http://search.cpan.org/faq.html#Is_there_a_API?

Would help if anyone ever wants to migrate.

MetaCPAN Status Page

It would be helpful to have a status page right on www.metacpan.org or www.metacpan.org/status We could include stats on what is currently in the index # of dists, authors etc. We could also list our module coverage, whether the API is currently online etc.

Perhaps also a few sample stats on how many PAUSE authors in the index have listed their Github accounts etc.