cc-archive / open-ledger Goto Github PK

Prototype code and examples for work on the Creative Commons "CC Search" project

License: MIT License

Python 89.80% HTML 2.24% CSS 2.04% JavaScript 4.02% Dockerfile 0.02% SCSS 1.89%

open-ledger's Introduction

Creative Commons Search prototype

This is an in-progress prototype for a consolidated "front-door" to the Commons of visual imagery. The project has two near-term goals:

Seek to understand the requirements for building out a fully-realized "ledger" of all known Commons works across multiple content providers.
Provide a visually engaging prototype of what that front-door could be for end users seeking to find licensable materials, and to understand how their works have been re-used.

It is not the goal of this project to:

Produce "web-scale" code or implementations
Replace or compete with content providers or partners. We seek to make their works more visible, but defer to them for the hard work of generating, promoting, and disseminating content.

Ancillary benefits of this project may include:

A better understanding of the kinds of tooling we could provide to partners that would allow them (or more often, their users) to integrate Commons-licensed works into a larger whole. For example, APIs provided by Creative Commons that surface CC-licensed images for inclusion in original writing.
Early surfacing of the challenges inherent in integrating partners' metadata into a coherent whole.
Research into the feasibility of uniquely fingerprinting visual works across multiple providers to identify and measure re-use -- there will be many technical and privacy challenges here, and we seek to identify those early.

Installation for development

Configuration

Create some local configuration data by copying the example file:

cp openledger/local.py.example openledger/local.py

You will want to set the following settings:

# Make this a long random string
SECRET_KEY = 'CHANGEME'

# Get these from the AWS config for your account
AWS_ACCESS_KEY_ID = 'CHANGEME'
AWS_SECRET_ACCESS_KEY = 'CHANGEME'

Docker

The easiest way to run the application is through Docker Compose. Install Docker, then run:

docker-compose up

If everything works, this should produce some help output:

docker-compose exec web python3 manage.py

Elasticsearch

Create the elasticsearch index named openledger. You can change its name in settings/openledger.py.

curl -XPUT 'localhost:9200/openledger?pretty' -H 'Content-Type: application/json' -d
{
    "settings" : {
        "index" : {
            "number_of_shards" : 3,
            "number_of_replicas" : 2
        }
    }
}
'

postgresql

Set up the database:

docker-compose exec db createdb -U postgres openledger
docker-compose exec web python manage.py migrate
docker-compose exec web python manage.py createcachetable

This should create the database tables. Everything should work locally, though you won't have any content yet. Visit http://localhost:8000 to see the site.

Testing

Verify that the test suite runs:

docker-compose exec python manage.py test

All tests should always pass. Tests assume that both Postgres and Elasticsearch are running locally.

Tests are set up to run automatically on master commits by Travis CI. When getting started with the app, it's still a good idea to run tests locally to avoid unnecessary pushes to master.

Deployment

Elastic Beanstalk deployment

Install the EC2 keypair associated with the Elastic Beanstalk instance (this will be shared privately among technical staff).

Install the AWS CLI tools: https://aws.amazon.com/cli/

In the openledger directory, run:

eb init

When you are ready to deploy, run the tests first.

If tests pass, commit your changes locally to git.

Then deploy to staging:

eb deploy open-ledger-3

Verify that your changes worked as expected on staging by clicking the thing you changed.

If that works out, deploy to production:

eb deploy open-ledger-prod

Don't forget to push your changes upstream!

EC2 Data Loader

At times it will be necessary to spin up purpose-built EC2 instances to perform certain one-off tasks like these large loading jobs.

Fabric is set up to do a limited amount of management of these instances. You'll need SSH keys that are registered with AWS:

fab launchloader

Will spin up a single instance of INSTANCE_TYPE, provision its packages, and install the latest version of the code from Github (make sure local changes are pushed!)

The code will expect a number of environment variables to be set, including:

export OPEN_LEDGER_LOADER_AMI="XXX" # The AMI name
export OPEN_LEDGER_LOADER_KEY_NAME="XXX" # An SSH key name registered with Amazon
export OPEN_LEDGER_LOADER_SECURITY_GROUPS="default,open-ledger-loader"
export OPEN_LEDGER_REGION="us-west-1"
export OPEN_LEDGER_ACCOUNT="XXX"  # The AWS account for CC
export OPEN_LEDGER_ACCESS_KEY_ID="XXX" # Use an IAM that can reach these hosts, like 'cc-openledger'
export OPEN_LEDGER_SECRET_ACCESS_KEY="XXX"

...and most of the same Django-level configuration variables expected in local.py.example. These values can be extracted from the Elastic Beanstalk config by using the AWS console.

Open Images dataset

To include the Google-provided Open Images dataset from https://github.com/openimages/dataset you can either download the files locally (faster) or use the versions on the CC S3 bucket (used by the AWS deployments)

Download the files linked as:

Image URLs and metadata
Human image-level annotations (validation set)

Run the database import script as a Django management command:

The script expects:

Path to a data file (usually CSV)
A 'source' label (the source of the dataset)
An identifier of the object type (usual 'images' but potentially also 'tags' or files that map images to tags.)

. venv/bin/activate

./manage.py loader /path/to/openimages/images_2016_08/validation/images.csv openimages images
./manage.py loader /path/to/openimages/dict.csv openimages tags
./manage.py loader /path/to/openimages/human_ann_2016_08/validation/labels.csv openimages image-tags

(This loads the smaller "validation" subject; the "train" files are the full 9 million set.)

This loader is invoked in production using the Fabric task, above:

fab launchloader --set datasource=openimages-small

See fabfile.py for complete documentation on loader tasks, including loading of other image sets.

Index the newly imported data in Elasticsearch.

./manage.py indexer

open-ledger's People

Contributors

Stargazers

Watchers

Forkers

kleopatra999 jennylouw alexxnica ajslaghu flinti111 pa-w henfee ultimatecoder afeld xdreamcode jessamynsmith f-adam-b redwoodtj

open-ledger's Issues

by platform vs category

would like to open this issue for discussion: does it make sense to have future iterations categorize by something like "cultural heritage institutions" or "museums" versus Rijksmuseum, Flickr, etc. ? I can imagine future versions of search getting unwieldy as we add hundreds of platforms to it..

Provide a better response to zero results than "Page 1 of 0 pages"

On both the main search page and the per-provider, per-source pages.

DNS entries for openledger on AWS

Looks like this is Amazon's recommended path:

http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/customdomains.html

I've deployed and will keep stable:

openledger-dev.us-west-1.elasticbeanstalk.com

and presumably will also have:

openledger.us-west-1.elasticbeanstalk.com

I feel weird about squatting on openledger.creativecommons.org when this is just a prototype, but then openledger-prototype.creativecommons.org is pretty long. Ideas welcome :)

Add comprehensive model tests

Currently there are zero tests of the database models, a number which is less than comprehensive.

Consider mechanisms to identify reused works

Identifying near-identical images is the goal of #5, but what about reuse that's quite transformative and not subject to identification by automated processes?

I typed my Flickr handle into the prototype search box and found reuse of one of my CC-licensed photos that I didn't know about (cool!): https://www.flickr.com/photos/53133240@N00/5051004716

Then I tried searching for the part of a Flickr URL up to my username (www.flickr.com/photos/lizadaly/) and found another that way:
https://www.flickr.com/photos/93211492@N06/8478352802

(That URL search query gives spurious results from 500px, though.)

Though identifying derivative works is not a priority for this part of the project, evaluating techniques could be instructive.

Connect ElasticSearch instance to listen for database changes

Following #47, set up a mechanism to receive changes to the database.

We'll probably want two mechanisms:

A simple function that takes an Image record and propagates all search data to the ES instance, clobbering the existing search data (it's tiny).

We can use this function to populate the initial index.

Later:

Use SQLAlchemy events to listen for changes at modification time:
http://docs.sqlalchemy.org/en/latest/orm/session_events.html

Do not use Flask-SQLAlchemy's TRACK_MODIFICATIONS method, it's deprecated and non-performant.

We could consider using Postgres's event model, but I don't think at this time the performance benefits are worth being tied to PG.

Note that neither of these implementations will probably "work" when a database is restored from a snapshot or backup; we'll need other approaches to ensure that the ES index stays in sync. (Presumably, also backing up ES at the same points-in-time.)

Add next/previous through search results to photo detail page

If you arrive at the photo detail page (#22) from a search, it would be useful to be able to continue next/previous through your search at the photo detail level.

Create metadata in the head of the photo detail page to provide a nice snippet

One of the goals of this project is to encourage reuse. To that end, links to individual photos as in [#22] should produce attractive snippets in services that speak Open Graph (such as Slack links).

ability to preview images and license would be nice

eg, curse over an image or click it once and it pops up in preview with license icons linked to license.. https://creativecommons.org/licenses/by/4.0/

Include Pixabay results

Pixabay has an API which includes CC0 results: https://pixabay.com/api/docs/

Rebrand "Open Images" search to "Open Ledger" search

Change the current UI on the "Open Images" search page to be:

The primary page at / (move the partner API search to a subpage)
Branded as the "Open Ledger" search

Maybe include a total number of records there and a list of current datasources.

Create loader for Open Images data on AWS RDS database

Decide on a mechanism to import the following data from the CC S3 bucket:

s3://cc-openledger-sources/openimages/

to the RDS Postgres database spun up in the Open Ledger Elastic Beanstalk environment.

It'll probably be something something "container commands"?

http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/customize-containers-ec2.html#linux-container-commands

Index image metadata for search

Using Elastic Search, index first order image metadata (like title) for full-text search.

We may also want to put author name into the index, but that should not be subject to full-text transformations like stemming, etc.

Include results from Google Open Image dataset

https://github.com/openimages/dataset

Provide HTTPS on both AWS instances

We'll be adding user-facing features soon, including accounts, which means it's a good time to add HTTPS to both hosts:

https://openledger.creativecommons.org/
https://openledger-dev.creativecommons.org/

From some quick reading, it looks like using Let's Encrypt with Elastic Beanstalk is a little tricky. AWS has a free certificate offering now, though: https://us-west-1.console.aws.amazon.com/acm/home?region=us-west-1#/firstrun/

I can probably figure out what buttons to click on if you enable that.

Of course one can also upload a cert from elsewhere too.

Once the certs are in, I'll require the webapp to run under HTTPS only.

Persist checkboxes on search results

Right now the form is cleared each time.

Deploy application on AWS

It's time to grow up and move to CC-controlled infrastructure. Deploy the application on an AWS cluster.

This will make it feasible to:

Use a real search engine
Deploy much larger datasets (like the Flickr 100M)
Begin to accept user data

Add testing framework

Will need to include mocked versions of all partner response types.

Ability to add a tag to a work

It should be possible to add a tag to a work.

There is already support for tags generated by [#18] in the database; this is extending that to a human-sourced tag.

As with [#19], it should be possible to add tags anonymously.

Per-provider 'next' links are broken

Add some regression tests for this:

http://openledger.lizadaly.com/provider/?search=wit+konijn%2C+ogata+gekk%C3%B4&page=2&per_page=20

Refactor blockhash-python into a library

We'd like to experiment with https://github.com/commonsmachinery/blockhash-python, but it's not written as a library, just a command-line tool.

Refactor into a library through our own fork, https://github.com/creativecommons/blockhash-python

This is to support #5

Add total number of pages instead of total number of photos to pagination link

The total number of photos is not as useful as the number of pages, and sometimes is such a huge number that it's not very enticing.

Add admin privileges to 'liza' IAM account on AWS?

I was able to spin up the openledger app successfully in my AWS environment, but the same steps aren't working when I switch to my IAM user in the CC account:

From the Elastic Beanstalk CLI:

INFO: Using elasticbeanstalk-us-west-2-194250649014 as Amazon S3 storage bucket for environment data.
ERROR: Service:AmazonCloudFormation, Message:The AWS Access Key Id needs a subscription for the service
ERROR: Failed to launch environment.

Ideally, I'd like to not use my own IAM at all, and set up one for the openledger app directly. But I don't have permission:

Can my IAM user get upgraded to admin level, so I can create a brand new IAM for the app, and ensure that it has appropriate permissions to spin up an Elastic Beanstalk instance?

Add pagination links to the bottom of per-provider/source listings

Right now they're just at the top, which means you have to scroll back up when you haven't found a result you want.

Research perceptual hashing work for resource fingerprinting

I know there is some prior art that CC has looked at before, but also going to look further afield.

Add search-by-creator to the Open Ledger database

Clicking on a creator name should return all results that we know of by that creator.

Initially this will just be a 1:1 correspondence with their identity on the distinct services—we won't "know" that a creator is the same on two services.

(I don't think it's worth doing this for the API partners interface as it'll be a bit of work to implement the query for each API type, versus just doing it once for our own database. But TBD.)

Allow users to download or cut and paste an attribution

Add a lightweight "collection" feature to gather specific works in a list

It should be possible to select multiple images and add them to a lightweight collection (see previous work with the List app).

The resultant List should be available publicly by default.

It should be possible to add a List anonymously (during a given session).

Set up code framework for evaluating different fingerprinting methods

We should have a testing framework to programmatically evaluate known samples against different hashing strategies, to support #5.

Add NYPL handler

NYPL has a rich API and many CC0-licensed works. Include this in the prototype.

Set up ElasticSearch schema/data model

Set up ES on our AWS cluster
Create a good first approximation of a search schema:

models.Image.identifier: unique id, not tied to database auto-keys (primary key)
models.Image.title: searchable full-text
models.Image.tags: for each tag, concatenate to a denormalized 'tags' field and search that

Open Images tag metadata is loaded in the wrong encoding

Not sure exactly where this is falling down, but the tag metadata needs to be more carefully ingested to get the encoding right:

Create a page for each image that includes additional information

Rather than clicking directly through to the source image, create a page for each image that allows us to hang rich metadata and user interface options.

License filter does not follow pagination

Do an explicit CC0 search:

http://openledger-dev.creativecommons.org/?search=puppy&licenses=CC0
Hit 'next' in the pagination link for Flickr
The results are no longer filtered by CC0

Don't ever say "1 results" in results list

seriously that's the worst

Add filter on type of license

We'll upgrade this to be more user-friendly, but start with just simple filtering on license name.

Track version number of CC licenses

Most of these providers offer only one version of the licenses, but don't necessarily expose that in their API responses. We'll have to keep track of the fact that, e.g. Flickr is 2.0 and 500px is 3.0.

Some Flickr images don't load on the detail page

Not sure what's up with this; other Flickr images load OK and this one shows a thumbnail just fine:

http://openledger-dev.creativecommons.org/image/detail/?license=cc0&creator=Dailypuppies&provider_url=https%3A%2F%2Fwww.flickr.com%2Fphotos%2F136292790%40N05%2F23708780231&title=Cute+Puppies+%40+December+16%2C+2015+at+09%3A01PM&provider=flickr&url=

Create loader for Flickr 100M dataset in AWS RDS database

Following whatever works best for #34, do the same, but for the much larger Flickr100M set:

s3://cc-openledger-sources/flickr100m/

Index tag metadata for search

Using Elastic Search, index all of the tag metadata for full-text search.

This will probably require some thought about denormalization—flattening all of the tags for a particular image as the image's full-text tag record.

Eventually will also require database triggers for indexing.

Don't 500 error on unknown license values

http://openledger-dev.creativecommons.org/?search=puppy&licenses=cc0 returns a 500 error, because it's expecting the license string to be 'CC0' (all-caps). It shouldn't error on an unknown license, just drop it and warn in the logs.

(It should also try to capitalize the value since the lowercase value is used as a URL param elsewhere, like on the detail page.)

Add pagination for each resource

Currently results only return the first n. Add simple pagination scheme (next/prev) for each item.

(Understanding the pagination requirements for each API will be necessary for the OL backend, regardless.)

Searching the Partner API endpoint with quoted terms returns a 500 error

http://openledger.creativecommons.org/?search=%22dogs%22

Using quotes on the OL search just returns no results, which is incorrect, but not actually broken.

Add Flickr endpoint

Using the existing framework, add a Flickr endpoint to the search prototype

consider using license icons instead of license short name

license icons we provide platforms here: https://creativecommons.org/platform/toolkit/#logos-and-icons

Create a mechanism to "copy the attribution for this photo"

This feature should be available from [#22].

Start with the human-readable deed and loop @sarahpearson in to review.

Add ability to mark a work as a "favorite"

This would be recorded in our database as a "vote" for this image.

It should be possible to favorite a photo anonymously (i.e. without a login).

In the future we may want to consider the possibility of abuse of this feature through automation. That said, it's unclear that there's an obvious reason to game the service and I don't recommend investing a lot of time in preventing abuse right now. If anything, we should simply focus on abuse detection—looking for anomalous or disproportionate usage of a subset of photos.

Add Wikimedia Commons results

TBD whether we should use Wikidata's (newer, more abstract) API or WMC's (older, but more concrete) API

500px images are returned with watermarks via the API

This wasn't obvious until I started implementing #22 and wanted to include a high-res representation of the image. The non-thumbnail images I get from the 500px API include a watermark:

This is documented in their API, but I hadn't noticed it when I did my review:

You'll find the URLs to the image(s) for a photo in the images field in the returned JSON for a photo. The images provided with our standard API access will be watermarked with the 500px logo and attribution. For non-watermarked images please contact [email protected]

It might be good to have a conversation with them about next steps.

cc @ryanmerkley

Show tags for a work on the detail page

We have tags for some works that are safely licensed (Open Images), so display those if available on the detail page.