Code Monkey home page Code Monkey logo

open-ledger's Introduction

Creative Commons Search prototype

build-status

This is an in-progress prototype for a consolidated "front-door" to the Commons of visual imagery. The project has two near-term goals:

  • Seek to understand the requirements for building out a fully-realized "ledger" of all known Commons works across multiple content providers.
  • Provide a visually engaging prototype of what that front-door could be for end users seeking to find licensable materials, and to understand how their works have been re-used.

It is not the goal of this project to:

  • Produce "web-scale" code or implementations
  • Replace or compete with content providers or partners. We seek to make their works more visible, but defer to them for the hard work of generating, promoting, and disseminating content.

Ancillary benefits of this project may include:

  • A better understanding of the kinds of tooling we could provide to partners that would allow them (or more often, their users) to integrate Commons-licensed works into a larger whole. For example, APIs provided by Creative Commons that surface CC-licensed images for inclusion in original writing.
  • Early surfacing of the challenges inherent in integrating partners' metadata into a coherent whole.
  • Research into the feasibility of uniquely fingerprinting visual works across multiple providers to identify and measure re-use -- there will be many technical and privacy challenges here, and we seek to identify those early.

Installation for development

Configuration

Create some local configuration data by copying the example file:

cp openledger/local.py.example openledger/local.py

You will want to set the following settings:

# Make this a long random string
SECRET_KEY = 'CHANGEME'

# Get these from the AWS config for your account
AWS_ACCESS_KEY_ID = 'CHANGEME'
AWS_SECRET_ACCESS_KEY = 'CHANGEME'

Docker

The easiest way to run the application is through Docker Compose. Install Docker, then run:

docker-compose up

If everything works, this should produce some help output:

docker-compose exec web python3 manage.py

Elasticsearch

Create the elasticsearch index named openledger. You can change its name in settings/openledger.py.

curl -XPUT 'localhost:9200/openledger?pretty' -H 'Content-Type: application/json' -d
{
    "settings" : {
        "index" : {
            "number_of_shards" : 3,
            "number_of_replicas" : 2
        }
    }
}
'

postgresql

Set up the database:

docker-compose exec db createdb -U postgres openledger
docker-compose exec web python manage.py migrate
docker-compose exec web python manage.py createcachetable

This should create the database tables. Everything should work locally, though you won't have any content yet. Visit http://localhost:8000 to see the site.

Testing

Verify that the test suite runs:

docker-compose exec python manage.py test

All tests should always pass. Tests assume that both Postgres and Elasticsearch are running locally.

Tests are set up to run automatically on master commits by Travis CI. When getting started with the app, it's still a good idea to run tests locally to avoid unnecessary pushes to master.

Deployment

Elastic Beanstalk deployment

Install the EC2 keypair associated with the Elastic Beanstalk instance (this will be shared privately among technical staff).

Install the AWS CLI tools: https://aws.amazon.com/cli/

In the openledger directory, run:

eb init

When you are ready to deploy, run the tests first.

If tests pass, commit your changes locally to git.

Then deploy to staging:

eb deploy open-ledger-3

Verify that your changes worked as expected on staging by clicking the thing you changed.

If that works out, deploy to production:

eb deploy open-ledger-prod

Don't forget to push your changes upstream!

EC2 Data Loader

At times it will be necessary to spin up purpose-built EC2 instances to perform certain one-off tasks like these large loading jobs.

Fabric is set up to do a limited amount of management of these instances. You'll need SSH keys that are registered with AWS:

fab launchloader

Will spin up a single instance of INSTANCE_TYPE, provision its packages, and install the latest version of the code from Github (make sure local changes are pushed!)

The code will expect a number of environment variables to be set, including:

export OPEN_LEDGER_LOADER_AMI="XXX" # The AMI name
export OPEN_LEDGER_LOADER_KEY_NAME="XXX" # An SSH key name registered with Amazon
export OPEN_LEDGER_LOADER_SECURITY_GROUPS="default,open-ledger-loader"
export OPEN_LEDGER_REGION="us-west-1"
export OPEN_LEDGER_ACCOUNT="XXX"  # The AWS account for CC
export OPEN_LEDGER_ACCESS_KEY_ID="XXX" # Use an IAM that can reach these hosts, like 'cc-openledger'
export OPEN_LEDGER_SECRET_ACCESS_KEY="XXX"

...and most of the same Django-level configuration variables expected in local.py.example. These values can be extracted from the Elastic Beanstalk config by using the AWS console.

Open Images dataset

To include the Google-provided Open Images dataset from https://github.com/openimages/dataset you can either download the files locally (faster) or use the versions on the CC S3 bucket (used by the AWS deployments)

  1. Download the files linked as:
  • Image URLs and metadata
  • Human image-level annotations (validation set)
  1. Run the database import script as a Django management command:

The script expects:

  • Path to a data file (usually CSV)
  • A 'source' label (the source of the dataset)
  • An identifier of the object type (usual 'images' but potentially also 'tags' or files that map images to tags.)
. venv/bin/activate

./manage.py loader /path/to/openimages/images_2016_08/validation/images.csv openimages images
./manage.py loader /path/to/openimages/dict.csv openimages tags
./manage.py loader /path/to/openimages/human_ann_2016_08/validation/labels.csv openimages image-tags

(This loads the smaller "validation" subject; the "train" files are the full 9 million set.)

This loader is invoked in production using the Fabric task, above:

fab launchloader --set datasource=openimages-small

See fabfile.py for complete documentation on loader tasks, including loading of other image sets.

  1. Index the newly imported data in Elasticsearch.
./manage.py indexer

open-ledger's People

Contributors

afeld avatar aldenstpage avatar lizadaly avatar pa-w avatar ultimatecoder avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

open-ledger's Issues

by platform vs category

would like to open this issue for discussion: does it make sense to have future iterations categorize by something like "cultural heritage institutions" or "museums" versus Rijksmuseum, Flickr, etc. ? I can imagine future versions of search getting unwieldy as we add hundreds of platforms to it..

Consider mechanisms to identify reused works

Identifying near-identical images is the goal of #5, but what about reuse that's quite transformative and not subject to identification by automated processes?

I typed my Flickr handle into the prototype search box and found reuse of one of my CC-licensed photos that I didn't know about (cool!): https://www.flickr.com/photos/53133240@N00/5051004716

Then I tried searching for the part of a Flickr URL up to my username (www.flickr.com/photos/lizadaly/) and found another that way:
https://www.flickr.com/photos/93211492@N06/8478352802

(That URL search query gives spurious results from 500px, though.)

Though identifying derivative works is not a priority for this part of the project, evaluating techniques could be instructive.

Connect ElasticSearch instance to listen for database changes

Following #47, set up a mechanism to receive changes to the database.

We'll probably want two mechanisms:

  1. A simple function that takes an Image record and propagates all search data to the ES instance, clobbering the existing search data (it's tiny).

We can use this function to populate the initial index.

Later:

  1. Use SQLAlchemy events to listen for changes at modification time:
    http://docs.sqlalchemy.org/en/latest/orm/session_events.html

Do not use Flask-SQLAlchemy's TRACK_MODIFICATIONS method, it's deprecated and non-performant.

We could consider using Postgres's event model, but I don't think at this time the performance benefits are worth being tied to PG.

Note that neither of these implementations will probably "work" when a database is restored from a snapshot or backup; we'll need other approaches to ensure that the ES index stays in sync. (Presumably, also backing up ES at the same points-in-time.)

Rebrand "Open Images" search to "Open Ledger" search

Change the current UI on the "Open Images" search page to be:

  1. The primary page at / (move the partner API search to a subpage)
  2. Branded as the "Open Ledger" search

Maybe include a total number of records there and a list of current datasources.

Index image metadata for search

Using Elastic Search, index first order image metadata (like title) for full-text search.

We may also want to put author name into the index, but that should not be subject to full-text transformations like stemming, etc.

Provide HTTPS on both AWS instances

We'll be adding user-facing features soon, including accounts, which means it's a good time to add HTTPS to both hosts:

https://openledger.creativecommons.org/
https://openledger-dev.creativecommons.org/

From some quick reading, it looks like using Let's Encrypt with Elastic Beanstalk is a little tricky. AWS has a free certificate offering now, though: https://us-west-1.console.aws.amazon.com/acm/home?region=us-west-1#/firstrun/

I can probably figure out what buttons to click on if you enable that.

Of course one can also upload a cert from elsewhere too.

Once the certs are in, I'll require the webapp to run under HTTPS only.

Deploy application on AWS

It's time to grow up and move to CC-controlled infrastructure. Deploy the application on an AWS cluster.

This will make it feasible to:

  • Use a real search engine
  • Deploy much larger datasets (like the Flickr 100M)
  • Begin to accept user data

Ability to add a tag to a work

It should be possible to add a tag to a work.

There is already support for tags generated by [#18] in the database; this is extending that to a human-sourced tag.

As with [#19], it should be possible to add tags anonymously.

Add admin privileges to 'liza' IAM account on AWS?

I was able to spin up the openledger app successfully in my AWS environment, but the same steps aren't working when I switch to my IAM user in the CC account:

From the Elastic Beanstalk CLI:

INFO: Using elasticbeanstalk-us-west-2-194250649014 as Amazon S3 storage bucket for environment data.
ERROR: Service:AmazonCloudFormation, Message:The AWS Access Key Id needs a subscription for the service
ERROR: Failed to launch environment.

Ideally, I'd like to not use my own IAM at all, and set up one for the openledger app directly. But I don't have permission:

image

Can my IAM user get upgraded to admin level, so I can create a brand new IAM for the app, and ensure that it has appropriate permissions to spin up an Elastic Beanstalk instance?

Add search-by-creator to the Open Ledger database

Clicking on a creator name should return all results that we know of by that creator.

Initially this will just be a 1:1 correspondence with their identity on the distinct services—we won't "know" that a creator is the same on two services.

(I don't think it's worth doing this for the API partners interface as it'll be a bit of work to implement the query for each API type, versus just doing it once for our own database. But TBD.)

Add NYPL handler

NYPL has a rich API and many CC0-licensed works. Include this in the prototype.

Set up ElasticSearch schema/data model

  1. Set up ES on our AWS cluster
  2. Create a good first approximation of a search schema:

models.Image.identifier: unique id, not tied to database auto-keys (primary key)
models.Image.title: searchable full-text
models.Image.tags: for each tag, concatenate to a denormalized 'tags' field and search that

Track version number of CC licenses

Most of these providers offer only one version of the licenses, but don't necessarily expose that in their API responses. We'll have to keep track of the fact that, e.g. Flickr is 2.0 and 500px is 3.0.

Index tag metadata for search

Using Elastic Search, index all of the tag metadata for full-text search.

This will probably require some thought about denormalization—flattening all of the tags for a particular image as the image's full-text tag record.

Eventually will also require database triggers for indexing.

Add pagination for each resource

Currently results only return the first n. Add simple pagination scheme (next/prev) for each item.

(Understanding the pagination requirements for each API will be necessary for the OL backend, regardless.)

Add Flickr endpoint

Using the existing framework, add a Flickr endpoint to the search prototype

Add ability to mark a work as a "favorite"

This would be recorded in our database as a "vote" for this image.

It should be possible to favorite a photo anonymously (i.e. without a login).

In the future we may want to consider the possibility of abuse of this feature through automation. That said, it's unclear that there's an obvious reason to game the service and I don't recommend investing a lot of time in preventing abuse right now. If anything, we should simply focus on abuse detection—looking for anomalous or disproportionate usage of a subset of photos.

500px images are returned with watermarks via the API

This wasn't obvious until I started implementing #22 and wanted to include a high-res representation of the image. The non-thumbnail images I get from the 500px API include a watermark:

image

This is documented in their API, but I hadn't noticed it when I did my review:

You'll find the URLs to the image(s) for a photo in the images field in the returned JSON for a photo. The images provided with our standard API access will be watermarked with the 500px logo and attribution. For non-watermarked images please contact [email protected]

It might be good to have a conversation with them about next steps.

cc @ryanmerkley

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.