Code Monkey home page Code Monkey logo

elasticmsd's Introduction

ElasticMSD

This project enables you to convert the Million Song Dataset into an Elasticsearch index.

Why?

Elasticsearch is a distributed, RESTful search and analytics engine that allows powerful text searches. Although MSD is an audio-featured focused dataset, it also contains metadata that one wants to make research with.

Installation

You need the Python elasticsearch and tables packages. I suggest you to work in a Python virtual environment, it's a good practice.

Set up your virtualenv:

pip install virtualenv
virtualenv ~/.env/elasticmsd
source ~/.env/elasticmsd/bin/activate

Install dependencies:

git clone https://github.com/deezer/elasticmsd
cd elasticmsd
pip install -r requirements.txt

Install hdf5_getters.py from from tbertinmahieux/MsongDB repository. You must then run a pt2to3 on this file (program shipped with tables package) even if you stay in Python2. hdf5_getters uses an old tables convention:

wget https://raw.githubusercontent.com/tbertinmahieux/MSongsDB/master/PythonSrc/hdf5_getters.py -O hdf5_getters_2.py
pt2to3 hdf5_getters_2.py > hdf5_getters.py
rm hdf5_getters_2.py

Download MSD summary file (~300Mo):

wget http://labrosa.ee.columbia.edu/millionsong/sites/default/files/AdditionalFiles/msd_summary_file.h5 -O msd_summary_file.h5

Or Download the full MSD (~200Go) from OSCD:

rsync -avzuP publicdata.opensciencedatacloud.org::ark:/31807/osdc-c1c763e4/ /path/to/local_copy

If you need so, you can install a local instance of an Elasticsearch server via docker:

docker run --rm -p 9200:9200 -p 9300:9300 -d --name=local_elasticsearch elasticsearch:2.3

Usage

This command will browse the MSD summary file (a big h5 file) to an Elasticsearch index.

Note: If you want to browse the entire dataset and not just the summary, use the -d argument like -d /path/to/local/msd

python msd_to_es.py \
        -H localhost \
        -p 9200 \ 
        -i research_msd \ 
        -f \ 
        -m msd_summary_file.h5

Output logs will look like:

2018-03-13 11:01:13,702 Found 1000000 songs in summary file
2018-03-13 11:01:17,037 1000 files read. Bulk ingest.
2018-03-13 11:01:17,037 Last MSD id read: TRMMENV12903CDDA6A
2018-03-13 11:01:22,221 2000 files read. Bulk ingest.
2018-03-13 11:01:22,221 Last MSD id read: TRMWQUX12903CD7496

Parameters

python msd_to_es.py -h
usage: msd_to_es.py [-h] [-H ESHOST] [-p ESPORT] [-i ESINDEX] [-t ESTYPE]
                    [-m MSDSUMMARYFILE] [-f]

optional arguments:
  -h, --help            show this help message and exit
  -H ESHOST, --eshost ESHOST
                        Host of elasticsearch.
  -p ESPORT, --esport ESPORT
                        Port of elasticsearch host.
  -i ESINDEX, --esindex ESINDEX
                        Name of index to store to.
  -t ESTYPE, --estype ESTYPE
                        Type of index to store to.
  -m MSDSUMMARYFILE, --msdsummaryfile MSDSUMMARYFILE
                        MSD summary file (one h5 file for 1M songs)
  -d MSDDIRECTORY, --msddirectory MSDDIRECTORY
                        MSD directory strucutre (one h5 file per song)
  -f, --force           Force writing in existing ES index.

Document in ES

The Document in Elasticsearch will look like this:

{
    "msd_tempo" : 120.299,
    "msd_artist_name" : "Darrell Scott",
    "msd_artist_mbid" : "98063361-cdd8-4a9e-b95c-1f29bff780d6",
    "msd_title" : "Shattered Cross",
    "msd_artist_id" : "ARZKPUC1187B99052C",
    "msd_year" : 2006,
    "msd_duration" : 325.53751,
    "msd_mode" : 1,
    "msd_artist_location" : "London, KY",
    "msd_release" : "Transatlantic Sessions - Series 3: Volume One",
    "msd_key" : 9
}

elasticmsd's People

Contributors

miqwit avatar

Watchers

James Cloos avatar Artur Barseghyan avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.