Code Monkey home page Code Monkey logo

cornel-movie-dialogs-corpus-storm's Introduction

cornel-movie-dialogs-corpus-storm

A set of python modules for cornel movie-dialogs corpus with storm.

Abstract

This module include some classes extending storm ORM for cornel movie-dialogs corpus data.

Install

pip install storm                # if you not
pip install cornel-movie-dialogs-corpus-storm

Setup

  1. download corpus and unzip
  2. generate database and insert with generate-mdcorpus-database.py

for example:

generate-mdcorpus-database.py --corpus-dir "cornell movie-dialogs corpus" corpus.db

Usage

from mdcorpus.orm import *
from mdcorpus.parser import *

...

Class List

  • MovieTitlesMetadata
  • Genre
  • MovieGenreLine
  • MovieCharactersMetadata
  • MovieConversation
  • MovieLine
  • RawScriptUrl

Corpus Problem

This is memo when I dealt with corpus problems.

movie_titles_metadata.txt

  • I ignored an alphabet following year.
    • for example, line 34, 1989/I
  • I ignored duplication for genre data.
    • line 58, ['horror', 'mystery', 'mystery', 'sci-fi', 'sci-fi']

Code Problem

I use Python2.7 and I don't know how to use codecs module.(Unicode HOWTO — Python 2.7ja1 documentation)

mime

convert text-code to utf-8 with Mi

before

cornell movie-dialogs corpus$ file --mime {(ls)}
README.txt:                    text/plain; charset=iso-8859-1
chameleons.pdf:                application/pdf; charset=binary
movie_characters_metadata.txt: text/plain; charset=iso-8859-1
movie_conversations.txt:       text/plain; charset=us-ascii
movie_lines.txt:               text/plain; charset=us-ascii
movie_titles_metadata.txt:     text/plain; charset=iso-8859-1
raw_script_urls.txt:           text/plain; charset=iso-8859-1

after

cornell movie-dialogs corpus$ file --mime {(ls)}
README.txt:                    text/plain; charset=utf-8
chameleons.pdf:                application/pdf; charset=binary
movie_characters_metadata.txt: text/plain; charset=utf-8
movie_conversations.txt:       text/plain; charset=us-ascii
movie_lines.txt:               text/plain; charset=us-ascii
movie_titles_metadata.txt:     text/plain; charset=utf-8
raw_script_urls.txt:           text/plain; charset=utf-8

movie_titles_metadata.txt

  • line 115, léon

movie_characters_metadata.txt

  • line 1727 - 1736, léon

result

sqlite> select * from movie_titles_metadata where title = 'léon';
sqlite> select * from movie_titles_metadata where title = 'l駮n';
114|l駮n|1994|8.6|204901

cornel-movie-dialogs-corpus-storm's People

Contributors

sosuke-k avatar

Stargazers

 avatar

Watchers

 avatar

cornel-movie-dialogs-corpus-storm's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.