pbs / pycaption Goto Github PK

View Code? Open in Web Editor NEW

250.0 89.0 139.0 1.83 MB

Python module to read/write popular video caption formats

License: Apache License 2.0

Python 99.88% Shell 0.12%

pycaption's People

Contributors

Stargazers

Watchers

Forkers

caniculaesei edgarroman mlrsmith kaklis jnorton001 channywang jdrusse hihihippp jaylyerly mbof devconcert scribu yifan kylejameswalker vladiibine matttuttle byu-odh vayam dwbuiten sshaunak9 helmutgranda rasod sunvssoongit jaimemf arielpontes goetzvonberlichingen jpenney joeyfeldberg chrisleighton11 easeltv stt arturg87 tolli81 flaviocpontes burakbostancioglu luser yol sannies magikvision sarbox clovis818 nsonnad soualid byumark polython ihfazhillah bogdan-cornianu ntoshev yny-learning mwx-limited snyderizer nese palaviv vyraun alastairmccormack mt-digital khryss lnrexpress raccoongang underxirox baughj slatwrok mgaitan kattyacuevas hmoreno-fun alexekoren npars mbertrand mitodl avastancu arski kdhub castlabs cielo24 mubi mazz udemy prabhakar267 azenk laszlo-jflmtco appsembler symphoniezhao sdvicorp virtosubogdan lucasheld bdon jennyfer bjester taeminlee geocoucou learningequality nrcook 1div0 raydouglass kganesa krzemienski kmosiejczuk john-tho zefr-inc morganyvm

pycaption's Issues

Tests failing on Ubuntu, Python 2.7.6, pycaption git master

15 out of 71 tests are failing here. The reason seems to be related to the way unicode is handled. Here is the log: https://storage.googleapis.com/vimeo-dev-dra-us/texteas/pycaption_test_fail

DFXP times can be expressed is offset format

The spec for DFXP allows times to be expressed in offset time in addition to clock time. The DFXP convertor errors on input files that use offset time.

Example body fragment from Spec http://www.w3.org/TR/ttml1/#ttml-example-body

Add support for inline WebVTT tags

WebVtt Reader strips all WebVTT inline tags (Like : , , etc.)

Can BeautifulSoup4 dependency be relaxed?

Hi,

I'm using pycaption library in a quite outdated and rigid environment. In short, it's a plugin for a large application, Python 2.7 & parent application has BeautifulSoup4==4.1.3 pinned & installed.

Is it possible to relax pycaption requirements from beautifulsoup4<4.5.0,>=4.2.1 to beautifulsoup4<4.5.0,>=4.1.3?

Of course, I'm talking about "0.x" branch.

Other suggestions are much appreciated.

Best regards,
Alexander

Some players have trouble playing DFXP files with concurrent caption timings

When a DFXP file has multiple captions with the exact same timing (same start and same end), some players don't display any captions at all. It should be possible to generate output that merges concurrent captions.

Caption writer per language

This is more of a question than an issue. I've been trying to write one SRT per language that exists inside of a DFXP. So if the DFXP has 5 languages, I'd like to end up with 5 different SRTs. With the get_languages and get_captions methods it seemed like it would be doable but get_captions is returning a CaptionList that doesn't translate to a CaptionSet. Or at least I can't seem to figure it out. And it seems that a writer only accepts a CaptionSet.

Any thoughts? Sorry if this isn't the right place to post this.

Consistently preserve text-align on conversion from SAMI to all other formats

Although there's no comprehensive SAMI specifications, SAMI files in the wild seem to apply the text-align attribute not to a cue's  tag but to any of a number of possible tags that may descend from  (e.g. span, div, etc). According to usual HTML/CSS rules this makes no sense, but since most files seemed to use the text-align attribute applied to a single  within a , we decided to determine a caption's alignment based on the first text-align value found on any child element of the .

After implementing this solution, however, it turns out that although the caption's positioning is being preserved on DFXP, sometimes it is not preserved in WebVTT output.

SRTReader CaptionReadNoCaptions on Spanish .srt file

I am currently doing a large-scale conversion of .srt to .vtt files. I have been successfully using pycaption 1.0.0 for months, and all of a sudden today one .srt file is just not working.

The code that has worked up to this point looks like this:

with open(tmp_srt_file, 'rU') as srt_file:
    converter = CaptionConverter()
    converter.read(srt_file.read().decode('utf-8'), SRTReader())
    vtt = converter.write(WebVTTWriter())
    return vtt

It is failing on the converter.read() (NOT the decode to utf8) with CaptionReadNoCaptions(('empty caption file',)).

If I print srt_file.read().decode('utf-8') it looks like it should look.

Unfortunately, I can't share this specific file as I am under NDA with a client. However, I can comment that as far as I can tell, there are no special characters. It looks like any other .srt file I have worked with. I have certainly seen weirder .srt files that worked.

Is there something I should be looking for, or is this potentially a real issue?

DFXP input generates WebVTT output with inconsistent alignment

DFXP reader assigns different positioning to text and break nodes when they're defined both for a region and a style. This leads to WebVTT output being inconsistent. When there are no line breaks, the caption is aligned according to the style, which is the expected behavior. When there are line breaks, the caption is aligned according to the region setting.

The DFXP reader must clearly be fixed. The WebVTT writer could also be modified though, because even if the CaptionSet is incorrect, the behavior should be consistent.

Legacy- and SinglePositioningDFXPWriter lose the style in the input

This sample (notice the BOBY style, and reference to it), when converted to dfxp with the writers in the extras module, will lose the styling information

<tt xml:lang="en-us"
    xmlns="http://www.w3.org/ns/ttml"
    xmlns:tts='http://www.w3.org/ns/ttml#styling'
    >
<head>
    <layout>
        <region xml:id="r0" tts:textAlign="center" tts:displayAlign="after" tts:origin="5% 5%" tts:extent="90% 90%"/>
    </layout>
    <styling>
        <style tts:color="#ffeedd" tts:fontFamily="Arial" tts:fontSize="10pt" tts:textAlign="center" xml:id="BOBY"/>
    </styling>
</head>
<body>
    <div>
        <p region="r0" begin="00:00:01.000" end="00:00:03.000" style="BOBY">
        When we think
        </p>
    </div>
</body>
</tt>```

Unicode objects in all dictionary should be replaced with strings.

As demonstrated by PR #52 list members in the all module attribute should be strings, not unicode objects.

Fit to screen doesn't prevent vertical clipping

In addition to issue #71, apparently DFXP can also be cut out vertically. If for example the vertical origin is shifted down (tts:origin="0% 25%"), the vertical alignment is set to bottom (tts:displayAlign="after") and the extent is not specified (and therefore set to its default of 100%), the caption will simply not appear (the text will be positioned vertically at 125% and therefore out of screen). At least that's what happens in the IE implementation, but it is a valid interpretation according to the DFXP specs:

“The rectangular area of a region is explicitly not constrained to be contained within the Root Container Region. In particular, the origin components of a region may be negative, and the extent (width and height) components of a region may be greater than the width and height of the Root Contatiner Region. Whether a presentation processor clips such a region to the Root Contatiner Region is implementation dependent, and not prescribed by this specification.”

– http://www.w3.org/TR/ttaf1-dfxp/#layout-vocabulary-region

DFXP writing fails when trying to write a Layout with a Padding object that has None values

When a Layout object attached to a caption/node/etc contains a Padding object with some None values, for example:
<Padding (before: None, after: None, start: "29pt", end: "29pt")>

The DFXP Writer raises a ValueError with the message "The attribute order specified is invalid". CaptionSets with such Layout objects are generated by the SAMIReader. Conversions from SAMI to DFXP, therefore, sometimes fail.

Users cannot write caption files ignoring positioning

The BaseWriter initializer should take a boolean parameter indicating whether the output should include positioning information or not.

Text alignment is lost on conversion from SAMI to DFXP

SAMI spans with the CSS text-align property are converted to a DFXP span with the tt:textAlign property. This property, however, only applies to

tags in DFXP according to the documentation.

– http://www.w3.org/TR/ttaf1-dfxp/#style-attribute-textAlign

Invalid characters in WebVTT Text

According to the WebVTT specification, "A WebVTT cue text span consists of one or more characters other than U+000A LINE FEED (LF) characters, U+000D CARRIAGE RETURN (CR) characters, U+0026 AMPERSAND characters (&), and U+003C LESS-THAN SIGN characters (<)."

Am I incorrect in understanding that unless the span in question is one of a very limited set of elements (class, italics, bold, underline, ruby, voice, language, or timestamp¹), characters such as < and > should be escaped to < and >, respectively? The WebVTT sample used in testing currently does not have these escaped, and the JavaScript WebVTT parser throws an error when the < character is used like this.

Parsing SAMI fails with unicode objects as input

This is how the traceback looks like:

>>> from pycaption import SAMIReader 
>>> s = open("example.sami").read()
>>> SAMIReader().read(s)
/usr/local/shakkhar/lib/python2.7/site-packages/bs4/dammit.py:269: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  if (len(data) >= 4) and (data[:2] == b'\xfe\xff') \
/usr/local/shakkhar/lib/python2.7/site-packages/bs4/dammit.py:273: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif (len(data) >= 4) and (data[:2] == b'\xff\xfe') \
/usr/local/shakkhar/lib/python2.7/site-packages/bs4/dammit.py:277: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif data[:3] == b'\xef\xbb\xbf':
/usr/local/shakkhar/lib/python2.7/site-packages/bs4/dammit.py:280: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif data[:4] == b'\x00\x00\xfe\xff':
/usr/local/shakkhar/lib/python2.7/site-packages/bs4/dammit.py:283: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif data[:4] == b'\xff\xfe\x00\x00':
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/shakkhar/subtitles/pycaption/pycaption/sami.py", line 36, in read
    sami_soup = BeautifulSoup(content)
  File "/usr/local/shakkhar/lib/python2.7/site-packages/bs4/__init__.py", line 193, in __init__
    self.builder.prepare_markup(markup, from_encoding)):
  File "/usr/local/shakkhar/lib/python2.7/site-packages/bs4/builder/_lxml.py", line 99, in prepare_markup
    for encoding in detector.encodings:
  File "/usr/local/shakkhar/lib/python2.7/site-packages/bs4/dammit.py", line 256, in encodings
    self.chardet_encoding = chardet_dammit(self.markup)
  File "/usr/local/shakkhar/lib/python2.7/site-packages/bs4/dammit.py", line 31, in chardet_dammit
    return chardet.detect(s)['encoding']
  File "/usr/local/shakkhar/lib/python2.7/site-packages/chardet/__init__.py", line 25, in detect
    raise ValueError('Expected a bytes object, not a unicode object')
ValueError: Expected a bytes object, not a unicode object

BeautifulSoup uses chardet to detect encoding. chardet requires that the input data be a bytes object.

This issue is probably library verison / python version / platform specific. Here are the details of my setup:

CentOS release 6.4 (Final)
Python 2.7.5
beautifulsoup4==4.3.2
lxml==3.2.3
chardet==2.2.1

The test file can be obtained from here.

The SinglePositioningDFXPWriter still keeps positioning information

Alignment is being kept (most likely via STYLE nodes).
The custom writer should ignore any positioning information from the input file and just output the default specified).

Write older but safer WebVTT instead of newer but scarcely supported format

Later versions of the WebVTT specification accept an align parameter when defining the "position" cue setting. For example, in order to have a cue box that stretches from the middle to the far right of the screen, instead of having to calculate the position in function of the computed align like this:

00:09.209 --> 00:12.312 position:75% size:50% (computed align:middle, position calculated relative to the middle of the cue box)

You can override the reference relative to which the position will be calculated and write:

00:09.209 --> 00:12.312 position:50%,start size:50% (the computed align is still middle but the position is calculated relative to the left edge of the cue box because of the ,start parameter)

However, as of this date, only Firefox supports this while Chrome, Safari and all Apple devices don't. Since this is sort of a mere "shortcut" and doesn't add any positioning that cannot be expressed alternatively using the old syntax, it seems to be a reasonable solution to simply write WebVTT files in the old format instead of the new one for the time being.

SRT test failure

I ran python -m unittest test_srt_conversion and got this.

May be it's specific to my BeautifulSoup / libxml version? I have
beautifulsoup4==4.3.2
lxml==2.3.2
libxml2 version: 2.7.8.dfsg-5.1ubuntu4.6

Does not allow cue without content (in webvtt)?

Following is the sample WebVTT file I have:

WEBVTT

Id 72
11:01:15.200 --> 11:01:16.201
Some Subtitle - 1 

NOTE
{
"message": "some note"
}

Mark 72
12:01:15.200 --> 12:01:16.201

NOTE
{
"message": "some note"
}

It is actually valid if you check here: https://quuz.org/webvtt/

However I am getting an error:

CaptionReadSyntaxError: CaptionReadSyntaxError(('Cue without content. (line 12)',))

Sami inline - "text-align" specified in the tag does not override global positioning attribute

This sami file (notice the style on the tag):

<SAMI>
<Head>
<Style Type="text/css">
<!-- P {margin-left: 10pt; margin-right: 10pt; margin-top: 1pt; margin-bottom: 1pt; font-family: Arial;font-size: 10pt; text-align: Center; font-weight: Normal; background-color: 000000;}.ENUSCC {Name: English; lang: en-US;}-->
</Style>
</Head>
<Body>

<SYNC Start=366>
<P Class=ENUSCC> &nbsp; 

<SYNC Start=3833>
<P Class=ENUSCC style="text-align: right"> overridden? AND TODAY, WE'RE GOING TO DIG INTO RESTORING KITCHEN CABINETS

Generates this output DFXP (notice there's nothing specifying alignment on the right):

<?xml version="1.0" encoding="utf-8"?>
<tt xml:lang="en" xmlns="http://www.w3.org/ns/ttml" xmlns:tts="http://www.w3.org/ns/ttml#styling">
 <head>
  <styling>
   <style tts:fontFamily="Arial" tts:fontSize="10pt" tts:textAlign="Center" xml:id="p"/>
  </styling>
  <layout>
   <region tts:padding="1pt 10pt 1pt 10pt" xml:id="r0"/>
   <region tts:padding="0.14% 1.04% 0.14% 1.04%" xml:id="r1"/>
  </layout>
 </head>
 <body>
  <div region="r0" xml:lang="en-US">
   <p begin="00:00:03.833" end="00:00:07.833" region="r1" style="p">
    overridden? AND TODAY, WE'RE GOING TO DIG INTO RESTORING KITCHEN CABINETS
   </p>
  </div>
 </body>
</tt>

And also this VTT file:

WEBVTT

00:03.833 --> 00:07.833
 overridden? AND TODAY, WE'RE GOING TO DIG INTO RESTORING KITCHEN CABINETS

So the alignment to the right is lost.

Timing of captions is incorrectly calculated, when converting from dfxp to all other formats

The fractions of seconds are incorrectly converted. As am example, timings such as '01:02:03.9' (representing hour:minute:second.fraction) are converted to '01:02:03.009' .

In this case fractions are divided by 100, but if the fraction were 2 decimals long (say '.84'), the resulting fraction would be '.084'.

The timing is only calculated properly for fractions with exactly 3 decimals.

And of course, if we have more than 3 decimals specified, the second counter might be affected. For a large enough number of decimals, we could even get the time specification in seconds to go above 60.

pycaption still doesn't handle the &pos; in the srt files.

Pycaption released a latest version 0.5.5 which had the DFXP file handling &pos; but unfortunately this is not being handled for SRTReader, i guess.

Align the WebVTT parser with spec

Right now pycaption WebVTT perser passes texts like this:

WEBVTT

1
00:00.000 --> 00:02.000
cue text

id1 id2

00:04.000 --> 00:05.000
Transcribed by Celestials™

According to WebVTT spec line 7 shall be interpreted as a cue identifier and must be followed by a cue. Since line 8 is empty, above snippet is not valid WebVTT.

Lines such as line 7 above shall only pass when prefixed by the text NOTE as follows:

WEBVTT

1
00:00.000 --> 00:02.000
cue text

NOTE id1 id2

00:04.000 --> 00:05.000
Transcribed by Celestials™

In this case, line 7 shall be interpreted as a comment and ignored.

SRT with blank dialog ends file

Example:

from pycaption import detect_format, SRTWriter
data = '''1
00:00:00,333 --> 00:00:01,300


2
00:00:01,300 --> 00:00:02,400
DID YOU HAVE A GOOD SUMMER?

3
00:00:02,400 --> 00:00:03,833
I HAD A GREAT SUMMER, ACTUALLY.
'''
reader = detect_format(data)
new_data = SRTWriter().write(reader().read(data))

Crashes:
pycaption.exceptions.CaptionReadNoCaptions: CaptionReadNoCaptions(('empty caption file',))

Removing the extra line, or adding dialog to the blank line works as expected.
Also placing the blank line in the middle of the file will terminate the file early (without an exception)

If this is something you feel should be fixed, let me know and I can look into creating a pull request for this issue.

Do not fail on empty WebVTT cue

According to the spec empty cue is NOT a syntax error: http://dev.w3.org/html5/webvtt/#dfn-webvtt-cue-text. Emphasis mine.

4.3.2 WebVTT cue text

WebVTT cue text is cue payload that consists of zero or more WebVTT cue components, in any order, each optionally separated from the next by a WebVTT line terminator.

We should get rid of this exception. Also check out the WebVTT validator.

Why is numpy a requirement?

I don't see where it's used in the code. Is this leftover cruft?

SCC File produces WebVTT with caption lines missing.

This sample SCC

Scenarist_SCC V1.0

00:21:29;23 9420 9452 6161 94f4 97a2 6262 942c 942f

Should produces this Webvtt file:

WEBVTT

... positioning and timing spec...
aa
bb

Instead, the last line (bb) is missing

Documentation is out of date

What arguments do writers support? (e.g. relativize, fit_to_screen)
What styling attributes do DFXP/SAMI/SCC support?
What positioning attributes do DFXP/SAMI/SCC support?

Extract text from caption

Is there a way to only extract the text from caption?

Example:

caps = u'''1
00:00:01,500 --> 00:00:12,345
Small caption'''

reader = SRTReader()
reader().some_method(caps)  # Small caption

SCC file produces invalidly closing tags in DFXP, when converting

This section of SCC, when converted to DFXP, produces  tags that close in another  tag - invalid XML basically.

Scenarist_SCC V1.0

00:01:28;09 9420 942f 94ae 9420 9452 97a2 f468 e520 73e5 e3f2 e5f4 7320 efe6 20d0 e5f4 f261 9470 9723 61f2 e520 6162 ef75 f420 f4ef 2062 e520 f2e5 76e5 61ec e564 ae80

00:01:31;10 9420 942f 94ae

00:01:31;18 9420 9454 d570 206e eff7 20ef 6e80 9458 97a1 91ae ce4f d6c1 2c80 9470 97a1 20a2 d0e5 f4f2 61ba 204c ef73 f420 43e9 f479 20ef e620 d3f4 ef6e e5ae a280

00:01:35;18 9420 942f 94ae

00:01:40;25 942c

Require "content" parameter in CaptionNode constructor.

Presently, creating a text node with the text "foo" goes like this:

node = CaptionNode(CaptionNode.TEXT)
node.content = u'foo'

It should be done like this instead:

node = CaptionNode(CaptionNode.TEXT, u'foo')

Windows version?

Support for Windows?

How do I access Cue identifier and the comment in a webvtt file?

Following is a webvtt file I have:

Mark 72
11:01:15.200 --> 11:01:16.201
Some Subtitle - 1 

NOTE
{
"message": "some note"
}

Id 72
12:01:15.200 --> 12:01:16.201
Some Subtitle - 2

NOTE
{
"message": "some note"
}

And it is valid.

Now the cue 1 has identifier Mark 72 and also a comment. How do I parse/read it?

Add default end time to last cue read from SCC files, if not ending in explicit 'clear screen' command

Some SCC files don't explicitly end the last with a [EDM] - (erase displayed memory)(942c) command, but only with [EOC] - (end of caption -> display the caption on the screen)(942f).

If this is the last line, it means we don't know when to set the end time of the last caption. As a result, it gets set to 00:00:00.

We can add a default of 4 seconds not to cause any weird timing problems. We already do this for SAMI because this problem is intrinsic for SAMI (no end time is specified there)

Several tests seem to test python's native str.decode method rather than pycaption

All caption formats seem to have 3 tests that do exactly the same thing with the difference that the samples it uses are in some cases bytestrings (utf8 encoded) and in others unicode strings . In any case the content is the same and, most importantly, the input that is finally sent to the reader is already converted to unicode. This seems pretty redundant and the tests can probably be safely removed.

Python < 2.7 compatibility

https://github.com/pbs/pycaption/blob/master/pycaption/scc.py#L755
This is incompatible with python lower than 2.7. Is that intentional?

Why lxml dependency?

I've searched the code for lxml but couldn't find any result. Do you use it?

lxml is a strong requirement as it's not a pure python and even if it makes parsing faster with beautifulsoup I think it should be an optional (extras) requirement.

ImportError: cannot import name BaseWriter

Using test.py with the code below. I'm trying to get the transcript (What I want to use the script for). Am I missing something?
This is running on windows with Python 2.7
If I leave out the transcript it work fine.

from pycaption import SRTReader, SAMIWriter, DFXPWriter, transcript

srt_caps = '''1
00:00:09,209 --> 00:00:12,312
This is an example SRT file,
which, while extremely short,
is still a valid SRT file.
'''

converter = CaptionConverter()
converter.read(srt_caps, SRTReader())
print converter.write(SAMIWriter())
print converter.write(DFXPWriter())
print converter.write(pycaption.transcript.TranscriptWriter())

I keep getting the error

Traceback (most recent call last):
File "C:\Python27\pycaption-master\test.py", line 3, in
from pycaption import SRTReader, SAMIWriter, DFXPWriter, transcript
File "C:\Python27\pycaption-master\pycaption\transcript.py", line 7, in
from pycaption import BaseWriter, CaptionNode
ImportError: cannot import name BaseWriter

I think CaptionNode will fail as well.

Are they missing from the build? Is there another requirement?

Here is the build and error.

C:\Python27\pycaption-master>C:\Python27\pycaption-master\setup.py install
running install
running bdist_egg
running egg_info
creating pycaption.egg-info
writing requirements to pycaption.egg-info\requires.txt
writing pycaption.egg-info\PKG-INFO
writing top-level names to pycaption.egg-info\top_level.txt
writing dependency_links to pycaption.egg-info\dependency_links.txt
writing manifest file 'pycaption.egg-info\SOURCES.txt'
reading manifest file 'pycaption.egg-info\SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'pycaption.egg-info\SOURCES.txt'
installing library code to build\bdist.win32\egg
running install_lib
running build_py
creating build
creating build\lib
creating build\lib\pycaption
copying pycaption\base.py -> build\lib\pycaption
copying pycaption\dfxp.py -> build\lib\pycaption
copying pycaption\exceptions.py -> build\lib\pycaption
copying pycaption\sami.py -> build\lib\pycaption
copying pycaption\scc.py -> build\lib\pycaption
copying pycaption\srt.py -> build\lib\pycaption
copying pycaption\transcript.py -> build\lib\pycaption
copying pycaption\webvtt.py -> build\lib\pycaption
copying pycaption__init__.py -> build\lib\pycaption
creating build\lib\tests
copying tests\mixins.py -> build\lib\tests
copying tests\samples.py -> build\lib\tests
copying tests\test_dfxp.py -> build\lib\tests
copying tests\test_dfxp_conversion.py -> build\lib\tests
copying tests\test_sami.py -> build\lib\tests
copying tests\test_sami_conversion.py -> build\lib\tests
copying tests\test_scc.py -> build\lib\tests
copying tests\test_scc_conversion.py -> build\lib\tests
copying tests\test_srt.py -> build\lib\tests
copying tests\test_srt_conversion.py -> build\lib\tests
copying tests\test_webvtt.py -> build\lib\tests
copying tests\test_webvtt_conversion.py -> build\lib\tests
copying tests__init__.py -> build\lib\tests
copying pycaption\english.pickle -> build\lib\pycaption
creating build\bdist.win32
creating build\bdist.win32\egg
creating build\bdist.win32\egg\pycaption
copying build\lib\pycaption\base.py -> build\bdist.win32\egg\pycaption
copying build\lib\pycaption\dfxp.py -> build\bdist.win32\egg\pycaption
copying build\lib\pycaption\english.pickle -> build\bdist.win32\egg\pycaption
copying build\lib\pycaption\exceptions.py -> build\bdist.win32\egg\pycaption
copying build\lib\pycaption\sami.py -> build\bdist.win32\egg\pycaption
copying build\lib\pycaption\scc.py -> build\bdist.win32\egg\pycaption
copying build\lib\pycaption\srt.py -> build\bdist.win32\egg\pycaption
copying build\lib\pycaption\transcript.py -> build\bdist.win32\egg\pycaption
copying build\lib\pycaption\webvtt.py -> build\bdist.win32\egg\pycaption
copying build\lib\pycaption__init__.py -> build\bdist.win32\egg\pycaption
creating build\bdist.win32\egg\tests
copying build\lib\tests\mixins.py -> build\bdist.win32\egg\tests
copying build\lib\tests\samples.py -> build\bdist.win32\egg\tests
copying build\lib\tests\test_dfxp.py -> build\bdist.win32\egg\tests
copying build\lib\tests\test_dfxp_conversion.py -> build\bdist.win32\egg\tests
copying build\lib\tests\test_sami.py -> build\bdist.win32\egg\tests
copying build\lib\tests\test_sami_conversion.py -> build\bdist.win32\egg\tests
copying build\lib\tests\test_scc.py -> build\bdist.win32\egg\tests
copying build\lib\tests\test_scc_conversion.py -> build\bdist.win32\egg\tests
copying build\lib\tests\test_srt.py -> build\bdist.win32\egg\tests
copying build\lib\tests\test_srt_conversion.py -> build\bdist.win32\egg\tests
copying build\lib\tests\test_webvtt.py -> build\bdist.win32\egg\tests
copying build\lib\tests\test_webvtt_conversion.py -> build\bdist.win32\egg\tests
copying build\lib\tests__init__.py -> build\bdist.win32\egg\tests
byte-compiling build\bdist.win32\egg\pycaption\base.py to base.pyc
byte-compiling build\bdist.win32\egg\pycaption\dfxp.py to dfxp.pyc
byte-compiling build\bdist.win32\egg\pycaption\exceptions.py to exceptions.pyc
byte-compiling build\bdist.win32\egg\pycaption\sami.py to sami.pyc
byte-compiling build\bdist.win32\egg\pycaption\scc.py to scc.pyc
byte-compiling build\bdist.win32\egg\pycaption\srt.py to srt.pyc
byte-compiling build\bdist.win32\egg\pycaption\transcript.py to transcript.pyc
byte-compiling build\bdist.win32\egg\pycaption\webvtt.py to webvtt.pyc
byte-compiling build\bdist.win32\egg\pycaption__init__.py to init.pyc
byte-compiling build\bdist.win32\egg\tests\mixins.py to mixins.pyc
byte-compiling build\bdist.win32\egg\tests\samples.py to samples.pyc
byte-compiling build\bdist.win32\egg\tests\test_dfxp.py to test_dfxp.pyc
byte-compiling build\bdist.win32\egg\tests\test_dfxp_conversion.py to test_dfxp_conversion.pyc
byte-compiling build\bdist.win32\egg\tests\test_sami.py to test_sami.pyc
byte-compiling build\bdist.win32\egg\tests\test_sami_conversion.py to test_sami_conversion.pyc
byte-compiling build\bdist.win32\egg\tests\test_scc.py to test_scc.pyc
byte-compiling build\bdist.win32\egg\tests\test_scc_conversion.py to test_scc_conversion.pyc
byte-compiling build\bdist.win32\egg\tests\test_srt.py to test_srt.pyc
byte-compiling build\bdist.win32\egg\tests\test_srt_conversion.py to test_srt_conversion.pyc
byte-compiling build\bdist.win32\egg\tests\test_webvtt.py to test_webvtt.pyc
byte-compiling build\bdist.win32\egg\tests\test_webvtt_conversion.py to test_webvtt_conversion.pyc
byte-compiling build\bdist.win32\egg\tests__init__.py to init.pyc
creating build\bdist.win32\egg\EGG-INFO
copying pycaption.egg-info\PKG-INFO -> build\bdist.win32\egg\EGG-INFO
copying pycaption.egg-info\SOURCES.txt -> build\bdist.win32\egg\EGG-INFO
copying pycaption.egg-info\dependency_links.txt -> build\bdist.win32\egg\EGG-INFO
copying pycaption.egg-info\requires.txt -> build\bdist.win32\egg\EGG-INFO
copying pycaption.egg-info\top_level.txt -> build\bdist.win32\egg\EGG-INFO
zip_safe flag not set; analyzing archive contents...
pycaption.transcript: module references file
creating dist
creating 'dist\pycaption-0.3.6-py2.7.egg' and adding 'build\bdist.win32\egg' to it
removing 'build\bdist.win32\egg' (and everything under it)
Processing pycaption-0.3.6-py2.7.egg
removing 'c:\python27\lib\site-packages\pycaption-0.3.6-py2.7.egg' (and everything under it)
creating c:\python27\lib\site-packages\pycaption-0.3.6-py2.7.egg
Extracting pycaption-0.3.6-py2.7.egg to c:\python27\lib\site-packages
pycaption 0.3.6 is already the active version in easy-install.pth

Installed c:\python27\lib\site-packages\pycaption-0.3.6-py2.7.egg
Processing dependencies for pycaption==0.3.6
Searching for cssutils==1.0
Best match: cssutils 1.0
Processing cssutils-1.0-py2.7.egg
cssutils 1.0 is already the active version in easy-install.pth
Installing csscombine-script.py script to C:\Python27\Scripts
Installing csscombine.exe script to C:\Python27\Scripts
Installing csscombine.exe.manifest script to C:\Python27\Scripts
Installing cssparse-script.py script to C:\Python27\Scripts
Installing cssparse.exe script to C:\Python27\Scripts
Installing cssparse.exe.manifest script to C:\Python27\Scripts
Installing csscapture-script.py script to C:\Python27\Scripts
Installing csscapture.exe script to C:\Python27\Scripts
Installing csscapture.exe.manifest script to C:\Python27\Scripts

Using c:\python27\lib\site-packages\cssutils-1.0-py2.7.egg
Searching for lxml==3.3.5
Best match: lxml 3.3.5
Processing lxml-3.3.5-py2.7-win32.egg
lxml 3.3.5 is already the active version in easy-install.pth

Using c:\python27\lib\site-packages\lxml-3.3.5-py2.7-win32.egg
Searching for beautifulsoup4==4.3.2
Best match: beautifulsoup4 4.3.2
Processing beautifulsoup4-4.3.2-py2.7.egg
beautifulsoup4 4.3.2 is already the active version in easy-install.pth

Using c:\python27\lib\site-packages\beautifulsoup4-4.3.2-py2.7.egg
Finished processing dependencies for pycaption==0.3.6

C:\Python27\pycaption-master>test.py
Traceback (most recent call last):
File "C:\Python27\pycaption-master\test.py", line 3, in
from pycaption import SRTReader, SAMIWriter, DFXPWriter, transcript
File "C:\Python27\pycaption-master\pycaption\transcript.py", line 7, in
from pycaption import BaseWriter, CaptionNode
ImportError: cannot import name BaseWriter

Python3 support

Would be great to have 3.3 and 3.4 python support. There's not many changes so I think that's doable.

SCC Reader: supply default values for alignment

The SCC Reader should specify default values for the alignment of the text.

This is needed because the writers don't have the proper knowledge of where to place the text (the defaults for the other formats might differ)

Upgrade to Beautiful Soup 4.4.0 breaks tests

test_srt_to_dfxp_conversion and test_webvtt_to_dfxp_conversion fail with AssertionError.
The expected value contains

<span tts:textalign="right">we have this vision of Einstein</span>

although DFXPTestingMixIn should be removing spans.

Frame accurate timing not working

 will convert to start = 90000000 and end = 120000000

technically it should be displayed in the 9th second at frame 20 but the approximate conversion from frame number to milliseconds is not working and therefore the cue will be displayed more than half a second to early.

In the pull request I repaired the approximate frame number to milliseconds conversion. The assumption of 30 fps is still in there - not correct but reasonable precise

'pip install pycaption --upgrade' doesn't pull latest pycaption code updates.

I tried to upgrade your latest pycaption to my machine:
pip install pycaption --upgrade
Requirement already up-to-date: pycaption in /usr/local/lib/python2.7/site-packages/pycaption-0.5.4-py2.7.egg
Requirement already up-to-date: beautifulsoup4>=4.2.1 in /usr/local/lib/python2.7/site-packages/beautifulsoup4-4.4.1-py2.7.egg (from pycaption)
Collecting lxml>=3.2.3 (from pycaption)
Downloading lxml-3.5.0.tar.gz (3.8MB)
100% |████████████████████████████████| 3.8MB 132kB/s
Requirement already up-to-date: cssutils>=0.9.10 in /usr/local/lib/python2.7/site-packages/cssutils-1.0.1-py2.7.egg (from pycaption)
Installing collected packages: lxml
Found existing installation: lxml 3.5.0b1
Uninstalling lxml-3.5.0b1:
Successfully uninstalled lxml-3.5.0b1
Running setup.py install for lxml
Successfully installed lxml-3.5.0
You are using pip version 7.1.2, however version 8.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

It does upgrade lxml changes, but it did not upgrade the pycaption package itself. See above it says "Requirement already up-to-date: pycaption in /usr/local/lib/python2.7/site-packages/pycaption-0.5.4-py2.7.egg".

We require your latest change for Work around for lack of ''' support in html.parser. Your PR# #124.

Kindly look into this issue and let us know if you could get around to package latest pycaption changes to pip, asap. Our internal automated system that uses pycaption requires the latest change. It would be of great help. Thanks in advance!

SUPPORTED_WRITERS to parallel SUPPORTED_READERS

I wanted a dictionary similar to SUPPORTED_READERS to make it simple to have a conversion tool a text name to lookup the writer class to use — e.g. --output-format=webvtt. This is obviously pretty easy to handle but it means I have a hard-coded list which would be nice to avoid:

https://gist.github.com/acdha/c9fd54d4dee67801a09b#file-convert-subtitles-py-L24-L30

Ampersand character escaped by DFXPReader in valid XML entity "'"

On a DFXP to DFXP conversion, the text I'm is converted to I&apos;m instead of I'm as expected.

Unicode handling in SAMI

https://github.com/pbs/pycaption/blob/master/pycaption/sami.py#L415

Why the code here is just assuming that data will be UTF-8 encoded? This creates trouble with unicode strings.

Here is a code sample:

with open("unicode_problem_sample.sami") as f:
    s = f.read()
sp = unicode(s, "utf-8")
pcc = SAMIReader().read(sp)

Of course this one is an artificial example, but not all my data comes as UTF-8. I decode them and pass them to pycaption as unicode string.

Here is a sample file:
https://dl.dropboxusercontent.com/u/32117554/unicode_problem_sample.sami

Long captions in output files are cut out by some players when horizontally shifted

Some input files, though valid according to the specification, may cause captions to be cut out of the screen range in some players. This happens both for DFXP (IE11) and WebVTT files (Firefox), and potentially also SCC. For WebVTT it happens when "position" is specified but not "size". On DFXP it happens when "origin" is specified but not "extent". When one such a file is converted, the output (DFXP and WebVTT) preserve the same problem.

SCC File produces captions with the start and end time equal.

This example produces captions that start and end at the same moment

Scenarist_SCC V1.0                                                             

00:01:31;18 9420 9454 6162 9758 97a1 91ae 6261 9170 97a1 e362                  

00:01:35;18 9420 942f 94ae                                                     

00:01:40;25 942c

This is the output. (string representation of the captions created)
u'00:01:35.666 --> 00:01:35.666\nab'
u'00:01:35.666 --> 00:01:35.666\nba'
u'00:01:35.666 --> 00:01:40.866\ncb'

ab = 6162
ba = 6261
cb = e362

This should indeed be 3 captions created, but they should all be displayed on screen at the same times, (most likely starting at 00:01:35:666 and ending at 00:01:40:866)