pbs / pycaption Goto Github PK
View Code? Open in Web Editor NEWPython module to read/write popular video caption formats
License: Apache License 2.0
Python module to read/write popular video caption formats
License: Apache License 2.0
15 out of 71 tests are failing here. The reason seems to be related to the way unicode is handled. Here is the log: https://storage.googleapis.com/vimeo-dev-dra-us/texteas/pycaption_test_fail
The spec for DFXP allows times to be expressed in offset time in addition to clock time. The DFXP convertor errors on input files that use offset time.
Example body fragment from Spec http://www.w3.org/TR/ttml1/#ttml-example-body
<p xml:id="subtitle3" begin="10.0s" end="16.0s" style="s2">
WebVtt Reader strips all WebVTT inline tags (Like : <i>, <b>, etc.
)
Hi,
I'm using pycaption library in a quite outdated and rigid environment. In short, it's a plugin for a large application, Python 2.7 & parent application has BeautifulSoup4==4.1.3
pinned & installed.
Is it possible to relax pycaption requirements from beautifulsoup4<4.5.0,>=4.2.1
to beautifulsoup4<4.5.0,>=4.1.3
?
Of course, I'm talking about "0.x" branch.
Other suggestions are much appreciated.
Best regards,
Alexander
When a DFXP file has multiple captions with the exact same timing (same start
and same end
), some players don't display any captions at all. It should be possible to generate output that merges concurrent captions.
This is more of a question than an issue. I've been trying to write one SRT per language that exists inside of a DFXP. So if the DFXP has 5 languages, I'd like to end up with 5 different SRTs. With the get_languages and get_captions methods it seemed like it would be doable but get_captions is returning a CaptionList that doesn't translate to a CaptionSet. Or at least I can't seem to figure it out. And it seems that a writer only accepts a CaptionSet.
Any thoughts? Sorry if this isn't the right place to post this.
Although there's no comprehensive SAMI specifications, SAMI files in the wild seem to apply the text-align
attribute not to a cue's <p>
tag but to any of a number of possible tags that may descend from <p>
(e.g. span, div, etc). According to usual HTML/CSS rules this makes no sense, but since most files seemed to use the text-align
attribute applied to a single <span>
within a <p>
, we decided to determine a caption's alignment based on the first text-align
value found on any child element of the <p>
.
After implementing this solution, however, it turns out that although the caption's positioning is being preserved on DFXP, sometimes it is not preserved in WebVTT output.
I am currently doing a large-scale conversion of .srt to .vtt files. I have been successfully using pycaption 1.0.0 for months, and all of a sudden today one .srt file is just not working.
The code that has worked up to this point looks like this:
with open(tmp_srt_file, 'rU') as srt_file:
converter = CaptionConverter()
converter.read(srt_file.read().decode('utf-8'), SRTReader())
vtt = converter.write(WebVTTWriter())
return vtt
It is failing on the converter.read() (NOT the decode to utf8) with CaptionReadNoCaptions(('empty caption file',))
.
If I print srt_file.read().decode('utf-8')
it looks like it should look.
Unfortunately, I can't share this specific file as I am under NDA with a client. However, I can comment that as far as I can tell, there are no special characters. It looks like any other .srt file I have worked with. I have certainly seen weirder .srt files that worked.
Is there something I should be looking for, or is this potentially a real issue?
DFXP reader assigns different positioning to text and break nodes when they're defined both for a region and a style. This leads to WebVTT output being inconsistent. When there are no line breaks, the caption is aligned according to the style, which is the expected behavior. When there are line breaks, the caption is aligned according to the region setting.
The DFXP reader must clearly be fixed. The WebVTT writer could also be modified though, because even if the CaptionSet is incorrect, the behavior should be consistent.
This sample (notice the BOBY style, and reference to it), when converted to dfxp with the writers in the extras module, will lose the styling information
<tt xml:lang="en-us"
xmlns="http://www.w3.org/ns/ttml"
xmlns:tts='http://www.w3.org/ns/ttml#styling'
>
<head>
<layout>
<region xml:id="r0" tts:textAlign="center" tts:displayAlign="after" tts:origin="5% 5%" tts:extent="90% 90%"/>
</layout>
<styling>
<style tts:color="#ffeedd" tts:fontFamily="Arial" tts:fontSize="10pt" tts:textAlign="center" xml:id="BOBY"/>
</styling>
</head>
<body>
<div>
<p region="r0" begin="00:00:01.000" end="00:00:03.000" style="BOBY">
When we think
</p>
</div>
</body>
</tt>```
As demonstrated by PR #52 list members in the all module attribute should be strings, not unicode objects.
In addition to issue #71, apparently DFXP can also be cut out vertically. If for example the vertical origin is shifted down (tts:origin="0% 25%"), the vertical alignment is set to bottom (tts:displayAlign="after") and the extent is not specified (and therefore set to its default of 100%), the caption will simply not appear (the text will be positioned vertically at 125% and therefore out of screen). At least that's what happens in the IE implementation, but it is a valid interpretation according to the DFXP specs:
“The rectangular area of a region is explicitly not constrained to be contained within the Root Container Region. In particular, the origin components of a region may be negative, and the extent (width and height) components of a region may be greater than the width and height of the Root Contatiner Region. Whether a presentation processor clips such a region to the Root Contatiner Region is implementation dependent, and not prescribed by this specification.”
When a Layout object attached to a caption/node/etc contains a Padding object with some None values, for example:
<Padding (before: None, after: None, start: "29pt", end: "29pt")>
The DFXP Writer raises a ValueError with the message "The attribute order specified is invalid". CaptionSets with such Layout objects are generated by the SAMIReader. Conversions from SAMI to DFXP, therefore, sometimes fail.
The BaseWriter
initializer should take a boolean parameter indicating whether the output should include positioning information or not.
SAMI spans with the CSS text-align
property are converted to a DFXP span with the tt:textAlign
property. This property, however, only applies to
tags in DFXP according to the documentation.
– http://www.w3.org/TR/ttaf1-dfxp/#style-attribute-textAlign
According to the WebVTT specification, "A WebVTT cue text span consists of one or more characters other than U+000A LINE FEED (LF) characters, U+000D CARRIAGE RETURN (CR) characters, U+0026 AMPERSAND characters (&), and U+003C LESS-THAN SIGN characters (<)."
Am I incorrect in understanding that unless the span in question is one of a very limited set of elements (class, italics, bold, underline, ruby, voice, language, or timestamp¹), characters such as <
and >
should be escaped to <
and >
, respectively? The WebVTT sample used in testing currently does not have these escaped, and the JavaScript WebVTT parser throws an error when the <
character is used like this.
This is how the traceback looks like:
>>> from pycaption import SAMIReader
>>> s = open("example.sami").read()
>>> SAMIReader().read(s)
/usr/local/shakkhar/lib/python2.7/site-packages/bs4/dammit.py:269: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
if (len(data) >= 4) and (data[:2] == b'\xfe\xff') \
/usr/local/shakkhar/lib/python2.7/site-packages/bs4/dammit.py:273: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
elif (len(data) >= 4) and (data[:2] == b'\xff\xfe') \
/usr/local/shakkhar/lib/python2.7/site-packages/bs4/dammit.py:277: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
elif data[:3] == b'\xef\xbb\xbf':
/usr/local/shakkhar/lib/python2.7/site-packages/bs4/dammit.py:280: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
elif data[:4] == b'\x00\x00\xfe\xff':
/usr/local/shakkhar/lib/python2.7/site-packages/bs4/dammit.py:283: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
elif data[:4] == b'\xff\xfe\x00\x00':
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/shakkhar/subtitles/pycaption/pycaption/sami.py", line 36, in read
sami_soup = BeautifulSoup(content)
File "/usr/local/shakkhar/lib/python2.7/site-packages/bs4/__init__.py", line 193, in __init__
self.builder.prepare_markup(markup, from_encoding)):
File "/usr/local/shakkhar/lib/python2.7/site-packages/bs4/builder/_lxml.py", line 99, in prepare_markup
for encoding in detector.encodings:
File "/usr/local/shakkhar/lib/python2.7/site-packages/bs4/dammit.py", line 256, in encodings
self.chardet_encoding = chardet_dammit(self.markup)
File "/usr/local/shakkhar/lib/python2.7/site-packages/bs4/dammit.py", line 31, in chardet_dammit
return chardet.detect(s)['encoding']
File "/usr/local/shakkhar/lib/python2.7/site-packages/chardet/__init__.py", line 25, in detect
raise ValueError('Expected a bytes object, not a unicode object')
ValueError: Expected a bytes object, not a unicode object
BeautifulSoup uses chardet to detect encoding. chardet requires that the input data be a bytes
object.
This issue is probably library verison / python version / platform specific. Here are the details of my setup:
CentOS release 6.4 (Final)
Python 2.7.5
beautifulsoup4==4.3.2
lxml==3.2.3
chardet==2.2.1
The test file can be obtained from here.
Alignment is being kept (most likely via STYLE nodes).
The custom writer should ignore any positioning information from the input file and just output the default specified).
Later versions of the WebVTT specification accept an align parameter when defining the "position" cue setting. For example, in order to have a cue box that stretches from the middle to the far right of the screen, instead of having to calculate the position in function of the computed align like this:
00:09.209 --> 00:12.312 position:75% size:50%
(computed align:middle, position calculated relative to the middle of the cue box)
You can override the reference relative to which the position will be calculated and write:
00:09.209 --> 00:12.312 position:50%,start size:50%
(the computed align
is still middle
but the position is calculated relative to the left edge of the cue box because of the ,start
parameter)
However, as of this date, only Firefox supports this while Chrome, Safari and all Apple devices don't. Since this is sort of a mere "shortcut" and doesn't add any positioning that cannot be expressed alternatively using the old syntax, it seems to be a reasonable solution to simply write WebVTT files in the old format instead of the new one for the time being.
I ran python -m unittest test_srt_conversion
and got this.
May be it's specific to my BeautifulSoup / libxml version? I have
beautifulsoup4==4.3.2
lxml==2.3.2
libxml2 version: 2.7.8.dfsg-5.1ubuntu4.6
Following is the sample WebVTT file I have:
WEBVTT
Id 72
11:01:15.200 --> 11:01:16.201
Some Subtitle - 1
NOTE
{
"message": "some note"
}
Mark 72
12:01:15.200 --> 12:01:16.201
NOTE
{
"message": "some note"
}
It is actually valid if you check here: https://quuz.org/webvtt/
However I am getting an error:
CaptionReadSyntaxError: CaptionReadSyntaxError(('Cue without content. (line 12)',))
This sami file (notice the style on the <P> tag):
<SAMI>
<Head>
<Style Type="text/css">
<!-- P {margin-left: 10pt; margin-right: 10pt; margin-top: 1pt; margin-bottom: 1pt; font-family: Arial;font-size: 10pt; text-align: Center; font-weight: Normal; background-color: 000000;}.ENUSCC {Name: English; lang: en-US;}-->
</Style>
</Head>
<Body>
<SYNC Start=366>
<P Class=ENUSCC>
<SYNC Start=3833>
<P Class=ENUSCC style="text-align: right"> overridden? AND TODAY, WE'RE GOING TO DIG INTO RESTORING KITCHEN CABINETS
Generates this output DFXP (notice there's nothing specifying alignment on the right):
<?xml version="1.0" encoding="utf-8"?>
<tt xml:lang="en" xmlns="http://www.w3.org/ns/ttml" xmlns:tts="http://www.w3.org/ns/ttml#styling">
<head>
<styling>
<style tts:fontFamily="Arial" tts:fontSize="10pt" tts:textAlign="Center" xml:id="p"/>
</styling>
<layout>
<region tts:padding="1pt 10pt 1pt 10pt" xml:id="r0"/>
<region tts:padding="0.14% 1.04% 0.14% 1.04%" xml:id="r1"/>
</layout>
</head>
<body>
<div region="r0" xml:lang="en-US">
<p begin="00:00:03.833" end="00:00:07.833" region="r1" style="p">
overridden? AND TODAY, WE'RE GOING TO DIG INTO RESTORING KITCHEN CABINETS
</p>
</div>
</body>
</tt>
And also this VTT file:
WEBVTT
00:03.833 --> 00:07.833
overridden? AND TODAY, WE'RE GOING TO DIG INTO RESTORING KITCHEN CABINETS
So the alignment to the right is lost.
The fractions of seconds are incorrectly converted. As am example, timings such as '01:02:03.9' (representing hour:minute:second.fraction) are converted to '01:02:03.009' .
In this case fractions are divided by 100, but if the fraction were 2 decimals long (say '.84'), the resulting fraction would be '.084'.
The timing is only calculated properly for fractions with exactly 3 decimals.
And of course, if we have more than 3 decimals specified, the second counter might be affected. For a large enough number of decimals, we could even get the time specification in seconds to go above 60.
Pycaption released a latest version 0.5.5 which had the DFXP file handling &pos; but unfortunately this is not being handled for SRTReader, i guess.
Right now pycaption WebVTT perser passes texts like this:
WEBVTT
1
00:00.000 --> 00:02.000
cue text
id1 id2
00:04.000 --> 00:05.000
Transcribed by Celestials™
According to WebVTT spec line 7 shall be interpreted as a cue identifier and must be followed by a cue. Since line 8 is empty, above snippet is not valid WebVTT.
Lines such as line 7 above shall only pass when prefixed by the text NOTE
as follows:
WEBVTT
1
00:00.000 --> 00:02.000
cue text
NOTE id1 id2
00:04.000 --> 00:05.000
Transcribed by Celestials™
In this case, line 7 shall be interpreted as a comment and ignored.
Example:
from pycaption import detect_format, SRTWriter
data = '''1
00:00:00,333 --> 00:00:01,300
2
00:00:01,300 --> 00:00:02,400
DID YOU HAVE A GOOD SUMMER?
3
00:00:02,400 --> 00:00:03,833
I HAD A GREAT SUMMER, ACTUALLY.
'''
reader = detect_format(data)
new_data = SRTWriter().write(reader().read(data))
Crashes:
pycaption.exceptions.CaptionReadNoCaptions: CaptionReadNoCaptions(('empty caption file',))
Removing the extra line, or adding dialog to the blank line works as expected.
Also placing the blank line in the middle of the file will terminate the file early (without an exception)
If this is something you feel should be fixed, let me know and I can look into creating a pull request for this issue.
According to the spec empty cue is NOT a syntax error: http://dev.w3.org/html5/webvtt/#dfn-webvtt-cue-text. Emphasis mine.
4.3.2 WebVTT cue text
WebVTT cue text is cue payload that consists of zero or more WebVTT cue components, in any order, each optionally separated from the next by a WebVTT line terminator.
We should get rid of this exception. Also check out the WebVTT validator.
I don't see where it's used in the code. Is this leftover cruft?
This sample SCC
Scenarist_SCC V1.0
00:21:29;23 9420 9452 6161 94f4 97a2 6262 942c 942f
Should produces this Webvtt file:
WEBVTT
... positioning and timing spec...
aa
bb
Instead, the last line (bb
) is missing
Is there a way to only extract the text from caption?
Example:
caps = u'''1
00:00:01,500 --> 00:00:12,345
Small caption'''
reader = SRTReader()
reader().some_method(caps) # Small caption
This section of SCC, when converted to DFXP, produces <span>
tags that close in another <p>
tag - invalid XML basically.
Scenarist_SCC V1.0
00:01:28;09 9420 942f 94ae 9420 9452 97a2 f468 e520 73e5 e3f2 e5f4 7320 efe6 20d0 e5f4 f261 9470 9723 61f2 e520 6162 ef75 f420 f4ef 2062 e520 f2e5 76e5 61ec e564 ae80
00:01:31;10 9420 942f 94ae
00:01:31;18 9420 9454 d570 206e eff7 20ef 6e80 9458 97a1 91ae ce4f d6c1 2c80 9470 97a1 20a2 d0e5 f4f2 61ba 204c ef73 f420 43e9 f479 20ef e620 d3f4 ef6e e5ae a280
00:01:35;18 9420 942f 94ae
00:01:40;25 942c
Presently, creating a text node with the text "foo" goes like this:
node = CaptionNode(CaptionNode.TEXT)
node.content = u'foo'
It should be done like this instead:
node = CaptionNode(CaptionNode.TEXT, u'foo')
Support for Windows?
Following is a webvtt file I have:
Mark 72
11:01:15.200 --> 11:01:16.201
Some Subtitle - 1
NOTE
{
"message": "some note"
}
Id 72
12:01:15.200 --> 12:01:16.201
Some Subtitle - 2
NOTE
{
"message": "some note"
}
And it is valid.
Now the cue 1 has identifier Mark 72
and also a comment. How do I parse/read it?
Some SCC files don't explicitly end the last with a [EDM] - (erase displayed memory)(942c) command, but only with [EOC] - (end of caption -> display the caption on the screen)(942f).
If this is the last line, it means we don't know when to set the end time of the last caption. As a result, it gets set to 00:00:00.
We can add a default of 4 seconds not to cause any weird timing problems. We already do this for SAMI because this problem is intrinsic for SAMI (no end time is specified there)
All caption formats seem to have 3 tests that do exactly the same thing with the difference that the samples it uses are in some cases bytestrings (utf8 encoded) and in others unicode strings . In any case the content is the same and, most importantly, the input that is finally sent to the reader is already converted to unicode. This seems pretty redundant and the tests can probably be safely removed.
https://github.com/pbs/pycaption/blob/master/pycaption/scc.py#L755
This is incompatible with python lower than 2.7. Is that intentional?
I've searched the code for lxml but couldn't find any result. Do you use it?
lxml is a strong requirement as it's not a pure python and even if it makes parsing faster with beautifulsoup I think it should be an optional (extras) requirement.
Using test.py with the code below. I'm trying to get the transcript (What I want to use the script for). Am I missing something?
This is running on windows with Python 2.7
If I leave out the transcript it work fine.
from pycaption import SRTReader, SAMIWriter, DFXPWriter, transcript
srt_caps = '''1
00:00:09,209 --> 00:00:12,312
This is an example SRT file,
which, while extremely short,
is still a valid SRT file.
'''
converter = CaptionConverter()
converter.read(srt_caps, SRTReader())
print converter.write(SAMIWriter())
print converter.write(DFXPWriter())
print converter.write(pycaption.transcript.TranscriptWriter())
I keep getting the error
Traceback (most recent call last):
File "C:\Python27\pycaption-master\test.py", line 3, in
from pycaption import SRTReader, SAMIWriter, DFXPWriter, transcript
File "C:\Python27\pycaption-master\pycaption\transcript.py", line 7, in
from pycaption import BaseWriter, CaptionNode
ImportError: cannot import name BaseWriter
I think CaptionNode will fail as well.
Are they missing from the build? Is there another requirement?
Here is the build and error.
C:\Python27\pycaption-master>C:\Python27\pycaption-master\setup.py install
running install
running bdist_egg
running egg_info
creating pycaption.egg-info
writing requirements to pycaption.egg-info\requires.txt
writing pycaption.egg-info\PKG-INFO
writing top-level names to pycaption.egg-info\top_level.txt
writing dependency_links to pycaption.egg-info\dependency_links.txt
writing manifest file 'pycaption.egg-info\SOURCES.txt'
reading manifest file 'pycaption.egg-info\SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'pycaption.egg-info\SOURCES.txt'
installing library code to build\bdist.win32\egg
running install_lib
running build_py
creating build
creating build\lib
creating build\lib\pycaption
copying pycaption\base.py -> build\lib\pycaption
copying pycaption\dfxp.py -> build\lib\pycaption
copying pycaption\exceptions.py -> build\lib\pycaption
copying pycaption\sami.py -> build\lib\pycaption
copying pycaption\scc.py -> build\lib\pycaption
copying pycaption\srt.py -> build\lib\pycaption
copying pycaption\transcript.py -> build\lib\pycaption
copying pycaption\webvtt.py -> build\lib\pycaption
copying pycaption__init__.py -> build\lib\pycaption
creating build\lib\tests
copying tests\mixins.py -> build\lib\tests
copying tests\samples.py -> build\lib\tests
copying tests\test_dfxp.py -> build\lib\tests
copying tests\test_dfxp_conversion.py -> build\lib\tests
copying tests\test_sami.py -> build\lib\tests
copying tests\test_sami_conversion.py -> build\lib\tests
copying tests\test_scc.py -> build\lib\tests
copying tests\test_scc_conversion.py -> build\lib\tests
copying tests\test_srt.py -> build\lib\tests
copying tests\test_srt_conversion.py -> build\lib\tests
copying tests\test_webvtt.py -> build\lib\tests
copying tests\test_webvtt_conversion.py -> build\lib\tests
copying tests__init__.py -> build\lib\tests
copying pycaption\english.pickle -> build\lib\pycaption
creating build\bdist.win32
creating build\bdist.win32\egg
creating build\bdist.win32\egg\pycaption
copying build\lib\pycaption\base.py -> build\bdist.win32\egg\pycaption
copying build\lib\pycaption\dfxp.py -> build\bdist.win32\egg\pycaption
copying build\lib\pycaption\english.pickle -> build\bdist.win32\egg\pycaption
copying build\lib\pycaption\exceptions.py -> build\bdist.win32\egg\pycaption
copying build\lib\pycaption\sami.py -> build\bdist.win32\egg\pycaption
copying build\lib\pycaption\scc.py -> build\bdist.win32\egg\pycaption
copying build\lib\pycaption\srt.py -> build\bdist.win32\egg\pycaption
copying build\lib\pycaption\transcript.py -> build\bdist.win32\egg\pycaption
copying build\lib\pycaption\webvtt.py -> build\bdist.win32\egg\pycaption
copying build\lib\pycaption__init__.py -> build\bdist.win32\egg\pycaption
creating build\bdist.win32\egg\tests
copying build\lib\tests\mixins.py -> build\bdist.win32\egg\tests
copying build\lib\tests\samples.py -> build\bdist.win32\egg\tests
copying build\lib\tests\test_dfxp.py -> build\bdist.win32\egg\tests
copying build\lib\tests\test_dfxp_conversion.py -> build\bdist.win32\egg\tests
copying build\lib\tests\test_sami.py -> build\bdist.win32\egg\tests
copying build\lib\tests\test_sami_conversion.py -> build\bdist.win32\egg\tests
copying build\lib\tests\test_scc.py -> build\bdist.win32\egg\tests
copying build\lib\tests\test_scc_conversion.py -> build\bdist.win32\egg\tests
copying build\lib\tests\test_srt.py -> build\bdist.win32\egg\tests
copying build\lib\tests\test_srt_conversion.py -> build\bdist.win32\egg\tests
copying build\lib\tests\test_webvtt.py -> build\bdist.win32\egg\tests
copying build\lib\tests\test_webvtt_conversion.py -> build\bdist.win32\egg\tests
copying build\lib\tests__init__.py -> build\bdist.win32\egg\tests
byte-compiling build\bdist.win32\egg\pycaption\base.py to base.pyc
byte-compiling build\bdist.win32\egg\pycaption\dfxp.py to dfxp.pyc
byte-compiling build\bdist.win32\egg\pycaption\exceptions.py to exceptions.pyc
byte-compiling build\bdist.win32\egg\pycaption\sami.py to sami.pyc
byte-compiling build\bdist.win32\egg\pycaption\scc.py to scc.pyc
byte-compiling build\bdist.win32\egg\pycaption\srt.py to srt.pyc
byte-compiling build\bdist.win32\egg\pycaption\transcript.py to transcript.pyc
byte-compiling build\bdist.win32\egg\pycaption\webvtt.py to webvtt.pyc
byte-compiling build\bdist.win32\egg\pycaption__init__.py to init.pyc
byte-compiling build\bdist.win32\egg\tests\mixins.py to mixins.pyc
byte-compiling build\bdist.win32\egg\tests\samples.py to samples.pyc
byte-compiling build\bdist.win32\egg\tests\test_dfxp.py to test_dfxp.pyc
byte-compiling build\bdist.win32\egg\tests\test_dfxp_conversion.py to test_dfxp_conversion.pyc
byte-compiling build\bdist.win32\egg\tests\test_sami.py to test_sami.pyc
byte-compiling build\bdist.win32\egg\tests\test_sami_conversion.py to test_sami_conversion.pyc
byte-compiling build\bdist.win32\egg\tests\test_scc.py to test_scc.pyc
byte-compiling build\bdist.win32\egg\tests\test_scc_conversion.py to test_scc_conversion.pyc
byte-compiling build\bdist.win32\egg\tests\test_srt.py to test_srt.pyc
byte-compiling build\bdist.win32\egg\tests\test_srt_conversion.py to test_srt_conversion.pyc
byte-compiling build\bdist.win32\egg\tests\test_webvtt.py to test_webvtt.pyc
byte-compiling build\bdist.win32\egg\tests\test_webvtt_conversion.py to test_webvtt_conversion.pyc
byte-compiling build\bdist.win32\egg\tests__init__.py to init.pyc
creating build\bdist.win32\egg\EGG-INFO
copying pycaption.egg-info\PKG-INFO -> build\bdist.win32\egg\EGG-INFO
copying pycaption.egg-info\SOURCES.txt -> build\bdist.win32\egg\EGG-INFO
copying pycaption.egg-info\dependency_links.txt -> build\bdist.win32\egg\EGG-INFO
copying pycaption.egg-info\requires.txt -> build\bdist.win32\egg\EGG-INFO
copying pycaption.egg-info\top_level.txt -> build\bdist.win32\egg\EGG-INFO
zip_safe flag not set; analyzing archive contents...
pycaption.transcript: module references file
creating dist
creating 'dist\pycaption-0.3.6-py2.7.egg' and adding 'build\bdist.win32\egg' to it
removing 'build\bdist.win32\egg' (and everything under it)
Processing pycaption-0.3.6-py2.7.egg
removing 'c:\python27\lib\site-packages\pycaption-0.3.6-py2.7.egg' (and everything under it)
creating c:\python27\lib\site-packages\pycaption-0.3.6-py2.7.egg
Extracting pycaption-0.3.6-py2.7.egg to c:\python27\lib\site-packages
pycaption 0.3.6 is already the active version in easy-install.pth
Installed c:\python27\lib\site-packages\pycaption-0.3.6-py2.7.egg
Processing dependencies for pycaption==0.3.6
Searching for cssutils==1.0
Best match: cssutils 1.0
Processing cssutils-1.0-py2.7.egg
cssutils 1.0 is already the active version in easy-install.pth
Installing csscombine-script.py script to C:\Python27\Scripts
Installing csscombine.exe script to C:\Python27\Scripts
Installing csscombine.exe.manifest script to C:\Python27\Scripts
Installing cssparse-script.py script to C:\Python27\Scripts
Installing cssparse.exe script to C:\Python27\Scripts
Installing cssparse.exe.manifest script to C:\Python27\Scripts
Installing csscapture-script.py script to C:\Python27\Scripts
Installing csscapture.exe script to C:\Python27\Scripts
Installing csscapture.exe.manifest script to C:\Python27\Scripts
Using c:\python27\lib\site-packages\cssutils-1.0-py2.7.egg
Searching for lxml==3.3.5
Best match: lxml 3.3.5
Processing lxml-3.3.5-py2.7-win32.egg
lxml 3.3.5 is already the active version in easy-install.pth
Using c:\python27\lib\site-packages\lxml-3.3.5-py2.7-win32.egg
Searching for beautifulsoup4==4.3.2
Best match: beautifulsoup4 4.3.2
Processing beautifulsoup4-4.3.2-py2.7.egg
beautifulsoup4 4.3.2 is already the active version in easy-install.pth
Using c:\python27\lib\site-packages\beautifulsoup4-4.3.2-py2.7.egg
Finished processing dependencies for pycaption==0.3.6
C:\Python27\pycaption-master>test.py
Traceback (most recent call last):
File "C:\Python27\pycaption-master\test.py", line 3, in
from pycaption import SRTReader, SAMIWriter, DFXPWriter, transcript
File "C:\Python27\pycaption-master\pycaption\transcript.py", line 7, in
from pycaption import BaseWriter, CaptionNode
ImportError: cannot import name BaseWriter
Would be great to have 3.3 and 3.4 python support. There's not many changes so I think that's doable.
The SCC Reader should specify default values for the alignment of the text.
This is needed because the writers don't have the proper knowledge of where to place the text (the defaults for the other formats might differ)
test_srt_to_dfxp_conversion and test_webvtt_to_dfxp_conversion fail with AssertionError.
The expected value contains
<span tts:textalign="right">we have this vision of Einstein</span>
although DFXPTestingMixIn should be removing spans.
<p begin="00:00:09:20" end="00:00:12:7" region="b1">
will convert to start = 90000000 and end = 120000000
technically it should be displayed in the 9th second at frame 20 but the approximate conversion from frame number to milliseconds is not working and therefore the cue will be displayed more than half a second to early.
In the pull request I repaired the approximate frame number to milliseconds conversion. The assumption of 30 fps is still in there - not correct but reasonable precise
I tried to upgrade your latest pycaption to my machine:
pip install pycaption --upgrade
Requirement already up-to-date: pycaption in /usr/local/lib/python2.7/site-packages/pycaption-0.5.4-py2.7.egg
Requirement already up-to-date: beautifulsoup4>=4.2.1 in /usr/local/lib/python2.7/site-packages/beautifulsoup4-4.4.1-py2.7.egg (from pycaption)
Collecting lxml>=3.2.3 (from pycaption)
Downloading lxml-3.5.0.tar.gz (3.8MB)
100% |████████████████████████████████| 3.8MB 132kB/s
Requirement already up-to-date: cssutils>=0.9.10 in /usr/local/lib/python2.7/site-packages/cssutils-1.0.1-py2.7.egg (from pycaption)
Installing collected packages: lxml
Found existing installation: lxml 3.5.0b1
Uninstalling lxml-3.5.0b1:
Successfully uninstalled lxml-3.5.0b1
Running setup.py install for lxml
Successfully installed lxml-3.5.0
You are using pip version 7.1.2, however version 8.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
It does upgrade lxml changes, but it did not upgrade the pycaption package itself. See above it says "Requirement already up-to-date: pycaption in /usr/local/lib/python2.7/site-packages/pycaption-0.5.4-py2.7.egg".
We require your latest change for Work around for lack of '''
support in html.parser. Your PR# #124.
Kindly look into this issue and let us know if you could get around to package latest pycaption changes to pip, asap. Our internal automated system that uses pycaption requires the latest change. It would be of great help. Thanks in advance!
I wanted a dictionary similar to SUPPORTED_READERS
to make it simple to have a conversion tool a text name to lookup the writer class to use — e.g. --output-format=webvtt
. This is obviously pretty easy to handle but it means I have a hard-coded list which would be nice to avoid:
https://gist.github.com/acdha/c9fd54d4dee67801a09b#file-convert-subtitles-py-L24-L30
On a DFXP to DFXP conversion, the text I'm
is converted to I&apos;m
instead of I'm
as expected.
https://github.com/pbs/pycaption/blob/master/pycaption/sami.py#L415
Why the code here is just assuming that data
will be UTF-8
encoded? This creates trouble with unicode strings.
Here is a code sample:
with open("unicode_problem_sample.sami") as f:
s = f.read()
sp = unicode(s, "utf-8")
pcc = SAMIReader().read(sp)
Of course this one is an artificial example, but not all my data comes as UTF-8. I decode them and pass them to pycaption as unicode string.
Here is a sample file:
https://dl.dropboxusercontent.com/u/32117554/unicode_problem_sample.sami
Some input files, though valid according to the specification, may cause captions to be cut out of the screen range in some players. This happens both for DFXP (IE11) and WebVTT files (Firefox), and potentially also SCC. For WebVTT it happens when "position" is specified but not "size". On DFXP it happens when "origin" is specified but not "extent". When one such a file is converted, the output (DFXP and WebVTT) preserve the same problem.
This example produces captions that start and end at the same moment
Scenarist_SCC V1.0
00:01:31;18 9420 9454 6162 9758 97a1 91ae 6261 9170 97a1 e362
00:01:35;18 9420 942f 94ae
00:01:40;25 942c
This is the output. (string representation of the captions created)
u'00:01:35.666 --> 00:01:35.666\nab'
u'00:01:35.666 --> 00:01:35.666\nba'
u'00:01:35.666 --> 00:01:40.866\ncb'
ab = 6162
ba = 6261
cb = e362
This should indeed be 3 captions created, but they should all be displayed on screen at the same times, (most likely starting at 00:01:35:666 and ending at 00:01:40:866)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.