Code Monkey home page Code Monkey logo

node-webvtt's People

Contributors

autotel avatar baldurh avatar dependabot[bot] avatar fadomire avatar goatandsheep avatar jagdish7908 avatar krummi avatar kthelgason avatar osk avatar slifty avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

node-webvtt's Issues

Parser doesn't respect timestamps with single digit hours

The regular expression supplied for matching a WebVTT timestamp does not allow for a timestamp that denotes the hour number with a single digit:

Regular Expression

const TIMESTAMP_REGEXP = /([0-9]{2})?:?([0-9]{2}):([0-9]{2}\.[0-9]{3})/;

Example timestamps

59:58.000 --> 59:59.000
<b>(GASPS) Hunter?</b>

1:00:00.000 --> 1:00:01.000
<b>You're fired!</b>

However according to the WebVTT spec and its description of how to parse a timestamp, a parser should allow for this case:

If string is not exactly two characters in length, or if value1 is greater than 59, let most significant units be hours.

The solution would be to update the regular expression to allow for 1 or 2 digits in the hours slot:

const TIMESTAMP_REGEXP = /([0-9]{1,2})?:?([0-9]{2}):([0-9]{2}\.[0-9]{3})/;

my proposals

i have some thoughts about parser, can I share them with you?

Here is my file:

WEBVTT - this is coolest format for subtitles ever!!
Thank you for download this subtitles. 
author: aparus
title: my movie title
language: en

NOTE 
Scene 1. Actors:
Fred: https://avatars.com/fred-avatar.png
John: https://avatars.com/john-avatar.png

00:00:00.500 --> 00:00:02.000
<v.loud Fred> The Web is always changing. 

00:00:04.500 --> 00:00:05.678
<v.silent John> Yes. And it is good. 

00:00:07.323 --> 00:00:08.437
<v.loud Fred>  and the way we access it is changing
  1. first line comment is lost ever, may be we should store in meta? like firstLineComment etc?
  2. line without : in meta appears like properties:{'Thank you for download this subtitles': 'Thank you for download this subtitles. '}. Key is not predefined , but is variable.... may be better to use something like: {plainText: 'Thank you for download this subtitles. '}. and join there all text without keys?
  3. What about voice tag and its styles? Would you like to process them too? additional fields to cue: {voice: 'Fred', style: 'loud'}
  4. Also store NOTEs? I want to put there some additional info, like scene info, avatars to actors etc.

If you like some of them, I'll open pull request, when I realize them. But I need your opinion first.

Thank you for your attention. Best regards.

cue content causing 'ParserError', due to empty lines

I have extracted a subtitle track, as webvtt, from an MPEG4 file (from Contour+2 camera), using ffmpeg. Due to empty lines in the text of 'cue' block this is causing the following exception:

.../node_modules/node-webvtt/lib/parser.js:65
    throw new ParserError(msg);
    ^
Error: Cue identifier needs to be followed by timestamp (cue #1)

With a sample data from the webvtt file being:

WEBVTT

00:00.000 --> 00:01.000
FW version:1700 V2.1.29
FW name: ContourPlus2

UPDATE:N
UPDATE_FW:N

    SWITCH_1
1RES:D
1BR:H
1EV:0
1SHRP:3
1AE:C
1CTST:62
1MIC:45
1EXT_MIC:30
1SILENT:1
1LSR:1
1LED:0
1GPS_PWR:0
1GPS_REC:1
1AWB:0

    SWITCH_2
2RES:A
2BR:H
2EV:0
2SHRP:3
2AE:C
2CTST:62
2MIC:45
2EXT_MIC:30
2SILENT:0
2LSR:0
2LED:0
2GPS_PWR:0
2GPS_REC:1
2AWB:0

Is there any way we can get node-webvtt to deal with this? The current thought being to require all cue block be empty + properly formatted start/end time line and then treating the rest as cue content? I haven't read enough of the WebVTT spec to know whether this would cause issue.

[Feature Suggestion] WebVTT exporter

I'm open to contributing to this project to add an exporter. My idea is that users can parse the WebVTT file, modify it, then export it back to WebVTT format.

Webvtt parser breaking with X-TIMESTAMP-MAP

Content:

WEBVTT
X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:0

1
00:00.300 --> 00:01.889
This is from a recruit, and I've

2
00:01.890 --> 00:03.329
asked them if I can have one of the

3
00:03.330 --> 00:04.849
six consulting conversations, I

4
00:05.010 --> 00:06.509
said, can I do a tech assessment

5
00:06.510 --> 00:08.189
with you? And then they said, no,

6
00:08.340 --> 00:10.230
I do not need a tech assessment.

Error:

[Error: Missing blank line after signature] { error: undefined }
Waiting for the debugger to disconnect...

Error: Error: Missing blank line after signature

Not sure why X-TIMESTAMP-MAP is not supported.

timestamp with hour is smaller than timestamp without?

Just tried running against a file with the following entries (truncated):

51:13.387 --> 53:20.177
Chapter 14

53:20.180 --> 56:02.180
Chapter 15

56:02.175 --> 59:16.395
Chapter 16

59:16.403 --> 1:04:13.283
Chapter 17

1:04:13.283 --> 1:06:03.283
Chapter 18

1:06:03.276 --> 1:09:25.646
Chapter 19

1:09:25.645 --> 1:11:04.235
Chapter 20

This cause a failure with error 'Error: Start timestamp greater than end (cue #15)'. The offending time entry appears to be: 59:16.403 --> 1:04:13.283

A quick inspection suggests that this is something to with the regex, since adding, the following line, added to parseTimestamp():

console.log(matches[1], matches[2], matches[3])

gives 'undefined' in all cases for matches[1]

[Suggestion] Add an option for parsing without <v> tags?

Sometimes vtt has voice tags (<v> ... </v>), which are helpful for trying to stylize the expression of a sound. When you don't want them, however, they're very difficult to remove without adding some other parsing library for xml-like objects. Can node-webvtt offer an option where v-tags are ignored?

VTT Subs go out of sync when translate from EN to ES.

@osk I have this problem with node-webvtt

I first parse the English VTT file, translate the text and replace it within the object cue, and then compile the VTT file back.
It works very well but I have noticed that in the resulting file the starting time of the paragraphs is not exactly the same. In the paragraphs of the beginning of the file the difference is small but it accumulates and at the end of the file there are differences of up to 1 minute.

I put original and translate file on a shared driver:

The update cue it's done here: https://github.com/bySabi/subs-translate/blob/master/bin/subs-translate#L299

Thanks for this wonderful module.

Library throws error with malformed webvtt file

I had a section in this file as such:

1096
01:45:13.056 --> 01:45:14.390



...mission.

And this returned an error with:

TypeError: Cannot read property 'split' of undefined
    at parseCue (node_modules/node-webvtt/lib/parser.js:142:26)
    at cues.map (node_modules/node-webvtt/lib/parser.js:91:16)
    at Array.map (<anonymous>)
    at parseCues (node_modules/node-webvtt/lib/parser.js:89:6)
    at Object.parse (node_modules/node-webvtt/lib/parser.js:57:28)
    at main (anthony/parse.js:8122:27)
    at Object.<anonymous> (anthony/parse.js:8134:1)
    at Module._compile (module.js:653:30)
    at Object.Module._extensions..js (module.js:664:10)
    at Module.load (module.js:566:32)

I couldn't track down where in the webvtt it was either that was difficult.

Desired effect is that even given this malformed webvtt that it either parses it or gives back an error in the expected format

Hours timestamp should support more than 2 digits

The WebVTT spec says that the timesttamps should be:

A WebVTT timestamp consists of the following components, in the given order:
Optionally (required if hours is non-zero):
Two or more ASCII digits, representing the hours as a base ten integer.

Right now any hours longer than 2 digits are truncated to only include the final two digits of the hours count.

Failure if WebVTT has trailing lines

If the .vtt has more than one trailing lines, this result in node-webvtt failing, with the following output:

<path to project>/node_modules/node-webvtt/lib/parser.js:75
  if (lines[0].includes('-->')) {

Sample WebVTT file:

WEBVTT

00:00.000 --> 00:01.000
Hello there
how are you

00:01.000 --> 00:02.000
Hello there
how are you

00:02.000 --> 00:03.000
Hello there
how are you

00:03.000 --> 00:04.000
Hello there
how are you


When I examined the values of 'lines' in the parseCue() function, the first empty line produced an empty array, which causes the if (lines[0].includes('-->')) line to fail.

A workaround for now is to trim the text prior to parsing it to the parse() function.

[Bug] Parser parses empty cues

When parsing inputs with empty cues, e.g. WEBVTT↵↵, parser creates a cue with the following structure:

{
  end: 0
  identifier: ""
  start: 0
  styles: ""
  text: ""
}

When trying to compile from the parsed input, I get "Error: Cue malformed: start timestamp greater than end, which is set to go off when end <= start. The default value of end triggers it.

Expected Value

  1. I think if the cue is blank, the parser should not add it to the array at all. I know this will still be a non-reversible input, but at least the meaning is reversible.
  2. I think it should also set the value of cues to [] instead of not giving the parsed content the attribute at all.

Support relaxed parsing

The spec that allows rendering files with some errors. A situation in the spec where it happens: "[...] This is clearly a mistake, so a conformance checker will flag it as an error, but it is still useful to render the cues to the user."

There's an example file with some errors that are ignored by renderers: https://github.com/cgiffard/Captionator/blob/master/video/acid.vtt

Trying to parse the above file returns ParserError: Invalid cue timestamp (cue #14) without returning anything else. Would be better if it returned a object with {valid: false, errors: [ParserError('Invalid cue timestamp (cue #14)')], cues: [...]} and only throw an error when the file signature is invalid.

This is important because there are WebVTT files that are authored with errors and as renderers ignore some errors those don't get noticed until someone tries to open those in a strict parser, like node-webvtt. Sadly it wasn't noticed sooner: of the files I'm working with 20% are affected by this issue.

Subtitles can be compiled in non-chronological order

I don't know if this qualifies as an issue, but in case it is any use for you.

It is possible to compile a vtt with subtitles in non chronological order. In order to produce this:

let webvtt=require("node-webvtt");
console.log(webvtt.compile({
    "valid":true,
    "cues":[
        {
           "identifier":"",
           "start":30,
           "end":31,
           "text":"This is a subtitle",
           "styles":"align:start line:0%"
        },
       {
          "identifier":"",
          "start":0,
          "end":1,
          "text":"Hello world!",
          "styles":""
       },
       {
          "identifier":"",
          "start":60,
          "end":61,
          "text":"Foo",
          "styles":""
       },
       {
          "identifier":"",
          "start":110,
          "end":111,
          "text":"Bar",
          "styles":""
       }
    ]
 }));

Whether this is a bug or not, depends on whether a non-chronological vtt file is admissible.

Does not compile metadata

It seems that we can parse() out the metadata from a string --> json.
But json --> string is missing the metadata block at the start of the file content when using the compile() method.

[Suggestion] TypeScript Conversion / Support

I like this parser and as a TypeScript user am interested in a typed version of this package. Wanted to start a discussion about that process if you are interested, and how I can help in the event it moves forward.

Spaces being added in meta header

Currently we are using video.js and exoplayer for some devices to play our content and are using node-webvtt version "^1.9.4"
Our current .vtt files have a meta header that looks like this:

WEBVTT
X-TIMESTAMP-MAP=MPEGTS:183750,LOCAL:00:00:00.000

However after running our .vtt file through the compile() method, we are noticing that a space is being added after the semicolon in the meta header.

WEBVTT
X-TIMESTAMP-MAP=MPEGTS: 183750,LOCAL:00:00:00.000

This is causing video.js and exoplayer to have issues loading captions due to an error with X-TIMESTAMP-MAP and the caption files are not being displayed.

Looking in the compiler.js, I noticed this section of code which deliberately adds a space to the output:

if (input.meta) {
    if (typeof input.meta !== 'object' || Array.isArray(input.meta)) {
      throw new CompilerError('Metadata must be an object');
    }

    Object.entries(input.meta).forEach((i) => {
      if (typeof i[1] !== 'string') {
        throw new CompilerError(`Metadata value for "${i[0]}" must be string`);
      }

      output += `${i[0]}: ${i[1]}\n`;
    });
  }

(https://github.com/osk/node-webvtt/blob/master/lib/compiler.js#L44)
And I am wondering what the reasoning is behind adding that space there in the output section?
Some players handle it correctly but not every player does and by removing that space both video.js and exoplayer were able to render the caption correctly and have it displayed.

Would it be possible to get this space removed in the meta if block?

Add support for meta option in segmenter

I see that a meta option has been added to the parse function to support multi line headers.
The segmenter does not have that option and calls parse without passing any option.

[Feature Suggestion] Ignore extra header lines

I'm using this module to ingest data from Google's YouTube Captions API. Unfortunately, the content it generates has extra lines after the opening WEBVTT line, for example:

WEBVTT
Kind: captions
Language: en

00:00:00.000 --> 00:00:00.960
[Happy music]

According to MDN, this is not allowed, however nonetheless it appears there. At the moment I'm solving this with a workaround to alter the string before passing it to parse():

const adjustedCaption = caption.replace(/^WEBVTT[\s\S]*?\n\n/, "WEBVTT\n\n");

Without this workaround, I receive an error: Missing blank line after signature. It would be preferable if this module could instead accept an option to ignore trailing signature lines. Looking at the code, this wouldn't have adverse effects on the parsing. Alternatively, these lines could be parsed and added as metadata to the parsed output.

I'd be happy to issue a PR for this if you're comfortable with the approach, or if you have a better suggestion I can look at implementing that too.

Having a header or notes cause the parser to fail

Using one of the examples from MDN:

WEBVTT - Translation of that film I like

NOTE
This translation was done by Kyle so that
some friends can watch it with their parents.

1
00:02:15.000 --> 00:02:20.000
- Ta en kopp varmt te.
- Det är inte varmt.

2
00:02:20.000 --> 00:02:25.000
- Har en kopp te.
- Det smakar som te.  

NOTE This last line may not translate well.

3
00:02:25.000 --> 00:02:30.000
- Ta en kopp

That's the parser output:

ParserError: "Cue identifier needs to be followed by timestamp (cue #0)"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.