erengy / anitomy Goto Github PK

View Code? Open in Web Editor NEW

263.0 23.0 22.0 671 KB

Anime video filename parser

License: Mozilla Public License 2.0

C++ 100.00%

anime

anitomy's Introduction

Anitomy

Anitomy is a C++ library for parsing anime video filenames. It's accurate, fast, and simple to use.

Examples

The following filename...

[TaigaSubs]_Toradora!_(2008)_-_01v2_-_Tiger_and_Dragon_[1280x720_H.264_FLAC][1234ABCD].mkv

...is resolved into these elements:

Release group: TaigaSubs
Anime title: Toradora!
Anime year: 2008
Episode number: 01
Release version: 2
Episode title: Tiger and Dragon
Video resolution: 1280x720
Video term: H.264
Audio term: FLAC
File checksum: 1234ABCD

Here's an example code snippet...

#include <iostream>
#include <anitomy/anitomy.h>

int main() {
  anitomy::Anitomy anitomy;
  anitomy.Parse(L"[Ouroboros]_Fullmetal_Alchemist_Brotherhood_-_01.mkv");

  const auto& elements = anitomy.elements();

  // Elements are iterable, where each element is a category-value pair
  for (const auto& element : elements) {
    std::wcout << element.first << '\t' << element.second << '\n';
  }
  std::wcout << '\n';

  // You can access values directly by using get() and get_all() methods
  std::wcout << elements.get(anitomy::kElementAnimeTitle) << L" #" <<
                elements.get(anitomy::kElementEpisodeNumber) << L" by " <<
                elements.get(anitomy::kElementReleaseGroup) << '\n';

  return 0;
}

...which will output:

12      mkv
13      [Ouroboros]_Fullmetal_Alchemist_Brotherhood_-_01
7       01
2       Fullmetal Alchemist Brotherhood
16      Ouroboros

Fullmetal Alchemist Brotherhood #01 by Ouroboros

How does it work?

Suppose that we're working on the following filename:

"Spice_and_Wolf_Ep01_[1080p,BluRay,x264]_-_THORA.mkv"

The filename is first stripped off of its extension and split into groups. Groups are determined by the position of brackets:

"Spice_and_Wolf_Ep01_", "1080p,BluRay,x264", "_-_THORA"

Each group is then split into tokens. In our current example, the delimiter for the enclosed group is ,, while the words in other groups are separated by _:

"Spice", "and", "Wolf", "Ep01", "1080p", "BluRay", "x264", "-", "THORA"

Note that brackets and delimiters are actually stored as tokens. Here, identified tokens are omitted for our convenience.

Once the tokenizer is done, the parser comes into effect. First, all tokens are compared against a set of known patterns and keywords. This process generally leaves us with nothing but the release group, anime title, episode number and episode title:

"Spice", "and", "Wolf", "Ep01", "-"

The next step is to look for the episode number. Each token that contains a number is analyzed. Here, Ep01 is identified because it begins with a known episode prefix:

"Spice", "and", "Wolf", "-"

Finally, remaining tokens are combined to form the anime title, which is Spice and Wolf. The complete list of elements identified by Anitomy is as follows:

Anime title: Spice and Wolf
Episode number: 01
Video resolution: 1080p
Source: BluRay
Video term: x264
Release group: THORA

Why should I use it?

Anime video files are commonly named in a format where the anime title is followed by the episode number, and all the technical details are enclosed within brackets. However, fansub groups tend to use their own naming conventions, and the problem is more complicated than it first appears:

Element order is not always the same.
Technical information is not guaranteed to be enclosed.
Brackets and parentheses may be grouping symbols or a part of the anime/episode title.
Space and underscore are not the only delimiters in use.
A single filename may contain multiple delimiters.

There are so many cases to cover that it's simply not possible to parse all filenames solely with regular expressions. Anitomy tries a different approach, and it succeeds: It's able to parse tens of thousands of filenames per second, with great accuracy.

The following projects make use of Anitomy:

See other repositories for related projects (e.g. interfaces, ports, wrappers).

Are there any exceptions?

Yes, unfortunately. Anitomy fails to identify the anime title and episode number on rare occasions, mostly due to bad naming conventions. See the examples below.

Arigatou.Shuffle!.Ep08.[x264.AAC][D6E43829].mkv

Here, Anitomy would report that this file is the 8th episode of Arigatou Shuffle!, where Arigatou is actually the name of the fansub group.

Spice and Wolf 2

Is this the 2nd episode of Spice and Wolf, or a batch release of Spice and Wolf 2? Without a file extension, there's no way to know. It's up to you consider both cases.

Suggestions to fansub groups

Please consider abiding by these simple rules before deciding on your naming convention:

Don't enclose anime title, episode number and episode title within brackets. Enclose everything else, including the name of your group.
Don't use parentheses to enclose release information; use square brackets instead. Parentheses should only be used if they are a part of the anime/episode title.
Don't use multiple delimiters in a single filename. If possible, stick with either space or underscore.
Use a separator (e.g. a dash) between anime title and episode number. There are anime titles that end with a number, which creates ambiguity.
Indicate the episode interval in batch releases.

License

Anitomy is licensed under Mozilla Public License 2.0.

anitomy's People

Contributors

Stargazers

Watchers

anitomy's Issues

Consider License Change to LGPL

Hello,
I would like to use this library in a project I have, but I am unable to do so because of the GPL license. The GPL license requires that if I include it in my project, even as a linked library, I must also license my code under GPL. As this is a private project I am unable to publish the code as required.

Would you please consider switching to a license such as LGPL? This permits linking to private projects while still encouraging contribution to the library. A copy of LGPL can be found here.

It appears you don't mind this, as "MAL Updater OSX" and "Hachidori" are both released under BSD 3-Clause license and are technically in violation of the terms you have grated them.

There has been a lot of great work done on this project around parsing edge cases in titles and I would really like to take advantage of that without having to start from scratch in my own project.

Thanks,
Zak Kristjanson

Anime recognition fails when anime title has a "."

Notable example: the currently airing anime "NieR:Automata Ver1.1a"

recognition fails:
"NieR:Automata Ver1.1a - 01"

related: erengy/taiga#1110

Episode title parsed as release group name

I have a sample formatted in this way (substitutions are surrounded by curly brackets):
[{Category}] -{Romanized Title}- {Original Title} Vol{Volume Number} 第{Episode Number}話「{Episode Title}」 ({Video Codec} {WxH Resolution} {Audio Codec}).{Extension}

In this situation the episode title is parsed as the release group name and displayed as such in Taiga.

I know enclosing titles within brackets goes against your suggestions in the readme, but I have never seen Japanese brackets (「」) used for group names. Perhaps introducing bias for this pattern could be a solution?

Season-Episode in 3 digit semantics is wrong

Season 1, Episode 3: written as 103 is detected as Episode 103.

std::towlower / std::towupper not declared in this scope

Getting the following compilation error on my linux box with gcc 5.2.0:

lib/anitomy/anitomy/string.cpp: In function 'wchar_t anitomy::ToLower(wchar_t)':
lib/anitomy/anitomy/string.cpp:73:41: error: 'towlower' was not declared in this scope
          static_cast<wchar_t>(towlower(c));
                                         ^
lib/anitomy/anitomy/string.cpp: In member function 'wchar_t anitomy::ToUpper::operator()(wchar_t) const':
lib/anitomy/anitomy/string.cpp:80:43: error: 'towupper' was not declared in this scope
            static_cast<wchar_t>(towupper(c));
                                           ^

Looks like towupper and towlower are defined in cwctype so I had to include that in string.cpp to get it to compile.

Add full-width space to delimiters for tokenization

Just a suggestion since I have seen some videos where the title and episode number are separated by a full-width space (　). It looks like currently only half-width spaces and underscores are included. I may write a PR later if I have the time.

Season detection in S# format

Is there any reason for anitomy to parse 2nd Season or Season 2 as the season number but not parse S2 as well? e.g.:

Hayate no Gotoku 2nd Season 24 (Blu-Ray 1080p) [Chihiro]

Is parsed with the title Hayate no Gotoku and season number 2. But...

[SFW]_Queen's_Blade_S2

...is parsed with the title Queen's Blade S2 with no season number. Is this the expected behavior?

Wonder.Woman.2017.720p.10bit.BluRay.6CH.x265.HEVC

Given Wonder.Woman.2017.720p.10bit.BluRay.6CH.x265.HEVC:

AnitomyElements {
  AnimeTitle: 'Wonder Woman 2017',
  FileExtension: 'mkv',
  FileName: 'Wonder.Woman.2017.720p.10bit.BluRay.6CH.sample',
  Source: 'BluRay',
  VideoResolution: '720p',
  VideoTerm: '10bit' }

Is it possible to identify 2017 as the ReleaseYear rather than part of the title?

How Do I Use It

i am using taiga but when i finished haikyuu seasons 1 and when i played the first ep of season 2 it kept saying playing haikyu first season first ep (i use anichiraku its a gdrive of anime and i cant edit the name)

Anime Title Inconsistent Parsing Given Anime Type

I was running some tests (https://runkit.com/jaliborc/5c13d05e6ba83b0012bfbcf2) and I noticed this issue with the parser: if you look at the last two tests, you see that Piano no Mori 2 (TV) gives the anime_title Piano no Mori, with season 2 and anime_type TV. But Piano no Mori (TV) 2nd Season gives anime_title Piano no Mori (TV), with the anime_type TV remaining the same.

Group tag being parsed as episode number

[0x539] Somali and the Forest Spirit - S01E01 (WEB 1080p Hi10P AAC) [BB7C6531].mkv is being parsed as "Somali and the Forest Spirit - S01E01" episode 539 in Taiga.

New keyword "WEB"

Sometime anime have different pattern with the WEB keyword :
Kubo.Wont.Let.Me.Be.Invisible.S01E12.VOSTFR.1080p.WEB.x264-TsundereRaws-Wawacity.cyou
Seems to mean WEBRIP I guess.
I don't know if it's really matters.

Compilation error: back_inserter is not a member of std

Tried to compile taiga with Visual Studio 2015 RTM, while compilation of anitomy got an error in line 182 of tokenizer.cpp:
Error C2039: 'back_inserter': is not a member of 'std'
Error C3861: 'back_inserter': identifier not found

Fail to detect anime with version after the episode

The library failed to recognize some anime if the versioning right next to the episode number:
[Judas] Aharen-san wa Hakarenai - S01E06v2.mkv

Incorrect parse with multiple episode number elements

For the title

[Kaleido-subs] Blue Archive the Animation - 07 (S01E07) - (WEB 1080p HEVC x265 10-bit E-AC3 2.0) [3B0015AF]

anitomy seems to be exclusively prioritizing the (S01E07) token resulting in the anime title being parsed as "Blue Archive the Animation - 07" in this example.

Test cases in data.json are failing

Hello,

I ran the latest version of this code against the included unit tests and found that a number of them are failing. Just raising awareness in case this is unintentional.

I only coded the test to check the title, so other props (even in successful tests), may or may not be correct.

expected vs actual:

#14 MISMATCH: Juuni Kokki => (Les 12 Royaumes)
#39 MISMATCH: Kiddy Grade 2 => Kiddy Grade
#64 MISMATCH: Keroro => 148
#78 MISMATCH: Aim For The Top! Gunbuster => Aim For The Top! Gunbuster-ep1
#81 MISMATCH: Mobile Suit Gundam Seed Destiny => encoded by SEED
#82 MISMATCH: ?K? => Image
#98 MISMATCH: Golden Time => ?
#101 MISMATCH: Mangaka-san to Assistant-san to the Animation => 02
#103 MISMATCH: Rozen Maiden 3 => Rozen Maiden
#112 MISMATCH: Death Note => 37 [Ruberia] Death Note
#113 MISMATCH: Accel World - EX => Accel World - EX01
#120 MISMATCH: Akuma no Riddle => EvoBot [Watakushi] Akuma no Riddle
#121 MISMATCH:  => 01 - Land of Visible Pain
#124 MISMATCH: The iDOLM@STER 765 Pro to Iu Monogatari => The iDOLM@STER
#129 MISMATCH: Hidamari Sketch x365 => Hidamari Sketch x365 - 04.1
#130 MISMATCH:  => The Boy in the Iceberg
#138 MISMATCH: The Animatrix => The Animatrix 08.A Detective Story
#144 MISMATCH: Memories Off 3.5 => Memories Off
#146 MISMATCH: Byousoku 5 Centimeter => Byousoku

Suggestions/Issues

I'm not sure if you want all of my suggestions in 1 issue or multiple, but here are my suggestions/the things I've noticed:

"Hi10" and "HEVC2" should be added as keywords.
00+00 Multiple episode pattern support. (e.g [HorribleSubs] Momokuri - 09+10 [720p].mkv)
Support for basic roman numerals for volume patterns. (e.g Haikyuu!! Vol. III ) Maybe only support numbers 1-5 if you plan to store it as a keyword.

Nice to have:

Support for "&" multiple episode patterns (e.g Dragon_Ball_Z_Movies_8_&10[720p,BluRay,DTS,x264]_-_THORA)
Although that may be a bit more involved and may require checking if the previous token(assuming current token is 8) is plural(Movies/Specials/Episodes), and then checking if the following token is a "connector" type of token(e.g "&" or ",") and if true gather every numerical token until we hit a known/delim token.

-Interesting?
[Infantjedi] Norn9 - Norn + Nonetto - 12 results in : "Norn9 - Norn Nonetto" but that's such an edge case it can be ignored.

Wonderful library btw 👍