Code Monkey home page Code Monkey logo

anitomy's Introduction

Anitomy

Anitomy is a C++ library for parsing anime video filenames. It's accurate, fast, and simple to use.

Examples

The following filename...

[TaigaSubs]_Toradora!_(2008)_-_01v2_-_Tiger_and_Dragon_[1280x720_H.264_FLAC][1234ABCD].mkv

...is resolved into these elements:

  • Release group: TaigaSubs
  • Anime title: Toradora!
  • Anime year: 2008
  • Episode number: 01
  • Release version: 2
  • Episode title: Tiger and Dragon
  • Video resolution: 1280x720
  • Video term: H.264
  • Audio term: FLAC
  • File checksum: 1234ABCD

Here's an example code snippet...

#include <iostream>
#include <anitomy/anitomy.h>

int main() {
  anitomy::Anitomy anitomy;
  anitomy.Parse(L"[Ouroboros]_Fullmetal_Alchemist_Brotherhood_-_01.mkv");

  const auto& elements = anitomy.elements();

  // Elements are iterable, where each element is a category-value pair
  for (const auto& element : elements) {
    std::wcout << element.first << '\t' << element.second << '\n';
  }
  std::wcout << '\n';

  // You can access values directly by using get() and get_all() methods
  std::wcout << elements.get(anitomy::kElementAnimeTitle) << L" #" <<
                elements.get(anitomy::kElementEpisodeNumber) << L" by " <<
                elements.get(anitomy::kElementReleaseGroup) << '\n';

  return 0;
}

...which will output:

12      mkv
13      [Ouroboros]_Fullmetal_Alchemist_Brotherhood_-_01
7       01
2       Fullmetal Alchemist Brotherhood
16      Ouroboros

Fullmetal Alchemist Brotherhood #01 by Ouroboros

How does it work?

Suppose that we're working on the following filename:

"Spice_and_Wolf_Ep01_[1080p,BluRay,x264]_-_THORA.mkv"

The filename is first stripped off of its extension and split into groups. Groups are determined by the position of brackets:

"Spice_and_Wolf_Ep01_", "1080p,BluRay,x264", "_-_THORA"

Each group is then split into tokens. In our current example, the delimiter for the enclosed group is ,, while the words in other groups are separated by _:

"Spice", "and", "Wolf", "Ep01", "1080p", "BluRay", "x264", "-", "THORA"

Note that brackets and delimiters are actually stored as tokens. Here, identified tokens are omitted for our convenience.

Once the tokenizer is done, the parser comes into effect. First, all tokens are compared against a set of known patterns and keywords. This process generally leaves us with nothing but the release group, anime title, episode number and episode title:

"Spice", "and", "Wolf", "Ep01", "-"

The next step is to look for the episode number. Each token that contains a number is analyzed. Here, Ep01 is identified because it begins with a known episode prefix:

"Spice", "and", "Wolf", "-"

Finally, remaining tokens are combined to form the anime title, which is Spice and Wolf. The complete list of elements identified by Anitomy is as follows:

  • Anime title: Spice and Wolf
  • Episode number: 01
  • Video resolution: 1080p
  • Source: BluRay
  • Video term: x264
  • Release group: THORA

Why should I use it?

Anime video files are commonly named in a format where the anime title is followed by the episode number, and all the technical details are enclosed within brackets. However, fansub groups tend to use their own naming conventions, and the problem is more complicated than it first appears:

  • Element order is not always the same.
  • Technical information is not guaranteed to be enclosed.
  • Brackets and parentheses may be grouping symbols or a part of the anime/episode title.
  • Space and underscore are not the only delimiters in use.
  • A single filename may contain multiple delimiters.

There are so many cases to cover that it's simply not possible to parse all filenames solely with regular expressions. Anitomy tries a different approach, and it succeeds: It's able to parse tens of thousands of filenames per second, with great accuracy.

The following projects make use of Anitomy:

See other repositories for related projects (e.g. interfaces, ports, wrappers).

Are there any exceptions?

Yes, unfortunately. Anitomy fails to identify the anime title and episode number on rare occasions, mostly due to bad naming conventions. See the examples below.

Arigatou.Shuffle!.Ep08.[x264.AAC][D6E43829].mkv

Here, Anitomy would report that this file is the 8th episode of Arigatou Shuffle!, where Arigatou is actually the name of the fansub group.

Spice and Wolf 2

Is this the 2nd episode of Spice and Wolf, or a batch release of Spice and Wolf 2? Without a file extension, there's no way to know. It's up to you consider both cases.

Suggestions to fansub groups

Please consider abiding by these simple rules before deciding on your naming convention:

  • Don't enclose anime title, episode number and episode title within brackets. Enclose everything else, including the name of your group.
  • Don't use parentheses to enclose release information; use square brackets instead. Parentheses should only be used if they are a part of the anime/episode title.
  • Don't use multiple delimiters in a single filename. If possible, stick with either space or underscore.
  • Use a separator (e.g. a dash) between anime title and episode number. There are anime titles that end with a number, which creates ambiguity.
  • Indicate the episode interval in batch releases.

License

Anitomy is licensed under Mozilla Public License 2.0.

anitomy's People

Contributors

erengy avatar thaunknown avatar tophf avatar xtansia avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

anitomy's Issues

Consider License Change to LGPL

Hello,
I would like to use this library in a project I have, but I am unable to do so because of the GPL license. The GPL license requires that if I include it in my project, even as a linked library, I must also license my code under GPL. As this is a private project I am unable to publish the code as required.

Would you please consider switching to a license such as LGPL? This permits linking to private projects while still encouraging contribution to the library. A copy of LGPL can be found here.

It appears you don't mind this, as "MAL Updater OSX" and "Hachidori" are both released under BSD 3-Clause license and are technically in violation of the terms you have grated them.

There has been a lot of great work done on this project around parsing edge cases in titles and I would really like to take advantage of that without having to start from scratch in my own project.

Thanks,
Zak Kristjanson

Episode title parsed as release group name

I have a sample formatted in this way (substitutions are surrounded by curly brackets):
[{Category}] -{Romanized Title}- {Original Title} Vol{Volume Number} 第{Episode Number}話 「{Episode Title}」 ({Video Codec} {WxH Resolution} {Audio Codec}).{Extension}

In this situation the episode title is parsed as the release group name and displayed as such in Taiga.

I know enclosing titles within brackets goes against your suggestions in the readme, but I have never seen Japanese brackets (「」) used for group names. Perhaps introducing bias for this pattern could be a solution?

std::towlower / std::towupper not declared in this scope

Getting the following compilation error on my linux box with gcc 5.2.0:

lib/anitomy/anitomy/string.cpp: In function 'wchar_t anitomy::ToLower(wchar_t)':
lib/anitomy/anitomy/string.cpp:73:41: error: 'towlower' was not declared in this scope
          static_cast<wchar_t>(towlower(c));
                                         ^
lib/anitomy/anitomy/string.cpp: In member function 'wchar_t anitomy::ToUpper::operator()(wchar_t) const':
lib/anitomy/anitomy/string.cpp:80:43: error: 'towupper' was not declared in this scope
            static_cast<wchar_t>(towupper(c));
                                           ^

Looks like towupper and towlower are defined in cwctype so I had to include that in string.cpp to get it to compile.

Add full-width space to delimiters for tokenization

Just a suggestion since I have seen some videos where the title and episode number are separated by a full-width space ( ). It looks like currently only half-width spaces and underscores are included. I may write a PR later if I have the time.

Season detection in S# format

Is there any reason for anitomy to parse 2nd Season or Season 2 as the season number but not parse S2 as well? e.g.:

Hayate no Gotoku 2nd Season 24 (Blu-Ray 1080p) [Chihiro]

Is parsed with the title Hayate no Gotoku and season number 2. But...

[SFW]_Queen's_Blade_S2

...is parsed with the title Queen's Blade S2 with no season number. Is this the expected behavior?

Wonder.Woman.2017.720p.10bit.BluRay.6CH.x265.HEVC

Given Wonder.Woman.2017.720p.10bit.BluRay.6CH.x265.HEVC:

AnitomyElements {
  AnimeTitle: 'Wonder Woman 2017',
  FileExtension: 'mkv',
  FileName: 'Wonder.Woman.2017.720p.10bit.BluRay.6CH.sample',
  Source: 'BluRay',
  VideoResolution: '720p',
  VideoTerm: '10bit' }

Is it possible to identify 2017 as the ReleaseYear rather than part of the title?

How Do I Use It

i am using taiga but when i finished haikyuu seasons 1 and when i played the first ep of season 2 it kept saying playing haikyu first season first ep (i use anichiraku its a gdrive of anime and i cant edit the name)

Group tag being parsed as episode number

[0x539] Somali and the Forest Spirit - S01E01 (WEB 1080p Hi10P AAC) [BB7C6531].mkv is being parsed as "Somali and the Forest Spirit - S01E01" episode 539 in Taiga.

New keyword "WEB"

Sometime anime have different pattern with the WEB keyword :
Kubo.Wont.Let.Me.Be.Invisible.S01E12.VOSTFR.1080p.WEB.x264-TsundereRaws-Wawacity.cyou
Seems to mean WEBRIP I guess.
I don't know if it's really matters.

Compilation error: back_inserter is not a member of std

Tried to compile taiga with Visual Studio 2015 RTM, while compilation of anitomy got an error in line 182 of tokenizer.cpp:
Error C2039: 'back_inserter': is not a member of 'std'
Error C3861: 'back_inserter': identifier not found

Incorrect parse with multiple episode number elements

For the title

[Kaleido-subs] Blue Archive the Animation - 07 (S01E07) - (WEB 1080p HEVC x265 10-bit E-AC3 2.0) [3B0015AF]

anitomy seems to be exclusively prioritizing the (S01E07) token resulting in the anime title being parsed as "Blue Archive the Animation - 07" in this example.

Test cases in data.json are failing

Hello,

I ran the latest version of this code against the included unit tests and found that a number of them are failing. Just raising awareness in case this is unintentional.

I only coded the test to check the title, so other props (even in successful tests), may or may not be correct.

expected vs actual:

#14 MISMATCH: Juuni Kokki => (Les 12 Royaumes)
#39 MISMATCH: Kiddy Grade 2 => Kiddy Grade
#64 MISMATCH: Keroro => 148
#78 MISMATCH: Aim For The Top! Gunbuster => Aim For The Top! Gunbuster-ep1
#81 MISMATCH: Mobile Suit Gundam Seed Destiny => encoded by SEED
#82 MISMATCH: ?K? => Image
#98 MISMATCH: Golden Time => ?
#101 MISMATCH: Mangaka-san to Assistant-san to the Animation => 02
#103 MISMATCH: Rozen Maiden 3 => Rozen Maiden
#112 MISMATCH: Death Note => 37 [Ruberia] Death Note
#113 MISMATCH: Accel World - EX => Accel World - EX01
#120 MISMATCH: Akuma no Riddle => EvoBot [Watakushi] Akuma no Riddle
#121 MISMATCH:  => 01 - Land of Visible Pain
#124 MISMATCH: The iDOLM@STER 765 Pro to Iu Monogatari => The iDOLM@STER
#129 MISMATCH: Hidamari Sketch x365 => Hidamari Sketch x365 - 04.1
#130 MISMATCH:  => The Boy in the Iceberg
#138 MISMATCH: The Animatrix => The Animatrix 08.A Detective Story
#144 MISMATCH: Memories Off 3.5 => Memories Off
#146 MISMATCH: Byousoku 5 Centimeter => Byousoku

Suggestions/Issues

I'm not sure if you want all of my suggestions in 1 issue or multiple, but here are my suggestions/the things I've noticed:

  • "Hi10" and "HEVC2" should be added as keywords.
  • 00+00 Multiple episode pattern support. (e.g [HorribleSubs] Momokuri - 09+10 [720p].mkv)
  • Support for basic roman numerals for volume patterns. (e.g Haikyuu!! Vol. III ) Maybe only support numbers 1-5 if you plan to store it as a keyword.

Nice to have:

  • Support for "&" multiple episode patterns (e.g Dragon_Ball_Z_Movies_8_&10[720p,BluRay,DTS,x264]_-_THORA)
    Although that may be a bit more involved and may require checking if the previous token(assuming current token is 8) is plural(Movies/Specials/Episodes), and then checking if the following token is a "connector" type of token(e.g "&" or ",") and if true gather every numerical token until we hit a known/delim token.

-Interesting?
[Infantjedi] Norn9 - Norn + Nonetto - 12 results in : "Norn9 - Norn Nonetto" but that's such an edge case it can be ignored.

Wonderful library btw 👍

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.