Code Monkey home page Code Monkey logo

mtgencode's Introduction

mtgencode

Utilities to assist in the process of generating Magic the Gathering cards with neural nets. Inspired by this thread on the mtgsalvation forums:

http://www.mtgsalvation.com/forums/creativity/custom-card-creation/612057-generating-magic-cards-using-deep-recurrent-neural

The purpose of this code is mostly to wrangle text between various human and machine readable formats. The original input comes from mtgjson; this is filtered and reduced to one of several input formats intended for neural network training, such as the standard encoded format used in data/output.txt. Any json or encoded data, including output from appropriately trained neural nets, can then be interpreted as cards and decoded to a human readable format, such as a text spoiler, Magic Set Editor 2 set file, or a pretty, portable html file that can be viewed in any browser.

Requirements

I'm running this code on Ubuntu 14.04 with Python 2.7. Unfortunately it does not work with Python 3, though apparently it isn't too hard to use 2to3 to automatically convert it.

For the most part it should work out of the box, though there are a few optional bonus features that will make it much better. See DEPENDENCIES.md.

This code does not have anything to do with neural nets; if you want to generate cards with them, see the tutorial.

Usage

Functionality is provided by two main driver scripts: encode.py and decode.py. Logically, encode.py handles encoding to formats intended to feed into a neural network, while decode.py handles decoding to formats intended to be read by a human.

encode.py

usage: encode.py [-h] [-e {std,named,noname,rfields,old,norarity,vec,custom}]
                 [-r] [--nolinetrans] [--nolabel] [-s] [-v]
                 infile [outfile]

positional arguments:
  infile                encoded card file or json corpus to encode
  outfile               output file, defaults to stdout

optional arguments:
  -h, --help            show this help message and exit
  -e {std,named,noname,rfields,old,norarity,vec,custom}, --encoding {std,named,noname,rfields,old,norarity,vec,custom}
                        encoding format to use
  -r, --randomize       randomize the order of symbols in mana costs
  --nolinetrans         don't reorder lines of card text
  --nolabel             don't label fields
  -s, --stable          don't randomize the order of the cards
  -v, --verbose         verbose output

The supported encodings are:

Argument Description
std Standard format: `
named Name first: `
noname No name field at all: `
rfields Randomize the order of the fields, using only the label to distinguish which field is which.
old Legacy format: `
norarity Older legacy format: `
vec Produce a content vector for each card; used with word2vec.
custom Blank format slot, inteded to help users add their own formats to the python source.

decode.py

usage: decode.py [-h] [-e {std,named,noname,rfields,old,norarity,vec,custom}]
                 [-g] [-f] [-c] [-d] [-v] [-mse] [-html]
                 infile [outfile]

positional arguments:
  infile                encoded card file or json corpus to encode
  outfile               output file, defaults to stdout

optional arguments:
  -h, --help            show this help message and exit
  -e {std,named,noname,rfields,old,norarity,vec,custom}, --encoding {std,named,noname,rfields,old,norarity,vec,custom}
                        encoding format to use
  -g, --gatherer        emulate Gatherer visual spoiler
  -f, --forum           use pretty mana encoding for mtgsalvation forum
  -c, --creativity      use CBOW fuzzy matching to check creativity of cards
  -d, --dump            dump out lots of information about invalid cards
  -v, --verbose         verbose output
  -mse, --mse           use Magic Set Editor 2 encoding; will output as .mse-
                        set file
  -html, --html         create a .html file with pretty forum formatting

The default output is a text spoiler which modifies the output of the neural net as little as possible while making it human readable. Specifying the -g option will produce a prettier, Gatherer-inspired text spoiler with heavier-weight transformations applied to the text, such as capitalization. The -f option encodes mana symbols in the format used by the mtgsalvation forum; this is useful if you want to cut and paste your spoiler into a post to share it.

Passing the -mse option will cause decode.py to produce both the hilarious internal MSE text format as well as an actual mse set file, which is really just a renamed zip archive. The -f and -g flags will be respected in the text that is dumped to each card's notes field.

Finally, the -c and -d options will print out additional data about the quality of the cards. Running with -c is extremely slow due to the massive amount of computation involved, though at least we can do it in parallel over all of your processor cores; -d is probably a good idea to use in general unless you're trying to produce pretty output to show off. Using html mode is especially useful with -c as we can link to visual spoilers from magiccards.info.

Examples

To generate the standard encoding in data/output.txt, I run:

./encode.py -v data/AllSets.json data/output.txt

Of course, this requires that you've downloaded the mtgjson corpus to data/AllSets.json, and are running from the root of the repo.

If I wanted to convert that standard output to a Magic Set Editor 2 set, I'd run:

./decode.py -v data/output.txt data/allcards -f -g -d

This will produce a useless text file called data/allcards, and a set file called data/allcards.mse-set that you can open with MSE2. The -f and -g options will cause the text spoiler included in the notes field of each card in the set to be a pretty Gatherer-inspired affair that you could cut and paste onto the mtgsalvation forum. The -d option will dump additional information if any of the cards are invalidly formatted, which probably won't do anything because all existing magic cards are encoded correctly. Specifying the -c option here would be a bad idea; it would probably take several days to run.

Scripts

A bunch of additional data processing functionality is provided by the files in scripts/. Right now there isn't a whole lot, but more tools might be added in the future, to do things such as convert card dumps into .arff files that could be analyzed in Weka.

Currently, scripts/summarize.py will build a bunch of big data mining indices and use them to print out interesting statistics about a dump of cards. If you want to use mtgencode to do your own data analysis, taking a look at it would be a good place to start.

Tutorial

This tutorial will cover how to generate cards from scratch using neural nets.

Set up a Linux environment

If you're already running on Linux, hooray! If not, you have a few options. The easiest is probably to use a virtual machine; the disadvantage of this approach is that it will prevent you from using a graphics card to train the neural net, which speeds things up immensely. For reference, my GTX Titan is about 10x faster than my overclocked 8-core i7-5960X.

The other option is to dual boot your machine (which is what I do) or otherwise acquire a machine that you can run Linux on natively. How exactly you do this is beyond the scope of this tutorial.

If you do decide to go the virtual machine route:

  1. Download some sort of virtual machine software. I recommend VirtualBox.
  2. Download a Linux operating system. I recommend Ubuntu.
  3. Create a virtual machine, and install the operating system on it.

IMPORTANT NOTE: Training neural nets is extremely CPU intensive, and rather memory intensive as well. If you don't want training to take multiple weeks, it's a very good idea to give your virtual machine as many processor cores and as much memory as you can spare, and to monitor system performance with the 'top' command to make sure you aren't swapping, as that will degrade performance immensely.

You should be able to boot up the virtual machine and use whatever operating system you installed. If you're new to Linux, you might want to familiarize yourself with it a little. For my own sanity, I'm going to assume at least basic familiarity. Most of what we'll be doing will be in terminals; if the instructions say to do something and then provide some code in a block quote, it probably means to type that into a terminal, on line at a time.

Set up the neural net code

We're ultimately going to use the code from the mtg-rnn repo; if anything is unclear you can refer to the documentation there as well.

First, we need to install some dependencies. The primary one is Torch, the scientific computing framework the neural net code is written. Directions are here.

Next, open a terminal and install some additional lua packages:

luarocks install nngraph
luarocks install optim

Now we'll clone the git repo with the neural net code. You'll need git installed, if it isn't:

sudo apt-get install git

Then go to your home directory (or wherever you want to put the repo, it can be anywhere really) and clone it:

cd ~
git clone https://github.com/billzorn/mtg-rnn.git

This should create the folder mtg-rnn, with a bunch of files in it. To check if it works, try:

cd ~/mtg-rnn
th train.lua --help

A large usage message should be printed. If you get an error, then check to make sure Torch is working. As always, Google is your best friend when anything goes wrong.

Set up mtgencode

Go back to your home directory (or wherever) and clone mtgencode as well:

cd ~
git clone https://github.com/billzorn/mtgencode.git

This should create the folder mtgencode, also with a bunch of files in it.

You'll need Python to run it; to get full functionality, consult DEPENDENCIES.md. But, it should work with just Python. To install Python:

sudo apt-get install python

To check if it works:

cd ~/mtgencode
./encode.py --help

Again, you should see a usage message; if you don't, make sure Python is working. mtgencode uses Python 2.7, so if you think your default python is Python 3, you can try:

python2 encode.py --help

instead of running the script directly.

Generating an encoded corpus for training

If you just want to train with the default corpus, you can skip this step, as it already exists in mtg-rnn. Just replace all instances of 'custom_encoding' with 'mtgencode-std'.

To generate an encoded corpus, you'll first need to download AllSets.json from mtgjson.com to data/AllSets.json. Then to encode it:

./encode.py -v data/AllSets.json data/custom_encoding.txt

This will create a the file data/custom_encoding.txt with your encoding in it. You can add some options to create a different encoding; consult the usage of encode.py.

Now copy this encoded corpus over to mtg-rnn:

cd ~/mtg-rnn
mkdir data/custom_encoding
cp ~/mtgencode/data/custom_encoding.txt data/custom_encoding/input.txt

The input file does have to be named input.txt, though you can name the folder that holds it, under mtg-rnn/data/, whatever you want.

Training a neural net

There are lots of parameters to control training. With a good GPU, I can train a 3-layer, size 512 network in a few hours; on a CPU this will probably take at least a day.

Most networks we use are about that size. I'd recommend avoiding anything much larger, as they don't seem to produce appreciably better results and take longer to train. The only other parameter you really have to change from the defaults is seq_length, which we usually set somewhere from 120-200. If this causes memory issues you can reduce batch_size slightly to compensate.

A sample training command might like this:

th train.lua -gpuid -1 -rnn_size 256 -num_layers 3 -seq_length 200 -data_dir data/custom_encoding -checkpoint_dir cv/custom_format-256/ -eval_val_every 1000 -seed 7767

This tells the neural network to train using the corpus in data/custom_encoding/, and to output periodic checkpoints to the directory cv/custom_format-256/. The option "-gpuid -1" means to use the CPU, not a GPU (which won't be possible in VirtualBox anyway). The final options, -eval_val_every and -seed, aren't necessary, but I like to specify them. The seed will be set to a fixed 123 if you don't specify one yourself. If you're generating too many checkpoints and filling up your disk, you can increase the number of iterations between saving them by increasing the argument to -eval_val_every.

If all goes well, you should see the neural net code do some stuff and then start training, reporting training loss and batch times as it goes:

1/112100 (epoch 0.000), train_loss = 4.21492900, grad/param norm = 3.1264e+00, time/batch = 4.73s
2/112100 (epoch 0.001), train_loss = 4.29372822, grad/param norm = 8.6741e+00, time/batch = 3.62s
3/112100 (epoch 0.001), train_loss = 4.02817964, grad/param norm = 8.0445e+00, time/batch = 3.57s
...

This process can take a while, so go to sleep or something and come back in the morning. The train_loss should eventually start to decrease and settle around 0.5 or so; if it doesn't, then something is wrong and the neural net will probably produce gibberish.

Every N iterations, where N is the argument to -eval_val_every, the neural net will generate a checkpoint in cv/custom_format-256/. They look like this:

lm_lstm_epoch2.23_0.5367.t7

The numbers are important; the first is the epoch, which tells you how many passes the neural network had made over the training data when it saved the checkpoint, and the second is the validation loss of the checkpoint. Validation loss is effectively a measurement of how accurate the checkpoint is at producing text that resembles the encoded format, the lower the better. The two numbers are separated by an underscore, so for the example above, the checkpoint is from epoch 2.23, and it had a validation loss of 0.5367, which isn't great but probably isn't gibberish either.

Sampling checkpoints to generate cards

Once you're done training, or you've got enough checkpoints and you're just impatient, you can sample to generate actual cards. If the network is still training, you'll probably want to pause it by typing Control-Z in the terminal; you can resume it later with the command 'fg'. Training will use all available CPU resources all by itself, so trying to sample at the same time is a recipe for slow.

Once you're ready, go the the mtg-rnn repo. A typical sampling command might look like this:

th sample.lua cv/custom_format-256/lm_lstm_epochXX.XX_X.XXXX.t7 -gpuid -1 -temperature 0.9 -length 2000 | tee cards.txt

Replace the Xs in the checkpoint name with the numbers in the name of an actual checkpoint; tab completion is your friend. This command will sample 2000 characters, which is probably something like 20 cards, and both print them to the terminal and write them to a file called cards.txt. The interesting options here are the temperature and the length. Temperature controls how cautious the network is; lower values produce more probable output, while higher values make it wilder and more creative. Somewhere in the range of 0.7-1.0 usually works best. Length is just how many characters to generate. You can also specify a seed with -seed, exactly as for training, which is a particularly good idea if you just generated a few million characters and would like to see something new. The default seed is fixed at 123, again exactly as for training.

You can read the output yourself, but it might be painful, especially if you're using randomly ordered fields.

Postprocessing neural net output with mtgencode

Once you've generated some cards, you can turn them into pretty text spoilers or a set file for MSE2.

Go back to mtgencode, and run something like:

./decode.py -v ~/mtg-rnn/cards.txt cards.pretty.txt -d

This should create a file called cards.pretty.txt with a text spoiler in it that's actually designed for human consumption. Open it in your favorite text editor and enjoy!

The -d option ensures you'll still be able to see anything that went wrong with the cards. You can change the formatting with -f and -g, and produce a set file for MSE2 with -mse. The -c option produces some intersting comparisons to existing cards, but it's slow, so be prepared to wait a long time if you use it on a large dump.

Gory details of the format

Individual cards are separated by two newlines. Multifaced cards (split, flip, etc.) are encoded together, with the castable one first if applicable, and separated by only one newline.

All decimal numbers are in represented in unary, with numbers over 20 special-cased into english. Fun fact: the only numbers over 20 on cards are 25, 30, 40, 50, 100, and 200. The unary represenation uses one character to mark the start of the number, and another to count. So 0 is &, 1 is &^, 2 is &^^, 11 is &^^^^^^^^^^^, and so on.

Mana costs are specially encoded between braces {}. I use the unary counter to encode the colorless part, and then special two-character symbols for everything else. So, {3}{W}{W} becomes {^^^WWWW}, {U/B}{U/B} becomes {UBUB}, and {X}{X}{X} becomes {XXXXXX}. The details are controlled in lib/utils.py, and handled with the Manacost and Manatext objects in lib/manalib.py.

The name of the card becomes @ in the text. I try to handle all the stupid special cases correctly. For example, Crovax the Cursed is referred to in his text box as simply 'Crovax'. Yuch.

The names of counters are similarly replaced with %, and then a speial line of text is added to tell what kind of counter % refers to. Fun fact: there's more than a hundred different kinds used in real cards.

Several ambiguous words are resolved. Most directly, the word 'counter' as in 'counter target spell' is replaced with 'uncast'. This should prevent confusion with +&^/+&^ counters and % counters.

I also reformat cards that choose between multiple things by removing the choice clause itself and instead having a delimited list of options prefixed by a number. If you could choose different numbers of things (one or both, one or more - turns out the latter is valid in all existing cases) then the number is 0, otherwise it's however many things you'd get to choose. So, 'choose one -= effect x= effect y' (the \ is a newline) becomes [&^ = effect x = effect y].

Finally, some postprocessing is done to put the lines of a card's ability text into a standardized, canonical form. Lines with multiple keywords are split, and then we put all of the simple keywords first, followed by things like static or activated abilities. A few things always go first (such as equip and enchant) and a few other things always go last (such as kicker and countertype). There are various reasons for doing this transformation, and some proper science could probably come up with a better specific procedure. One of the primary motivations for putting abilities onto individual lines is that it should simplify the process of adding back in reminder text. It should be noted somewhere that the definition of a simple keyword ability vs. some other line of text is that a simple keyword won't contain a period, and we can split a line with multiple of them by looking for commas and semicolons.

======

Here's an attempt at a list of all the things I do:

  • Aggregate split / flip / rotating / etc. cards by their card number (22a with 22b) and put them together

  • Make all text lowercase, so the symbols for mana and X are distinct

  • Remove all reminder text

  • Put @ in for the name of the card

  • Encode the mana costs, and the tap and untap symbols

  • Convert decimal numbers to unary

  • Simplify the syntax of dashes, so that - is only used as a minus sign, and ~ is used elsewhere

  • Make sure that where X is the variable X, it's uppercase

  • Change the names of all counters to % and add a line to identify what kind of counter % refers to

  • Move the equip cost of equipment to the beginning of the text so that it's closer to the type

  • Rename 'counter' in the context of 'counter target spell' to 'uncast'

  • Put choices into [&^ = effect x = effect y] format

  • Replace acutal newline characters with \ so that we can use those to separate cards

  • Clean all the unicode junk like accents and unicode minus signs out of the text so there are fewer characters

  • Split composite text lines (i.e. "flying, first strike" -> "flying\first strike") and put the lines into canonical order

mtgencode's People

Contributors

billzorn avatar pak90 avatar reimannsum avatar sabrecat avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mtgencode's Issues

Imprisoned in the Moon encodes it's valid targets for enchanting on separate lines

The line "Enchant creature, land, or planeswalker" get encoded as "enchant creature\land\or planeswalker". It appears it was assumed commas only occur in lists for keyword abilites and not in the set of valid aura targets.
Not sure what kind of disambiguation we'd need to resolve this, other than maybe looking for card types before substitution.

HTML and Gatherer flags

when Using HTML and Gatherer flags mana_text is not properly handled giving <img class="Mana-T"> instead of <img class="mana-T">

formatting magic

I cannot seem to understand how decode.py can print names of cards when format() only parses cardname within a if gatherer clause.

Edit: somehow failed to recognize the else for that if gatherer clause

Verbose output of invalid cards breaks decoding

My bad:

Using the -v switch with decode.py throws an error UnboundLocalError: local variable 'json_cardname' referenced before assignment. Caused by 5e09c65. Probably just needs a different variable name in the decode handling (works OK on encode from JSON).

Encoding chokes on π and ∞ characters

Running encode.py over the current AllSets.json throws errors of the form

UnicodeEncodeError: 'ascii' codec can't encode character [code] in position 27: ordinal not in range(128)

It appears to be hung up on the values "π" and "∞" found in Unstable cards "Just Desserts" and "Infinity Elemental", respectively.

Followed tutorial, but only recognizes one card

we took input.txt and only set it to planeswalkers, but when trying to start training, it says it can only recognize one card and there's no reason to train it. even when trying the default file, it wouldn't work. there were errors along the way, but even with using all of the fixes that worked for my machine, it still doesn't work on my friend's. what should we do? i attached the planeswalkers file.
new1.txt

License/Etiquette question

I'm forking this to adapt to the Pokemon TCG - I think the card structure there is different enough to potentially have some quite interesting results. I'm not quite sure how the etiquette and licensing works for this situation, though - do I rename my fork now or only once I've actually made the changes? How do I write the license to account for the fact that the structure of the code is yours but many details will be mine? That sort of thing. I would prefer to defer to @billzorn in this situation, as the initial maker of the code.

issue with manatext and transforms

In python 3 Manatext is throwing Attribute errors because it has no replace() functions. My hunch is that it is working in python 2 because you are using old style classes. Any possibility on making it new-class compatible?

Encoding doesn't like Vehicles

With a bit of hacking, I got encode.py to output this list of cards it considered invalid:

aethersphere harvester
aradara express
ballista charger
bomat bazaar barge
consulate dreadnought
cultivator's caravan
daredevil dragster
demolition stomper
fleetwheel cruiser
heart of kiran
irontread crusher
mobile garrison
ovalchase dragster
peacewalker colossus
renegade freighter
sky skiff
skysovereign, consul flagship
smuggler's copter
untethered express

I.e., vehicles. There must be something the AllSets corpus does with vehicles that mtgencode doesn't quite grasp.

Fields remain all lowercase in default and forum modes

When decoding to Gatherer or MSE format, encoded card text is returned to title / sentence case, as appropriate. However, when using the default format or specifying forum format, everything is lowercase. This makes the process of copying and pasting card text to the forum cumbersome, as the user must either manually clean up every card they want to show off, or leave it in less readable lowercase form.

Is there a specific reason we don't perform the capitalization steps in these export modes? If not, I can expand the scope of #24 to cover this.

Faulty capitalization when exporting to Magic Set Editor

When creating an MSE set, card names are capitalized rather carelessly.

  • Any "'s" is capitalized. Example: card name "allanda's rupture" becomes "Allanda'S Rupture".
  • "of" and "the" are capitalized mid-phrase. Example" card name "anvul of the fire" becomes "Anvul Of The Fire".

Provide an option to dump data on invalid cards during encode

When encoding, there's summary output of how many cards were invalid or failed to parse, but no way to unfold more information on what trouble was encountered. Running encode.py with #12 in place says it encountered 19 invalid cards, and I'm curious what else is getting missed, but there's no simple way to find out!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.