Use them to do great things. Please share the results!
Usage
# export Jan 2016 Standard games to lichess_db_2016-01.pgn
sbt "runMain lichess.Games 2016-01"
# export Jan 2016 Standard games to custom_file.pgn
sbt "runMain lichess.Games 2016-01 custom_file.pgn"
# export Jan 2016 Atomic games to custom_file.pgn
sbt "runMain lichess.Games 2016-01 custom_file.pgn atomic"
not so much to improve compression speed, but to speed up decompression:
Files that are compressed with pbzip2 are broken up into pieces and each individual piece is compressed. This is how pbzip2 runs faster on multiple CPUs since the pieces can be compressed simultaneously. The final .bz2 file may be slightly larger than if it was compressed with the regular bzip2 program due to this file splitting (usually less than 0.2% larger). Files that are compressed with pbzip2 will also gain considerable speedup when decompressed using pbzip2.
Files that were compressed using bzip2 will not see speedup since bzip2 packages the data into a single chunk that cannot be split between processors.
Notice that the PGN claims that evaluation of move 6 is mate in 3: 6... g6 $4 { [%eval #3] }. One the other hand the data on lichess looks just fine. There are other less noticeable differences. Like the first move is evaluated as 0.17 in PGN and as 0 on the site (almost every move has differences which can't be due to the rounding errors.
Two fields that could provide useful information to the puzzles would be ECO and opening.
Although my interest is in a sense selfish (it would save me the time of downloading the games), I think it might be of interest to more people.
I got the idea from: Scott blog: Lichess puzzles, by ECO
Hi there,
I have found that the Inaccuracy, Mistake, and Blunder annotations are incorrect for the 2024-04 standard-rated data file in the Lichess Database.
The problem starts after the 5.Qxd4 move, which is correctly marked as an inaccuracy. However, all the following moves are marked as inaccuracies where they're not. The same happens after the 10.a3 move, which is a mistake, but all following moves are incorrectly marked as mistakes until the end of the game.
I found this issue in the 2024-04 dataset but I'm not sure if it occurs on any of the older files or not.
Regarding the databases on https://database.lichess.org/, most files for different months were generated long after the months were over, which meant that abandoned games had long been removed from the server.
Starting from the recent July 2021 database however, the PGN actually contains games which were cancelled/abandoned near the end of the month.
One such example game in the July 2021 database is given below, which was played/abandoned on July 30th. Note that the corresponding link https://lichess.org/zPDH02kW is now long gone, but presumably when the static export was generated the game still existed on the server.
To fix the database and exclude such games (as was also done for all months prior to July 2021), maybe one could generate the static database later, when the aborted games have been removed from the server, or the export could manually filter out these aborted games.
Regarding the databases on https://database.lichess.org/, most files for different months were generated long after the months were over, which meant that correspondence games started in that month had long finished.
However, with new databases now being generated shortly after the end of the month, the PGN databases now actually contain correspondence games which were still in progress and therefore had many moves missing.
An example of such a game from the July 2021 database: a half-way finished correspondence game which started some time in July, but finished some time in August/September after 60 moves (119 plies), as can be seen at https://lichess.org/TY9oxOqR :
For correspondence games, it probably makes sense to make separate databases for them, and export them in batches based on the dates the games finished rather than when they started. Otherwise there will always be lots of these unfinished games in these databases, and the full games will not appear in any subsequent databases either. (Or one would have to wait several months before generating the export, as correspondence games might still be in progress.)
So, maybe the nicest solution: separate correspondence games from the main "standard chess" database, and batch that separate database according to the dates the games finished, rather than started.
The file lichess_db_chess960_rated_2023-11.pgn.zst contains a whole series of invalid starting FEN positions. For example, the 89569th game contains the follow FEN header:
[FEN "rkrnnbbq/pppppppp/8/8/8/8/PPPPPPPP/RKRNNBBQ w HEhe - 0 1"]
Chess 960 start positions usually have the castling availability specified as "KQkq" and not using the files where the rooks are, which is odd. But in this case the specification "HEhe" is entirely invalid, because the rooks are on files A and C. On the website that same game (https://lichess.org/osfJ5OM2) shows no castling availability at all in the initial position, which is also wrong.
The file has many other start positions with invalid initial castling availability, some are even just "-".
I didn't find this error in any other files from January 2023 to May 2024 for Chess 960.
We received a couple of questions regarding the license of the evaluations, because it only says "Lichess games and puzzles are released under the Creative Commons CC0 license" - could we mention the license of the evaluations / include them in the sentence on https://database.lichess.org?
The lichess puzzle db at https://database.lichess.org/#puzzles
does not follow the specified format. Instead of encoding puzzles with the move number as in the example:
Hi, thank you for your work on the lichess database and for the recently added stockfish evaluations. It represents a lot of CPU time and therefore makes for a very valuable dataset.
I believe it would be even more useful if we knew the stockfish version that generated these evals: 100M nodes do not mean the same thing if it was stockfish 16 nnue or stockfish 11 hce.
Is this something you could consider adding in the future ?
Sometimes the "opening" tag in the database PGN specifies a different opening than what the opening explorer shows. In these cases, the opening explorer has been correct in the examples I have checked.
In the opening explorer, it is correctly classified as "Bongcloud Attack" from move 2 onward. In the database file lichess_db_standard_rated_2020-09.pgn.bz2, though, the PGN appears as follows, with the tag [Opening "King's Pawn Game"]:
This game ought to be classified as "Bongcloud Attack" in the database.
In fact, "Bongcloud Attack" does not appear as an opening even once in the file. There seem to be other openings affected by this too. For example, "Fried Fox Defense" seems consistently misclassified as "Barnes Defense" in the database.
In Glicko2, calculating a new player’s rating, rating deviation (RD), and volatility after a match requires all three metrics: rating, RD, and volatility . I've been trying to simulate some new matches with the chess puzzle dataset. The absence of volatility data can led to inaccurate calculations of the new puzzle ratings and RD. I read the lila code only discover the default volatility to be 0.09 and a maximum of 0.1. Is there anyway to access this data?