Research: The Pokemon Cry compression algorithm and how they are stored

ipatix · Mar 19, 2014

The cry compression algorithm

Introduction:
So, Hi everyone. I was doing a little research lately about cries and the way they are compressed in the ROM since I wanted to do cry hacking 'o' mass and I wanted to save as much space as possible.
I couldn't find any information about that and as far as I know existing programs like PokeCry and Advanced Cry use basic 8 bit encoding without actual compression and so I started the research.

So because many people don't know how waveforms are stored in memory I will explain:
First the is the so called uncompressed data which is usually stored in a format called "PCM" (Pulse Code Modulation)

(source: http://en.wikipedia.org/wiki/Pulse-code_modulation)

So PCM encoded data is made out of a specific integer type (e.g. 8 bits, 16 bits, signed or unsigned).
When we record PCM data we will "sample" a value of a waveform with a fixed time interval. On CDs this is usually 44100 Hz with a 16 bit resolution. This provides enough quality that the human won't hear any quantisation noise. This will takes up quite a bit of space (on a CD this will be 176,4 kByte per second) which would be too much to store on a AGB cartridge. So we can reduce the samplingrate and our sample resolution to save space at the lack of quality (Pokemon games produce sound at a rate of 13379 Hz with 8 bit resolution). Because this will still be too much to store cries of >300 Pokemons the samplingrate is even more degraded to a samplingrate of 10512 Hz. To save more space on top of that Game Freak developed a compression algorithm (I didn't found this being used in any other GBA game even though they use the same sound engine) which I want to explain here.

Location of Cries:
In all GBA Pokemon games there are the so called "Crytables" which store the location of the data. I personally don't know how they are accessed by the game engine because I don't know how they game deals with Pokemon IDs and which table entry will be accessed but this isn't my topic today anyway. As far as I know some people who made Cry hacking programs know how this works and maybe they might be able to explain how this works.

However to find the crytable run your hexeditor and search for the following hex string:
"20 3C 00 00 XX XX XX XX FF 00 FF 00".
This is basically a crytable entry (length = 0xC bytes). They work similar like voicegroup entrys used by music. I don't know what the 0x20 does but apperantly it is only used for Cries and I don't know what it does. In this case 0x3C shouldn't do anything but in some cases it might be able to change the pitch shifting so these should be 0x3C always. XX XX XX XX is the little endian pointer to the Sample Data I will explain later. FF 00 FF 00 define the shape of the cry. Changing it doesn't really make sense for Cries. This option will make the cries immediantley turn on and turn off when the Cry playback is getting started or stopped.

The Sample Data:
The first 16 bytes are the header.
The structure is the following:

2 bytes: (16 bit value) 0x0000 = 8 bit signed PCM sampledata [OR] 0x0001 = compressed sampledata
2 bytes: (16 bits value) 0x0000 = one shot sample (default) [OR] 0x4000 = looped waveform (only used for musical instruments so they can play as long as you want them to play)
4 bytes: (32 bit value) default samplingrate * 1024; a Cry with a rate of 10512 Hz would result 0xA44000 (remember, they values are stored in the little endian format)
4 bytes: (32 bit value) loop start position (only used for musical instruments, not important for Cries)
4 bytes: (32 bit value) length of the sampledata in samples (for uncompressed sample => 1 byte per sample)

After that the actual wave data will follow. For uncomressed sampledata this will be N samples in 8 bit signed PCM.
For the compressed wave data this is a a little bit different and this is where we get to the actual interesting part where I did the research...

The compression algorithm:
Since the GBA has to decode the Cries in realtime with low CPU time a simple but fast decoding algorithm is required to do so.
This type of compression has nothing to do with other compression types that is used for graphics (LZ77, Huffman, etc.). It is based on a concept of the so called DPCM algorthim (Differential Pulse Code Modulation).
Although DPCM is not identical to this one it might be helpful to check this out before continue reading: http://en.wikipedia.org/wiki/DPCM

So let's see how it works:
The compressed data is split into blocks with a size of 1 byte + 0x20 bytes.
The first byte will always be like normal 8 bit signed PCM data. The next 0x20 bytes aren't PCM though. The next 0x20 bytes are split into 2 4-bit values each except the first 4 bit value (which isn't used).

Each 4 bit value is used to calculate the next sample based on the previous sample.
When the engine wants to get the second sample and the third one (remeber, our first one was a 8 bit signed value) the engine will put the most significant (except for the first one which is skipped) 4-bit value into a differntial-lookup-table and will return a differencial value that is either added or subtracted from the previous sample. After that the least significant 4-bit value will follow and will be subtracted/added to the previous sample as well to calculate the third sample. The same process is done with the 2 4-bit values over and over again until the end of the block is reached. After the end of one block the previous decoded value will get reset to the first byte of the block (remeber, the first byte of a block is a raw signed 8 bit PCM sample) and will be used as sample as well.

But how does the engine know when to stop decoding blocks?
Well as mentioned in the sampledata header overview a sample amount value is stored. In this case this value isn't 1 byte per 1 sample because we don't use regular 8 bit PCM code. The sample amount value will be the amount of overall samples that have been calculated.
Remeber, each block is 0x21 bytes big. The first byte is used as sample and the rest of the 0x20 bytes is split into 4 bit values which results 0x40 samples but since it skips the first 4 bit value it's only 0x3F. So Considering the first sample and the first 4 bit value is skipped it results 0x40 samples exactly per 0x21 bytes.
I'm not sure if the sample amount has to be aligned with the actual amount of blocks. So it might be possible the sampledata can end before the end of a block is reached. I can't confirm this yet though. EDIT: Yes it does work, I checked it. The code will decode those bytes that are either unused or used for something else but won't play them.

If you read the text carefully and still didn't understood the algorithm you can check a C styled pseudo code example I made for decoding the cry:

Code:

char lookup_table[] = { 0x0, 0x1, 0x4, 0x9, 0x10, 0x19, 0x24, 0x31 , 0xC0, 0xCF, 0xDC, 0xE7, 0xF0, 0xF7, 0xFC, 0xFF };

unsigned char input_data[] = { /* input data here */ };

char PCM_LEVEL = 0;

int SAMPLE_COUNT = ?; // ? = value of the sample amount given by the header
int CURRENT_SAMPLE = 0;

int BLOCK_ALIGN = 0;

for (int i = 0; true; i++)
{
    if (BLOCK_ALIGN == 0)
    {
        PCM_LEVEL = (char) input_data[i];
        someOutputFunction(PCM_LEVEL);
        BLOCK_ALIGN = 0x20;
        continue;
    }
    if (BLOCK_ALIGN < 0x20)
    {
        PCM_LEVEL += lookup_table[input_data[i] >> 4];
        someOutputFunction(PCM_LEVEL);
    }

    PCM_LEVEL += lookup_table[input_data[i] & 0xF];
    someOutputFunction(PCM_LEVEL);
    CURRENT_SAMPLE += 2;

    if (CURRENT_SAMPLE >= SAMPLE_COUNT) break;

    BLOCK_ALIGN--;
}

Conclusion:
So yeah that's it and I hope you found this a little interesting to read and/or helpful.
Since I'm no native English speaker I hope it wasn't too difficult to read or understand. Please tell me if I have some major grammer mistakes in my litlle information to help you understanding it further :)

PS: If you guys are interested in the ASM stuff that does the decoding feel free to ask for it. I might put the decoding routine in here and explain how it works. I also plan on making an encoder for the Cries but this still in planning.

Kawaii Shoujo Duskull · Mar 22, 2014

This is pretty interesting.
I don't think there's a way I could use this, but maybe somebody could use it if they decide to make a new cry inserter/editer program(we probably need that sort of thing lol).

Nice job. ^^

esperance · Mar 22, 2014

Wow. I love you right now. Literally. <3
This is awesome. And you did such a nice job explaining it!

These are the kind of things we need posted on here more often, especially for something so useful!

GoGoJJTech · Mar 26, 2014

The 0x20 is to play backwards I think. For directsound, the format is as you said, 00 3c 00 00 XX XX XX XX YY YY YY YY where XX is the pointer and YY are the adsr values. The first byte determines HOW it plays. 00 is normal, 10 is backwards, 20 is this, and 30 is backwards. I haven't really gone over it since the games don't.

AmineX · Apr 1, 2014

As midi(songs)(MID2AGB) & aif(samples)(AIF2AGB),cries seem to have their own converter too:

.section .rodata
.align 2
.global cry_273

cry_273:
.short 0x1
.short 0x0
.int 10764288
.int 0
.int 6391

.byte 0xD2,0xD6,0xDF,0xE8,0xF1,0xF5,0xF4,0x0D,0x16,0x15
.byte 0x14,0x0B,0xF2,0x02,0x0B,0x0F,0x0B,0x0A,0xCA,0xE3
.byte 0x14,0x10,0x0F,0x0F,0xFF,0xFB,0x04,0x35,0x66,0x56
.byte 0x25,0xE5,0x16,0x3A,0x31,0x28,0x18,0xE7,0x00,0x10
........

ipatix · Apr 2, 2014

Yeah, the problem is that the cires is something "Game Freak" special and doesn't come with the Nintendo AGB SDK. So if some guy didn't leak it on the net we're pretty helpless.

However, I tried to write my own encoder. If you want to you can try it out. I didn't do much exception handling for "bad" wav-files and for program crashing so don't expect it to work straight forward and I don't give any warranty for it to work properly (it was a quick dirty program work).

It doesn't seem to be very accurate since it does sound a little different ingame than in the WAV.
Usage is "wav2bdpcm.exe input.wav output.bin"
The .bin file will have a header so just copy the file to your ROM and change the cry pointer to your particular location.
Please only use mono files. Samplerate doesn't matter. 16 bit and 8 bit WAV should work but I only tested 16 bit ones.

*Download - no warranty*:
https://dl.dropboxusercontent.com/u/28573353/Programs/wav2bdpcm.exe

AmineX · Apr 2, 2014

The latest version of AGB SDK library i've is v3.0 no cry converter there! Is there any newer version?
Thanks alot :) I have somthing i hope that it will be usful:

I found this on openpoké demo Source code:

You can download them here:
http://helmet.kafuka.org/openpoke/openpoke_oldsource.rar

ipatix · Nov 4, 2014

So after a long time I finally corrected the cry decompressor pseudo code. If anyone of you made work based on the old one please update it. The code I posted previously was WRONG! (and also the description).

MajinBlueDragon · Nov 15, 2014

Hi! Sound dev here, I don't know if it matters at this point but I can shed some light in to the format the "cry table" is in. This is actually what is known as a vgroup, and like you pointed out is essentially the same as how instrument samples are stored. The format you listed: 20 3C 00 00 XX XX XX XX FF 00 FF 00 <-- can be easily broken down, as it is standard across all instruments. I have data on all possible variations of this (it differs slightly based on that first value), but to keep this post brief I'll simply explain it as it relates to the cries.

20 = Instrument type; 20 being compressed sample in this case. the first byte is NOT "how it plays", but rather what kind of instrument it is. This byte will also affect how the rest of the values are read, though its mostly standardized across the majority of types with a few obvious exceptions (such as with GameBoy sounds)...I'm pretty sure there are modified versions of the m4a engine used by Camelot which has slightly different functionality for some type parameters too.

3C = Base playback note, 3C is Middle C

00 = always empty, in my experience

00 = this second 00 is special. It is used as either the pan value for individual drums, or the sweep value for GameBoy sounds on channel 2. In this instance, it does nothing. When unused it is always 00, though it will have no effect on any unrelated instrument types.

XX XX XX XX = obviously, the sample pointer!

FF 00 FF 00 = this is the default envelope. an envelope is something that affects the sound of a sample when played back, representing Attack, Sustain, Decay and Release respectively. at these default values, this maxes out all of those parameters, essentially playing your sample back "as-is".

this format is essentially going to be the same for your instruments too, though that Type value will be different as well as some parameters depending on what that value is. I suppose that could be a thread all its own some time :x

ipatix · Nov 15, 2014

Well, you're mostly right. For the first byte, though, it has mixed funcionality:
It differes between the different playback methods and the output device at the same time. Bit 0-3 seem to control the output device and bit 0-7 control the samples sound method.
So for Nintendo's default mixer it will only trigger by bit 7 set (which disables resampling). Bit 5-6 are iirc only used in Pokemons modded m4a engine. Bit 6 will enable the cry decoder while bit 5 will reverse the sample playback (both at the cost of additional processing time!).
For Camelot's mod bit 6 is "kinda" used. The engine usually copies this byte to the virtual channels and incase one of Camelot's exclusive synth instruments is used it will set bit 6 in the runtime variables. I haven't tried setting it in Camelot's original games but since I partially ported Camelots mixer algorithm for Pokemon Sovereign of the Skies to improve quality and reduce processing time I found out that setting it could eventually result in garbled sound because if the game detects a synth instrument it checks if bit 6 is set and if it isn't it'll run a short init procedure which would be skipped otherwise.

Other than that 0x3C (well that's its most usual value) is not the base playback note!!! It's only used in drum tables and will determine the actualy key to play back with drum tables but it doesn't do anything for regular instruments. I've tried it multiple times in the past and it won't change the base note. Final Fantasy actually sometimes has instruments where this value is always 0x0 and this is why the legacy Sappy always resulted horribly ugly sounds because it thought it'd be the base note even though it wasn't.

MajinBlueDragon · Nov 16, 2014

if you encode an AIFF file with a base note set to something other than middle C I believe it will change it, at least in the stock M4A engine. I know Fire Emblem 6 does something like this, iirc. For sure that is set somewhere anyway...! I am aware that it's used for the drum note though :) Pokemon uses it for its tom drum samples, etc... I've done a little work in the industry as a sound developer so I have experience with the rudimentary M4A engine! I never knew that about the Camelot mod though, I was very much wondering how the synth instruments were handled (as they are only a few bytes long per sample, and obviously won't play when ported to another game normally)....very interesting, thank you for that! I was always hoping I could use some of those crazy synth sounds one day in other project so that will be helpful information.

ipatix · Nov 16, 2014

Interesting to hear. But do you know how to set another base note with aiff2agb?
I personally don't know if AIFF has some tags for these kind of things nor I know how to set them. Since AIFF is a pretty abandoned format I personally used to use Audacity to export the raw PCM data and write the header with the pitch and loops manually.

Camelots mod is actually pretty funny and I think it was very worth it to combine parts of it with Gamefreaks mod. They use techniques like self modifying code and very basic synthesizers (PWM wave, Sawtooth and some kind of pulse shifted triangular wave) which execute very fast and provide high quality. Their code seems to be optimized for higher samplingrates which is actually pretty cool but is usually pretty hard to hack into other games since they use a 16 bit work buffer which requires quite a lot of memory and usually is hard to allocate in commercial games.
But Camelot didn't only change the mixer; they also messed around with a lot of other things I haven't researched yet but there still seems to be a lot of potential in functionality. One interersting thing is probably their own reverb algorithm but they also seem to have two reserved virtual channels for the textbox sound effects in Golden Sun.
I might research all of that some time but I just don't get to it thesesays due to time issues.

MajinBlueDragon · Nov 16, 2014

I use SoundForge for my sample creation, which I believe Nintendo also suggests at some point in their documentation (which is hilarious, because it's a Sony program...) It changes slightly between versions, but in most of them I believe you access this menu by right-clicking timeline above the graphical representation of the waveform itself and selecting "Edit Sample" ... the parameter "MIDI Unity Note" is what you're after here :) it's metadata like the sample loop-point, but M4a definitely does read it and use it to determine the relative base playback pitch. In this case, what Soundforge says is "C5" defaults to that value of 3C. You generally would not edit this parameter unless your sample was at a note other than C, of course, or else your MIDI notes would have to be transposed accordingly (see also: Golden Sun and it's incorrect octaves)

...it won't let me post an image yet because I don't have 15 posts, but here you go: dropbox.com/s/tf0jwenc1t1lasu/basenotesf.png

Ah man...so the synth engine in Golden Sun really does just use basic waveforms...that probably explains why those samples are like, 10 bytes each, insane. I need to figure out how to port specifically that part of the engine out, it's the one thing I've always been really interested in from it. That high quality mixer is pretty cool too, though, I was always wondering how the hell they managed things with the sample rates as insanely high as they were.

ipatix · Nov 17, 2014

Well, I managed to merge Gamefreaks and Camelots mixer quite a while ago. Here is the assembly but replacing the old mixer requires a little tricks with memory management since the code is quite a bit bigger.
https://dl.dropboxusercontent.com/u/28573353/temp/main_2_advanced.s

I think it's the equivalent to SoundRAM in the mp2k SDK (at least that's what the object file disassembly tells me) and needs to be replaced accordingly.

In the bottom of the file there is a sine based fm synthesizer I tried to program once but (even though it worked) it didn't produce amazing sounds so I disabled the code in my config section.
Maybe you're interested in the code so feel free to check it out. I documented quite a lot of the code but there is a lot of things other people than myself probably won't understand AND I didn't understand all parts of the code when I originally rewrote it, found it out later but didn't complete the documentation.

MajinBlueDragon · Dec 2, 2014

On the subject of cries--I was curious if you could shed some light on something for me. How are the cries actually mapped to the Pokemon themselves in the gen 3 games? Like...I get there's obviously the m4a-format pointer table and the such, but is there a parameter somewhere within the Pokemon themselves? I know there's at least three or so separate tables for cries (since each m4a table can only be a maximum of 0x5F4 bytes) too.

I'm actually trying to figure out how the cries in Pokemon Pinball RS are mapped (which use the exact same M4A tables, with missing Kanto pokemon blanked out) and knowing how it works in the Gen 3 games may make this a bit more clear for me D:

ipatix · Dec 2, 2014

Oh it's just terrible how Gamefreak did the mapping of the cries to the tables.

For Gen 1 + 2 the entry number is calculated by the internal IDs - 1(look at the XSE poke ID definition file). The 25 "empty" ID slots are hardwired by ASM to the Unown table entry and all IDs after those 25 slots (Gen 3) are put into a lookup table to get the actual entry number. I have no clue why they did it that complicated (for no real reason). I once looked up all the offsets for German Emerald to modify the ASM and to repoint and extend the cry table but I don't think they are useful for you.

AtecainCorp. · Dec 2, 2014

Remember that after last normal cry pointer it was begin that same table but this time with 30 3c 00 00 xx xx xx xx 00 00 00 00 without 20 in begining. Effect was funny. But 30 make all this cries to play backwards.

HidoranBlaze · Dec 2, 2014

Ksiazek Bartlomiej said:
Remember that after last normal cry pointer it was begin that same table but this time with 30 3c 00 00 xx xx xx xx 00 00 00 00 without 20 in begining. Effect was funny. But 30 make all this cries to play backwards.

That's the secondary cry table, which is used for stuff like Growl and whatnot.

MajinBlueDragon · Dec 2, 2014

...wow, that's exactly as hacky as I imagined it. So, it is essentially mapped via the ID number in ASM with the exception of the lookup table, haha. Any idea where that lookup table might be located? This is actually pretty interesting, in a "what on earth were they thinking" sort of way.

EDIT: wow, I figured it out in Pinball. They apparently ditched GameFreak's mapping system and basically just used the hex value corresponding with that cry's position in the table. Far less hacky and easier to manage than ever, hah.

Research: The Pokemon Cry compression algorithm and how they are stored

More options

ipatix

Sound Expert

Kawaii Shoujo Duskull

The Cutest Duskull

esperance

GoGoJJTech

(☞ﾟヮﾟ)☞ http://GoGoJJTech.com ☜(ﾟヮﾟ☜)

AmineX

Music Expert

ipatix

Sound Expert

AmineX

Music Expert

ipatix

Sound Expert

MajinBlueDragon

ipatix

Sound Expert

MajinBlueDragon

ipatix

Sound Expert

MajinBlueDragon

ipatix

Sound Expert

MajinBlueDragon

ipatix

Sound Expert

AtecainCorp.

Rejishan awake...

HidoranBlaze

MajinBlueDragon