The cry compression algorithm
Introduction:
So, Hi everyone. I was doing a little research lately about cries and the way they are compressed in the ROM since I wanted to do cry hacking 'o' mass and I wanted to save as much space as possible.
I couldn't find any information about that and as far as I know existing programs like PokeCry and Advanced Cry use basic 8 bit encoding without actual compression and so I started the research.
So because many people don't know how waveforms are stored in memory I will explain:
First the is the so called uncompressed data which is usually stored in a format called "PCM" (Pulse Code Modulation)
(source: https://en.wikipedia.org/wiki/Pulse-code_modulation)
So PCM encoded data is made out of a specific integer type (e.g. 8 bits, 16 bits, signed or unsigned).
When we record PCM data we will "sample" a value of a waveform with a fixed time interval. On CDs this is usually 44100 Hz with a 16 bit resolution. This provides enough quality that the human won't hear any quantisation noise. This will takes up quite a bit of space (on a CD this will be 176,4 kByte per second) which would be too much to store on a AGB cartridge. So we can reduce the samplingrate and our sample resolution to save space at the lack of quality (Pokemon games produce sound at a rate of 13379 Hz with 8 bit resolution). Because this will still be too much to store cries of >300 Pokemons the samplingrate is even more degraded to a samplingrate of 10512 Hz. To save more space on top of that Game Freak developed a compression algorithm (I didn't found this being used in any other GBA game even though they use the same sound engine) which I want to explain here.
Location of Cries:
In all GBA Pokemon games there are the so called "Crytables" which store the location of the data. I personally don't know how they are accessed by the game engine because I don't know how they game deals with Pokemon IDs and which table entry will be accessed but this isn't my topic today anyway. As far as I know some people who made Cry hacking programs know how this works and maybe they might be able to explain how this works.
However to find the crytable run your hexeditor and search for the following hex string:
"20 3C 00 00 XX XX XX XX FF 00 FF 00".
This is basically a crytable entry (length = 0xC bytes). They work similar like voicegroup entrys used by music. I don't know what the 0x20 does but apperantly it is only used for Cries and I don't know what it does. In this case 0x3C shouldn't do anything but in some cases it might be able to change the pitch shifting so these should be 0x3C always. XX XX XX XX is the little endian pointer to the Sample Data I will explain later. FF 00 FF 00 define the shape of the cry. Changing it doesn't really make sense for Cries. This option will make the cries immediantley turn on and turn off when the Cry playback is getting started or stopped.
The Sample Data:
The first 16 bytes are the header.
The structure is the following:
For the compressed wave data this is a a little bit different and this is where we get to the actual interesting part where I did the research...
The compression algorithm:
Since the GBA has to decode the Cries in realtime with low CPU time a simple but fast decoding algorithm is required to do so.
This type of compression has nothing to do with other compression types that is used for graphics (LZ77, Huffman, etc.). It is based on a concept of the so called DPCM algorthim (Differential Pulse Code Modulation).
Although DPCM is not identical to this one it might be helpful to check this out before continue reading: https://en.wikipedia.org/wiki/DPCM
So let's see how it works:
The compressed data is split into blocks with a size of 1 byte + 0x20 bytes.
The first byte will always be like normal 8 bit signed PCM data. The next 0x20 bytes aren't PCM though. The next 0x20 bytes are split into 2 4-bit values each except the first 4 bit value (which isn't used).
Each 4 bit value is used to calculate the next sample based on the previous sample.
When the engine wants to get the second sample and the third one (remeber, our first one was a 8 bit signed value) the engine will put the most significant (except for the first one which is skipped) 4-bit value into a differntial-lookup-table and will return a differencial value that is either added or subtracted from the previous sample. After that the least significant 4-bit value will follow and will be subtracted/added to the previous sample as well to calculate the third sample. The same process is done with the 2 4-bit values over and over again until the end of the block is reached. After the end of one block the previous decoded value will get reset to the first byte of the block (remeber, the first byte of a block is a raw signed 8 bit PCM sample) and will be used as sample as well.
But how does the engine know when to stop decoding blocks?
Well as mentioned in the sampledata header overview a sample amount value is stored. In this case this value isn't 1 byte per 1 sample because we don't use regular 8 bit PCM code. The sample amount value will be the amount of overall samples that have been calculated.
Remeber, each block is 0x21 bytes big. The first byte is used as sample and the rest of the 0x20 bytes is split into 4 bit values which results 0x40 samples but since it skips the first 4 bit value it's only 0x3F. So Considering the first sample and the first 4 bit value is skipped it results 0x40 samples exactly per 0x21 bytes.
I'm not sure if the sample amount has to be aligned with the actual amount of blocks. So it might be possible the sampledata can end before the end of a block is reached. I can't confirm this yet though. EDIT: Yes it does work, I checked it. The code will decode those bytes that are either unused or used for something else but won't play them.
If you read the text carefully and still didn't understood the algorithm you can check a C styled pseudo code example I made for decoding the cry:
Conclusion:
So yeah that's it and I hope you found this a little interesting to read and/or helpful.
Since I'm no native English speaker I hope it wasn't too difficult to read or understand. Please tell me if I have some major grammer mistakes in my litlle information to help you understanding it further :)
PS: If you guys are interested in the ASM stuff that does the decoding feel free to ask for it. I might put the decoding routine in here and explain how it works. I also plan on making an encoder for the Cries but this still in planning.
So, Hi everyone. I was doing a little research lately about cries and the way they are compressed in the ROM since I wanted to do cry hacking 'o' mass and I wanted to save as much space as possible.
I couldn't find any information about that and as far as I know existing programs like PokeCry and Advanced Cry use basic 8 bit encoding without actual compression and so I started the research.
So because many people don't know how waveforms are stored in memory I will explain:
First the is the so called uncompressed data which is usually stored in a format called "PCM" (Pulse Code Modulation)
![[PokeCommunity.com] The Pokemon Cry compression algorithm and how they are stored [PokeCommunity.com] The Pokemon Cry compression algorithm and how they are stored](https://upload.wikimedia.org/wikipedia/commons/thumb/b/bf/Pcm.svg/250px-Pcm.svg.png)
(source: https://en.wikipedia.org/wiki/Pulse-code_modulation)
So PCM encoded data is made out of a specific integer type (e.g. 8 bits, 16 bits, signed or unsigned).
When we record PCM data we will "sample" a value of a waveform with a fixed time interval. On CDs this is usually 44100 Hz with a 16 bit resolution. This provides enough quality that the human won't hear any quantisation noise. This will takes up quite a bit of space (on a CD this will be 176,4 kByte per second) which would be too much to store on a AGB cartridge. So we can reduce the samplingrate and our sample resolution to save space at the lack of quality (Pokemon games produce sound at a rate of 13379 Hz with 8 bit resolution). Because this will still be too much to store cries of >300 Pokemons the samplingrate is even more degraded to a samplingrate of 10512 Hz. To save more space on top of that Game Freak developed a compression algorithm (I didn't found this being used in any other GBA game even though they use the same sound engine) which I want to explain here.
Location of Cries:
In all GBA Pokemon games there are the so called "Crytables" which store the location of the data. I personally don't know how they are accessed by the game engine because I don't know how they game deals with Pokemon IDs and which table entry will be accessed but this isn't my topic today anyway. As far as I know some people who made Cry hacking programs know how this works and maybe they might be able to explain how this works.
However to find the crytable run your hexeditor and search for the following hex string:
"20 3C 00 00 XX XX XX XX FF 00 FF 00".
This is basically a crytable entry (length = 0xC bytes). They work similar like voicegroup entrys used by music. I don't know what the 0x20 does but apperantly it is only used for Cries and I don't know what it does. In this case 0x3C shouldn't do anything but in some cases it might be able to change the pitch shifting so these should be 0x3C always. XX XX XX XX is the little endian pointer to the Sample Data I will explain later. FF 00 FF 00 define the shape of the cry. Changing it doesn't really make sense for Cries. This option will make the cries immediantley turn on and turn off when the Cry playback is getting started or stopped.
The Sample Data:
The first 16 bytes are the header.
The structure is the following:
- 2 bytes: (16 bit value) 0x0000 = 8 bit signed PCM sampledata [OR] 0x0001 = compressed sampledata
- 2 bytes: (16 bits value) 0x0000 = one shot sample (default) [OR] 0x4000 = looped waveform (only used for musical instruments so they can play as long as you want them to play)
- 4 bytes: (32 bit value) default samplingrate * 1024; a Cry with a rate of 10512 Hz would result 0xA44000 (remember, they values are stored in the little endian format)
- 4 bytes: (32 bit value) loop start position (only used for musical instruments, not important for Cries)
- 4 bytes: (32 bit value) length of the sampledata in samples (for uncompressed sample => 1 byte per sample)
For the compressed wave data this is a a little bit different and this is where we get to the actual interesting part where I did the research...
The compression algorithm:
Since the GBA has to decode the Cries in realtime with low CPU time a simple but fast decoding algorithm is required to do so.
This type of compression has nothing to do with other compression types that is used for graphics (LZ77, Huffman, etc.). It is based on a concept of the so called DPCM algorthim (Differential Pulse Code Modulation).
Although DPCM is not identical to this one it might be helpful to check this out before continue reading: https://en.wikipedia.org/wiki/DPCM
So let's see how it works:
The compressed data is split into blocks with a size of 1 byte + 0x20 bytes.
The first byte will always be like normal 8 bit signed PCM data. The next 0x20 bytes aren't PCM though. The next 0x20 bytes are split into 2 4-bit values each except the first 4 bit value (which isn't used).
Each 4 bit value is used to calculate the next sample based on the previous sample.
When the engine wants to get the second sample and the third one (remeber, our first one was a 8 bit signed value) the engine will put the most significant (except for the first one which is skipped) 4-bit value into a differntial-lookup-table and will return a differencial value that is either added or subtracted from the previous sample. After that the least significant 4-bit value will follow and will be subtracted/added to the previous sample as well to calculate the third sample. The same process is done with the 2 4-bit values over and over again until the end of the block is reached. After the end of one block the previous decoded value will get reset to the first byte of the block (remeber, the first byte of a block is a raw signed 8 bit PCM sample) and will be used as sample as well.
But how does the engine know when to stop decoding blocks?
Well as mentioned in the sampledata header overview a sample amount value is stored. In this case this value isn't 1 byte per 1 sample because we don't use regular 8 bit PCM code. The sample amount value will be the amount of overall samples that have been calculated.
Remeber, each block is 0x21 bytes big. The first byte is used as sample and the rest of the 0x20 bytes is split into 4 bit values which results 0x40 samples but since it skips the first 4 bit value it's only 0x3F. So Considering the first sample and the first 4 bit value is skipped it results 0x40 samples exactly per 0x21 bytes.
I'm not sure if the sample amount has to be aligned with the actual amount of blocks. So it might be possible the sampledata can end before the end of a block is reached. I can't confirm this yet though. EDIT: Yes it does work, I checked it. The code will decode those bytes that are either unused or used for something else but won't play them.
If you read the text carefully and still didn't understood the algorithm you can check a C styled pseudo code example I made for decoding the cry:
Code:
char lookup_table[] = { 0x0, 0x1, 0x4, 0x9, 0x10, 0x19, 0x24, 0x31 , 0xC0, 0xCF, 0xDC, 0xE7, 0xF0, 0xF7, 0xFC, 0xFF };
unsigned char input_data[] = { /* input data here */ };
char PCM_LEVEL = 0;
int SAMPLE_COUNT = ?; // ? = value of the sample amount given by the header
int CURRENT_SAMPLE = 0;
int BLOCK_ALIGN = 0;
for (int i = 0; true; i++)
{
if (BLOCK_ALIGN == 0)
{
PCM_LEVEL = (char) input_data[i];
someOutputFunction(PCM_LEVEL);
BLOCK_ALIGN = 0x20;
continue;
}
if (BLOCK_ALIGN < 0x20)
{
PCM_LEVEL += lookup_table[input_data[i] >> 4];
someOutputFunction(PCM_LEVEL);
}
PCM_LEVEL += lookup_table[input_data[i] & 0xF];
someOutputFunction(PCM_LEVEL);
CURRENT_SAMPLE += 2;
if (CURRENT_SAMPLE >= SAMPLE_COUNT) break;
BLOCK_ALIGN--;
}
So yeah that's it and I hope you found this a little interesting to read and/or helpful.
Since I'm no native English speaker I hope it wasn't too difficult to read or understand. Please tell me if I have some major grammer mistakes in my litlle information to help you understanding it further :)
PS: If you guys are interested in the ASM stuff that does the decoding feel free to ask for it. I might put the decoding routine in here and explain how it works. I also plan on making an encoder for the Cries but this still in planning.
Last edited: