About.
Several new Rom-Hackers have trouble understanding things such as bits, bytes and offsets. The goal and intention of this thread is to provide a reference and explanation that makes this easier to understand. Here are a few things you need to know and understand in order to use this thread functionally:
- Basic Algebra
- Will to learn and understand
- A positive attitude
If you feel there needs to be an addition to this thread, don't hesitate to do so, and I will do my best to add it into the first post in a reasonable time.
Index:
Download this tutorial:
Here.
Number Bases.
A number base is the amount of possible combinations of characters in a digit. For example, base 10, has 10 possible characters in one digit: 0,1,2,3,4,5,6,7,8,9. Base 2, has 2 different possible values per digit: 0,1. So on and so forth. Number bases are usually assumed, but can be notated with a subscript value.
For example: 101 - You would normally assume a base 10 ( which is the number system most people grow up with now-adays )
But if you put a subscript-2 next to it...
101
_{2} it becomes equivalent to: 5
_{10}
Here is a list of common number systems:
- Decimal - base 10.
- Binary - base 2.
- Octal - base 8.
- Hexadecimal (or hex) - base 16.
*Hex numbers are usually notated with a '0x', '&h', '$' preceding the value, as opposed to using a sub-script number to notate the base system.
Back to Top.
Bits.
A bit is the tiniest possible storage unit in modern computing, and uses the binary number system ( base 2 ). This means that it has two possible values: 0, and 1. Let's say you want to know the highest possible value held in X number of bytes.
y = 2^{x}-1
Using that we can figure out that...
2 bits has a maximum value of: 3.
3 bits has a maximum value of: 7.
And so forth.
Now you can adapt that function to fit other number systems.
y = n^{x}-1
Where n is the base ( base 2 - binary, base 10 - decimal, base 8 - octal ) and X is the number of digits.
Back to Top.
Bytes.
A byte consists of 8 bits, and has a maximum value of 0xFF
_{16} ( 255 )
In programming, there is a 'signed' or 'unsigned' byte (or char, if you must ). A signed byte sacrifices the most significant bit as a 'negative' flag. The most significant bit is the bit with the highest place-value. ( Furthest away from the bit with value 1
_{10} ).
For the sake of simplicity, know that in the rest of this document: I will only notate non-decimal numbers, and The most significant bit ( Bit of significance, or BOS ) will be considered on the right 'side' of a number ( (left)100001000110(right) )
The sacrificing of the BOS means that a signed byte only has 15 bits to store the actual number ( y = 2
^{15}-1 ) which effectively cuts the maximum value in half. Unsigned bytes have no such limitations, however negative numbers are not possible in this way.
Back to Top.
Shorts.
It would be good of you to notice that GBA Thumb Instructions (with 1 exception that I am aware of, long branch with link ) are 16 bits in size.
Shorts ( Or half-words ) are 16 bits long ( Max Val: 0xFFFF ). The same information regarding signed-ness applies to shorts, as bytes.
Back to Top.
Words.
It would be good of you to notice that GBA ARM Instructions, and GBA Registers are 32 bits in size. Also notice, on most processors, a WORD is 16 bits, and a DWORD ( double word ) is 32 bits. GBA ARM processor is an exception to this.
Words are 32 bits long ( Max Val: 0xFFFFFFFF ). The same information regarding signed-ness applies to shorts, as bytes and shorts.
Back to Top.
Pointers.
I borrowed the house-address metaphor from C++ for Dummies, 5th edition.
In the world of programming, there exists a thing known as 'variables'. Variables are a programmers way of storing and holding data. As a programmer, you need more than 16 variables, which means you can't just put your variables in your registers. Instead, the variables are stored into memory ( Usually the RAM ). Now you have the variables in memory, now what? If you want to work with them, you have to know
where the variable is at in the memory. Think of a city. A city has many houses, apartments, etc. A city also has a mail-man. Mail-men have letters that belong to houses. Letters contain the address of where it belongs. Think of the city as your memory, containing all the houses and apartments ( variables ). The mail contains the address ( pointer ) of a house ( variable ) so that the Mail-Man ( processor ) can get to the house ( variable ) and deliver the mail ( use the data ). In ROM-hacking, an address is commonly referred to as an offset ( the two are equivalent in actuality, but some people hesitate to make the connection )
Back to Top.
Arrays.
A c-style string is an array of chars ( bytes ), and the end of the string is notated by a null-byte ( 0 )
Back to our City metaphor. Houses aren't just randomly dispersed in the city ( usually ). They have neighborhoods. Each house is in a nice row, evenly spaced out, and identical, but the internals of the house can vary. Think of an array as a neighborhood. It contains many houses ( variables ), and each variable can hold it's own value.
Back to Top.
Structures.
You'll notice that I explain things using C and C++ terms quite often, I do apologize for those who do not program in the language, but try to bare with me.
In several processors and architectures, registers are generally 32 bits. The processor can
only work with processors. So what happens if your variable is larger than 32 bits? What happens, is a struct. Consider this: A file header has a File-Signature ( provides information about the file type e.g. what version, ensures the correct file type, etc ) and then it contains a WORD ( 32 bit integer ). Well, we'll assume that the signature is 32 bits. 32+32 = 64. This means that our FileHeader Variable can not fit inside a register. So what do we do? We take the variables pointer, and use pointer arithmetics. The first part of our variable ( signature ) is 32 bits ( 4 bytes ). So, we add 4 to our pointer because we want the WORD contained in the header, which is what our pointer now points to. You can now work with the WORD contained in the header.
Back to Top.
BitWise Operators.A byte is 8 bits, and has a maximum value of 0xFF. A little shortcut for BitWise operations: There are two digits, 4 bits belong to each digit ( when dealing with hexadecimal ). EG: 1111
_{2} is equal to 0x0F. 1111 1111 is equal to 0xFF. So if you learn how to count to 0xF in binary, you should be good to go, and doing BitWise operators, as well as converting between number bases, inside your head should be a breaze.
Bit wise operators are just that. They do things to bits. Move bits, reverse bits, set bits, unset bits, etc. Bit Shifting does not apply to
a bit. Instead, Bit shifting applies to a group of bits ( Bytes, Shorts, Words, etc ). To BitShift (BS) a unit, you need to know two things:
The amount of bits to shift, and the direction of the shift.
If you bit shift towards the BOS ( left ), the numerical value of the unit will increase. The opposite is also true.
BS-ing to the left:
X << N =(exact) X * 2^{N}
BS-ing to the right:
X >> N =(rounded) X / 2^{N}
AND operator:
AND-ing, involves two, corresponding bits of two units. IF both of the bits are set ( == 1 ), then the resulting bit is also set ( X = A AND B; X = result, A = Unit 1, B = Unit 2 ). Otherwise, the resulting bit is 0. Unfortunately, I don't know a way to represent this operation with algebra, I'm sorry. In programming ( save for ASM ), the AND operator is represented with the '&' character.
OR operator:
OR-ing, also uses two corresponding bits of two units. IF
either bit A,
OR bit B is set, then the resulting bit is also set. The only way to get 0 from this operator, is for both bits to be 0. OR-ing is represented with the pipe ( '|' ) character.
XOR ( eXclusive OR operator )
XOR is a bit more complicated than the previous operators, and is somewhat representable in math. 1 XOR 1 = 0. 1 XOR 0 = 1. If BOTH bits are 1, the result is 0. If 1 Bit is one, the result is 1. If BOTH bits are 0, the result is 0.
X XOR Y = C;
C XOR X = Y;
Y XOR C = X;
XOR-ing is represented with a '^' character.
NOT operator:
NOT-ing a bit, is simply reversing it. EG if a bit is set, it becomes unset. If a bit is not set, it becomes set. Typically applied to whole units, but is applicable to a single bit. NOT-ing is often represented by an exclamation point ( '!' ) or a tilde ( '~' [ a C++ destructor reference ] ).
Back to Top.
Logic Operators.In C++, you signify a destructor with a tilde ( '~' ) followed by the corresponding class name. So in a sense, you're saying NOT X. EG: make X NOT exist. Very clever C++. Very clever.
Without logic, computers would be redundant, at best ( see what I did there? )
Fortunately for us, computer logic is easy to understand. There are a few basic operators you need to know.
X == Y - returns true if X = Y
X <= Y - returns true if X is less than, or equal to Y
X >= Y - returns true if X is greater than, or equal to Y
X < Y - returns true if X is less than Y
X > Y - returns true if X is greater than Y
X != Y - returns true if X is NOT equal to Y
X - returns true if X is NOT 0
!X - returns true if X IS 0.
Take the return value, and IF it is TRUE, then do this. In thumb-ASM, this is what that would look like:
cmp rn,ry @ sets the compares register N, and register Y and sets an appropriate Processor flag ( look them up in gbaTEK )
beq rz @ if ( rn == ry ) goto rZ
Back to Top.
Byte Endianness.This is what pointers look like.
If you have a pointer to address 0xABCDEF, the value in the hex-editor is 0xEFCDAB. HOWEVER, for the most part pointers that most ROM hackers deal with are pointers into the ROM area, which in the GBA is either 0x08NNNNNN, or 0x09NNNNNN. SO, when you see a pointer with '0x08' or '0x09' appended to it, that's what that means. A pointer to the ROM area 0xABCDEF looks like 0xEFCDAB08 in a hex editor.
Byte endianness refers to what order the bytes are in, in a WORD or DWORD. You write numbers like so: 1234. This is known as "Big Endian". In Big Endian, the BOS ( of the DWORD itself ) is all the way on the right. eg:
10101010 10101010 10101010 1010010
1
The alternative is "Little Endian", and the bytes are in an opposite order.
The best way for me to explain this is by example.
Big Endian: 0x(12 34 56 78)
Little Endian: 0x(78 56 34 12)
Correct me if I'm wrong those who know, but I believe this is the reasoning to this madness.
This seems a little pointless ( albeit, with modern technology, it kind of is ), but in the past processors were slow and the difference between processing 1 byte and 2 bytes may have been significant. If you have a word, and you want say... 16 bits of it. ( 0x12345678 is what you have. You want 0x1234 ) What you would do is:
u32 value = 0x12345678;
u16* pVal = &value; //u16* is syntax to define a pointer of u16 type.
if you look at *pVal ( what pVal points to ) you will get: 0x1234. Why? Because you took the address of a u32 ( 0x12345678 ) and it is stored in memory like so: 0x78563412. If you process the pointer as a short, you get 0x5678 ( the 16 bits are
also "flipped" )
Back to Top.