dev-resources.site
for different kinds of informations.
100 Languages Speedrun: Episode 40: x86-64 Assembly
We write in all kinds of programming languages, but in the end, actual CPU needs to run it.
CPU sees a list of numbers. It takes a few of those numbers at a time, interprets that as an instruction to run, then goes to the next one. At least that's the general idea, CPUs are now far more complex than that. To keep the post approachable I'll do my best to ignore all that extra complexity.
We'll be using x86-64 assembly. x86 is the instruction set originally introduced by 16-bit Intel processor 8086, which then got expanded to 32-bit and finally to 64-bit, getting all sorts of expansions along the way. x86-64 just means x86 in 64-bit mode.
Currently there are only two CPU architectures that matter. x86-64 is completely dominant with PCs and laptops. ARM is completely dominant with phones, laptops, and other smart devices. Generally x86-64 offers better performance, while ARM offers better power efficiency. Other kinds of CPUs see a lot less use.
x86-64 assembly for OSX, Linux, and Windows works the same as far as calculations go, but different operating systems have different ways of telling the operating systems to print data and so on. Code for this episode will run on OSX.
How to run assembly
The two names are often used interchangeably, as it's generally clear from the context, but "machine code" refers to numbers CPUs sees, while "assembly" refers to human readable text that is turned into those numbers is a pretty much one-to-one way.
Enough introduction, let's brew install nasm
and write some code:
global start
section .text
start:
; Tell operating system to exit with code 7
mov rax, 0x2000001 ; B8 01 00 00 02
mov rdi, 7 ; BF 07 00 00 00
syscall ; 0F 05
section .data
And compile it:
$ nasm -f macho64 exit.asm
$ ld -static -o exit exit.o
$ ./exit
$ echo $?
7
This is as simple an assembly program as it gets.
Step by step:
- I put in comments the numbers this code gets turned into
-
global start
andstart:
are there to tell the operating system where we want to start our program. -
section .text
starts the executable code, that doesn't really have anything to do with text, but the name stuck -
section .data
starts the data section, which in our case is empty -
rax
,rdi
etc. are 64-bit registers, which are used to store numbers - CPU has a bunch of those registers, which it can access super fast, if you need more data, you need to put it in "memory", which is a lot slower -
mov rax, 0x2000001
meansrax = 0x2000001
-
mov rdi, 7
meansrdi = 7
-
syscall
means to call operating system to do something - arguments are passed in registers - on OSX system call
0x2000001
is for exit, andrdi
contains exit code - which is generally used to tell the parent process if we succeeded or not, we can see it in shell with$?
- due to its complicated history OSX has multiple sets of system calls,
0x0200....
are where all the usual ones go. On Linux the interface is very similar, mainly the system call numbers are different. - for register names
r
prefix means 64bit,e
prefix means 32bit, without any means 16bit, and there are special names for 8bit. There's also a lot of special registers for floating point numbers, doing multiple operations at once, and so on. CPUs are really complicated, but we'll be sticking to simple stuff. - first we use
nasm
to compile one assembly file to "object file", then we useld
to gather a bunch of object files into a final executable. This two step process is fairly common with many compiled languages. -
nasm -f macho64
just means OSX 64-bit. -
ld -static
means we don't want to link to any libraries and we'll do everything on our own - so noprintf
,atoi
,malloc
or such, just raw assembly andsyscall
s.
Disassembly
We can do it in reverse, and use objdump
turn binary data back into readable text:
$ objdump -x86-asm-syntax=intel -d exit.o
exit.o: file format Mach-O 64-bit x86-64
Disassembly of section __TEXT,__text:
0000000000000000 start:
0: b8 01 00 00 02 mov eax, 33554433
5: bf 07 00 00 00 mov edi, 7
a: 0f 05 syscall
You should already have objdump
installed. We need to pass it -x86-asm-syntax=intel
to specify that we want the Intel / NASM syntax. For stupid historical reasons there are multiple different syntaxes, and GNU tooling defaults to a non-standard one, with a lot of sigils and with order of arguments that's backwards:
$ objdump -d exit.o
exit.o: file format Mach-O 64-bit x86-64
Disassembly of section __TEXT,__text:
0000000000000000 start:
0: b8 01 00 00 02 movl $33554433, %eax
5: bf 07 00 00 00 movl $7, %edi
a: 0f 05 syscall
Oh wait, why are they referencing eax
not rax
? eax
is the bottom 32bits of 64bit register rax
. When writing to eax
, the top 32bits are all automatically set to 0
.
Hello, World!
All right, let's write a simple program that prints "Hello, World!" to the screen.
For that we'll need second system call, write
, that takes three arguments: file descriptor, memory address, and amount of data we're writing. Every program starts with 0 as "standard input", 1 as "standard output", and 2 as "standard error", and any extra files or Internet connections we open get extra file descriptor numbers.
global start
section .text
start:
; write(1, "Hello, World!\n", 14)
mov rax, 0x2000004
mov rdi, 1
mov rsi, hello
mov rdx, 14
syscall
; exit(0)
mov rax, 0x2000001
mov rdi, 0
syscall
section .data
hello:
db "Hello, World!", 10
You probably won't be too surprised by the result:
$ nasm -f macho64 hello.asm
$ ld -static -o hello hello.o
$ ./hello
$ ./hello
Hello, World!
Step by step:
- in
.data
section we add a labelhello
, and put"Hello, World!", 10
there. 10 is just code for\n
, asnasm
doesn't understand escape codes. - we calculated length of it manually to be
14
-nasm
can definitely help with that, and we'll get to how -
mov rsi, hello
doesn't move the string torsi
-rsi
only contains numbers, it cannot contain strings - it just means to assign torsi
the memory address wherehello
is.
Loop
Let's write a simple loop that prints Hello, World! five times.
global start
section .text
start:
; rbx = 5
; start the loop
mov rbx, 5
jmp loop_check
loop_iteration:
; inside loop body
; write(1, "Hello, World!\n", 14)
mov rax, 0x2000004
mov rdi, 1
mov rsi, hello
mov rdx, hello_len
syscall
; rbx -= 1
dec rbx
loop_check:
; check if we want to run the loop or not
; if (rbx != 0) goto loop_iteration
cmp rbx, 0
jne loop_iteration
; we're outside the loop now
; exit(0)
mov rax, 0x2000001
mov rdi, 0
syscall
section .data
hello:
db "Hello, World!", 10
hello_len: equ $ - hello
It does what we expect:
$ ./loop
Hello, World!
Hello, World!
Hello, World!
Hello, World!
Hello, World!
What's going on here:
- putting
hello_len: equ $ - hello
just afterhello
defines a constant, that says "how far are we from start of hello". This is just a number and isn't stored into memory. This way we can havenasm
do string lengths and such for us - we store iteration counter in
rbx
, starting from5
and ending the loop when it hits0
- we can either put loop body first, or loop condition first
-
jmp loop_check
simply jumps toloop_check
-
cmp rbx, 0
compares ifrbx
is> 0
,= 0
, or< 0
and sets some CPU flags - after
cmp
runs, we can dojne loop_iteration
which meansjump if not equal
- it checks the flags set by the last instruction that set them - this isn't the "optimal" way to do this
Print numbers
If we linked with C standard library, we could use printf
, but that's not what we're going for.
Instead let's write out own print_number
function:
global start
section .text
; number to use goes into rax
print_number:
; we'll build the string to print backwards
; so 1234 will be built step by step as
; "\n" "4\n" "34\n" "234\n" "1234\n"
mov rbx, buffer_last_byte
mov [rbx], byte 10
print_number_loop:
; make space for another character
dec rbx
; div instruction is more complicated it uses 2 registers for input, and 2 for output
; input is always rdx:rax
; rdx = input % 10
; rax = input / 10
mov rdx, 0
mov rdi, 10
div rdi
; we add 48 to turn numbers 0-9 to ascii codes for digits 48-57
; then store it in a string
add rdx, 48
mov [rbx], dl
; if rax is 0, we're done, otherwise continue
cmp rax, 0
jne print_number_loop
; time to tell operating system what we want to print
; we know how many bytes to print by how far rbx moved from end of the buffer
; write(1, rbx, buffer_after-rbx)
mov rax, 0x2000004
mov rdi, 1
mov rsi, rbx
mov rdx, buffer_after
sub rdx, rsi
syscall
; return to caller
ret
start:
; call print_number(12345678)
; it saves return address on stack
; when ret is called we return to continue this code
mov rax, 12345678
call print_number
; exit(0)
mov rax, 0x2000001
mov rdi, 0
syscall
section .data
buffer:
db " "
buffer_last_byte: equ $ - 1
buffer_after: equ $
Hopefully comments in the code explain enough. It works just fine:
$ ./print_number
12345678
Oh wait, what happens if we didn't leave it enough memory, and the number is too big? The program crashes of course. And hackers can take advantage of that, and take over your computer. Better hope you did it right.
Print numbers with loop
Now that we have our number printing function, we can call it in a loop:
global start
section .text
print_number:
mov rbx, buffer_last_byte
mov [rbx], byte 10
print_number_loop:
dec rbx
mov rdx, 0
mov rdi, 10
div rdi
add rdx, 48
mov [rbx], dl
cmp rax, 0
jne print_number_loop
mov rax, 0x2000004
mov rdi, 1
mov rsi, rbx
mov rdx, buffer_after
sub rdx, rsi
syscall
ret
start:
; r12 = 0
mov r12, 0
; do {
; r12 = r12 + 1; print_number(r12);
; } while (r12 < 10);
loop:
inc r12
mov rax, r12
call print_number
cmp r12, 10
jl loop
; exit(0)
mov rax, 0x2000001
mov rdi, 0
syscall
section .data
buffer:
db " "
buffer_last_byte: equ $ - 1
buffer_after: equ $
Which prints the numbers:
$ ./print_loop
1
2
3
4
5
6
7
8
9
10
You might already be noticing a small problem. If CPU only has small number of registers, and every function needs to use some to do its things, how do we decide which function gets to use which register? There are many very complicated conventions for this, some registers are "callee-saved" - so the called function should save them on stack and restore before returning, if it wants to use them; others are "caller-saved" - so the caller can expect them to get overwritten and if it wants them preserved, the caller needs to save them on stack. syscall
does that too, and some registers won't be preserved by the syscall
. This usually works well enough.
For now our programs are small enough we can completely ignore the issue.
FizzBuzz
And now that we have it, we can build the FizzBuzz:
global start
section .text
print_number:
mov rbx, buffer_last_byte
mov [rbx], byte 10
print_number_loop:
dec rbx
mov rdx, 0
mov rdi, 10
div rdi
add rdx, 48
mov [rbx], dl
cmp rax, 0
jne print_number_loop
mov rax, 0x2000004
mov rdi, 1
mov rsi, rbx
mov rdx, buffer_after
sub rdx, rsi
syscall
ret
start:
; r12 = 0
mov r12, 0
loop:
; r12 += 1
inc r12
; if (r12 % 3 == 0) go to divides_by_three
mov rdx, 0
mov rax, r12
mov rdi, 3
div rdi
cmp rdx, 0
je divides_by_3
; if (r12 % 3 == 0) go to divides_by_five
mov rdx, 0
mov rax, r12
mov rdi, 5
div rdi
cmp rdx, 0
je divides_only_by_5
does_not_divide_by_3_or_5:
mov rax, r12
call print_number
jmp loop_continue
divides_only_by_5:
mov rax, 0x2000004
mov rdi, 1
mov rsi, buzz
mov rdx, buzz_len
syscall
jmp loop_continue
divides_by_3:
; if (r12 % 3 == 0) go to divides_by_five
mov rdx, 0
mov rax, r12
mov rdi, 5
div rdi
cmp rdx, 0
je divides_by_3_and_5
divides_only_3:
mov rax, 0x2000004
mov rdi, 1
mov rsi, fizz
mov rdx, fizz_len
syscall
jmp loop_continue
divides_by_3_and_5:
mov rax, 0x2000004
mov rdi, 1
mov rsi, fizzbuzz
mov rdx, fizzbuzz_len
syscall
jmp loop_continue
loop_continue:
cmp r12, iterations
jl loop
; exit(0)
mov rax, 0x2000001
mov rdi, 0
syscall
section .data
buffer:
db " "
buffer_last_byte: equ $ - 1
buffer_after: equ $
fizz:
db "Fizz", 10
fizz_len: equ $ - fizz
buzz:
db "Buzz", 10
buzz_len: equ $ - buzz
fizzbuzz:
db "FizzBuzz", 10
fizzbuzz_len: equ $ - fizzbuzz
iterations: equ 100
Which does exactly what we want.
These examples are all extremely unoptimized just to keep things simple. I also feel like I barely scratched the surface, but this episode is already nearly the longest so far, so I'll just end here. I might do a followup episode after the series is over.
Should you use Assembly?
It's good fun as esoteric language, and useful to have some assembly basics if you enjoy CTFs and other hacking challenges.
Assembly has a few real world uses:
- compilers need to write some assembly somehow, so if you're writing a compiler, you might want some familiarity with it. It's definitely possible to create a programming language without it - nowadays most compiled languages just target LLVM or JVM JIT or whatnot, and have the VM deal with all the assembly issues, and interpreted ones just use C, but traditionally compilers were turning source code intto assembly
- programs sometimes use assembly for some extremely performance sensitive code, like crypto, or decoding video etc. - but that's less and less common, and it takes insane amount of effort to beat what compilers do; almost always this time is better spent by careful benchmarking and giving compiler hints how to optimize things better - assembly is really no magic here
- if you want to write binary exploits, you'll generally need some assembly
As for writing real programs in it, it would be completely insane.
I can't verify it, but rumor has it that in Japan, where programmers work 100 hours a week on salaries lower than what a Walmart cashier makes in the US, still commonly wrote whole program in assembly well into this century. So I guess if you have slave labor available, that's a thing you could do. Otherwise, don't bother.
Code
All code examples for the series will be in this repository.
Featured ones: