The C Program Journey

Ever wondered what happens behind the scenes when you compile a C program? If you have, you’re at the right place since this post demystifies everything that happens when you compile the C program. As it turns out, the journey of a C program from the human readable source file to the final executable has four stages.

Before we delve deep into each of those stages, just for the sake of context, let’s quickly go through the typical process of compiling a C program. We are using gcc and the GNU toolchain, the de facto compiler and build system for C in Linux.

Here’s our sample C program named hello.c :

#include <stdio.h>

#define STRING "hello world\n"

int main(void)
{

	// printing by substituting the macro
	printf(STRING);

	return 0;
}

Next we compile the above source code using gcc like so:

$ gcc hello.c -o hello

Compiling a C program

Note here that hello.c is the source file to be compiled by gcc. The -o (oh) option tells gcc what should be the name of the executable file so compiled (hello). If -o option is omitted, by default gcc names the executable file as “a.out“.

Lastly, we execute the compiled file hello to see the result which in this case just prints “hello world” on the screen.

Executing helloworld program C

What just happened was a transformation from the source code to the executable through four specific stages. They are summarized in the following diagram.

Journey of C program

By default, gcc takes care of all the four stages one after the other to produce the executable. We can instruct gcc to do only what we want by specifying the right command-line switch. Let’s examine each of the four stages in detail.

Stage #1 : Pre-processing

The first stage is pre-processing during which following actions take place:

  1. Macro substitution
  2. Comments are stripped off
  3. Header files are expanded

The pre-processor accepts the .c file and unless specified by the -o switch, the output is echoed onto the stdout.

gcc pre-processing

gcc -E hello_c

Let’s examine the hello.i file.

# 1 "hello.c"
# 1 ""
# 1 ""
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 1 "" 2
# 1 "hello.c"
# 1 "/usr/include/stdio.h" 1 3 4
...
...
...
  # 873 "/usr/include/stdio.h" 3 4
extern FILE *popen (const char *__command, const char *__modes) ;

extern int pclose (FILE *__stream);

extern char *ctermid (char *__s) __attribute__ ((__nothrow__ , __leaf__));

# 913 "/usr/include/stdio.h" 3 4
extern void flockfile (FILE *__stream) __attribute__ ((__nothrow__ , __leaf__));
extern int ftrylockfile (FILE *__stream) __attribute__ ((__nothrow__ , __leaf__)) ;
extern void funlockfile (FILE *__stream) __attribute__ ((__nothrow__ , __leaf__));

# 943 "/usr/include/stdio.h" 3 4

# 2 "hello.c" 2

int main(void)
{

 printf("hello world");

 return 0;
}

Three things can be noticed from the large pre-processed file.

First, the macro STRING has been substituted with its character string value. Second, the comment we wrote above the printf statement has been stripped off. Third, the header file stdio.h has been expanded with hundreds of lines of code. By this we get to know that the header file source code actually gets inserted into our source file.

If we search for printf, we’ll get the following:

extern int printf (const char *__restrict __format, ...);

The keyword ‘extern’ tells that the function printf() is not defined here. It is external to this file. We will later see how gcc gets the definition of printf().

Stage #2 : Compilation

The second stage is compilation in which the GNU C compiler accepts the pre-processed hello.i file and outputs the compiled file named hello.s. Note here that the compiler expects its input file’s extension to be “.i.

gcc compilation

gcc -S hello_i

Viewing hello.s file reveals that the C tokens and instructions have been replaced with assembly language directives and instructions.

		.file	"hello.c"
	.section	.rodata
.LC0:
	.string	"hello world"
	.text
	.globl	main
	.type	main, @function
main:
.LFB0:
	.cfi_startproc
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset 6, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register 6
	movl	$.LC0, %edi
	movl	$0, %eax
	call	printf
	movl	$0, %eax
	popq	%rbp
	.cfi_def_cfa 7, 8
	ret
	.cfi_endproc
.LFE0:
	.size	main, .-main
	.ident	"GCC: (Debian 4.9.2-10) 4.9.2"
	.section	.note.GNU-stack,"",@progbits

Stage #3 : Assembly

The third stage is assembly in which the compiled output is passed to the assembler. The assembler expects its input file’s extension to be “.s” and produces an intermediate file with the extension “.o“.

gcc assembly

gcc -c hello_s

The compiled file hello.s in the previous stage is nothing but a bunch of assembler directives to be interpreted by the assembler. GCC internally calls the GNU Assembler as to do the job of interpreting the assembly level instructions in the compiled file to produce the machine level code. This machine code is also known as the object code. You can also call the as independently to process hello.s instead via gcc like so:

$ as hello.s -o hello.o

At this stage only the existing code is converted into machine language, the function calls such as printf() are not resolved.

Since the output of this stage is a machine level file (hello.o), its content is not understandable by us. If we still try to open the hello.o and view it, we’ll see something that is totally not readable.

ELF object file

The only thing we can explain by looking at the print.o file is about the string ELF. ELF stands for executable and linkable format. This is a relatively new format for machine level object files and executables that are produced by gcc. Prior to this, a format known as a.out was used. ELF is said to be a format that’s more sophisticated than a.out.

Note that if you compile your code without specifying the name of the output file, the output file produced has name ‘a.out’, but the format now have changed to ELF. The default executable file name has nothing to do with the format of the machine code. The same name a.out is mere incidental.

Stage #4 : Linking

This is the last stage in which some housekeeping functions are performed by the linker to produce the ready-to-run machine level code. Calling gcc without any option will link all the object files to produce the final executable.

gcc linking

gcc hello.o

As discussed earlier, till this stage gcc doesn’t know about the definition of functions like printf(). Until the compiler knows exactly where all of these functions are implemented, it simply uses a place-holder for the function call. It is at this stage, the definition of printf() is resolved and the actual address of the function printf() is plugged in.

gcc internally makes use of the GNU Linker ld to achieve this task. You can directly call ld to link the object files like so:

$ ld hello.o -o hello

The linker also does some extra work; it adds extra code to our program that is required to indicate when the program starts and when the program ends. For example, there is code which is standard for setting up the running environment like passing command line arguments, passing environment variables to every program. Similarly some standard code that is required to return the return value of the program to the system.

The above tasks of the compiler can be verified by a small experiment. Since now we already know that the linker converts .o file (hello.o) to an executable file (hello). If we compare the file sizes of both the hello.o and hello file, we’ll see the difference.

size hello.o and hello

Through the size command we get a rough idea about how the size of the output file increases from an object file to an executable file. This is all because of that extra standard code the linker adds to our program.

That’s all there to it about what happens when you compile a C program. Isn’t this beautiful!

Pro Tip: You can pass the –save-temp switch to gcc to get all the intermediate files in one command.

$ gcc --save-temp hello.c -o hello

gcc --save-temp

The C Inception

There would be no surprise in saying that the C language literally shook the computer world? The impact of C cannot be underestimated because it completely changed the perspective with which programming was approached. The birth of the C language was a direct result of the need for a structured, efficient and a high level language that was capable of replacing the assembly code when creating systems programs. On the whole, The C language was designed by programmers and for programmers.

The Mastermind Behind the Invention

Dennis Ritchie
Dennis Ritchie

Well, now who is the visionary who ended up with such a versatile programming language? It was Dennis Ritchie. Yes, Dennis Ritchie who is often mentioned as one of the greatest and most influential minds of computing. No wonder he is highly regarded by the professionals in the field as a tech trailblazer.

Born in Bronxville, New York, Ritchie grew up in New Jersey; later, he attended the Harvard University to graduate with a degree in Physics in the year 1963. Ritchie’s father was a switching system engineer at the Bell Labs. It was there that Ritchie spent the better part of his career. His vision was to create a computer language that could serve dual purposes: fulfilling the intellectual caliber of programmers and creating a freedom for them to independently create and pursue their passions.

In the late 1960s, Ritchie and his fellow programmer Ken Thompson set up to create an operating system that could be run on the then upcoming minicomputer which was greatly competent. The result of this undertaking was UNIX. Though the first incarnation of UNIX was versatile, it was characterized as clunky, mainly for the reason that it was written in assembly language. So, whatever machine it was put on had a lot of limitations in terms of memory and vocabulary. Ritchie and Thomson found the ultimate remedy for this problem by brilliantly re-coding UNIX in Ritchie’s own C language by 1973.

Series of Incidents That Led to the Creation of the C Language

The Early Programming Languages

In the early days, CPL or the Combined Programming Language was predominant; it was developed mainly with the goal of creating a language that had the capacity of high level, machine independent programming as well as allow the programmer to have control over the behavior of each individual bit of information. However, CPL had a major drawback; it was too large that it cannot be used in many applications. Later, in 1967, BCPL or the Basic CPL was developed as a scale down version of CPL; the basic features of CPL were retained in the new BCPL. It was Ken Thompson who took this further by developing the B language, which was a scale down version of the BCPL.

The Multics Project

In the 1960s, Ritchie and several other professionals of the Bell Labs (AT & T) worked on a project called Multics. The vision of the project was to build an operating system for a large computer that could be probably used by a thousand users. Unfortunately, in 1969, Bell Labs pulled out from the project; the apparent reason being the project could not create an economically useful system. Therefore, the employees of the company, especially Dennis Ritchie and Ken Thompson, had to look out for another project to work on.

Development of the UNIX

Brian W. Kernighan
Brian Kernighan

Eventually, Thompson started working on developing a new file system. He wrote in assembler, a version of the new file system for DEC PDP-7. This file system was also used for the popular game Space Travel. Soon, improvements were made and expansions were added to the new file system. A lot of knowledge from the Multics project was applied to accomplish this. In a short time, a complete system was created. This system was termed as UNIX by Brian W. Kernighan; the name being a sarcastic reference to Multics. The new system was, however, written in assembly code.

What led to the creation of B language?

UNIX had an interpreter for programming language B, besides FORTRAN and Assembler. The programming language B was developed by Ken Thompson in 1969-70. Early on, computer code was written in assembly code. Programmers had to write a lot of pages of code to perform a specific task. However, using a high level language like B, it became possible to perform the same task in just a few lines of code. The efficiency of language B was utilized for further developing the UNIX system. Code could be produced faster using B language than assembly due to its high level efficiency.

The Ultimate Development of C Language

One major drawback of the B language was that it could not identify the data types; everything was expressed in machine words. Yet another shortcoming was the B language did not provide the use of ‘structures’. It was these constraints of the B language that induced Dennis Ritchie to develop a new language. It was during 1971 to 1973 that Ritchie developed the C language, keeping most of the B language syntax and making various additions like adding data types and the like. Many of its ideas and principles were taken not only from the earlier language B but also B’s ancestors BCPL and CPL. The C language had a great mix of high-level functionality and detailed features that were required to program an operating system.

The Greatness of C Language

The power and flexibility of the C language developed by Ritchie soon became apparent.  Eventually, the UNIX operating system which was actually written in assembly language was perhaps immediately rewritten in the new C language; the only things retained was the assembly language that was required to bootstrap the C code.

As the language was such a dominant, powerful and a supple language, its use immediately spread beyond the Bell labs; it quickly spread throughout a lot of universities and colleges mainly because of its close connections to UNIX and the availability of C compilers. In the late 1970s, the language began to almost replace the well known languages of that time such as the ALGOL, PL/I and the like. Programmers all over the world started using it to write almost all sort of programs.

For this ultimate design, Ritchie received the Turing Award in 1983 and the National Medal of Technology in 1999 along with Thompson.

The First Book on the Language C

Meanwhile, towards the end of the decade, Ritchie and collaborator Kernighan had published and released a rather slim yet remarkably helpful reference guide on the C language “The C Programming Language, 1st edition” which is still looked up to in high regards when it comes to inspiration and practicality. Though the book was said to be written down by both Kernighan and Ritchie, Kernighan has denied the statement saying that he had no part in the development of the C language and it was entirely Ritchie’s work and all the credit went to him.

ANSI C and ISO C

For quite some years, the book written by Ritchie was the standard on the language C. However, various organizations started applying their own unique versions of C with a slight difference. This, in fact, posed a serious problem for system developers. In order to solve this critical problem, the ANSI or the American National Standards Institute formed a committee in the year 1983 to establish a standard definition of the language C. The same committee approved a version of C in 1989 known as ANSI C. Leaving along a few exceptions, almost every modern C compiler was capable of adhering to the standard proposed. In 1990, ANSI C was approved by the ISO or the International Standards Organization. Though the correct term should, of course, be ISO C, still everyone calls it ANSI C.

In later days, UNIX and C both made countless incarnations like Linux, C++, the Mac Operation system and iOS.

C language:  A Language for Programmers

The C language is still common today and is widely used; no wonder it is the second most popular code in the world. The development of C is thought to have marked the beginning of the modern era of computer languages. After successfully synthesizing the conflicting factors that caused serious trouble to earlier computer languages, C emerged as an efficient, powerful and a structured language that was rather easier to learn.  Another crucial aspect of the language was it was a programmer’s language. It was rather designed, developed and implemented by real, effective programmers, reflecting their passion towards programming. The features of C were tested, honed, thought about and rethought by those who actually used the language. The outcome was a computer language that all programmers liked to use.

Perhaps, the C language quickly attracted a lot of followers that had an amazing zeal for it. It was widely and rapidly accepted in the programmer community. Thanks to Dennis Ritchie, C is a language created by and for programmers.