ID3 Algorithm Implementation in C

This tutorial has been created on the ID3 explanation shown here so in source code you can find the same dataset.

D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

In file main.c you can see how dataset must be declared for this implementation

char *dataset[] =
	"SUNNY",  	"HOT",    	"HIGH",    	"WEAK",		"NO",
	"SUNNY",	"HOT",    	"HIGH",    	"STRONG",	"NO",
	"OVERCAST",	"HOT",    	"HIGH",   	"WEAK",		"YES",
	"RAIN",		"MILD",   	"HIGH",   	"WEAK",		"YES",
	"RAIN",		"COOL",		"NORMAL",  	"WEAK",		"YES",
	"RAIN",		"COOL",  	"NORMAL",  	"STRONG",	"NO",
	"SUNNY", 	"MILD",  	"HIGH",    	"WEAK",		"NO",
	"SUNNY", 	"COOL",  	"NORMAL",  	"WEAK",		"YES",
	"RAIN",		"MILD",  	"NORMAL",  	"WEAK",		"YES",
	"SUNNY", 	"MILD",  	"NORMAL",  	"STRONG",	"YES",
	"OVERCAST",	"MILD",  	"HIGH",    	"STRONG",	"YES",
	"OVERCAST",	"HOT", 		"NORMAL",  	"WEAK",		"YES",
	"RAIN",  	"MILD",  	"HIGH",		"STRONG",	"NO",

The function you have to call to create the decision tree is id3tree_create() as you can see in file main.c. This function requires essential parameters about dataset

So the first thing happens in id3tree_create() function is to translate string pointer of dataset argument in a equivalent table of values; value compare is faster than string compare, and is more simple to treat in developing. Following the above dataset sample we have a table like this

Sunny Hot High Weak No
Overcast Cool High Strong No
... ... ... ... ...

translated to something like

1 2 3 4 5
6 7 3 8 5
... ... ... ... ...

Decision tree creation core is the create_leaves() function, that recursevely create new nodes according to the analized samples. First of all create_leaves() function calculates the entropy set of dataset samples indexed by values in samples parameter of node struct.
If entropy set has value 1.0 data is totally random and has no rules.
If entropy set has value 0.0 data contained in samples has been totally classified.
If 0.0 < entropy set < 1.0 function search for greatest gain attribute and create new nodes for all kind of values of attribute.
The formula
Entropy(S) = S -p(I) log2 p(I)

is solved by calc_entropy_set() function and
Gain(S, A) = Entropy(S) - S ((|Sv| / |S|) * Entropy(Sv))

formula is solved by calc_attrib_gain() function. Nodes contains this data types

struct node_t
	long			winvalue;
	long			tot_attrib;
	long			*avail_attrib;
	long			tot_samples;
	long			*samples;
	long			tot_nodes;
	struct node_t		*nodes;

A short description: winvalue is the attribute value assigned to that node, it must be used in rules extraction, tot_attrib and avail_attrib are used in entropy calculation of samples pointed by samples;tot_nodes and nodes contains info about leaf nodes. At the end of create_leaves() function you get a tree like this


By now greatest part of work has done! We have all information to extract rules, feel free to navigate the tree as you want.
We can now extract the rules

Class YES
IF outlook = SUNNY AND humidity = NORMAL IF outlook = OVERCAST IF outlook = RAIN AND Wind = WEAK Class NO
IF outlook = SUNNY AND humidity = HIGH IF outlook = RAIN AND Wind = STRONG

NOTE! This sources are an experimental version, many many optimizations can be done and memory allocations checked.

12/01/2010 - I've fixed some little bug in pruning leaves, the new big dataset now it works.

You can download source code from here.
Enjoy ;)