ID3 algorithm – Implementation in C


This project is a C implementation of brilliant algorithm invented by Ross Quinlan. This source generates a decision tree from a dataset and extract rules from a categorized dataset.


DAYOUTLOOKTEMPERATUREHUMIDITYWINDPLAY BALL
D1SunnyHotHighWeakNo
D2SunnyHotHighStrongNo
D3OvercastHotHighWeakYes
D4RainMildHighWeakYes
D5RainCoolNormalWeakYes
D6RainCoolNormalStrongNo
D7OvercastCoolNormalStrongYes
D8SunnyMildHighWeakNo
D9SunnyCoolNormalWeakYes
D10RainMildNormalWeakYes
D11SunnyMildNormalStrongYes
D12OvercastMildHighStrongYes
D13OvercastHotNormalWeakYes
D14RainMildHighStrongNo

In file main.c you can see how dataset must be declared for this implementation


char *dataset[] = { “SUNNY”, “HOT”, “HIGH”, “WEAK”, “NO”, “SUNNY”, “HOT”, “HIGH”, “STRONG”, “NO”, “OVERCAST”, “HOT”, “HIGH”, “WEAK”, “YES”, “RAIN”, “MILD”, “HIGH”, “WEAK”, “YES”, “RAIN”, “COOL”, “NORMAL”, “WEAK”, “YES”, “RAIN”, “COOL”, “NORMAL”, “STRONG”, “NO”, “OVERCAST”, “COOL”, “NORMAL”, “STRONG”, “YES”, “SUNNY”, “MILD”, “HIGH”, “WEAK”, “NO”, “SUNNY”, “COOL”, “NORMAL”, “WEAK”, “YES”, “RAIN”, “MILD”, “NORMAL”, “WEAK”, “YES”, “SUNNY”, “MILD”, “NORMAL”, “STRONG”, “YES”, “OVERCAST”, “MILD”, “HIGH”, “STRONG”, “YES”, “OVERCAST”, “HOT”, “NORMAL”, “WEAK”, “YES”, “RAIN”, “MILD”, “HIGH”, “STRONG”, “NO”, NULL };



The function you have to call to create the decision tree is id3tree_create() as you can see in file main.c.
This function requires essential parameters about dataset

  • pointer to dataset
  • dataset columns (attributes columns + class column)
  • dataset rows
  • label for attribute 1 (only for rules explanation)
  • label for attribute 2 (only for rules explanation)
  • NULL (Attention! Don’t miss this null terminator)




So the first thing happens in id3tree_create() function is to translate string pointer of dataset argument
in a equivalent table of values; value compare is faster than string compare, and is more simple to treat in developing.
Following the above dataset sample we have a table like this

SunnyHotHighWeakNo
OvercastCoolHighStrongNo

translated to something like

12345
67385





Decision tree creation core is the create_leaves() function, that recursevely create new nodes according to the analized samples.
First of all create_leaves() function calculates the entropy set of dataset samples indexed by values in samples parameter of node struct.

If entropy set has value 1.0 data is totally random and has no rules.

If entropy set has value 0.0 data contained in samples has been totally classified.

If 0.0 < entropy set < 1.0 function search for greatest gain attribute and create new nodes for all kind of values of attribute.

The formula

Entropy(S) = S -p(I) log2 p(I)



is solved by calc_entropy_set() function and

Gain(S, A) = Entropy(S) – S ((|Sv| / |S|) * Entropy(Sv))



formula is solved by calc_attrib_gain() function.
Nodes contains this data types


struct node_t { long winvalue; long tot_attrib; long *avail_attrib; long tot_samples; long *samples; long tot_nodes; struct node_t *nodes; };



A short description: winvalue is the attribute value assigned to that node, it must be used in rules extraction,
tot_attrib and avail_attrib are used in entropy calculation of samples pointed by samples;tot_nodes and nodes contains info about leaf nodes.
At the end of create_leaves() function you get a tree like this





where

  • -1 is the root node
  • 0 is Sunny value for attribute Outlook
  • 6 is Overcast value for attribute Outlook
  • 8 is Rain value for attribute Outlook
  • 2 is High value for attribute Humidity
  • 11 is Normal value for attribute Humidity
  • 3 is Weak value for attribute Wind
  • 5 is Strong value for attribute Wind
  • 4 is value for Class NO
  • 7 is value for Class YES




By now greatest part of work has done! We have all information to extract rules, feel free to navigate the tree as you want.

We can now extract the rules


Class YES
IF outlook = SUNNY AND humidity = NORMAL IF outlook = OVERCAST IF outlook = RAIN AND Wind = WEAK Class NO
IF outlook = SUNNY AND humidity = HIGH IF outlook = RAIN AND Wind = STRONG



NOTE! This sources are an experimental version, many many optimizations can be done and memory allocations checked.

You can find source code here:

https://github.com/dannyb79/ID3

Happy coding!

Daniele Brunello