a number of note-worthy updates

2016-07-07 23:00:21 -06:00 · 2016-07-07 23:00:21 -06:00 · 9cd58eb5c5
parent 034ba2af59
commit 9cd58eb5c5
5 changed files with 94 additions and 44 deletions
--- a/README.md
+++ b/README.md
@ -4,7 +4,7 @@ Karoo GP is an evolutionary algorithm, a Genetic Programming application suite w

 Karoo GP provides a transparent interface to the inner workings of Genetic Programming. As a teaching tool, it enables instructors to showcase, step-by-step, how an evolutionary algorithm arrives to its solution. As a hands-on learning tool, Karoo GP supports rapid, repeatable experimentation with a simple, no-programming-required interface. Included with Karoo GP are two executables: an intuitive Text-based User Interface with built-in, real-world test cases, and a fully scriptable, single-line configuration which provides SciKit Learn-like functionality.

-The Quick Start Tutorial (PDF) offers system requirements, a crash-course in Genetic Programming, and use of Karoo GP for both the novice and advanced user.
+The Quick Start Tutorial (PDF) offers system requirements, a crash-course in Genetic Programming, and use of Karoo GP for both the novice and advanced user. Included are a suite of data manipulation tools and example datasets.

 Feedback is welcomed.

--- a/RELEASE_NOTES.txt
+++ b/RELEASE_NOTES.txt
@ -1,23 +1,59 @@
-2015 11/04
+2015 11/04 - version 0.9.1.0

 Initial development of Karoo GP began in February 2015, on a Python-based evolutionary algorithm for an MSc research project at the University of Cape Town (UCT) / African Institute for Mathematical Sciences (AIMS) and the Square Kilometer Array (SKA). The myriad debug statements evolved into the user interface while the classic Machine Learning test cases became the built-in example runs.

-In the end, Karoo GP became a flexible, easy-to-use platform for Genetic Programming.
+In the course of six months development, the code base grew to become a flexible, easy-to-use platform for Genetic Programming.

-It has been thoroughly tested on a 40-core server at the Square Kilometer Array offices in Cape Town, South Africa, where for one month it worked on 10,000 row datasets for up to 50 hours without a single crash. It is proved as a fully functional, multi-core workhorse.
+Karoo GP has been thoroughly tested on a 40-core server at the Square Kilometer Array offices in Cape Town, South Africa, where for one month it chewed through 10,000 rows of data for up to 50 hours without incident. It is proved as a fully functional, multi-core workhorse.

-With all development to date conducted locally, this version 0.9 marks the first release to GitHub.
+With all development to date conducted locally, this version 0.9 marks the first release to github.

-This initial GitHub release is private, shared with select collaborators only. Please do not distribute any part of the code until it is made public.
+This initial github release is private, shared with select collaborators only. Please do not distribute any part of the code until it is made public.

 Thank you! --kai



-2015 12/23
+2015 12/23 - version 0.9.1.1

-Discovered that when loading external datasets, Karoo was yet extracting variables (terminals) from the data in the files/ directory, according to the selected kernel.
+It was iscovered that when loading external datasets, Karoo was yet extracting variables (terminals) from the data in the files/ directory, according to the selected kernel.

 This is now fixed.

 Happy holidays! --kai
+
+
+
+2016 07/07 - version 0.9.1.2
+
+In preparation for public launch of Karoo GP, a number of updates are complete or underway.
+
+The Quick Start Tutorial is being fully revised. A number of corrections were made, but more importantly, all new content has been added relevant to preparation of datasets and the use of the Karoo GP Tools accordingly. The genetic operators descriptions now feature visuals and revised descriptions, as to many other sections.
+
+In the karoo_gp/tools/ directory, all scripts have undergone updates, 2 of which now offer automated scaling and a user interface that in the original versions were not present, as follows:
+
+	karoo_data_sort.py (formerly karoo_features_sort.py)
+	This script now engages the user with a query for the number of class labels and the number of data points (rows) for the new, randomly generated subset of the parent dataset. This script is designed to be used prior to karoo_normalise.py.
+
+	karoo_normalise.py
+	This script now auto-scales to any number of columns and rows (within the limit of your computer's capability), and features a text-based user interface. This script is designed to be used following karoo_data_sort.py.
+
+	karoo_multiclassifier.py
+	This script functions as before, but with a minor bug fixed in which the final class was mislabeled.
+
+	karoo_iris_plot.py
+	This script functions as before, but with improved in-script documentation and cleaner code.
+
+
+In development now are a number of updates and improvements to the base_class such that Karoo GP will more readily conform to the GP standards, as follows:
+
+1) Karoo GP currently produces only 1 offspring for each parent, where it should produce 2.
+
+2) The tree generation method "Ramped Half/Half" is in its current form only a 50/50 split of Full and Grow methods without a graduated scaling.
+
+3) Karoo GP currently engages a bloat inhibitor, that is, an upper limit on tree depth which is maintained through all modes of mutation and crossover. This will become a user defined switch such that it can be active or deactive, enabling growth of trees beyond the original, user defined limit.
+
+4) Karoo GP will be made to launch as a single, command-line function with all required parameters included, SciKit Learn style.
+
+
+Stay tuned for more updates, soon! --kai
--- a/karoo_gp_base_class.py
+++ b/karoo_gp_base_class.py
@ -1,8 +1,15 @@
 # Karoo GP Base Class
-# Define the Karoo GP methods and global variables
+# Define the methods and global variables used by Karoo GP
 # by Kai Staats, MSc UCT / AIMS
 # Much thanks to Emmanuel Dufourq and Arun Kumar for their support, guidance, and free psychotherapy sessions
-# version 0.9.1.1
+# version 0.9.1.2
+
+'''
+A NOTE TO THE NEWBIE, EXPERT, AND BRAVE
+Even if you are highly experienced in Genetic Programming, it is recommended that you review the 'Karoo Quick Start' before running 
+this application. While your computer will not burst into flames nor will the sun collapse into a black hole if you do not, you will 
+likely find more enjoyment of this particular flavour of GP with a little understanding of its intent and design.
+'''

 import csv
 import os
@ -104,7 +111,7 @@ class Base_GP(object):
 		self.algo_sym = 0 # temp store the sympified polynomial-- CONSIDER MAKING THIS VARIABLE LOCAL
 		self.fittest_dict = {} # temp store all Trees which share the best fitness score
 		self.gene_pool = [] # temp store all Tree IDs for use by Tournament
-		self.core_count = pp.get_number_of_cores()
+		self.core_count = pp.get_number_of_cores() # pprocess
 		
 		return
 		
@ -215,43 +222,43 @@ class Base_GP(object):
 		func_dict = {'a':'files/functions_ABS.csv', 'b':'files/functions_BOOL.csv', 'c':'files/functions_CLASSIFY.csv', 'm':'files/functions_MATCH.csv', 'p':'files/functions_PLAY.csv'}
 		fitt_dict = {'a':'min', 'b':'max', 'c':'max', 'm':'max', 'p':''}
 		
-		if len(sys.argv) == 1: # load data in the files/ directory
-			data_x = np.loadtxt(data_dict[self.kernel], skiprows = 1, delimiter = ',', dtype = float); data_x = data_x[:,0:-1] # skip right-most column
+		if len(sys.argv) == 1: # load data in the karoo_gp/files/ directory
+			data_x = np.loadtxt(data_dict[self.kernel], skiprows = 1, delimiter = ',', dtype = float); data_x = data_x[:,0:-1] # load all but the right-most column
 			data_y = np.loadtxt(data_dict[self.kernel], skiprows = 1, usecols = (-1,), delimiter = ',', dtype = float) # load only right-most column (class labels)
 			
 			header = open(data_dict[self.kernel],'r')
-			self.terminals = header.readline().split(','); self.terminals[-1] = self.terminals[-1].replace('\n','')
+			self.terminals = header.readline().split(','); self.terminals[-1] = self.terminals[-1].replace('\n','') # load the variables across the top of the .csv
 			
-			self.functions = np.loadtxt(func_dict[self.kernel], delimiter=',', skiprows=1, dtype = str)
+			self.functions = np.loadtxt(func_dict[self.kernel], delimiter=',', skiprows=1, dtype = str) # load the user defined functions (operators)
 			self.fitness_type = fitt_dict[self.kernel]
 			
 		elif len(sys.argv) == 2: # load an external data file
 			print '\n\t\033[36m You have opted to load an alternative dataset:', sys.argv[1], '\033[0;0m'
-			data_x = np.loadtxt(sys.argv[1], skiprows = 1, delimiter = ',', dtype = float); data_x = data_x[:,0:-1]
-			data_y = np.loadtxt(sys.argv[1], skiprows = 1, usecols = (-1,), delimiter = ',', dtype = float)
+			data_x = np.loadtxt(sys.argv[1], skiprows = 1, delimiter = ',', dtype = float); data_x = data_x[:,0:-1] # load all but the right-most column
+			data_y = np.loadtxt(sys.argv[1], skiprows = 1, usecols = (-1,), delimiter = ',', dtype = float) # load only right-most column (class labels)
 			
 			header = open(sys.argv[1],'r')
-			self.terminals = header.readline().split(','); self.terminals[-1] = self.terminals[-1].replace('\n','')
+			self.terminals = header.readline().split(','); self.terminals[-1] = self.terminals[-1].replace('\n','') # load the variables across the top of the .csv
 			
-			self.functions = np.loadtxt(func_dict[self.kernel], delimiter=',', skiprows=1, dtype = str)
+			self.functions = np.loadtxt(func_dict[self.kernel], delimiter=',', skiprows=1, dtype = str) # load the user defined functions (operators)
 			self.fitness_type = fitt_dict[self.kernel]
 			
 		else: print '\n\t\033[31mERROR! You have assigned too many command line arguments at launch. Try again ...\033[0;0m'; sys.exit()
 		
 		
-		### 2) from the dataset, prepare terminals, TRAINING, and TEST data ###
+		### 2) from the dataset, generate TRAINING and TEST data ###
 		if len(data_x) < 11: # for small datasets we will not split them into TRAINING and TEST components
 			data_train = np.c_[data_x, data_y]
 			data_test = np.c_[data_x, data_y]
 			
-		else: # if larger than 10, we run the data through the SciKit Learn random split
-			x_train, x_test, y_train, y_test = skcv.train_test_split(data_x, data_y, test_size = 0.2)
+		else: # if larger than 10, we run the data through the SciKit Learn's 'random split' function
+			x_train, x_test, y_train, y_test = skcv.train_test_split(data_x, data_y, test_size = 0.2) # 80/20 TRAIN/TEST split
 			data_x, data_y = [], [] # clear from memory
 			
-			data_train = np.c_[x_train, y_train] # recombine the features with the solutions
+			data_train = np.c_[x_train, y_train] # recombine each row of data with its associated label (right column)
 			x_train, y_train = [], [] # clear from memory
 			
-			data_test = np.c_[x_test, y_test] # recombine the features with the solutions
+			data_test = np.c_[x_test, y_test] # recombine each row of data with its associated label (right column)
 			x_test, y_test = [], [] # clear from memory
 			
 		self.data_train_cols = len(data_train[0,:])
@ -269,9 +276,9 @@ class Base_GP(object):
 				data_train_dict.update( {self.terminals[col]:data_train[row,col]} ) # to be unpacked in 'fx_fitness_eval'
 				
 			self.data_train_dict_array = np.append(self.data_train_dict_array, data_train_dict.copy())
-		
-		data_train = [] # clear from memory
 			
+		data_train = [] # clear from memory
+		
 		
 		### 4) copy TEST data into an array (rows) of dictionaries (columns) ###
 		data_test_dict = {}
@ -282,9 +289,9 @@ class Base_GP(object):
 				data_test_dict.update( {self.terminals[col]:data_test[row,col]} ) # to be unpacked in 'fx_fitness_eval'
 				
 			self.data_test_dict_array = np.append(self.data_test_dict_array, data_test_dict.copy())
-		
-		data_test = [] # clear from memory
 			
+		data_test = [] # clear from memory
+		
 		
 		### 5) initialise all .csv files ###
 		self.filename = {} # a dictionary to hold .csv filenames
@ -298,11 +305,11 @@ class Base_GP(object):
 		target.close()
 		
 		self.filename.update( {'f':'files/population_f.csv'} )
-		target = open(self.filename['f'], 'w') # initialise the .csv file for the final population (used to test)
+		target = open(self.filename['f'], 'w') # initialise the .csv file for the final population (test)
 		target.close()
 		
 		self.filename.update( {'s':'files/population_s.csv'} )
-		# do NOT initialise this .csv file, as it is retained for loading a previous run
+		# do NOT initialise this .csv file, as it is retained for loading a previous run (recover)
 		
 		return
 		
@ -310,7 +317,8 @@ class Base_GP(object):
 	def fx_karoo_data_recover(self, population):
 	
 		'''
-		[need to write]
+		This method is used to load a saved population of trees into the current population. As invoked through the
+		(pause) menu, this loads population_s to replace population_a.
 		'''
 		
 		with open(population, 'rb') as csv_file:
@ -348,8 +356,8 @@ class Base_GP(object):
 		
 		'''
 		As used by the method 'fx_karoo_gp', this method constructs the initial population based upon the user-defined 
-		Tree type and quantity. As "ramped half/half" is an industry standard, it was hard-coded into this method. But 
-		the ratio of Full to Grow Trees may be easily modified, below.
+		Tree type and quantity. "Ramped half/half" is currently not ramped, rather split 50/50 Full/Grow. This will be
+		updated with the next version of Karoo GP.
 		'''
 		
 		if self.display == 'i' or self.display == 'g':
@ -1426,7 +1434,7 @@ class Base_GP(object):
 		'''
 		This multiclass classifer compares each row of a given Tree to the known solution, comparing estimated values 
 		(labels) generated by Karoo GP against the correct labels. This method is able to work with any number of class 
-		labels, from 2 to n. The first label bin includes -inf. The last label bin includes +inf. Those in between are 
+		labels, from 2 to n. The left-most bin includes -inf. The right-most bin includes +inf. Those inbetween are 
 		by default confined to the spacing of 1.0 each, as defined by:
 		
 			(solution - 1) < result <= solution
@ -1435,8 +1443,6 @@ class Base_GP(object):
 		origin. At the time of this writing, an odd number of class labels will generate an extra bin on the positive 
 		side of origin as it has not yet been determined the effect of enabling the middle bin to include both a 
 		negative and positive space.
-		
-		Commented in the code is another 
 		'''
 		
 		# tested 2015 10/18
@ -1448,11 +1454,11 @@ class Base_GP(object):
 			fitness = 1
 			if self.display == 'i': print '\t\033[36m data row', row, 'yields class label:\033[1m', int(solution), 'as', result, '<=', int(0 - skew), '\033[0;0m'
 			
-		elif solution == self.class_labels - 1 and result > (solution - 1) - skew: # check for last class
+		elif solution == self.class_labels - 1 and result > solution - 1 - skew: # check for last class
 			fitness = 1
-			if self.display == 'i': print '\t\033[36m data row', row, 'yields class label:\033[1m', int(solution), 'as', result, '>', int(solution - skew), '\033[0;0m'
+			if self.display == 'i': print '\t\033[36m data row', row, 'yields class label:\033[1m', int(solution), 'as', result, '>', int(solution - 1 - skew), '\033[0;0m'
 			
-		elif (solution - 1) - skew < result <= solution - skew: # check for class bins between first and last
+		elif solution - 1 - skew < result <= solution - skew: # check for class bins between first and last
 			fitness = 1
 			if self.display == 'i': print '\t\033[36m data row', row, 'yields class label:\033[1m', int(solution), 'as', int(solution - 1 - skew), '<', result, '<=', int(solution - skew), '\033[0;0m'
 			
@ -1861,7 +1867,7 @@ class Base_GP(object):
 		branch_top = np.random.randint(2, len(tree[3])) # randomly select a non-root node
 		branch_eval = self.fx_eval_id(tree, branch_top) # generate tuple of 'branch_top' and subseqent nodes
 		branch_symp = sp.sympify(branch_eval) # convert string into something useful
-		branch = np.append(branch, branch_symp)
+		branch = np.append(branch, branch_symp) # append list to array
 		
 		branch = np.sort(branch) # sort nodes in branch for Crossover Reproduction.
 		
@ -2406,7 +2412,7 @@ class Base_GP(object):
 		
 		return
 		
-	
+		
 	def fx_test_normalize(self, array):
 	
 		'''
@ -2423,7 +2429,8 @@ class Base_GP(object):
 		array_max = np.max(array)
 		
 		for col in range(1, len(array) + 1):
-			norm = float((array[col - 1] - array_min) / (array_max - array_min))			
+			norm = float((array[col - 1] - array_min) / (array_max - array_min))
+			norm = round(norm, 4) # force to 4 decimal points		
 			array_norm = np.append(array_norm, norm)
 			
 		return array_norm
--- a/karoo_gp_main.py
+++ b/karoo_gp_main.py
@ -2,7 +2,7 @@
 # Use Genetic Programming for Classification and Symbolic Regression
 # by Kai Staats, MSc UCT / AIMS
 # Much thanks to Emmanuel Dufourq and Arun Kumar for their support, guidance, and free psychotherapy sessions
-# version 0.9.1.1
+# version 0.9.1.2

 '''
 A NOTE TO THE NEWBIE, EXPERT, AND BRAVE
--- a/karoo_gp_server.py
+++ b/karoo_gp_server.py
@ -2,7 +2,14 @@
 # Use Genetic Programming for Classification and Symbolic Regression
 # by Kai Staats, MSc UCT / AIMS
 # Much thanks to Emmanuel Dufourq and Arun Kumar for their support, guidance, and free psychotherapy sessions
-# version 0.9.1.1
+# version 0.9.1.2
+
+'''
+A NOTE TO THE NEWBIE, EXPERT, AND BRAVE
+Even if you are highly experienced in Genetic Programming, it is recommended that you review the 'Karoo Quick Start' before running 
+this application. While your computer will not burst into flames nor will the sun collapse into a black hole if you do not, you will 
+likely find more enjoyment of this particular flavour of GP with a little understanding of its intent and design.
+'''

 import sys # sys.path.append('modules/') # add the directory 'modules' to the current path
 import karoo_gp_base_class; gp = karoo_gp_base_class.Base_GP()