SMILE documentation

Introduction

SMILE package implements Linear Genetic Programming (LGP) algorithm in python, with a scikit-learn style API. LGP is a paradigm of genetic programming that employs a representation of linearly sequenced instructions. A population of diverse candidate models is initialized randomly and will improve prediction accuracy gradually using random sampled training set through a number of generations. After evolution, the best model with highest fitness score (i.e. accuracy on random sampled training set) will be the output.

linear genetic programming package implements LGP algorithm in python, with a scikit-learn compatible API. It retains the familiar scikit-learn fit/predict API and works with the existing scikit-learn modules (e.g. grid search ).

LGP API

class linear_genetic_programming.lgp_classifier.LGPClassifier(numberOfInput, numberOfOperation=5, numberOfVariable=4, numberOfConstant=9, max_prog_ini_length=30, min_prog_ini_length=10, maxProgLength=300, minProgLength=10, pCrossover=0.75, pConst=0.5, pInsert=0.5, pRegmut=0.6, pMacro=0.75, pMicro=0.5, tournamentSize=2, maxGeneration=200, fitnessThreshold=1.0, populationSize=1000, showGenerationStat=True, isRandomSampling=True, constInitRange=(1, 11, 1), randomState=None, testingAccuracy=-1, validationScores=None, names=None)

Linear Genetic Programming algorithm with scikit learn inspired API.

Parameters
numberOfInputinteger, required

Number of features, can be obtained use X.shape[1]

numberOfOperation: integer, optional

Operation consists of (+, -, *, /, ^) and branch (if less, if more)

numberOfVariable: integer, optional (default=4)

A variable number of additional registers used to aid in calculations performed as part of a program. Number of variable size should be at least half of feature size.

numberOfConstant: integer, optional, (default=9)

Number of constant in register. Constants are stored in registers that are write-protected. Constant registers are only initialized once at the beginning with values from a constInitRange.

max_prog_ini_length: integer, optional, (default=30)

Max program initialization length

min_prog_ini_length: integer, optional, (default=10)

Min program initialization length

maxProgLength: integer, optional, (default=300)

maximum program length limit during evolution.

minProgLength: integer, optional, (default=10)

minimum program length required during evolution.

pCrossover: float, optional, (default=0.75)

Probability of exchanging the genetic information of two parent programs

pConst: float, optional, (default=0.5)

Control the probability of constant in Instruciton initialization. It controls whether the register will be a constant. It also controls mutation probability in micromutaion. It controls whether a register will mutate to constant.

pInsert: float, optional, (default=0.5)

Control probability of insertion in macromutation. It will insert a random instruction into the program.

pRegmut: float, optional, (default=0.6)

Control probability of register mutation used in micromutaion. It will either mutate register1, register2 or return register.

pMacro: float, optional, (default=0.75)

Probability of macromutation, Macromutation operate on the level of program. It will add or delete instruction. It will affect program size.

pMicro: float, optional, (default=0.5)

Probability of micromuation. Micromuation operate on the level of instruction components (micro level) and manipulate registers, operators, and constants.

tournament_sizeinteger, optional, (default=2)

The size of tournament selection. The number of programs that will compete to become part of the next generation.

maxGenerationsinteger, optional, (default=200)

The number of generations to evolve.

fitnessThreshold: float, optional, (default=1.0)

When not using random sampling, terminate the evolution if threshold is met. When using random sampling, fitnessThreshold has no effect.

populationSize: integer, optional, (default=1000)

Size of population

showGenerationStat: boolean, optional, (default=True)

If True, print out statistic in each generation. Set to False to save time. Some average statistical calculations is time consuming.

isRandomSampling: Boolean, optional, (default=True)

Train the genetic algorithm on random sampled dataset (without replacement)

constInitRange: tuple (start, stop, step), optional, (default=(1,11,1))

Initiation of the constant set. range: [start, stop).

randomState: int, default=None

Controls both the randomness of the algorithm.

testingAccuracy: int

used to save testing set accuracy score

validationScores: dict

used to hold validation metrics during running

names: list

feature names of the dataset

Attributes
register_: array of shape (numberOfInput + numberOfVariable + numberOfConstant, )

Register stores the calculation variables, feature values and constants.

bestProg_: class Program

A list of Instructions used for classification calculation

bestEffProg_:

Best program with struct intron and semantic intron removed

bestProFitness_float

Training set accuracy score of the best program

bestProgStr_: str

String representation of the best program

bestEffProgStr_: str

Intron removed program string representation

populationAvg_: float

Average fitness of the final generation

Methods

fit(self, X, y)

Fit the Genetic Program according to X, y.

get_params(self[, deep])

Get parameters for this estimator.

load_model([fname, mode])

load lgp object from a pickle file.

load_model_directly(pickle_file_input)

Used to read a file in website

predict(self, X)

Predict using the best fit genetic model.

predict_proba(self, X)

Probability estimates.

save_model(self[, fname, mode])

Save the current object into a pickle file.

score(self, X, y[, sample_weight])

Return the mean accuracy on the given test data and labels.

set_params(self, \*\*params)

Set the parameters of this estimator.

fit(self, X, y)

Fit the Genetic Program according to X, y.

Parameters
Xarray-like, shape = [n_samples, n_features]

Training vectors, where n_samples is the number of samples and n_features is the number of features.

yarray-like, shape = [n_samples]

Target values.

Returns
selfbest program for classification

Returns self.

get_params(self, deep=True)

Get parameters for this estimator.

Parameters
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns
paramsmapping of string to any

Parameter names mapped to their values.

classmethod load_model(fname='lgp.pkl', mode='rb')

load lgp object from a pickle file. Assuming the file is in the same directory

Parameters
fname: string (default = ‘lgp.pkl’)

file name of the output

Returns
lgp: LGPClassifier generator

generator

classmethod load_model_directly(pickle_file_input)

Used to read a file in website

Parameters
pickle_file_input: byte stream

BytesIO input

Returns
lgp: LGPClassifier generator

generator

predict(self, X)

Predict using the best fit genetic model.

Parameters
Xarray_like or sparse matrix, shape (n_samples, n_features)

Samples.

Returns
Carray, shape (n_samples,)

Returns predicted values.

predict_proba(self, X)

Probability estimates. The returned estimates for all classes are ordered by the label of classes.

Parameters
Xarray-like of shape (n_samples, n_features)

Vector to be scored, where n_samples is the number of samples and n_features is the number of features.

Returns
Tarray-like of shape (n_samples, n_classes)

Returns the probability of the sample for each class in the model, where classes are ordered as they are in self.classes_.

save_model(self, fname='lgp.pkl', mode='ab')

Save the current object into a pickle file. Assuming the file is in the same directory.

Parameters
fname: string (default = ‘lgp.pkl’)

file name of the output

Returns
True:

if successfully saved

score(self, X, y, sample_weight=None)

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters
Xarray-like of shape (n_samples, n_features)

Test samples.

yarray-like of shape (n_samples,) or (n_samples, n_outputs)

True labels for X.

sample_weightarray-like of shape (n_samples,), default=None

Sample weights.

Returns
scorefloat

Mean accuracy of self.predict(X) wrt. y.

set_params(self, **params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters
**paramsdict

Estimator parameters.

Returns
selfobject

Estimator instance.

Indices and tables