IEEE Congress on Evolutionary Computation

Sendai International Center, Sendai, Japan, 25-28 May, 2015

Optimization of Big Data 2015 Competition

BigOpt2015

http://www.husseinabbass.net/BigOpt.html

*National University of Singapore, Department of Electrical and Computer Engineering, 4 Engineering Drive, 117583, Singapore.

**The University of New South Wales, School of Engineering and Information Technology, Canberra, ACT 2600, Australia.

Evolutionary optimization techniques have been very successful in solving multi-modal optimization problems. To date, many competitions have been proposed on evolutionary optimization for both single and multiobjective problems. The community responded very well to these competitions, which encouraged a variety of new very competitive algorithms.

In the age of Big Data, there is an urge to take evolutionary optimization techniques to the next level for solving problems with hundreds, thousands, and even millions of variables. This competition takes first steps towards achieving this objective.

A special session on Big Optimization is organized in conjunction with this competition. Details on the special session can be found at http://www.husseinabbass.net/BigOptSessionCec2015.html

A Big Data problem is not just about size. The frequency and nature of changes in the underlying trends in the data, heterogeneity of data sources, level of noise, and other factors come into play to define the true complexity of a Big Data problem.

One important factor to consider is: to solve Big problems, we need to solve some small ones; that is, to manage 1TB of data per day, we need to be able to manage 12MB of data per second. While 12MB does not sound large, it is the "per second" that makes the problem harder; that is, the time constraint imposed on the size.

Classically, data gets accummulated in infrastructures managed by distributed file systems (DFS) such as the Hadoop DFS (HDFS). When the data contains timeseries, it is sometimes more efficient to do certain level of data processing at the source before storing the data.

This competition is an example of this situation. In 2015, we will start with a very small scale representing around 20KB of data per second if the data is accessed in its binary form and 0.5MB per second if the data is accessed in its text form. This scale requires solving an optimization problem with close to 5000 variables every second. We will not impose the time constraint on entries to the competition. However, we will use the computational time (or an estimate of it) as a criterion to score entries.

In future years, we will increase the size of the dataset to bring it closer to the real-world Big Data challenges for this problem domain.

Evolutionary optimization is a major research area of evolutionary computation. The primary objective of this competition is to push the boundaries of evolutionary optimization beyond the small-scale problems that have been used as benchmark problems in the literature so far.

Evolutionary optimization has successfully solved very large scale "simple" problems such as the one-max function, with a million variables. In global and/or multiobjective optimization problems, however, this scale is beyond the capabilities of existing evolutionary optimization algorithms.

This competition is only the starting point, focusing in 2015 on up to 5000 variables. The competition will evolve overtime to problems with tens of thousands of variables; once the community is ready for this scale. The aim of this competition is to create next-generation evolutionary optimization algorithms with competitive performance on both quality of solutions and convergence speed.

The series of problems used in this competition are both single objective and multi-objective nonlinear optimization problems. Each optimization problem is based on a number of interdependent time series forming a dataset. The number of time series together with the length of each time series define the number of variables in the problem. For this competition, each time series is of length 256; thus, each optimization problem has a multiple of 256 variables. For example, if the dataset has 4 timeseries, this will define a 1024 variables optimization problem. In this year competition, the aim is to solve problems with 1024, 3072 and 4864 variables.

Assume the matrix *X* is of dimension *N × M*, where *N* is
the number of inter-dependent timeseries and *M* is the length of each
timeseries. Let *S* be a matrix of *N × M*, with *N* independent
timeseries of length *M*, such that, given *A*, an *N × N*
linear transformation matrix,

*X = A .
S*

The problem is to decompose the matrix *S* into two matrices: *S1*
and *S2* with the same dimensionalities of *S*; that is,
*S=S1+S2* and *X = A x S1 + A x S2*. Let *C* be the Pearson
correlation coefficient between *S1* and *X*; that is,

covar(X, A x S1) |

vary(X) ×
var(A x S1) |

The objective is to maximize the diagonal elements of *C*, while
minimizing off-diagonal elements to zeros. At the same time, the distance
between *S* and *S1* should be as minimum as possible; that is,
*S1* needs to be as similar as possible as *S*.

This definition generates two formulations for the problem, one as a single objective, while the other as a multiobjective problem.

Given, *X*, *A*, and *S*, find *S1* which optimizes the
following two objective functions:

*Minimize*
f_{1} = ^{1}⁄_{(N × M)} ∑_{ij}
(S_{ij} − S1_{ij})^{2}

*Minimize*
f_{2} = ^{1}⁄_{(N}2 _{− N)} ∑_{i,j ≠
i} (C_{ij}^{2}) + ^{1}⁄_{N}
∑_{i} (1 − C_{ii})^{2}

Given, *X*, *A*, and *S*, find *S1* which optimizes the
following objective function:

*Minimize*
f_{1} + f_{2}

For the sake of this exercise, all variables in the optimization problem are
limited to be -8 ≤ s_{1} ≤ 8

It is important to emphasize that understanding the context of this problem is not necessary to solve it. The previous abstract definition should be sufficient for many scientists in optimization theory and evolutionary optimization to solve the problem.

For those interested in understanding the technical basis for the data generation method and the above mathematical formulation, this section will provide such details. However, to emphasize the point again, researchers joining the competition do not need to read these domain-specific technical details.

The conext of this problem is EEG analysis. A technical description of the data generation method, the multi-objective formulation above, and the blind source separation problem in signal processing for cleaning EEG data are all described in the following two references in addition to the supplementary documents provided below

- Abbass H.A. (2014) Calibrating Independent Component Analysis with Laplacian Reference for Real-time EEG Artifact Removal, 21st International Conference on Neural Information Processing, L.C. Kiong et al. (Eds.), LNCS 8836, pp. 68–75, 2014, Springer.
- Goh S.K., Abbass H.A., and Tan, K.C. (2014) Artifact Removal from EEG using a Multi-Objective Independent Component Analysis Model, 21st International Conference on Neural Information Processing, C.K. Loo et al. (Eds.), LNCS 8834, pp. 570–577, 2014, Springer.
- A supplementary document is provided here.

There are three datasets used for the competition with 4, 12, and 19 timeseries. Each timeseries correspond to a synthetic electrode sensing EEG data, and is of length 256 resembling a 256Hz (sampling rate). The three datasets are named: D4, D12, and D19 respectively. The three datasets are repeated by adding a noise component to each sample, and are renamed to D4N, D12N, and D19N. Therefore, we have 6 single objective optimization problems and another 6 multi-objective optimization problems.

Noise Level | Number of Channels | ||

4 Channels | 12 Channels | 19 Channels | |

0 | Matrix X [file] Matrix A [file] Matrix S [file] |
Matrix X [file] Matrix A [file] Matrix S [file] |
Matrix X [file]
Matrix A [file] Matrix S [file] |

0.1 | Matrix X [file] Matrix A [file] Matrix S [file] |
Matrix X [file] Matrix A [file] Matrix S [file] |
Matrix X [file] Matrix A [file] Matrix S [file] |

Entries should provide the following files

- The source code using the naming convention EntryName_Source_FileID. For example, Goh_Source_Objective.c is an entry by Goh, and it is the Objective.c file of the source code. Initial entries may submit results without source code. However, for the winners of the competition, the source code MUST be provided for independent validation.
- The output for each dataset with each seed in a separate file with the
following naming convention: EntryName_ProblemID_Sing/Mult_S1. For example,
Goh_D4N_Sing_1000 is an entry by Goh, for dataset D4N, using a single
objective formulation and the seed used is 1000.
- In the case of single objective optimization, this output will include the final population. If the method does not use elitism, the best solution found so far should be added to this file.
- In the case of multiobjective optimization, this file should contains the final Pareto set for this seed.

- A summary file with the following naming convention:
EntryName_ProblemID_Sing/Mult_Sum. For example, Goh_D4N_Sing_Sum is a
summary of the entry by Goh, for dataset D4N, using a single objective
formulation.
- In the case of single objective optimization, this output should include the top five best solutions found for each seed.
- In the case of multiobjective optimization, this file should contains the five samples (as described below) from the Pareto set found for each of the 30 seeds.

Three evaluation criteria are used to evaluate entries to the competition: quality of solutions, speed, and stability as measured by variance of solution qualities with different initial seeds.

The procedures for evaluating single-objective entries into the competition will rely on the following evaluation criteria to rank entries in the order they are presented below.

- Quality: the overall best solution found over 30 runs will be used as a
measure of quality. The smaller this value is, the better. In case of
problems with noise, evaluation will be done using the data without noise.
If there is a tie on quality, speed will be used as follows:
- Speed: convergence speed will be calculated in terms of number of
objective function calculations (number of complete scans of each
dataset). The smaller this value is, the better. If there is a tie on
speed, stability will be used as follows:
- Stability: stability is assessed using the variance of the set of best solutions found over 30 runs. The smaller this value is, the better.

- Speed: convergence speed will be calculated in terms of number of
objective function calculations (number of complete scans of each
dataset). The smaller this value is, the better. If there is a tie on
speed, stability will be used as follows:

The procedures for evaluating multi-objective entries into the competition will rely on the following evaluation criteria to rank entries in the order they are presented below.

- Quality: five solutions will be sampled uniformaly from the Pareto set,
including the two extreme solutions. The average distance between these
five solutions and the corresponding nearst solutions found in the baseline
will be used as a measure of quality. Only those solutions found by an
entry that dominate the corresponding solution in the baseline will be
considered in this metric. For those solutions found by an entry that are
dominated by the corresponding solution in the baseline, a fixed distance
of value -1000 will be used. The larger this distance, the better. In case
of problems with noise, evaluation will be done using the data without
noise. If there is a tie on quality, speed will be used as follows:
- Speed: convergence speed will be calculated in terms of the total
number of objective function calculations (number of complete scans of
each dataset) in all 30 runs. The smaller this value is, the better. If
there is a tie on speed, stability will be used as follows:
- Stability: stability is assessed using the variance of the set of best solutions found over 30 runs. The smaller this value is, the better.

- Speed: convergence speed will be calculated in terms of the total
number of objective function calculations (number of complete scans of
each dataset) in all 30 runs. The smaller this value is, the better. If
there is a tie on speed, stability will be used as follows:

A baseline is provided for each dataset using NSGA-II for the multiobjective problem and Differential Evolution for the single objective problem.

The baseline is calculated using NSGA-II. In case of single objective, NSGA-II was used as it is while zeroing the second objective. The NSGA-II code we used is accredited to http://dces.essex.ac.uk/staff/zhang/webofmoead.htm.

Instructions for obtaining the results for the baselines are as follows:

- Download MOEAD from http://dces.essex.ac.uk/staff/qzhang/code/MOEADDE/MOEA-D-DE.rar
- Correct some minor syntax error using hint given by the visual studio IDE.
- Create a new folder DATAin in the main folder.
- Add the data files into DATAin.
- Download the new loaddata_bigopt.h [last updated 15 Oct 2014] to the common folder in MOEAD folder.
- Download the new main_moea.cpp [last updated 15 Oct 2014], objective.h [last updated 21 Oct 2014].
- Modify in global.h the upper bound and lower bound of the decision variable to lowBound = -8, uppBound = 8.
- Build and Execute the program.
- A matlab function to evaluate the score based on quality of solutions is
provided here as follows:
Download Scoreboard.zip [last updated 9 Nov 2014]. Extract it. Put your Pareto file and the scaling file in the Scoreboard folder.

Use Scoreboard.m to evaluate your score. If Scaling was used in the optimization, please remember to scale the objective to its original value.

Score = ScoreBoard(filename,problemtype,scalefile,MultiOrSingle)

eg . Score = ScoreBoard('YourPF.txt','D4','YourScale.txt','M') % for MOP

Score = ScoreBoard('YourSO.txt','D4','YourScale.txt','S') % for SOP

YourScale should be in order of f1min, f1max, f2min, f2max.

A score will be given, If you are working on Multiobjective Optimization, a pareto front that compare with baseline will be shown.

The baseline results will be available from October 30 below

Noise Level | Number of Channels | ||

4 Channels | 12 Channels | 19 Channels | |

0 | Matrix S1 [file] last updated 1 Dec 2014 |
Matrix S1 [file] last updated 1 Dec 2014 |
Matrix S1 [file] last updated 1 Dec 2014 |

0.1 | Matrix S1 [file] last updated 1 Dec 2014 |
Matrix S1 [file] last updated 1 Dec 2014 |
Matrix S1 [file] last updated 1 Dec 2014 |

- Competition Announced: 15th October 2014
- Deadline for receiving entries: 19th December 2014
- Initial Assessment and Validation of Entries Period: 19th December 2014 to 31st January 2015
- Deadline for receiving papers associated with entries for the special session: This is the same as the CEC deadline, currently, 19th December 2014
- Announcement of Winner and Runner-up: During the Conference

Inquiries should be directed to Sim-Kuan Goh at simkuan@nus.edu.sg

We anticipate 6-12 participants in this competition.

National University of Singapore

Sim Kuan Goh received the B.Eng. degree in Electrical Engineering from National University of Singapore in year 2013. He is currently a PhD candidate in Control, Intelligent Systems & Robotics, Department of Electrical and Computer Engineering, National University of Singapore. His research interests include multiobjective optimization, evolutionary computation and Brain Computer Interface.

University of New South Wales

Hussein Abbass is a full Professor with the University of New South Wales (UNSW-Australia), Canberra Campus, Australia. In 2014, he is spending his sabbatical visiting the Department of Electrical and Computer Engineering, National University of Singapore. Prof. Abbass is a fellow of the UK Operational Research Society and a fellow of the Australian Computer Society. He is an Associate Editor of six international journals, including the IEEE Transactions on Evolutionary Computation, and the IEEE Computational Intelligence Magazine. He has been serving as the Chair of the Emerging Technologies Technical Committee of the IEEE Computational Intelligence Society (IEEE-CIS) for 2 years and has served on many different committees within IEEE-CIS. Prof. Abbass is currently a College Member of the Australian Research Council (ARC) Engineering, Mathematics, and Information Cluster. He was a member of the Research Evaluation Committee of Excellence of Research Australia in 2010. He spent his sabbatical in 2005 at the University of Illinois – Urbana Champaign, and was a UNSW John-Yu Fellow at Imperial College London in 2003. He published 200+ refereed papers. His current research interest is in computational red teaming and integrating human brain data with advanced analytics and automation.

National University of Singapore

Kay Chen Tan is currently an Associate Professor in the Department of Electrical and Computer Engineering, National University of Singapore. He has published over 100 journal papers, over 100 papers in conference proceedings, co-authored 5 books. Dr Tan has been an Invited Keynote/Plenary speaker for over 40 international conferences. He is the General Co-Chair for the 2016 IEEE World Congress on Computational Intelligence to be held in Vancouver, Canada. Dr Tan is an elected member of AdCom (2014-2016) and was an IEEE Distinguished Lecturer of IEEE Computational Intelligence Society from 2011-2013. Dr Tan is the Editor-in-Chief of IEEE Transactions on Evolutionary Computation beginning in 2015. He was the Editor-in-Chief of IEEE Computational Intelligence Magazine from 2010-2013. He currently serves as an Associate Editor / Editorial Board member of over 20 international journals, such as IEEE Transactions on Cybernetics, IEEE Transactions on Computational Intelligence and AI in Games, Evolutionary Computation (MIT Press) etc. Dr Tan is a Fellow of IEEE. He is the awardee of the 2012 IEEE Computational Intelligence Society (CIS) Outstanding Early Career Award and the recipient of the Recognition Award (2008) from the International Network for Engineering Education & Research (iNEER).