IEEE Congress on Evolutionary Computation

Sendai International Center, Sendai, Japan, 25-28 May, 2015

Optimization of Big Data 2015 Competition



Sim Kuan Goh*, Hussein A. Abbass**, and Kay Chen Tan*

*National University of Singapore, Department of Electrical and Computer Engineering, 4 Engineering Drive, 117583, Singapore.

**The University of New South Wales, School of Engineering and Information Technology, Canberra, ACT 2600, Australia.


Evolutionary optimization techniques have been very successful in solving multi-modal optimization problems. To date, many competitions have been proposed on evolutionary optimization for both single and multiobjective problems. The community responded very well to these competitions, which encouraged a variety of new very competitive algorithms.

In the age of Big Data, there is an urge to take evolutionary optimization techniques to the next level for solving problems with hundreds, thousands, and even millions of variables. This competition takes first steps towards achieving this objective.

A special session on Big Optimization is organized in conjunction with this competition. Details on the special session can be found at

The Big Data Problem

A Big Data problem is not just about size. The frequency and nature of changes in the underlying trends in the data, heterogeneity of data sources, level of noise, and other factors come into play to define the true complexity of a Big Data problem.

One important factor to consider is: to solve Big problems, we need to solve some small ones; that is, to manage 1TB of data per day, we need to be able to manage 12MB of data per second. While 12MB does not sound large, it is the "per second" that makes the problem harder; that is, the time constraint imposed on the size.

Classically, data gets accummulated in infrastructures managed by distributed file systems (DFS) such as the Hadoop DFS (HDFS). When the data contains timeseries, it is sometimes more efficient to do certain level of data processing at the source before storing the data.

This competition is an example of this situation. In 2015, we will start with a very small scale representing around 20KB of data per second if the data is accessed in its binary form and 0.5MB per second if the data is accessed in its text form. This scale requires solving an optimization problem with close to 5000 variables every second. We will not impose the time constraint on entries to the competition. However, we will use the computational time (or an estimate of it) as a criterion to score entries.

In future years, we will increase the size of the dataset to bring it closer to the real-world Big Data challenges for this problem domain.

Expected Impact on Evolutionary Computation

Evolutionary optimization is a major research area of evolutionary computation. The primary objective of this competition is to push the boundaries of evolutionary optimization beyond the small-scale problems that have been used as benchmark problems in the literature so far.

Evolutionary optimization has successfully solved very large scale "simple" problems such as the one-max function, with a million variables. In global and/or multiobjective optimization problems, however, this scale is beyond the capabilities of existing evolutionary optimization algorithms.

This competition is only the starting point, focusing in 2015 on up to 5000 variables. The competition will evolve overtime to problems with tens of thousands of variables; once the community is ready for this scale. The aim of this competition is to create next-generation evolutionary optimization algorithms with competitive performance on both quality of solutions and convergence speed.

The Problem - Context-Independent Abstract Definition

The series of problems used in this competition are both single objective and multi-objective nonlinear optimization problems. Each optimization problem is based on a number of interdependent time series forming a dataset. The number of time series together with the length of each time series define the number of variables in the problem. For this competition, each time series is of length 256; thus, each optimization problem has a multiple of 256 variables. For example, if the dataset has 4 timeseries, this will define a 1024 variables optimization problem. In this year competition, the aim is to solve problems with 1024, 3072 and 4864 variables.

Assume the matrix X is of dimension N M, where N is the number of inter-dependent timeseries and M is the length of each timeseries. Let S be a matrix of N M, with N independent timeseries of length M, such that, given A, an N N linear transformation matrix,

X = A . S

The problem is to decompose the matrix S into two matrices: S1 and S2 with the same dimensionalities of S; that is, S=S1+S2 and X = A x S1 + A x S2. Let C be the Pearson correlation coefficient between S1 and X; that is,

covar(X, A x S1)
vary(X) var(A x S1)
where, covar(.) is the covariance matrix, and var(.) is the variance.

The objective is to maximize the diagonal elements of C, while minimizing off-diagonal elements to zeros. At the same time, the distance between S and S1 should be as minimum as possible; that is, S1 needs to be as similar as possible as S.

This definition generates two formulations for the problem, one as a single objective, while the other as a multiobjective problem.

The Multi-objective optimization problem is defined as follows:

Given, X, A, and S, find S1 which optimizes the following two objective functions:

Minimize f1 = 1(N M)ij (Sij − S1ij)2

Minimize f2 = 1(N2 − N)i,j ≠ i (Cij2) + 1Ni (1 − Cii)2

The single objective formulation is defined as follows:

Given, X, A, and S, find S1 which optimizes the following objective function:

Minimize f1 + f2

Boxing Constraints

For the sake of this exercise, all variables in the optimization problem are limited to be -8 ≤ s1 ≤ 8

The Problem - Context-Dependent Definition

It is important to emphasize that understanding the context of this problem is not necessary to solve it. The previous abstract definition should be sufficient for many scientists in optimization theory and evolutionary optimization to solve the problem.

For those interested in understanding the technical basis for the data generation method and the above mathematical formulation, this section will provide such details. However, to emphasize the point again, researchers joining the competition do not need to read these domain-specific technical details.

The conext of this problem is EEG analysis. A technical description of the data generation method, the multi-objective formulation above, and the blind source separation problem in signal processing for cleaning EEG data are all described in the following two references in addition to the supplementary documents provided below


There are three datasets used for the competition with 4, 12, and 19 timeseries. Each timeseries correspond to a synthetic electrode sensing EEG data, and is of length 256 resembling a 256Hz (sampling rate). The three datasets are named: D4, D12, and D19 respectively. The three datasets are repeated by adding a noise component to each sample, and are renamed to D4N, D12N, and D19N. Therefore, we have 6 single objective optimization problems and another 6 multi-objective optimization problems.

Noise Level Number of Channels
4 Channels 12 Channels 19 Channels

Matrix X [file]

Matrix A [file]

Matrix S [file]

Matrix X [file]

Matrix A [file]

Matrix S [file]

Matrix X [file]

Matrix A [file]

Matrix S [file]


Matrix X [file]

Matrix A [file]

Matrix S [file]

Matrix X [file]

Matrix A [file]

Matrix S [file]

Matrix X [file]

Matrix A [file]

Matrix S [file]


Entries should provide the following files

Evaluation Procedures

Three evaluation criteria are used to evaluate entries to the competition: quality of solutions, speed, and stability as measured by variance of solution qualities with different initial seeds.

Single Objective Evaluation Procedures

The procedures for evaluating single-objective entries into the competition will rely on the following evaluation criteria to rank entries in the order they are presented below.

Multi Objective Evaluation Procedures

The procedures for evaluating multi-objective entries into the competition will rely on the following evaluation criteria to rank entries in the order they are presented below.

Established Baselines

A baseline is provided for each dataset using NSGA-II for the multiobjective problem and Differential Evolution for the single objective problem.

The baseline is calculated using NSGA-II. In case of single objective, NSGA-II was used as it is while zeroing the second objective. The NSGA-II code we used is accredited to

Instructions for obtaining the results for the baselines are as follows:

  1. Download MOEAD from
  2. Correct some minor syntax error using hint given by the visual studio IDE.
  3. Create a new folder DATAin in the main folder.
  4. Add the data files into DATAin.
  5. Download the new loaddata_bigopt.h [last updated 15 Oct 2014] to the common folder in MOEAD folder.
  6. Download the new main_moea.cpp [last updated 15 Oct 2014], objective.h [last updated 21 Oct 2014].
  7. Modify in global.h the upper bound and lower bound of the decision variable to lowBound = -8, uppBound = 8.
  8. Build and Execute the program.
  9. A matlab function to evaluate the score based on quality of solutions is provided here as follows:

    Download [last updated 9 Nov 2014]. Extract it. Put your Pareto file and the scaling file in the Scoreboard folder.

    Use Scoreboard.m to evaluate your score. If Scaling was used in the optimization, please remember to scale the objective to its original value.

    Score = ScoreBoard(filename,problemtype,scalefile,MultiOrSingle)

    eg . Score = ScoreBoard('YourPF.txt','D4','YourScale.txt','M') % for MOP

    Score = ScoreBoard('YourSO.txt','D4','YourScale.txt','S') % for SOP

    YourScale should be in order of f1min, f1max, f2min, f2max.

    A score will be given, If you are working on Multiobjective Optimization, a pareto front that compare with baseline will be shown.

The baseline results will be available from October 30 below

Noise Level Number of Channels
4 Channels 12 Channels 19 Channels

Matrix S1 [file] last updated 1 Dec 2014

Matrix S1 [file] last updated 1 Dec 2014

Matrix S1 [file] last updated 1 Dec 2014

Matrix S1 [file] last updated 1 Dec 2014

Matrix S1 [file] last updated 1 Dec 2014

Matrix S1 [file] last updated 1 Dec 2014



Inquiries should be directed to Sim-Kuan Goh at

Anticipated Number of Participants

We anticipate 6-12 participants in this competition.

Biography of the Main Team Members

Sim-Kuan Goh

Department of Electrical and Computer Engineering
National University of Singapore
Sim Kuan  Photo

Sim Kuan Goh received the B.Eng. degree in Electrical Engineering from National University of Singapore in year 2013. He is currently a PhD candidate in Control, Intelligent Systems & Robotics, Department of Electrical and Computer Engineering, National University of Singapore. His research interests include multiobjective optimization, evolutionary computation and Brain Computer Interface.

Prof. Hussein A. Abbass

School of Engineering and Information Technology
University of New South Wales
Hussein Abbass Photo

Hussein Abbass is a full Professor with the University of New South Wales (UNSW-Australia), Canberra Campus, Australia. In 2014, he is spending his sabbatical visiting the Department of Electrical and Computer Engineering, National University of Singapore. Prof. Abbass is a fellow of the UK Operational Research Society and a fellow of the Australian Computer Society. He is an Associate Editor of six international journals, including the IEEE Transactions on Evolutionary Computation, and the IEEE Computational Intelligence Magazine. He has been serving as the Chair of the Emerging Technologies Technical Committee of the IEEE Computational Intelligence Society (IEEE-CIS) for 2 years and has served on many different committees within IEEE-CIS. Prof. Abbass is currently a College Member of the Australian Research Council (ARC) Engineering, Mathematics, and Information Cluster. He was a member of the Research Evaluation Committee of Excellence of Research Australia in 2010. He spent his sabbatical in 2005 at the University of Illinois – Urbana Champaign, and was a UNSW John-Yu Fellow at Imperial College London in 2003. He published 200+ refereed papers. His current research interest is in computational red teaming and integrating human brain data with advanced analytics and automation.

Prof. Kay Chen Tan

Department of Electrical and Computer Engineering
National University of Singapore
KC Tan Photo

Kay Chen Tan is currently an Associate Professor in the Department of Electrical and Computer Engineering, National University of Singapore. He has published over 100 journal papers, over 100 papers in conference proceedings, co-authored 5 books. Dr Tan has been an Invited Keynote/Plenary speaker for over 40 international conferences. He is the General Co-Chair for the 2016 IEEE World Congress on Computational Intelligence to be held in Vancouver, Canada. Dr Tan is an elected member of AdCom (2014-2016) and was an IEEE Distinguished Lecturer of IEEE Computational Intelligence Society from 2011-2013. Dr Tan is the Editor-in-Chief of IEEE Transactions on Evolutionary Computation beginning in 2015. He was the Editor-in-Chief of IEEE Computational Intelligence Magazine from 2010-2013. He currently serves as an Associate Editor / Editorial Board member of over 20 international journals, such as IEEE Transactions on Cybernetics, IEEE Transactions on Computational Intelligence and AI in Games, Evolutionary Computation (MIT Press) etc. Dr Tan is a Fellow of IEEE. He is the awardee of the 2012 IEEE Computational Intelligence Society (CIS) Outstanding Early Career Award and the recipient of the Recognition Award (2008) from the International Network for Engineering Education & Research (iNEER).