My publications include peer-reviewed journals and proceedings papers, and is listed in the following databases:
Committe member of: "International Workshop on Energy-aware high performance Heterogeneous Architectures and Accelerators" (WEHA 2015), organizzato come parte del "International Conference on High Performance Computing & Simulation (HPCS 2015)", July 20-24, 2015, Amsterdam, The Netherlands. Teaching Activity
I have lectured the following courses at University of Ferrara since 2002:- [2002 - 2010]: Computability and Complexity , 76 hours, 12 CFU;
- [2002 - 2012]: Algorithms and Data Structures , 108 hours, 12 CFU;
- [2003 - 2007]: Applications for Distributed Systems , 28 hours, 3 CFU;
- [2010 - today]: Operating Systems , 108 hours 12 CUF;
Scientific Activity
My research interests are in the area of design of massively parallel architectures, and in the optimization of scientific applications, mainly in the field of computational physics. Numerical simulations have becoming over the years the main -- and sometime the only possible -- investigation methods to study complex systems, which behavior can not be predicted through analitycal methods.Relevant examples in the are of theoretical physics are the Lattice Quantum Chromodynamics (LQCD) to study the interactions of elementary particles of the matter, the Lattice Boltzmann Methods (LBM) to study the behavior of fluids, and the Monte Carlo simulation of Spin Glass systems to study disordered systems such as ferro-magnetics.My scientific activity can be divided into two main parts:
- in the first I have mainly focused in the design of massively parallel architectures optimized for scientific applications. In the framework of this activity I have been one of the founders of the the APE, JANUS e QPACE projects;
- in the latter, I have mainly focused in the investigation of performances of recent multi- and many-core processors; the main goal is to study the way to implement and optimize scientific applications such that a large fraction of peak performance of processor is exploited.
In the following I give more details about the activity carried out.APE Project (1997-2004)
The APE project has been developed by INFN in collaboration with the DESY institute in Germany and the University of Paris-Sud in France. The project has designed and implemented several generations of massively parallel systems optimized for numerical simulations of Lattice Quantum Chromodynamics (LQCD). I joined the APE group in 1997 during the design of the APEmille system, and after I have been one of the founders of the apeNEXT project the successor of the APEmille system. In the framework of the apeNEXT project I played a key role in the design, test and deployment of the system.Within the framework of the activities of the APE projects:
- I have been involved in the VLSI design and test of the processors. I also developed a test framework to validate the design of the chips of the systems;
- I have coordinated the development of the code-optimizer for the APE systems. The code-optimizer called shaker plays a key role for the efficiency of the applications. It translates the assembly instructions generated by the high-level compiler into VLIW instructions. It also schedules the VLIW instructions in order to maximize the occupation of devices, and to minimize the execution time. As last step it also maps virtual registers into physical register of the processor;
- I have coordinated the development of the operating system called CAOS (Cool Ape Operating System). CAOS allows the user to interface the machine, run programs and monitor its execution. It also implement the run-time support for the execution of input-output operations of the program running on the machine;
- I have been involved into the implementation and optimization of relevant kernels of applications; I frequently collaborated with physics groups in Germany, France and Italy to develop LQCD and LBM applications.
Relevant publications for this activity are:Janus Project (2005-2008)
In 2005 I have been one of the founders of the Janus project, to design and deploy of a massively parallel system optimized for Monte Carlo simulations of Spin Glass systems.Spin Glass applications are relevant in the area of both condensed matter and optimization.
Commodity system are not enable to meet the computing requirements of Spin Glass simulations, mainly because all the operations process 1-bit spin variables. Moreover the processing of one single spin requires to load from memory 7 single-bit variables, requiring a memory bandwidth much higher of that available on commodity processors.
The Janus system is a heterogeneous parallel system based on 16 boards. Each board includes 16 processors based on FPGA called scientific processors (SP), and 1 IO processor called IOP. The simulations algorithms are implemented in firmware using a hardware level description language such as VHDL. This allows to integrate on a single FPGA around one thousand of update engine, each having a private memory and able to process one spin.
Within the Janus project I have been involved in the design of the architecture, and I led the implementation of the IO system to connect the Janus system to host PC.
A Janus system of 256 processors have been installed in 2008 at BIFI institute in Spain. The system deliver a peek performance of around 75 Tera-ops, with a ratio performance per watt of 7.5 Giga-ops/Watt.
Janus has given a great contribution to the investigation of condensed matter and has allowed to simulate spin-glass systems for for 1 second in time.
Relevant publications for this activity are:
QPACE Project (2007-2009)
Starting from 2005 I have focused on the investigation of performances of processor architectures to meet the computing requirements of an application. As first example I have considered a LQCD application. The results of this analysis have been published in: The potential of on-chip multiprocessing for QCD machines.This analysis has shown that for a given application is possible to define a balance equation which defines the optimal partition of the area of the chip area between functional units and storage units. It can be also used to estimate the performance of processors architectures for a given application. The balance equation has been applied to the IBM Cell processor, the first commodity multi-core processor released in 2005. It has shown that the Cell processor, if programmed in appropriate way, can meets the computational requirements of LQCD applications. This analysis has led to the QPACE project, to realize a massively parallel machine optimized for LQCD applications. QPACE is a 3D grid of nodes based on the PowerXcell8i processor, the version of Cell processor supporting in hardware at full-speed the execution of double-precision operations.
In the framework of the QPACE project I have coordinated the design and implementation of the network processor. It is implemented on a FPGA device, and on one side it interfaces the Cell processor, and on the other side is connected to six 1 GB/s bi-directional links. The communication protocol has been implemented in firmware, e does not needs support from the operating system. Data is transmitted in packets of 128 Bytes together with a CRC code. On the receive side the CRC is computed on the received data and checked with the CRC code associated to the packet data. A corresponding feedback is sent back to the sender. The bit error measured is less than 10-14, and the latency measured is less that 0.5 micro-seconds.
Two large prototypes of 8192 computing cores each have been installed in August 2009, delivering each a peek performance of 200 Tflops in double precision, and a power dissipation 280 KWatts.
In November 2009 and June 2010 the QPACE system has been awarded has the best system with the highest ratio Flops/Watt, see the Green500 lists green-500 list Nov. 2009, green-500 list Jun 2010.
Relevant publications for this activity are:
Performance Assessment of Multi- e Many-Core Processors (2009 - today)
Recently I have focused on the assessment of performances of recent developed processors for scientific applications. The goal of this activity is to investigate how efficient are the multi- and many-core commodity processors to support the execution of scientific applications, and to establish efficient programming methodology.So far I have considered two scientific applications relevant in the field of theoretical physics: the simulation of fluids based on Lattice Boltzmann methods, and the simulations of Spin Glass systems based on Monte Carlo methods.I'm investigating several architectures, starting from the IBM Cell Broadband Engine, the multi-core processors of Intel and the NVIDIA GP-GPUs cards.Within the framework of this activity I have coordinated the implementation and optimizations of a D2Q37 fluid-dynamic code which is currently used for physics production. I have first implemented the code on a cluster based on multi-core processors architectures such as Nehalem and Sandybridge, and after on a cluster based on GP-GPUs nodes using the NVIDIA Tesla C2050 boards. For both implementations the code exploit a large fraction of the peak performance, and is able to run at around 40-50% of efficiency and scale up to tens of nodes.
I'm a founder and the national coordinator of the Computing on Knights Architecture (COKA) project funded by INFN. The COKA project has started in 2012, and it is focused on the investigation of performance of recently developed Intel Many Integrated Core (MIC) architectures. The goal of the project is to investigate how to structure scientific application for MIC processors and how to exploit a large fraction of peak performance. We are considering relevant applications both in the field of theoretical and experimental physics.
Relevant publications for this activity are:
Other Activities
- in 2006 and 2007 I have lectured at master school of micro-electronics at University of Padova.The main topics of the course covered the description of the architecture of high-performance processors, and in detail the description of hardware supports such as: dynamic instruction scheduling, Tomasulo and Scoreboard scheduling, reordering buffer, branch predictors, dynamic register renaming.
- in 2004 I have been involved in the design of the Amchip3 processor. The Amchip3 performs in hardware the pattern-matching of tracks resulting from particle collisions at CFD experiment installed at Fermilab national laboratory in the US (see A VLSI processor for fast track finding based on content addressable memories).
Other Information
- official member of the board of the Mathematics and Informatics PhD of University of Ferrara;
- supervisor of 6 bachelor, 5 master, and 2 PhD thesis in Informatics;
- Nov 2003: member of the organizing committee of the conference Non Perturbative Problems and Computational Physics: The Next Five Years, 27-28 November 2003, Ferrara (Italy);
- 2005 - today: reviewer of the Elsevier journal Parallel Computing Systems and Applications;
- 2006-2007: member of the European committee HPC Europe Taskforce (HET) to define a European framework in the area High Performance Computing;
- May 2008: member of the program committee of conference ACM International Conference on Computing Frontiers 2008, May 5-7, 2008, Ischia (Italy);
- 2009-2011: local coordinator of the project Hadron Physics 2 work-package 22, 7th program of the European Committe;
- Jun-Jul 2009: scientific visitor at JSC institute in Juelich (Germany);
- Jul-Aug 2010: scientific visitor at DESY institute in Zeuthen-Berlin (Germany);
- Jan 2012 - today: locale coordinator of the Hadron Physics 3 project work-package 10, 7th program of the European Committe;
- Jan 2012 - today: national coordinator of INFN COKA project.
- Jun 2012: member of the program Committe of 26th ACM International Conference on Supercomputing (ICS) conference, June 25-29, 2012, San Servolo - Venezia (Italy);
- Jun 2012: co-organizer e co-chair of the Future HPC systems: the Challenges of Power-Constrained Performance workshop, June 25, 2012, San Servolo - Venezia (Italy).
- Jul 2012: external reviewer of 24th International Symposium on Computer Architecture and High Performance Computing conference.