Birth Date: July 11, 1968
Birth Place: Erice (TP)
Nationality: Italian
Email: schifano at fe.infn.it
RearchId: http://www.researcherid.com/rid/C-3555-2012

Curriculum

  • [1994:] Degree in Computer Science at University of Pisa (score: 105/110), title of the thesis: Parallelism and FP Programming.
  • [Mar 1995 - Mar 1996:] CNR fellowship at IEI institute in Pisa;
  • [Mar 1996 - Mar 1997:] fellow associate at CNR IEI institute in Pisa;
  • [Feb 1997 - Feb 1999:] PQE2000 fellowship at INFN institute in Pisa;
  • [Mar 1999 - Mar 2004:] research associate at INFN in Ferrara (ex art. 23);
  • [Apr 2004 - Dec 2005:] research associate at INFN in Ferrara (art. 2222, tecnologo/ricercatore);
  • [Jan 2006 - Oct 2006:] research associate at INFN in Ferrara (ex art. 23, primo tecnologo/ricercatore);
  • [Nov 2006 - today:] research associate (ricercatore) at University of Ferrara.

Publications

My publications include peer-reviewed journals and proceedings papers, and is listed in the following databases:

Presentations given at international conferences (peer-reviewed)

  1. Early experience on using Knights Landing processors for Lattice Boltzmann applications.
    12th International Conference on Parallel Processing and Applied Mathematics", September 10-13 2017, Lublin, Poland
  2. Experience on vectorizing Lattice Boltzmann kernels for multi- and many-core architectures.
    11th International Conference on Parallel Processing and Applied Mathematics", September 6-9 2015, Krakow, Poland
  3. Optimizing Communications in multi-GPU Lattice Boltzmann Simulations.
    International Conference on High PerformanceComputing and Simulation", July 20-24 2015, Amsterdam, The Netherlands
  4. Using Accelerator to Speed-Up Scientific and Engineering Codes: Perspective and Problems.
    6th Conference on Computational Methods in Marine Engineering", June 15-17, 2015, Rome, Italy
  5. A portable OpenCL Lattice Boltzmann code for multi- And many-core processor architectures.
    International Conference on Computational Science" (ICCS), June 10-12 2014, Cairns, Australia
  6. Benchmarking GPUs with a Parallel Lattice-Boltzmann Code.
    25th Int. Symp. on Computer Architecture and HighPerformance Computing" (SBAC-PAD), October 23-26 2013, Porto de Galinhas, Brazil
  7. Computing on knights and kepler architectures.
    20th International Conference on Computing in HighEnergy and Nuclear Physics" (CHEP), October 14-18 2013, Amsterdam, The Netherlands
  8. An optimized Lattice Boltzmann code for BlueGene/Q.
    10th International Conference on Parallel Processingand Applied Mathematics" (PPAM), September 8-11 2013, Warsaw, Poland
  9. Exploiting parallelism in many-core architectures: a test case based on Lattice Boltzmann Models.
    Conference on Computational Physics (CCP), October 14-18, 2012, Kobe, Japan
  10. Performance Impact of AVX Instructions on a D2Q37 Lattice Boltzmann Scheme ,
    24th International Conference on Parallel Computation Fluid Dynamics (PARCFD), May 21-25, 2012, Atlanta, GE USA
  11. Implementation and Optimization of a Thermal Lattice Boltzmann Algorithm on a multi-GPU cluster,
    Innovative Parallel Computing 2012 (INPAR), May 13-14, 2012 San Jose, CA USA
  12. A multi-GPU implementation of a D2Q37 Lattice Boltzmann Code,
    9a International Conference on Parallel Processing and Applied Mathematics (PPAM), September 11-14, 2011, Torun (Poland)
  13. Optimization of Multi-Phase Compressible Lattice Boltzmann Codes on Massively Parallel Multi-Core Systems,
    International Conference on Computational Science (ICCS), June 1-3, 2011, Singapore
  14. Lattice Boltzmann Method Simulations on Massively Parallel Multi-core Architectures,
    High Performance Computing Symposium (HPC), April 3-6, 2011, Boston, Massachusetts, USA
  15. Monte Carlo Simulations of Spin Systems on Multi-core Processors,
    Para 2010: State of the Art in Scientific and Parallel Computing (PARA), Reykjavik, June 6-9, 2010, Iceland
  16. Monte Carlo Simulations of Spin Glass on the Cell Broadband Engine,
    8a International Conference on Parallel Processing and Applied Mathematics (PPAM), September 13-16 2009, Wroclaw, Poland

Invited Talks

  1. Parallel Approaches To Lattice Boltzman Methods.
    Introductory School on Parallel Programming and Parallel Architecture for High-Performance Computing", October 10th, 2016, ICTP Trieste, Italy
  2. Early experience on running GPU-based Lattice Boltzmann simulations on POWER8 systems.
    PADC Opening Workshop", October 12-13, 2015, Juelich, Germany
  3. Benchmarking GPU architectures with Lattice Boltzmann simulations.
    NVIDIA Application Lab Workshop", July 8-9, 2013, Juelich, Germany
  4. LBM on multi- and many-core architectures.
    PRACE Summer School Enabling Applications on Intel MIC based Parallel Architectures", July 8-11, 2013, Casalecchio di Reno, Bologna, Italy
  5. Multi- and many-core computing for Physics applications.
    X Seminar on Nuclear, Subnuclear and Applied Physics", June 2-8, 2013, Alghero, Italy
  6. Implementation and Optimization of a D2Q37 Lattice Boltzmann.
    "NVIDIA Application Lab Kick-off Workshop", Sep. 19-20, 2012, Juelich, Germany
  7. The QPACE Project,
    Scalperf Workshop 2009, September 20-24 2009, Bertinoro, Italy
  8. Implementation of the QPACE Torus Network,
    eQPACE workshop, February 9-10, 2009, JSC Juelich, Germany
  9. Computing Systems for Theoretical Physics,
    Young Investigators Symposium, October 13-15, 2008, Oak Ridge National Laboratory, USA
  10. Trends in Computing for Theoretical Physics,
    Scalperf Workshop 2008, September 07-12, 2008, Bertinoro, Italy
  11. Monte Carlo simulations in statistical physics: Janus,
    26th IFAE conference, March 2008, Bologna, Italy
  12. Lattice QCD on Cell,
    Scalperf workshop 2007, September 02-06 2007, Bertinoro, Italy
  13. Computing for Lattice QCD: apeNEXT,
    HPTC 2004 Conference, 19-22 September 2004, Chateau de Maffliers, Paris, France
  14. CAOS: The APEmille Operating System,
    Third German Perl Workshop, February 28 - March 2, 2001, Saint Augustin, Bonn, Germany
  15. European Tflops Project for Lattice Quantum Chromodynamics,
    10th ORAP Forum and 28th SPEEDUP Workshop, 5-6 October 2000, CERN, Geneve, Swiss

Committe Member and Organization of Workshops

  • Committe member of: "8th International Workshop in Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS17)", organizzato come parte della conferenza ACM/IEEE Supercomputing 2017 (SC17), Nov. 13,2017, Denver, CO (USA)
  • Organizer of: ParCo 2017: "Mini-Symposium on Energy Aware Scientific Computing on low power and heterogeneous architectures", organizzato come parte della conferenza "International Conference on Parallel Computing" (ParCo 2017), September 12-15, 2017, Bologna, Italia
  • Committe member of: "The Seventh International Conference on Advanced Communications and Computation (INFOCOMP17)" June 25 - 29, 2017 Venezia (ITALY)
  • Committe member of: "The International Workshop on OpenPOWER for HPC (IWOPH'17)", 22 June, 2017 Frankfurt (GERMANY)
  • Committe member of: "The International Workshop in Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS16)", organizzato come parte della conferenza ACM/IEEE Supercomputing 2016 (SC16), Nov. 13, 2016, Salt Lake City, UT (USA)
  • Organizer of: "The International Workshop on Energy-aware high performance Heterogeneous Architectures and Accelerators (WEHA 2016) as part of The International Conference on High Performance Computing & Simulation" (HPCS 2016), July 18-22, 2016, Innsbruck, Austria
  • Organizer of: "Distributed Computing Architectures And Environmental Science Applications Workshop, June 6-10 2016, Ferrara, Italy
  • Organizer and chair of: "Computing on Low-Power Architectures (COLA)", February 25-26 2016, Ferrara, Italy.
  • Committe member of: "6th International Workshop in Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS15)", as part of "ACM/IEEE Supercomputing 2015 (SC15)", Nov. 15, 2015, Austin, TX (USA)
  • Committe member of: "International Workshop on Energy-aware high performance Heterogeneous Architectures and Accelerators" (WEHA 2015), organizzato come parte del "International Conference on High Performance Computing & Simulation (HPCS 2015)", July 20-24, 2015, Amsterdam, The Netherlands.

    Teaching Activity

    I have lectured the following courses at University of Ferrara since 2002:
    • [2002 - 2010]: Computability and Complexity , 76 hours, 12 CFU;
    • [2002 - 2012]: Algorithms and Data Structures , 108 hours, 12 CFU;
    • [2003 - 2007]: Applications for Distributed Systems , 28 hours, 3 CFU;
    • [2010 - today]: Operating Systems , 108 hours 12 CUF;

    Scientific Activity

    My research interests are in the area of design of massively parallel architectures, and in the optimization of scientific applications, mainly in the field of computational physics. Numerical simulations have becoming over the years the main -- and sometime the only possible -- investigation methods to study complex systems, which behavior can not be predicted through analitycal methods.Relevant examples in the are of theoretical physics are the Lattice Quantum Chromodynamics (LQCD) to study the interactions of elementary particles of the matter, the Lattice Boltzmann Methods (LBM) to study the behavior of fluids, and the Monte Carlo simulation of Spin Glass systems to study disordered systems such as ferro-magnetics.

    My scientific activity can be divided into two main parts:

    • in the first I have mainly focused in the design of massively parallel architectures optimized for scientific applications. In the framework of this activity I have been one of the founders of the the APE, JANUS e QPACE projects;

    • in the latter, I have mainly focused in the investigation of performances of recent multi- and many-core processors; the main goal is to study the way to implement and optimize scientific applications such that a large fraction of peak performance of processor is exploited.
    In the following I give more details about the activity carried out.

    APE Project (1997-2004)

    The APE project has been developed by INFN in collaboration with the DESY institute in Germany and the University of Paris-Sud in France. The project has designed and implemented several generations of massively parallel systems optimized for numerical simulations of Lattice Quantum Chromodynamics (LQCD). I joined the APE group in 1997 during the design of the APEmille system, and after I have been one of the founders of the apeNEXT project the successor of the APEmille system. In the framework of the apeNEXT project I played a key role in the design, test and deployment of the system.

    Within the framework of the activities of the APE projects:

    • I have been involved in the VLSI design and test of the processors. I also developed a test framework to validate the design of the chips of the systems;
    • I have coordinated the development of the code-optimizer for the APE systems. The code-optimizer called shaker plays a key role for the efficiency of the applications. It translates the assembly instructions generated by the high-level compiler into VLIW instructions. It also schedules the VLIW instructions in order to maximize the occupation of devices, and to minimize the execution time. As last step it also maps virtual registers into physical register of the processor;
    • I have coordinated the development of the operating system called CAOS (Cool Ape Operating System). CAOS allows the user to interface the machine, run programs and monitor its execution. It also implement the run-time support for the execution of input-output operations of the program running on the machine;
    • I have been involved into the implementation and optimization of relevant kernels of applications; I frequently collaborated with physics groups in Germany, France and Italy to develop LQCD and LBM applications.
    Relevant publications for this activity are:

    Janus Project (2005-2008)

    In 2005 I have been one of the founders of the Janus project, to design and deploy of a massively parallel system optimized for Monte Carlo simulations of Spin Glass systems.

    Spin Glass applications are relevant in the area of both condensed matter and optimization.

    Commodity system are not enable to meet the computing requirements of Spin Glass simulations, mainly because all the operations process 1-bit spin variables. Moreover the processing of one single spin requires to load from memory 7 single-bit variables, requiring a memory bandwidth much higher of that available on commodity processors.

    The Janus system is a heterogeneous parallel system based on 16 boards. Each board includes 16 processors based on FPGA called scientific processors (SP), and 1 IO processor called IOP. The simulations algorithms are implemented in firmware using a hardware level description language such as VHDL. This allows to integrate on a single FPGA around one thousand of update engine, each having a private memory and able to process one spin.

    Within the Janus project I have been involved in the design of the architecture, and I led the implementation of the IO system to connect the Janus system to host PC.

    A Janus system of 256 processors have been installed in 2008 at BIFI institute in Spain. The system deliver a peek performance of around 75 Tera-ops, with a ratio performance per watt of 7.5 Giga-ops/Watt.

    Janus has given a great contribution to the investigation of condensed matter and has allowed to simulate spin-glass systems for for 1 second in time.

    Relevant publications for this activity are:

    QPACE Project (2007-2009)

    Starting from 2005 I have focused on the investigation of performances of processor architectures to meet the computing requirements of an application. As first example I have considered a LQCD application. The results of this analysis have been published in: The potential of on-chip multiprocessing for QCD machines.

    This analysis has shown that for a given application is possible to define a balance equation which defines the optimal partition of the area of the chip area between functional units and storage units. It can be also used to estimate the performance of processors architectures for a given application. The balance equation has been applied to the IBM Cell processor, the first commodity multi-core processor released in 2005. It has shown that the Cell processor, if programmed in appropriate way, can meets the computational requirements of LQCD applications. This analysis has led to the QPACE project, to realize a massively parallel machine optimized for LQCD applications. QPACE is a 3D grid of nodes based on the PowerXcell8i processor, the version of Cell processor supporting in hardware at full-speed the execution of double-precision operations.

    In the framework of the QPACE project I have coordinated the design and implementation of the network processor. It is implemented on a FPGA device, and on one side it interfaces the Cell processor, and on the other side is connected to six 1 GB/s bi-directional links. The communication protocol has been implemented in firmware, e does not needs support from the operating system. Data is transmitted in packets of 128 Bytes together with a CRC code. On the receive side the CRC is computed on the received data and checked with the CRC code associated to the packet data. A corresponding feedback is sent back to the sender. The bit error measured is less than 10-14, and the latency measured is less that 0.5 micro-seconds.

    Two large prototypes of 8192 computing cores each have been installed in August 2009, delivering each a peek performance of 200 Tflops in double precision, and a power dissipation 280 KWatts.

    In November 2009 and June 2010 the QPACE system has been awarded has the best system with the highest ratio Flops/Watt, see the Green500 lists green-500 list Nov. 2009, green-500 list Jun 2010.

    Relevant publications for this activity are:

    Performance Assessment of Multi- e Many-Core Processors (2009 - today)

    Recently I have focused on the assessment of performances of recent developed processors for scientific applications. The goal of this activity is to investigate how efficient are the multi- and many-core commodity processors to support the execution of scientific applications, and to establish efficient programming methodology.So far I have considered two scientific applications relevant in the field of theoretical physics: the simulation of fluids based on Lattice Boltzmann methods, and the simulations of Spin Glass systems based on Monte Carlo methods.I'm investigating several architectures, starting from the IBM Cell Broadband Engine, the multi-core processors of Intel and the NVIDIA GP-GPUs cards.

    Within the framework of this activity I have coordinated the implementation and optimizations of a D2Q37 fluid-dynamic code which is currently used for physics production. I have first implemented the code on a cluster based on multi-core processors architectures such as Nehalem and Sandybridge, and after on a cluster based on GP-GPUs nodes using the NVIDIA Tesla C2050 boards. For both implementations the code exploit a large fraction of the peak performance, and is able to run at around 40-50% of efficiency and scale up to tens of nodes.

    I'm a founder and the national coordinator of the Computing on Knights Architecture (COKA) project funded by INFN. The COKA project has started in 2012, and it is focused on the investigation of performance of recently developed Intel Many Integrated Core (MIC) architectures. The goal of the project is to investigate how to structure scientific application for MIC processors and how to exploit a large fraction of peak performance. We are considering relevant applications both in the field of theoretical and experimental physics.

    Relevant publications for this activity are:

    Other Activities

    • in 2006 and 2007 I have lectured at master school of micro-electronics at University of Padova.The main topics of the course covered the description of the architecture of high-performance processors, and in detail the description of hardware supports such as: dynamic instruction scheduling, Tomasulo and Scoreboard scheduling, reordering buffer, branch predictors, dynamic register renaming.

    • in 2004 I have been involved in the design of the Amchip3 processor. The Amchip3 performs in hardware the pattern-matching of tracks resulting from particle collisions at CFD experiment installed at Fermilab national laboratory in the US (see A VLSI processor for fast track finding based on content addressable memories).

    Other Information

    • official member of the board of the Mathematics and Informatics PhD of University of Ferrara;
    • supervisor of 6 bachelor, 5 master, and 2 PhD thesis in Informatics;
    • Nov 2003: member of the organizing committee of the conference Non Perturbative Problems and Computational Physics: The Next Five Years, 27-28 November 2003, Ferrara (Italy);
    • 2005 - today: reviewer of the Elsevier journal Parallel Computing Systems and Applications;
    • 2006-2007: member of the European committee HPC Europe Taskforce (HET) to define a European framework in the area High Performance Computing;
    • May 2008: member of the program committee of conference ACM International Conference on Computing Frontiers 2008, May 5-7, 2008, Ischia (Italy);
    • 2009-2011: local coordinator of the project Hadron Physics 2 work-package 22, 7th program of the European Committe;
    • Jun-Jul 2009: scientific visitor at JSC institute in Juelich (Germany);
    • Jul-Aug 2010: scientific visitor at DESY institute in Zeuthen-Berlin (Germany);
    • Jan 2012 - today: locale coordinator of the Hadron Physics 3 project work-package 10, 7th program of the European Committe;
    • Jan 2012 - today: national coordinator of INFN COKA project.
    • Jun 2012: member of the program Committe of 26th ACM International Conference on Supercomputing (ICS) conference, June 25-29, 2012, San Servolo - Venezia (Italy);
    • Jun 2012: co-organizer e co-chair of the Future HPC systems: the Challenges of Power-Constrained Performance workshop, June 25, 2012, San Servolo - Venezia (Italy).
    • Jul 2012: external reviewer of 24th International Symposium on Computer Architecture and High Performance Computing conference.