title: Optimising the data-collection time of a large-scale data-acquisition system
creator: Colombo, Tommaso
subject: 004
subject: 004 Data processing Computer science
description: Data-acquisition systems are a fundamental component of modern scientific experiments. Large-scale experiments, particularly in the field of particle physics, comprise millions of sensors and produce petabytes of data per day. Their data-acquisition systems digitise, collect, filter, and store experimental signals for later analysis. The performance and reliability of these systems are critical to the operation of the experiment: insufficient performance and failures result in the loss of valuable scientific data.    By its very nature, data acquisition is a synchronous many-to-one operation: every time a phenomenon is observed by the experiment, data from its various sensors must be assembled into a single coherent dataset. This characteristic yields a particularly challenging traffic pattern for computer networks dedicated to data acquisition. If no corrective measures are taken, this pattern, known as incast, results in a significant underutilisation of the network resources, with a direct impact on a data-acquisition systems' throughput.    This thesis presents effective and feasible approaches to maximising network utilisation in data-acquisition systems, avoiding the incast problem without sacrificing throughput. Rather than using abstract models, it focuses on an existing large-scale experiment, used as a case-study: the ATLAS detector at the Large Hadron Collider.    First, the impact of incast on data-acquisition performance is characterised through a series of measurements performed on the actual data-acquisition system of the ATLAS experiment. As the size of the data sent synchronously by multiple sources to the same destination grows past the size of the network buffers, the throughput falls. A simple but effective mitigation is proposed and tested: at the application-layer, the data-collection receivers can limit the number of senders they simultaneously collect data from. This solution recovers a large part of the throughput lost to incast, but introduces some performance losses of its own.    Further investigations are enabled by the development of a complete packet-level model of the ATLAS data-acquisition network in an event-based simulation framework. Comparing real-world measurements and simulation results, the model is shown to be accurate enough to be used for studying the incast phenomenon in a data-acquisition system.    Leveraging the simulation model, various optimisations are analysed. The focus is kept on practical software changes, that can realistically be deployed on otherwise unmodified existing systems. Receiver-side traffic-shaping, incast- and traffic-shaping-aware work scheduling policies, tuning of TCP's timeouts, and centralised network packet injection scheduling are evaluated alone and in combination. Used together, the first three techniques result in a very significant increase of the system's throughput, which gets within 10% of the ideal maximum performance, even with a high network traffic load.
date: 2018
type: Dissertation
type: info:eu-repo/semantics/doctoralThesis
type: NonPeerReviewed
format: application/pdf
identifier: https://archiv.ub.uni-heidelberg.de/volltextserverhttps://archiv.ub.uni-heidelberg.de/volltextserver/24682/1/dissertation.pdf
identifier: DOI:10.11588/heidok.00024682
identifier: urn:nbn:de:bsz:16-heidok-246829
identifier:   Colombo, Tommaso  (2018) Optimising the data-collection time of a large-scale data-acquisition system.  [Dissertation]     
relation: https://archiv.ub.uni-heidelberg.de/volltextserver/24682/
rights: info:eu-repo/semantics/openAccess
rights: http://archiv.ub.uni-heidelberg.de/volltextserver/help/license_urhg.html
language: eng