Dataflow monitoring in LHCb
Paper i proceeding, 2011
The LHCb data-flow starts from the collection of event-fragments from more than 300 read-out boards at a rate of 1 MHz. These data are moved through a large switching network consisting of more than 50 routers to an event-filter farm of up to 1500 servers. Accepted events are sent through a dedicated network to storage collection nodes which concatenate accepted events in to files and transfer them to mass-storage. At nominal conditions more than 30 million packets enter and leave the network every second. Precise monitoring of this data-flow down to the single packet counter is essential to trace rare but systematic sources of data-loss. We have developed a comprehensive monitoring framework allowing to verify the data-flow at every level using a variety of standard tools and protocols such as sFlow, SNMP and custom software based on the LHCb Experiment Control System frame-work. This paper starts from an analysis of the data-flow and the involved hardware and software layers. From this analysis it derives the architecture and finally presents the implementation of this monitoring system.