Sunday, January 27, 2013

Diagnostic Tools to diagnose Infiniband Device

There are a few Diagnostic Tools to diagnose Infiniband Devices.
  1. ibv_devinfo (Query RDMA devices)
  2. ibstat (Query basic status of InfiniBand device(s))
  3. ibstatus (Query basic status of InfiniBand device(s))

ibv_devinfo (Query RDMA devices) 
Print  information about RDMA devices available for use from userspace.
# ibv_devinfo

hca_id: mlx4_0
        transport:                      InfiniBand (0)
        fw_ver:                         2.10.2322
        node_guid:                      0002:c903:0045:1280
        sys_image_guid:                 0002:c903:0045:1283
        vendor_id:                      0x02c9
        vendor_part_id:                 4099
        hw_ver:                         0x0
        board_id:                       IBM0FD0140019
        phys_port_cnt:                  2
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 1
                        port_lid:               1
                        port_lmc:               0x00
                        link_layer:             IB

                port:   2
                        state:                  PORT_DOWN (1)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             IB

ibstat (Query basic status of InfiniBand device(s))

ibstat is a binary which displays basic information obtained  from  the local  IB  driver.  Output  includes LID, SMLID, port state, link width active, and port physical state.

It is similar to the ibstatus  utility  but  implemented  as  a  binary rather  than a script. It has options to list CAs and/or ports and displays more information than ibstatus.

# ibstat

CA 'mlx4_0'
        CA type: MT4099
        Number of ports: 2
        Firmware version: 2.10.2322
        Hardware version: 0
        Node GUID: 0x0002c90300451280
        System image GUID: 0x0002c90300451283
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 40
                Base lid: 1
                LMC: 0
                SM lid: 1
                Capability mask: 0x0251486a
                Port GUID: 0x0002c90300451281
                Link layer: InfiniBand
        Port 2:
                State: Down
                Physical state: Polling
                Rate: 40
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x02514868
                Port GUID: 0x0002c90300451282
                Link layer: InfiniBand


ibstatus - (Query basic status of InfiniBand device(s))

ibstatus is a script which displays basic information obtained from the local IB driver. Output includes LID, SMLID,  port  state,  link  width active, and port physical state.

# ibstatus

Infiniband device 'mlx4_0' port 1 status:
        default gid:     fe80:0000:0000:0000:0002:c903:0045:1281
        base lid:        0x1
        sm lid:          0x1
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            40 Gb/sec (4X QDR)
        link_layer:      InfiniBand

Infiniband device 'mlx4_0' port 2 status:
        default gid:     fe80:0000:0000:0000:0002:c903:0045:1282
        base lid:        0x0
        sm lid:          0x0
        state:           1: DOWN
        phys state:      2: Polling
        rate:            40 Gb/sec (4X QDR)
        link_layer:      InfiniBand

Friday, January 25, 2013

Specifying Maximum and Minimum Walltime in Torque

Specifying Maximum and Minimum Walltime in the queue or globally within the Torque configuration is important or you have users who uses without restraint.

At Torque Server, you can set maximum walltime 
# qmgr -c "set server resources_max.walltime = 720:00:00"

At Queue, you can set maximum walltime,
# qmgr -c "set queue dqueue resources_max.walltime = 720:00:00


For minimum walltime, it seems that you can set at queue level at least for Torque 2.5.x
# qmgr -c "set queue dqueue resources_max.walltime = 00:00:10"

Thursday, January 24, 2013

Great YouTube Video on how to find the functions that cause a crash in a Linux application

A tutorial showing how to find the functions that cause a crash in a Linux application if all you know is the address of the crash using the common ldd and nm tools by Gilad Ben-Yossef. See
Gilad Ben-Yossef on using ldd and nm (You Tube)

Wednesday, January 23, 2013

Modifying the walltime of a running job in Torque

The attribute qalter in Torque is a useful tool in helping to modify the attributes of the job or jobs specified by job_identifier on the command line.

One of the more commonly use of qalter is to modify the walltime of a running job in Torque. For example

# qalter jobid -l walltime=5:00:00

For more information on qalter, see qalter from Adaptive Computing

Tuesday, January 22, 2013

Quick method for estimating walltime

For Torque / OpenPBS or any other scheduler, walltime is a important parameter to allow the scheduler to determine how long the jobs will take. You can do a quick rough estimate by using the command time

# time -p mpirun -np 16 --host node1,node2 hello_world_mpi

real 4.31
user 0.04
sys 0.01

Use the value of 4:31 as the estimate walltime. Since this is an estimate, you may want to place a higher value in the walltime

$ qsub -l walltime=5:00 -l nodes=1:ppn=8 openmpi.sh -v file=hello_world



Friday, January 18, 2013

Disabling and enabling interactive mode on Torque

To disable interactive mode for a selected queue in Torque, it is very simple to implement, you just fire a command

# qmgr -c 'set queue queue_name disallowed_types = interactive'

To remove this attribute , use the qmgr -c unset command.

# qmgr -c 'unset queue queue_name disallowed_types'


For more information, see  Using qmgr to remove a queue attribute

Thursday, January 17, 2013

OFED Performance Micro-Benchmark Latency Test

Open Fabrics Enterprise Distribution (OFED) has provided simple performance micro-benchmark has provided a collection of tests written over uverbs. The notes taken from OFED Performance Tests README
  1. The benchmark uses the CPU cycle counter to get time stamps without a context switch.
  2. The benchmark measures round-trip time but reports half of that as one-way latency. This means that it may not be sufficiently accurate for asymmetrical configurations.
  3. Min/Median/Max results are reported.
    The Median (vs average) is less sensitive to extreme scores.
    Typically, the Max value is the first value measured Some CPU architectures
  4. Larger samples only help marginally. The default (1000) is very satisfactory.   Note that an array of cycles_t (typically an unsigned long) is allocated once to collect samples and again to store the difference between them.   Really big sample sizes (e.g., 1 million) might expose other problems with the program.
On the Server Side
# ib_write_lat -a

On the Client Side
# ib_write_lat -a Server_IP_address

For more information, do take a look at OFED Performance Micro-Benchmark Latency Test

Wednesday, January 16, 2013

Enabling CPU Scaling for Dual socket.

My earlier blog entry that shutting the cpuspeed daemon somehow only impact the first socket, but the second socket is still running at a reduced speed. There is a write-ups which was obtained from Experts-Exchange "Avoiding CPU speed scaling in modern Linux distributions. Running CPU at full speed Tips."

Firstly, do what was written at Disabling CPU speed scaling in CentOS 5

On CentOS 5, you may want to use the script and place it at /etc/init.d/cpuperf.

#! /bin/bash
#
# cpuperf sets cpu govermor
#
# chkconfig: 2345 10 90
#
# description: Set the CPU Frequency Scaling governor to "performance"
#
### BEGIN INIT INFO
# Provides: $ondemand
### END INIT INFO

PATH=/sbin:/usr/sbin:/bin:/usr/bin

case "$1" in
    start)
        for CPUFREQ in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
        do
                [ -f $CPUFREQ ] || continue
                echo -n performance > $CPUFREQ
        done
        ;;
    restart|reload|force-reload)
        echo "Error: argument '$1' not supported" >&2
        exit 3
        ;;
    stop)
        ;;
    *)
        echo "Usage: $0 start|stop" >&2
        exit 3
        ;;
esac                                    

Type the command and you will see that the CPUSpeed are all at performance level.
# grep -E '^model name|^cpu MHz' /proc/cpuinfo

Tuesday, January 15, 2013

Understanding the Infiniband Subnet

Intel has published a easy-to-understand article on Infiniband Subnetting. "Understanding the InfiniBand Subnet Manager"

From the article:

The InfiniBand subnet manager (OpenSM) assigns Local IDentifiers (LIDs) to each port connected to the InfiniBand fabric, and develops a routing table based off of the assigned LIDs. 
....
....
A typical InfiniBand installation using the OFED package will run the OpenSM subnet manager at system start up after the OpenIB drivers are loaded. This automatic OpenSM is resident in memory, and sweeps the InfiniBand fabric approximately every 5 seconds for new InfiniBand adapters to add to the subnet routing tables. This usage will be sufficient for most installations, and can be controlled using the following commands:

/etc/init.d/opensmd start
/etc/init.d/opensmd stop
/etc/init.d/opensmd restart
/etc/init.d/opensmd status


For more information, read the article.

Monday, January 14, 2013

Software Updates for critical vulnerability in Java 7 is available

Oracle has released an emergency software update today to fix a vulnerability that is exploited by a zero-day Trojan horse called Mal/JavaJar-B, which was already identified as attacking Windows, Linux and Unix systems and being distributed in exploit kits "Blackhole" and "NuclearPack," making it far more convenient to attackers. More information of Update Release  of the Software Update.

More Information on Java 7 Vulnerability:
  1. Oracle Java 7 Security Manager Bypass Vulnerability

Thursday, January 10, 2013

Killing a SSH User Shell Session

As Administrator, when you do a notice that you will some of your users have idle and active ssh session. The idle ssh session could be be due to the hanged ssh session. So the question is how to remove the individual session without killing the active and genuine session.

First thing first, do

# w 

You may get some information like this
USER     TTY      FROM              LOGIN@   IDLE   JCPU   PCPU WHAT 

.....
user1   pts/31   :24.0            08Oct12 22days  0.05s  0.05s -bash
user2   pts/24   :30              02Jan13  2days  0.66s  0.66s -bash
user3   pts/55   :17              12Nov12 59days  0.01s  0.01s -bash
.....

To get process id (PID) of the idle session, do the command
# ps -aux |grep 'pts/31'

Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.7/FAQ
root     27552  0.0  0.0  61172   776 pts/1    S+   00:41   0:00 grep pts/31
546      30050  0.0  0.0  64188  1516 pts/31   Ss+   2012   0:00 -bash

Kill the Process
# kill -9 30050 

The idle ssh session has been removed. You can verify with the command "w"

Wednesday, January 9, 2013

Disabling CPU speed scaling in CentOS 5

If you are using CentOS, you may notice that with the cpuspeed daemon, you will find that somehow you do not get the full CPU Frequency as specified in the CPU model.

To check whether you have discrepancy, do fire the command
# grep -E '^model name|^cpu MHz' /proc/cpuinfo

model name      :        Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
cpu MHz         : 1200.000
model name      :        Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
cpu MHz         : 1200.000
model name      :        Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
cpu MHz         : 1200.000
model name      :        Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
cpu MHz         : 1200.000
model name      :        Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
cpu MHz         : 1200.000
model name      :        Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
cpu MHz         : 1200.000
model name      :        Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
cpu MHz         : 1200.000
model name      :        Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
cpu MHz         : 1200.000
model name      :        Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
cpu MHz         : 1200.000
model name      :        Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
cpu MHz         : 1200.000
model name      :        Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
cpu MHz         : 1200.000
model name      :        Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
cpu MHz         : 1200.000

You can stop CPU throttling by disabling the cpuspeed daemon
# service cpuspeed stop

But if you do a closer check, you realise that if you have dual CPU socket, the disabling of the daemon seems to only impact 1 CPU socket..... Notice the figures below.....Hmmmm

model name      :        Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
cpu MHz         : 2501.000
model name      :        Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
cpu MHz         : 2501.000
model name      :        Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
cpu MHz         : 2501.000
model name      :        Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
cpu MHz         : 2501.000
model name      :        Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
cpu MHz         : 2501.000
model name      :        Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
cpu MHz         : 2501.000
model name      :        Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
cpu MHz         : 1200.000
model name      :        Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
cpu MHz         : 1200.000
model name      :        Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
cpu MHz         : 1200.000
model name      :        Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
cpu MHz         : 1200.000
model name      :        Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
cpu MHz         : 1200.000
model name      :        Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
cpu MHz         : 1200.000

Monday, January 7, 2013

Errors when running doing ib testing with ib_write_lat

I was doing a ib test using  perftest package which has simple tests for benchmarking IB bandwidth and latency. the 2 simple packages are ib_write_bw and ib_write_lat. 

On the Server side, I launched
# ib_write_lat -a

------------------------------------------------------------------
                    RDMA_Write Latency Test
 Number of qps   : 1
 Connection type : RC
 Mtu             : 2048B
 Link type       : IB
 Max inline data : 400B
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
------------------------------------------------------------------
 local address: LID 0x03 QPN 0x0065 PSN 0xba11f4 RKey 0x003900 VAddr 0x002ab9bad9600

On the Client Side,
# ib_write_lat -a 192.168.5.1

The errors are as followed
Conflicting CPU frequency values detected: 1200.000000 != 2501.000000
 2       1000          inf            inf          inf
Conflicting CPU frequency values detected: 1200.000000 != 2501.000000
 4       1000          inf            inf          inf
Conflicting CPU frequency values detected: 1200.000000 != 2501.000000
 8       1000          inf            inf          inf
Conflicting CPU frequency values detected: 1200.000000 != 2501.000000
 16      1000          inf            inf          inf
Conflicting CPU frequency values detected: 1200.000000 != 2501.000000
 32      1000          inf            inf          inf

To solve the issues, use "-F" option while running the tests. The flag will ignore "Conflicting CPU frequency" errors. Although there will still be error messages, but with "-F", you will also see the results at least

A better solution is to disable the cpuspeed if you are on CentOS. For more information see blog entry

Friday, January 4, 2013

Performance Tuning of Latency-Sensitive Workloads in Vmware vSphere VMs

This article Best Practices for Performance Tuning of Latency-Sensitive Workloads in vSphere VMs (pdf). This article covers and recommends the tuning of differnt layers of an application's environment for latency-sensitive workloads. A word of caution in the article, is that the recommendations in the paper the tweaking may hurt the performance that are tolerant of higher latency

Wednesday, January 2, 2013

Multiprotocol Performance Test of VMware EX 3.5 on NetApp Storage Systems

This paper "Performance Report: Multiprotocol Performance Test of VMware® ESX 3.5 on NetApp Storage Systems" by NetApp and Vmware was written in 2008 is a interesting comparison with 3 protocols, FC, iSCSI and NFS. The variation between the protocols is about maximum 9%. With the advent of the new Vmmare VSphere, 10G with TOE etc, the performance gap may be close that in 2008.

The summary of the pro and cons as mentioned in the article is summarised at Multiprotocol Performance Test of VMware EX 3.5 on NetApp Storage Systems