Wednesday, June 27, 2012

Adding and Specifying Compute Resources at Torque

This blog entry is the follow-up of Installing Torque 2.5 on CentOS 6 with xCAT tool.

After installing of Torque on the Head Node and Compute Node, the next things to do is to configure the Torque Server. In this blog entry, I will focus on the Configuring the Compute Resources at Torque Server

 Step 1: Adding Nodes to the Torque Server
# qmgr -c "create node node01"

Step 2: Configure Auto-Detect Nodes CPU Detection. Setting auto_node_np to TRUE overwrites the value of np set in $TORQUEHOME/server_priv/nodes
# qmgr -c "set server auto_node_np = True"

Step 3: Start the pbs_mom of the compute nodes, the torque server will detect the nodes automatically
# service pbs_mom start

Tuesday, June 26, 2012

BUG soft lockup - CPU#3 stuck for 10s! on CentOS 5

I noticed that the under intense CPU load on CentOS 5.4, I've got this bug BUG soft lockup - CPU#3 stuck for 10s! on CentOS 5.

This particular bug have been mentioned at https://bugzilla.redhat.com/show_bug.cgi?id=484590

There are mentioned solution such as upgrading the kernel and changing the softlockup_thresh and softlockup_panic parameters at /proc/sys/kernel/softlockup_thresh and /proc/sys/kernel/softlockup_panic

I increase the /proc/sys/kernel/softlockup_thresh from 10 to 60. The original figure may be too aggressive.

# echo 30 > /proc/sys/kernel/softlockup_thresh

Check that the  /proc/sys/kernel/softlockup_panic = 0
# less  /proc/sys/kernel/softlockup_panic

Add this line to /etc/sysctl.conf (takes effect on next reboot):
kernel.softlockup_thresh=30

For more information, see
  1. Why do I see "cpu soft lockup" messages in Red Hat Enterprise Linux on a Unisys E7600 or NEC 5800 Express with 96 cores?

Monday, June 25, 2012

Ganglia Node unable to update Gmetad Node

I was using CentOS 5.5 for the Ganglia Head Node and CentOS 5.7 for the Ganglia node. I followed the blog entry Installing and configuring Ganglia on CentOS 5.x , but somehow the Ganglia Head Node did not register the compute node and there is no error when I used the command

# service gmond restart

Even when I turn on the Ganglia debugging as seen in Gmond dead but subsys locked for ganglia monitoring daemon

The only hint of an issue is to go to the /usr/sbin directory and use the command

# /usr/sbin/gmond --debug=1
You will get this output
Unable to create tcp_accept_channel. Exiting

Solution

Actually I realise the version of the Ganglia package and the Ganglia package are not of the same version. My Ganglia  Head Node is v3.0.7. But my Compute Node is v3.1.7, I have to downgrade my Ganglia Node to v3.0.7

First remove the newer version
# yum remove ganglia-gmond

Install the older version
# yum install ganglia-gmond-3.0.7

Sunday, June 24, 2012

Using netstat to diagnose network

Netstat is one good "swiss army knife" to look deeper into the workings linux networking.

I thought I just quickly put some netstat commands which I often used and find it helpgul in resolving networking issue.

1. Checking of networking interfaces
netstat -i

Iface       MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0       1500   0 95453656      0      0      0   177764      0      0      0 BMRU
lo        16436   0       70      0      0      0       70      0      0      0 LRU

 

2. Show Kernel Routing Table Information
# netstat -r 

Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
  1.1.57.28     *               255.255.255.128 U         0 0          0 eth0
link-local      *               255.255.0.0     U         0 0          0 eth0
default             1.1.1.125   0.0.0.0         UG        0 0          0 eth0

3. Show active listening ports ( -l) with the specific process (-p) associated with listening port
# netstat -ap

....
unix  2      [ ACC ]     STREAM     LISTENING     21138  2474/gnome-session  @/tmp/.ICE-unix/2474
unix  2      [ ACC ]     STREAM     LISTENING     23166  2674/pulseaudio     /tmp/.esd-0/socket
....

3a. To show a specific process ie ssh
# netstat -ap |grep ssh

tcp        0      0 *:ssh                       *:*                         LISTEN      1771/sshd
tcp        0     52 1.1.57.28:ssh            172.21.4.129:50591          ESTABLISHED 7837/sshd
tcp        0      0 *:ssh                       *:*                         LISTEN      1771/sshd
unix       2      [ ACC ]     STREAM     LISTENING     21646  2464/gnome-keyring- /tmp/keyring-i1zxcd/socket.ssh
unix       2      [ ]         DGRAM                    8766783 7837/sshd

4. View operational statistics for network protocol
# netstat -s

Ip:
    12311840 total packets received
    1801583 with invalid addresses
    0 forwarded
    0 incoming packets discarded
    10510256 incoming packets delivered
    174002 requests sent out
Icmp:
    300 ICMP messages received
    0 input ICMP message failed.
    ICMP input histogram:
        destination unreachable: 18
        echo requests: 282
    555 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
        destination unreachable: 125
        echo request: 148
        echo replies: 282
...
A good resource can be found at
  1.  UNIX / Linux: 10 Netstat Command Examples

Saturday, June 23, 2012

Adding and removing nodes on Torque in real-time

In order to add new nodes on Torque in real0-time, the command is quite simple,

# qmgr -c "create node node"


To delete the  nodes, see Deleting Queue and Nodes in Torque in real-time

Tuesday, June 19, 2012

Installing Torque 2.5 on CentOS 6 with xCAT tool

The details of the writeup can be found at my other blog Installing Torque 2.5 on CentOS 6 with xCAT tool . Basically, the write-up shows
  1. How torque 2.5.9 can be configured and compiled from source and 
  2. How torque can be configured as a service and push to the clients

Saturday, June 16, 2012

Preventing PackageKit to lock yum

For more information on PackageKit, do take a look at their site.

PackageKit caused a yum lock by launching yumBackend.py and as a result you may not be able to do a yum install. Sometimes PackageKit is not the culprit but applications that uses it to update could be buggy and PackageKit is left with the "smoking gun". I think if you read the Red Hat Bugzilla – Bug 748790, you will some sense of thing.

But for me, I just want my yum to myself and control my own updates, you may want to edit PackageKit not to use yum as the backend

# vim /etc/PackageKit/PackageKit.conf

DefaultBackend=nobackend

Friday, June 15, 2012

Cannot get device settings No such device.


I had a strange problem. I had a Network card. When I do the command "ethtool eth0", I got this error information instead.

Settings for eth0:
Cannot get device settings: No such device
Cannot get wake-on-lan settings: No such device
Cannot get message level: No such device
Cannot get link status: No such device
No data available

Strange, I have ifcfg-eth0 file in /etc/sysconfig/network-script directory.

But if I take a closer look at /var/log/messages by using dmesg:
# dmesg |grep eth0

I see error
bnx2 0000:04:00.0: eth0: Broadcom NetXtreme II BCM5708 1000Base-T (B2) 
PCI-X 64-bit 133MHz found at mem c8000000, IRQ 18, node addr 00:14:5e:fd:6d:76
udev: renamed network interface eth0 to eth3

This issue is similar to "Device eth0 does not seem to be present" on cloned CentOS VM.
You have to go to /etc/udev/rules.d/70-persistent-net.rules and take a look

# PCI device 0x0000:0x0001 (bnx2)

SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", 
ATTR{address}=="00:00:00:00:00:01", ATTR{type}=="1", 
KERNEL=="eth*", NAME="eth3"

(1a) One method is to simply rename your original ifcfg-eth0 file

# mv /etc/sysconfig/network-script/ifcfg-eth0 
/etc/sysconfig/network-script/ifcfg-eth3

(1b) Rename your  DEVICE parameter inside /etc/sysconfig/network-script/ifcfg-eth3
DEVICE="eth3"
...
...


(2) Another method is actually to clean up all the SUBSYSTEM Information and reboot. Hopefully, the eth0 is mapped back to /etc/sysconfig/network-script/ifcfg-eth0

(3) Or you could try modifying the /etc/udev/rules.d/70-persistent-net.rules and reboot.

# PCI device 0x0000:0x0001 (bnx2)

SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", 
ATTR{address}=="00:00:00:00:00:01", ATTR{type}=="1", 
KERNEL=="eth*", NAME="eth3"

# PCI device 0x0000:0x0001 (bnx2)

SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", 
ATTR{address}=="00:00:00:00:00:01", ATTR{type}=="1", 
KERNEL=="eth*", NAME="eth0"

Once you are done, reload the udev configuration. See Updating the udev configuration on CentOS

Thursday, June 14, 2012

News - Beware of DNS Changer Malware

Information on DNS Changer Malware

  1. Information on DNS Changer Malware, do look at DNS Changer Working Group (CFWG) Website
  2. Check whether your machines are affected or not with the http://www.dns-ok.us/
  3. How to disinfect the DNS Changer Malware with http://www.dcwg.org/fix/
  4. Operation Ghost Click  and additional information from FBI website.
  5. According to FBI website, the clean DNS servers will be turned off on July 9, 2012, and computers still impacted by DNSChanger may lose Internet connectivity at that time


Tuesday, June 12, 2012

Using sshpass to automate install ssh-copy-id to remote machines with CentOS 6.2

This is a follow-up of the writeup of  Tools to automate ssh-copy-id to remote servers. The Server OS used is CentOS 6.2. If you are automating scripts, you may have to modify the default settings SSH first and later write a simple bash scripts using sshpass to push the ssh-copy-id.

You may want to look at the
Automate pushing of ssh-copy-id to multiple servers from LinuxCluster

Tools to automate ssh-copy-id to remote servers

Well you can write scripts or if you prefer to use an open-source tools, you can use the sshpass from sourceforge.

According to ssh manpage

sshpass  is  a utility designed for running ssh using the mode referred to as "keyboard-interactive" password authentication, but in non-inter-active mode.

ssh  uses  direct  TTY  access to make sure that the password is indeed issued by an interactive keyboard user. Sshpass runs ssh in a dedicated tty, fooling it into thinking it is getting the password from an inter-active user.

The command to run is specified after sshpassâ own  options.  Typically it  will  be "ssh" with arguments, but it can just as well be any other command. The password prompt used by ssh is, however,  currently  hard-coded into sshpass.

Common example of use is

# sshpass -f password.txt ssh-copy-id user@remoteserver


Saturday, June 9, 2012

Article: Compiling R with MKL

I stumbled on the article on Compiling 64-bit R 2.10.1 with MKL in Linux.. Althought the R version is quite dated, but the technique of compiling will not vary too much

Thursday, June 7, 2012

World Map of High-throughput Sequencers


This is interesting information graphical World map of High-throughput Sequencers in the world on http://omicsmaps.com/


Dependency issues when installing xCAT 2.7 on CentOS 6

If you are using the yum install for xCAT 2.7, you will need the .repo and putting in /etc/yum.repos.d/

# wget http://sourceforge.net/projects/xcat/files/yum/stable/xcat-core/xCAT-core.repo
# wget http://sourceforge.net/projects/xcat/files/yum/xcat-dep/rh6/x86_64/xCAT-dep.repo

Do a yum check-update
# yum check-update

Do a yum install of xCAT ie

# yum install xCAT

You might get the error
Error: Package: xCAT-2.7.2-snap201205230215.x86_64 (xcat-2-core)
           Requires: elilo-xcat
Error: Package: xCAT-2.7.2-snap201205230215.x86_64 (xcat-2-core)
           Requires: xCAT-genesis-x86_64

You will notice you will have these error. To rectify, you have to download the from http://sourceforge.net/projects/xcat/files/yum/xcat-dep/rh6/x86_64/ and do a rpm install

# rpm -Uvh xCAT-genesis-x86_64-2.7.......
# rpm -Uvh elilo-xcat-3.14-4.noarch.rpm

Finally do a yum install xCAT and you should be able to install without issue.

Monday, June 4, 2012

Unable to find kernel source for CentOS 5

Sometimes when installing an application that requrie the kernel source of the CentOS 5, somehow even with yum install of the kernel-devel, the application is not able to find, one reason among the whole host of possbilities is that the pathing of the source is incorrect

In my case, at CentOS 5.4
If you go to /lib/modules/{kernel-version}. In my case 2.6.18-164.el5, you will see that there is a the build has a red background to it, apparently the path is wrong. You have to correct the soft-link error

#cd /lib/modules
#ln -s /usr/src/kernels/2.6.18-164.9.1.el5-x86_64

You will notice that the /lib/modules/source is linked nicely to "build"