Wednesday, November 30, 2011

pbs_mom LOG_ERROR sys_copy, command /usr/bin/scp -rpB

I encountered 1 of my parallel job failed and this error appeared on the log file for my compute nodes. 

pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB...............failed with status=1, 
giving up after 4 attempts

My SSH public/private key authentication is working without a hitch. Similarly, my /etc/hosts and firewall is as what I expected. But I realise my /etc/resolv.conf and /etc/sysconfig/network are incorrect. I got a hint of this possibility when I was reading this forum http://www.mail-archive.com/mauiusers@supercluster.org/msg00998.html . A quick amendment and everything seems ok at least for a while. Will write if this solution is incorrect. :)

Tuesday, November 29, 2011

Companies Comparison for Storage in the Gartner Ranking

For information on companies selling storage boxes, do look at the article on
Ball-gazer casts magic runes to heal HP's credibility. Swallowing startups pushes up Gartner ranking.
NetApp and EMC are the leaders with NetApp ahead of EMC on the vision scale.


Installing Pylith using Pylith Installer


PyLith is a finite element code for the solution of dynamic and quasi-static tectonic deformation problems.

This entry will only focus on the compilation of Pylith from the installer. Most if not all of the information comes from INSTALLER files after you untar the software.

For more information, see Installing Pylith using Pylith Installer



Thursday, November 24, 2011

Cannot find -llapack when doing /usr/bin/ld on CentOS 5

I encountered an error when one of our researchers did a compilation of a Fortran Program which requires blas and lapack

$ g77 test.f  -L/usr/lib64/ -llapack -lblas
/usr/bin/ld: cannot find -llapack
collect2: ld returned 1 exit status

I was quite puzzled as I have installed lapack and blas. And it seems that lapack is having issues

To check whether you have the libraries, you can use the command
$ ldconfig -p | grep llapack
 libscalapack.so.1 (libc6,x86-64) => /usr/lib64/libscalapack.so.1
 liblapack.so.3 (libc6,x86-64) => /usr/lib64/liblapack.so.3
So it is not an issue of missing lapack libraries. It is there.

           "On systems which support shared libraries, ld may also search for libraries with exten-
           sions other than ".a".  Specifically, on ELF and SunOS systems, ld will search a direc-
           tory  for  a library with an extension of ".so" before searching for one with an exten-
           sion of ".a".  By convention, a ".so" extension indicates a shared library.

           The linker will search an archive only once, at the location where it is  specified  on
           the  command  line.  If the archive defines a symbol which was undefined in some object
           which appeared before the archive on the command line,  the  linker  will  include  the
           appropriate  file(s)  from  the  archive.   However,  an  undefined symbol in an object
           appearing later on the command line will not cause the linker  to  search  the  archive
           again."


So just do a quick soft-links and the problem was solved
$ ln -s /usr/lib64/liblapack.so.3 /usr/lib64/liblapack.so



Wednesday, November 23, 2011

Unspecified GSS failure from SSH causes slow login

I SSH into one of my server, But I encounter this error, but instead I encounter the follow error. Eventually, after waiting about 15-20 seconds, I'm able to connect to. This was far too long for a LAN-based machine 

$ ssh -v ip_of_remote_server

.....
debug1: Unspecified GSS failure.  Minor code may provide more information
Unknown code krb5 195

debug1: Unspecified GSS failure.  Minor code may provide more information
Unknown code krb5 195

debug1: Unspecified GSS failure.  Minor code may provide more information
Unknown code krb5 195
.....


I was quite puzzled. Although I am using IP address of the server to ssh and have tweaked "UseDNS = no" at /etc/sshd_config. See Resolving Slow SSH Login, In addition, I'm doing SSH public/private key authentication. See Auto SSH Login without Password


But the resolution for this issue was easier than I thought. I just need to ensure /etc/hosts contains both the servers I am ssh from and to and it became very quick.

If you are using DNS instead of /etc/hosts, do take a look that your DNS settings at /etc/resolv.conf

Other Issues on SSH, you may want to read about
  1. SSH Error : Permission denied (publickey,gssapi-with-mic,password)
  2. Resolving Slow SSH Login

Tuesday, November 22, 2011

List of Intel Xeon and AMD Microprocessors with pricing

The listing Intel Xeon Microprocessors with pricing from Wikipedia in $USD is very useful for price comparison and budgeting. See Wikipedia List of Intel Xeon Microprocessors

Similarly the listing of AMD Microprocessors from Wikipedia is very informative. But sadly no price listing
List of AMD Opteron microprocessors

Monday, November 21, 2011

Brief overview of Valgrind usage

This write-up covers some very basis commands. But I will try to list out some of the other collections of tutorial and reading to complement this lack of information. I'm assuming that you have compiled the program as written in Compiling Valgrind on CentOS 5 One of the most commonly used command in Valgrind is
# valgrind --tool=memcheck --leak-check=full ./my_program
Commonly-used Options
S/No Command Option Description
1 --leak-check=<no|summary|yes|full> [default: summary] When enabled, search for memory leaks when the client program finishes. If set to summary, it says how many leaks occurred. If set to full or yes, it also gives details of each individual leak.
2 --show-reachable=<yes|no> [default: no] When disabled, the memory leak detector only shows "definitely lost" and "possibly lost" blocks. When enabled, the leak detector also shows "reachable" and "indirectly lost" blocks. (In other words, it shows all blocks, except suppressed ones)
For more information on more details usage of Valgrind of options and how to use,
  1. Valgrind Manual - 4.3 Memcheck Command Options
  2. Using Valgrind to Find Memory Leaks and Invalid Memory Use
  3. Using Valgrind to debug memory leaks

Wednesday, November 16, 2011

Removing a node from the Ganglia Web Frontend

According to the Ganglia_Readme, there is not easy way to remove a single dead node from the list from the ganglia web front-end. To flush the dead node from the record by restarting the the gmetad and gmond processes, you have to add the line at /etc/gmond.conf

globals { 
host_dmax = 3600 
}

The hosts will be removed from host tables when they haven't been heard from in 3600 seconds. See "man gmond.conf" for details.

Sunday, November 13, 2011

Compiling Valgrind on CentOS 5

Valgrind tools automatically detect many memory management and threading bugs, and is able to profile your programs in detail. It runs on the following platforms: X86/Linux, AMD64/Linux, ARM/Linux, PPC32/Linux, PPC64/Linux, S390X/Linux, ARM/Android (2.3.x), X86/Darwin and AMD64/Darwin (Mac OS X 10.6 and 10.7) According to Valgrind, a number of useful tools are supplied as standard.
  1. Memcheck is a memory error detector. It helps you make your programs, particularly those written in C and C++, more correct.
  2. Cachegrind is a cache and branch-prediction profiler. It helps you make your programs run faster.
  3. Callgrind is a call-graph generating cache profiler. It has some overlap with Cachegrind, but also gathers some information that Cachegrind does not.
  4. Helgrind is a thread error detector. It helps you make your multi-threaded programs more correct.
  5. DRD is also a thread error detector. It is similar to Helgrind but uses different analysis techniques and so may find different problems.
  6. Massif is a heap profiler. It helps you make your programs use less memory.
  7. DHAT is a different kind of heap profiler. It helps you understand issues of block lifetimes, block utilisation, and layout inefficiencies.
  8. SGcheck is an experimental tool that can detect overruns of stack and global arrays. Its functionality is complementary to that of Memcheck: SGcheck finds problems that Memcheck can't, and vice versa..
  9. BBV is an experimental SimPoint basic block vector generator. It is useful to people doing computer architecture research and development.
Compilation of Valgrind Compilation is very straightforward......
# tar -xvjpf valgrind-3.7.0.tar.bz2
# cd valgrind-3.7.0
# ./configure --prefix=/usr/local/valgrind-3.7.0
# make; make install
Testing Valgrind
# /usr/local/valgrind-3.7.0/bin/valgrind ls -l
Either this works, or it bombs out with some complaint.

For more information, see Compiling Valgrind on CentOS 5

Saturday, November 12, 2011

Compiling adaptive Poisson-Boltzmann Solver (APBS) on CentOS 5


Adaptive Poisson-Boltzmann Solver (APBS) is a software package for modeling biomolecular solvation through solution of the Poisson-Boltzmann equation (PBE), one of the most popular continuum models for describing electrostatic interactions between molecular solutes in salty, aqueous media......

Installation is very simple. There are many binaries there and you can use the binaries directly. Do note that the latest binaries (apbs-1.3) uses will require glibc 2.7 and greater. If you are using CentOS 5, you may want to use apbs-1.21 binaries or below.

For details on Compiling adaptive Poisson-Boltzmann Solver (APBS) on CentOS 5 on Linux Cluster

Tuesday, November 8, 2011

Using strace as a troubleshooting tool

Taken from Using strace as a troubleshooting tool (linuxcluster.wordpress.com) Strace, when runs in conjunction with a program do output all the calls made to the kernel by the program.

One of quick way to found out what is going on in your program is to do
$ strace -c ./my_hello_world_program
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
74.80    0.002998        1499         2           wait4
21.91    0.000878           4       221           read
0.95    0.000038           0       237         2 mmap
0.77    0.000031          10         3         1 mkdir
0.67    0.000027           0       566       361 open
0.35    0.000014           0        81           mprotect
0.30    0.000012           0        62        37 stat
0.25    0.000010           0       225           close
0.00    0.000000           0        37         1 write
0.00    0.000000           0       132           fstat
0.00    0.000000           0         8           poll
0.00    0.000000           0         2           lseek
0.00    0.000000           0       120           munmap
0.00    0.000000           0        15           brk
0.00    0.000000           0        16           rt_sigaction
................

................

------ ----------- ----------- --------- --------- ----------------
100.00    0.004008                  1990       411 total

If you wish to do a tracing, just do a, you can easily find out the error if there was....
$ strace ./my_hello_world_program
............

............

open("/tmp/openmpi-sessions-root@starfruit-h00.cluster.spms.ntu.edu.sg_0/25979/1/0",
O_RDONLY|O_NONBLOCK|O_DIRECTORY) = -1 ENOENT (No such file or directory)
munmap(0x2b46e05ef000, 2111200)         = 0
munmap(0x2b46dffe5000, 2102312)         = 0
munmap(0x2b46dfdde000, 2123264)         = 0
munmap(0x2b46e103f000, 2106960)         = 0
munmap(0x2b46e1242000, 2104560)         = 0
munmap(0x2b46e269d000, 2114912)         = 0
munmap(0x2b46e41c9000, 2145008)         = 0
munmap(0x2b46e43d5000, 2162608)         = 0

If you wish the output of strace to a file instead, do use the argument -o
$ strace -o strace_output_file ./my_hello_world_program

If you wish to trace system call, process,network, you can use the "-e trace=file", "-e trace=process", "-e trace=network",
$ strace -e trace=open,close,read,write ./my_hello_w0rld_program
$ strace -e trace=stat,chmod,unlink ./my_hello_world_program
Further Information:
  1. Solutions for tracing UNIX applications (IBM DeveloperWorks)
  2. strace - A very powerful troubleshooting tool for all Linux users (linuxhelp.blogspot.com)
  3. Ten commands every linux developer should know (Linux Journal)

Friday, November 4, 2011

A*Star Computational Resource Centre Software Listings

A*Star which is the main Government funded research organisation in Singapore has a highly effective A*STAR Computational Resource Centre (A*CRC) which provides high performance computational (HPC) resources to the entire A*STAR research community.

They have an interesting software listing which includes,   
  1. Biology and Bioinformatics
  2. Chemistry and Molecular Modeling
  3. Physics and Material Science 
  4. Mathematical, Statistical and Other Utilities
  5. Software Development
  6. System Software

Thursday, November 3, 2011

NetApp posts world-record SPEC SFS2008 NFS benchmark result

NetApp achieved over 1.5 million SPEC SFS2008 NFS operations per second with a 24-node cluster based on FAS6240 boxes running ONTAP 8 in Cluster Mode......For more information, see NetApp posts world-record SPEC SFS2008 NFS benchmark result

Wednesday, November 2, 2011

Basic Overview and use of NMON on CentOS 5



nmon for Linux – Nigel’s performance Monitor for Linux is a wonderful Swiss Army Knife for Performance Information.You can display multiple screen on the same windows and get information on CPU, Memory, NFS, Network, Disks, Resource, kernel etc


For more information, do look at Basic Overview and use of NMON on CentOS 5 from Linux Cluster

Tuesday, November 1, 2011

Installing ALPS 2.0 from source on CentOS 5

What is ALPS Project?

The ALPS project (Algorithms and Libraries for Physics Simulations) is an open source effort aiming at providing high-end simulation codes for strongly correlated quantum mechanical systems as well as C++ libraries for simplifying the development of such code. ALPS strives to increase software reuse in the physics community. Good information on installing ALPS can be found on ALPS Wiki's Download and install ALPS for Ubuntu 9.10, Ubuntu 10.04, Ubuntu 10.10, Debian and MacOS

Installing ALPS with Boost

# wget http://alps.comp-phys.org/static/software/releases/alps-2.0.2-r5790-src-with-boost.tar.gz
You will need either gfortran or Intel Fortran Compiler. If you are installing using gfortan
# yum install gcc-c++ gcc-gfortran
If you want to use the evaluation tools, you will need to install a newer version of Python than the provided 2.4. You can install from source or use an unofficial repository for binary RPMs. This is not required if you just want to run your compiled simulations (c++ applications), but make sure you still have python headers (specify -DALPS_BUILD_PYTHON=OFF when invoking cmake):
# yum install python-devel
BLAS/LAPACK is necessary. Make sure you have EPEL repository ready. For more information,Red Hat Enterprise Linux / CentOS Linux Enable EPEL (Extra Packages for Enterprise Linux) Repository
# yum install blas-devel lapack-devel
CMake 2.8.0 and HDF5 1.8 need to be installed. There is a wonderful scripts that comes with ALPS that help to compile CMAKE 2.8 and HDF5.1.8 with CentOS 5
$ $HOME/src/alps2/script/cmake.sh $HOME/opt $HOME/tmp
$ $HOME/src/alps2/script/hdf5.sh $HOME/opt $HOME/tmp

Build ALPS

Create a build directory (anywhere you have write access) and execute cmake giving the path to the alps and to the boost directory:
# cmake -D Boost_ROOT_DIR:PATH=/path/to/boost/directory /path/to/alps/directory
For example if the alps precompiled directory is in /root/alps-2.0.2 # cmake -D Boost_ROOT_DIR:PATH=/root/alps-2.0.2/boost /root/alps-2.0.2/alps To install in another directory, set set the variable CMAKE_INSTALL_PREFIX
# cmake -DCMAKE_INSTALL_PREFIX=/path/to/install/directory /path/to/alps/directory
For example:
# cmake -DCMAKE_INSTALL_PREFIX=/usr/local/alps-2.0.2 /root/alps-2.0.2/alps

Build and test ALPS

$ make -j 8
$ make test
$ make install
* HDF5.1.8 binaries and libraries are very useful not only for compiling ALPS but other applications require HDF5.1.8. You may want to consider to move its binaries and libraries to the /usr/local/ directories