Scripting on Supercomputers

Montag, 28. März 2011

monads in R: sapply and foreach

Monads are a powerful way of structuring functional programs. They are used in functional languages like haskell or F# to define control flows (like handling concurrency, continuations, side effects such as input/output, or exceptions). Basically a monad is defined by a type and two functions, unit and bind. Alternatively one can define a monad using two other functions together with a data type, namely: join and fmap. In following I want to show some analogies between fmap and join and the functions sapply and c in R.

fmap is defined as a (higher-order) function that fulfills the following relations (in haskell and the . denotes function concatenation):

fmap id = id

fmap (f . g) = (fmap f) . (fmap g)

the definition in R is then:

fmap <- function(f) sapply(x,f)

for the join function the following has to be valid (also in haskell):

join . fmap join = join . join

this translates to the following code identities in R (read === as "is the same as"):

c( fmap(c)(x)) === c(sapply(x,c)) === c(x) === x

Therefore c and sapply define the list monad in R. Can we go one step further and define some notation like the "do" notation in haskell in R? I think this is already done. It is the foreach package.

foreach(x=a) %do% f(x)

is the exact translation of the corresponding haskell monadic "do" notation (again in haskell):

fmap f x = do r <- x return (f r)

join x = do a <- x a

and in R:

fmap(f)(x) === foreach(r=x) %do% f(r) === sapply(x,f)

join(x) === foreach(a=x) %do% a === c(x)

Now hat does this mean for the several foreach extensions (doMC, doMPI, doRedis, doSNOW)?

They are all implementations of monads in R. So it would be very interesting to port all the other monadic types from other functional programming languages to R.

doRedis: redis as dispatcher for parallel R

Redis is an open source, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets. Clients are available for C, C++, C#, Objective-C, Clojure, Common Lisp, Erlang, Go, Haskell, Io, Lua, Perl, Python, PHP, R ruby, scala, smalltalk, tcl...

B. W. Lewis has developed a parallel extension for the foreach package that allows a cluster of workers to obtain workloads from a redis server.

This is the redis binding for R (rredis)

redisConnect() # connect to redis store
redisSet('x',runif(5)) # store a value
redisGet('x') # retrieve value from store
redisClose() # close connection
redisAuth(pwd) # simple authentication
redisConnect()
redisLPush('x',1) # push numbers into list
redisLPush('x',2)
redisLPush('x',3)
redisLRange('x',0,2) # retrieve list

using this redis interface for R, the fine R library doRedis allows redis to be a dispatcher for parallel R commands on a cluster. The usage is fairly easy and resembles closely the usage of the doSNOW clustering library:

first you have to start a redis server on one of the machines as cluster master.
then connect as many R workers as you want to the redis master
finally start an R interpreter and connect to the redis master and submit the parallel computation job using the foreach package

the workload is then distributed to the workers (which can reside on other machines than the master) and the result is gathered back to the interactive R interpreter. Here is an example how to use the package:

start the redis server:

./redis-server

start as many workers as you like

echo "require('doRedis');redisWorker('jobs')" > R --no-save -q &

start a R interpreter and connect to the redis server:

library(doRedis)

registerDoRedis('jobs')

foreach(j=1:1000, .combine=sum) %dopar% sum(runif(10000000))

if you want to minimize the communication with the redis server use setChunkSize to send out chunks to each R worker.

Montag, 28. Juni 2010

R is the superglue language

We all know that scripting languages are often used as glueing unix commands together.

But the ultimate glueing language is R. By using R you can glue shared object libraries together in a seamless way that makes perl or python look pale. Here is a set of slides how you can use it together with all the R goodies:

R as the superglue language

Using R you can glue C/C++/ObjectiveC, FORTRAN, opengl, CUDA, web, MPI, threads and you name it... together.

Installing bioperl without root access

In order to give our users the maximum freedom, we are not installing perl modules systemwide, but the modules have to be installed on a per user basis, unless the modules are of such importance and wide use by our user basis, that it justifies a system wide installation. Here is a tutorial how to install perl modules by yourself without having root access:


1) start cpan and config. defaults are ok.
   downloadsites set: 11 9 4
   takes a while -> quit

2) create local perl module directory:
 > mkdir ~/myperl
 
3) start cpan and set the following options:

 > cpan
 cpan> o conf makepl_arg "LIB=~/myperl/lib \
  INSTALLMAN1DIR=~/myperl/man/man1 \
  INSTALLMAN3DIR=~/myperl/man/man3 \
  INSTALLSCRIPT=~/myperl/bin \
  INSTALLBIN=~/myperl/bin"

 cpan> o conf mbuildpl_arg "--lib=~/myperl/lib \
  --installman1dir=~/myperl/man/man1 \
  --installman3dir=~/myperl/man/man3 \
  --installscript=~/myperl/bin \
  --installbin=~/myperl/bin"

 cpan> o conf mbuild_install_arg "--install_path lib=~/myperl"
 cpan> o conf prerequisites_policy automatically
 cpan> o conf commit
 cpan> quit

then you can install whatever modules you like in your local directory.
For example, for bioperl you would need:


 cpan>d /bioperl/
 CPAN: Storable loaded ok
 Going to read /home/bosborne/.cpan/Metadata
 Database was generated on Mon, 20 Nov 2006 05:24:36 GMT
 
 ....
 
 Distribution B/BI/BIRNEY/bioperl-1.2.tar.gz
 Distribution B/BI/BIRNEY/bioperl-1.4.tar.gz
 Distribution C/CJ/CJFIELDS/BioPerl-1.6.0.tar.gz

Now install:


 cpan> force install C/CJ/CJFIELDS/BioPerl-1.6.0.tar.gz

in case the download is slow, then edit the file ~/.cpan/CPAN/MyConfig.pm and insert the following line
into $CPAN::Config :


'dontload_hash' => { "Net::FTP" => 1, "LWP" =>1 },

in case something goes wrong you can delete ~/.cpan and start over again.

very helpful is the perl shell which you can obtain by installing:
cpan> install Psh
cpan> install IO::String

it is then available under ~/myperl/bin/psh
e.g. try out 
psh% use Bio::Perl;
psh% $seq_object = get_sequence('genbank',"ROA1_HUMAN");
psh% write_sequence(">roa1.fasta",'fasta',$seq_object);

HPC and Visualisation

LRZ is giving a course on "Visualisation of Large Data Sets on Supercomputers". Here are some of the slides:

Montag, 31. Mai 2010

Intel announces accelerator card

Today at 12:00 at the International Supercomputing Conference ISC2010 here in Hamburg, Intel announced a new accelerator card featuring 32 cores and up to 2 GB RAM which can run 128 threads via hyperthreading. Looks like the accelerator wars have begun. They showed a LU decomposition test running on the card and claimed it to be the fastest decomposition at the moment with more than 500 GFlop/s. Pretty impressive. I was attending the keynote from Kirk B. Skaugen
Vice President, Intel Architecture Group & General Manager, Data Center Group, Intel, USA who demoed the accelerator card and presented a complete software stack including C and FORTRAN compilers.

Donnerstag, 27. Mai 2010

R on the SGI Altix 4700

Here at LRZ we have a large SGI Altix 4700 Supercomputer. Lately we have compiled an MPI-enabled version of R for this beast. The machine sports 9728 cores connected with a ccNUMA link and maxes out at 62 TFlop/s. With R we were able to use a whooping 4000 cores on the machine and still get a decent performance.

MPI programming with R is a breeze. You can use the foreach packages from Revolutions and just transfer your serial code to the supercomputer, replace all %do% with %dopar% and submit your jobs and you are done.

Here is a short tutorial how to start programming on supercomputers with R.