Condor Master-Worker

From Ben's Writing

Jump to: navigation, search

Contents

Introduction

This is a basic tutorial for getting up and running with MW. If you are on the University of Lethbridge campus, using a CS account, then you can skip the installation part, and go directly to the notes on the examples.

This material is based on a tutorial for the The 3rd International Summer School on Grid Computing 2005.

Installation

Download a copy of the MW source from the MW site:

$ wget http://www.cs.wisc.edu/condor/mw/mw-0.3.0.tgz

Unpack the source to some temporary location:

$ tar zxvpf mw-0.3.0.tgz

Now configure and build MW. Make sure your environment is properly configured and you can find the Condor binaries.

$ which condor_version
/local/raid1/condor/binaries/Linux-i686/bin/condor_version
$ echo $CONDOR_CONFIG
/home/condor/condor_config
$ ./configure --prefix=/local/raid1/condor/binaries/mw-0.3.0-`uname`-`uname -p` \
              --with-condor=/local/raid1/condor/binaries/`uname`-`uname -p` \
              --without-pvm --without-mwfile --without-chirp --with-mpi
checking for g++... g++
checking for C++ compiler default output file name... a.out
checking whether the C++ compiler works... yes
checking whether we are cross compiling... no
checking for suffix of executables... 
checking for suffix of object files... o
...
$ make
[ "__src examples" = "__" ] || for subdir in `echo "src examples"`; do (cd $subdir && make all) ; done
make[1]: Entering directory `/tmp/mw-src/src'
/usr/lib/ccache/g++ -DPACKAGE_NAME=\"\" -DPACKAGE_TARNAME=\"\" \
 -DPACKAGE_VERSION=\"\" -DPACKAGE_STRING=\"\" -DPACKAGE_BUGREPORT=\"\"  \
 -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 \
 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 \
 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 \
 -DHAVE_UNISTD_H=1 -DHAVE_FCNTL_H=1 -DHAVE_LIMITS_H=1 -DHAVE_SYS_TIME_H=1 \
 -DHAVE_UNISTD_H=1 -DTIME_WITH_SYS_TIME=1 -DHAVE_VPRINTF=1 -DHAVE_GETCWD=1 \
 -DHAVE_GETHOSTNAME=1 -DHAVE_GETTIMEOFDAY=1 -DHAVE_MKDIR=1 -DHAVE_STRSTR=1 \
 -DHAVE_DYNAMIC_CAST= -DCONDOR_DIR=\"/local/raid1/condor/binaries/Linux-i686\" \
 -DUSE_CHIRP=1 -DUSE_POLL=1  -I. -I. -IRMComm -IMW-File -IMW-CondorPVM \
 -IMW-Socket -IMWControlTasks -g -O2 -Wall -c MW.C
...

Installing the resulting binaries is simples; note, however, you do not need to do this if you are just working with the examples shipped with MW. The above will build the examples for you and will leave them ready for use in the examples directory.

$ make install
[ "__src examples" = "__" ] || for subdir in `echo "src examples"`; do (cd $subdir && make install) ; done
make[1]: Entering directory `/tmp/mw-src/src'
/bin/sh ../mkinstalldirs /local/raid1/condor/binaries/mw-0.3.0-Linux-i686/lib
/usr/bin/install -c -m 644 libMW.a /local/raid1/condor/binaries/mw-0.3.0-Linux-i686/lib/libMW.a
...
$ cd /local/raid1/condor/binaries
$ ln -sf mw-0.3.0-`uname`-`uname -p` mw-`uname`-`uname -p`

That should be all that you need to do to get MW installed. If there are any errors please refer to the MW home page, or speak directly with the Condor team.

MW Examples

MW ships with several examples. These examples can be found in the examples directory:

$ cd /tmp/mw/examples
$ ls
fib/       Makefile     matmul/     newskel/   skel/
knapsack/  Makefile.in  newmatmul/  n-queens/

Submitting MW Jobs

The submit file for the matmul example is submit_socket.

Look at submit_socket:

# Now we're in the scheduler universe 
Universe   = Scheduler

# The name of our exeutable 
Executable = mastermatmul_socket

# Assume a max image size of 4 Megabytes. 
Image_Size     = 4 Meg
+MemoryRequirements = 4

# This goes into stdin for the master. 
Input      = in_master.socket

# Set the output of this job to go to out_master 
Output     = out_master.socket

# Set the stderr of this job to go to out_worker.  It is named 
# out_worker because the output of the workers is directed to stderr 
Error      = out_worker.socket

# Keep a log in case of problems. 
Log        = work.log

# Send us an email if anything happens
notify_user = invalid-user@cs.uleth.ca

Queue

Notice two things about this submit file:

  1. Change the notify_user line to be correct for you.
  2. This is a Scheduler Universe job. It's a job that runs on the submit computer as soon as you submit it. You get all the benefits of Condor (reliability, logging, etc.) with a job that executes locally. We use it for DAGMan and MW: it is a job that submits other jobs and watches over them. In this case, it will be master, which will spawn the other workers (as jobs) and will send them their tasks.

Before you submit the job, ensure the executables have the correct permissions:

$ chmod 775 mastermatmul* workermatmul*

Now submit the job and watch it run:

$ condor_submit submit_socket 
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 1.

$ condor_q
-- Submitter: tenor.cs.uleth.ca : <142.66.140.38:52276> : tenor.cs.uleth.ca
ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  1.0   condor          2/10 19:34   0+00:00:23 R  0   4.2  mastermatmul_socke 

1 jobs; 0 idle, 1 running, 0 held

...

$ condor_q
-- Submitter: tenor.cs.uleth.ca : <142.66.140.38:52276> : tenor.cs.uleth.ca
ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  1.0   condor          2/10 19:34   0+00:00:23 R  0   4.2  mastermatmul_socke 
  2.0   condor          2/10 19:34   0+00:00:00 I  0   0.0  mw_exec0.$$(Opsys)
  2.1   condor          2/10 19:34   0+00:00:00 I  0   0.0  mw_exec0.$$(Opsys)
  2.2   condor          2/10 19:34   0+00:00:00 I  0   0.0  mw_exec0.$$(Opsys)
  2.3   condor          2/10 19:34   0+00:00:00 I  0   0.0  mw_exec0.$$(Opsys)
  2.4   condor          2/10 19:34   0+00:00:00 I  0   0.0  mw_exec0.$$(Opsys)
  2.5   condor          2/10 19:34   0+00:00:00 I  0   0.0  mw_exec0.$$(Opsys)

7 jobs; 6 idle, 1 running, 0 held
 
...

$ condor_q
-- Submitter: tenor.cs.uleth.ca : <142.66.140.38:52276> : tenor.cs.uleth.ca
ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 idle, 0 running, 0 held

We saw the master submit six workers. Several of them started to run, and they did all of the work. Look at out_master.socket to see the result of the run:

$ cat out_master.socket
19:34:08.435 MWDriver is pid 27140.
19:34:08.435 Starting from the beginning.
19:34:08.435 argc=1, argv[0]=condor_scheduniv_exec.1.0
19:34:08.435 Adding executable workermatmul_socket for ((OPSYS=="LINUX")&&(ARCH=="INTEL"))
19:34:08.435 MWRMComm::set_num_arch_classes to 1
19:34:08.435 MWRMComm::set_arch_class_attributes for arch class 0 to ((OPSYS=="LINUX")&&(ARCH=="INTEL"))
19:34:08.435 Making a link from workermatmul_socket to mw_exec0.LINUX.INTEL.exe
19:34:08.687 In MWFileRC::init_beginning_workers()
19:34:08.687 Good to go.
19:34:08.687 In hostaddlogic
...

If you look at the output carefully, you'll notice that only one worker did all of the tasks. That is because the time to do the tasks in this simple case was really short.

Creating new MW Applications

As we saw above, MW ships with several examples. One of these examples can be used to create skeleton MW applications. Look at the examples/newskel directory:

$ cd /tmp/mw/examples/newskel
$ ls
configure       in_master.pvm       README          Task_MYAPP.h
configure.in    in_master.socket    rep_MYAPP.pl    WorkerMain_MYAPP.C
Driver_MYAPP.C  Makefile.in         submit_pvm      Worker_MYAPP.C
Driver_MYAPP.h  MasterMain_MYAPP.C  submit_socket   Worker_MYAPP.h
in_master.indp  new_app             Task_MYAPP.C

We are most interested in the new_app script. The script takes on argument, the new application's name (test_app). The script will create a directory ../test_app, create template files in that directory, configure and build the simple example for you.

Unfortunately, there is a bug in the code generation script which ends up creating non-functional Makefiles if you use an installed version of MW vs. a version you have built yourself (i.e. any include line that refers to the src directory will silently fail). Apply the following patch to fix the problem:

--- Makefile.in-	2010-02-10 21:10:25.000000000 −0700
+++ Makefile.in	2010-02-11 19:19:20.000000000 −0700
@@ -66,12 +66,12 @@
  MW_LIBDIR = $(MW_LIBDIR_DEBUG)
  # MW-File doesn't work with Insure, so will not compile *_file if DEBUG_BUILD
  PROGRAMS = master_MYAPP_socket worker_MYAPP_socket $(INDEPENDENT_PROGS)
-  INCLUDES = -I$(MWDIR)/src -I$(MW_DIR)/src/MWControlTasks -I$(MW_DIR)/src/RMComm -I$(MW_DIR)/src/RMComm/MW-CondorPVM \
+  INCLUDES = -I$(MW_DIR)/include -I$(MW_DIR)/src -I$(MW_DIR)/src/MWControlTasks -I$(MW_DIR)/src/RMComm -I$(MW_DIR)/src/RMComm/MW-CondorPVM \
		-I$(INDEPENDENT_INCLS) $(MEASURE_DEFN)
else
  PROGRAMS = \
		master_MYAPP_socket worker_MYAPP_socket $(INDEPENDENT_PROGS)
-  INCLUDES = -I$(MW_DIR)/src -I$(MW_DIR)/src/MWControlTasks -I$(MW_DIR)/src/RMComm -I$(MW_DIR)/src/RMComm/MW-File \
+  INCLUDES = -I$(MW_DIR)/include -I$(MW_DIR)/src -I$(MW_DIR)/src/MWControlTasks -I$(MW_DIR)/src/RMComm -I$(MW_DIR)/src/RMComm/MW-File \
		-I$(MW_DIR)/src/RMComm/MW-CondorPVM -I$(INDEPENDENT_INCLS) $(MEASURE_DEFN)
endif

To generate a set of files for your application, run:

$ ./new_app test_app
(1) Creating directory ../test_app
(2) Making new template files in ../test_app
(3) Now you must run configure with appropriate options, then make

Since the new application is created at the same level as 'newskel', which is usually at mw/examples/app_name, and the default MW install path (../..) should, in general, work; however, since we are going to be running our application on the CS pool, we need to specify our local MW path in configure:

$ cd ../test_app	
$ autoconf ; configure --with-MW=/local/raid1/condor/binaries/mw-`uname`-`uname -p` \
                       --with-condor=/local/raid1/condor/binaries/`uname`-`uname -p` \
                       --without-pvm --without-chirp
checking for g++... g++
checking for C++ compiler default output file name... a.out
checking whether the C++ compiler works... yes
checking whether we are cross compiling... no
checking for suffix of executables... 
...

Now, compiling the new application is easy:

$ make
/usr/lib/ccache/g++ -DPACKAGE_NAME=\"\" -DPACKAGE_TARNAME=\"\" \
 -DPACKAGE_VERSION=\"\" -DPACKAGE_STRING=\"\" -DPACKAGE_BUGREPORT=\"\" \
 -DUSE_CHIRP=1 -I. \
 -I/local/raid1/condor/binaries/mw-Linux-i686/src \
 -I/local/raid1/condor/binaries/mw-Linux-i686/src/MWControlTasks \
 -I/local/raid1/condor/binaries/mw-Linux-i686/src/RMComm \
 -I/local/raid1/condor/binaries/mw-Linux-i686/src/RMComm/MW-File \
 -I/local/raid1/condor/binaries/mw-Linux-i686/include \
 -I/local/raid1/condor/binaries/mw-Linux-i686/src/RMComm/MW-CondorPVM \
 -I/local/raid1/condor/binaries/mw-Linux-i686/src/RMComm/MW-Independent \
 -g -O2 -Wall -c Driver_test_app.C
...

We can use the default submit file for the skeleton application:

Universe = Scheduler
Executable = master_test_app_socket
Image_Size = 4
+MemoryRequirements = 4
Input = in_master.socket
Output = out_master.socket
Error = out_worker.socket
Log = test_app.log
Requirements = (Arch == "INTEL" && OPSYS=="LINUX")

# Only start out with one machine; add more ourselves
# The two numbers are the min and max number of hosts to get before startup.
Machine_count = 1..1

Queue

Again, before you submit the job, ensure the executables have the correct permissions:

$ chmod 775 master_test_app_* worker_test_app_*

Now we can finally submit our new application to Condor:

$ condor_submit submit_socket
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 7.

...
 
$ condor_q
-- Submitter: tenor.cs.uleth.ca : <142.66.140.38:52276> : tenor.cs.uleth.ca
ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  7.0   condor          2/10 22:04   0+00:00:07 R  0   0.0  master_test_app_so

1 jobs; 0 idle, 1 running, 0 held

...

$ condor_q
-- Submitter: tenor.cs.uleth.ca : <142.66.140.38:52276> : tenor.cs.uleth.ca
ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  7.0   condor          2/10 22:04   0+00:00:07 R  0   0.0  master_test_app_so
  8.0   condor          2/10 22:04   0+00:00:00 I  0   0.0  mw_exec0.$$(Opsys)
  8.1   condor          2/10 22:04   0+00:00:00 I  0   0.0  mw_exec0.$$(Opsys)
  8.2   condor          2/10 22:04   0+00:00:00 I  0   0.0  mw_exec0.$$(Opsys)
  8.3   condor          2/10 22:04   0+00:00:00 I  0   0.0  mw_exec0.$$(Opsys)

5 jobs; 4 idle, 1 running, 0 held

...

$ condor_q
-- Submitter: tenor.cs.uleth.ca : <142.66.140.38:52276> : tenor.cs.uleth.ca
ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 idle, 0 running, 0 held

We saw the master submit four workers. Each did it's own part of the work, which is to pick out the largest number from a bunch of integers given in the input file. Look at out_master.socket to see the result of the run:

$ cat out_master.socket
22:04:28.938 The master is starting.
22:04:28.938 MWDriver is pid 7480.
22:04:28.939 Starting from the beginning.
22:04:28.939 argc=1, argv[0]=condor_scheduniv_exec.7.0
22:04:28.939 Enter Driver_test_app::get_userinfo
22:04:28.939 arg 0: condor_scheduniv_exec.7.0
22:04:28.939 Set the arch class to 1.
22:04:28.939  worker_test_app_socket
22:04:28.939 tempnum_executables = 0
22:04:28.939 Making a link from worker_test_app_socket to mw_exec0.LINUX.INTEL.exe
...

From this skeleton a new MW application can be created to serve any function.

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox