Condor Master-Worker
From Ben's Writing
Contents |
Introduction
This is a basic tutorial for getting up and running with MW. If you are on the University of Lethbridge campus, using a CS account, then you can skip the installation part, and go directly to the notes on the examples.
This material is based on a tutorial for the The 3rd International Summer School on Grid Computing 2005.
Installation
Download a copy of the MW source from the MW site:
$ wget http://www.cs.wisc.edu/condor/mw/mw-0.3.0.tgz
Unpack the source to some temporary location:
$ tar zxvpf mw-0.3.0.tgz
Now configure and build MW. Make sure your environment is properly configured and you can find the Condor binaries.
$ which condor_version /local/raid1/condor/binaries/Linux-i686/bin/condor_version
$ echo $CONDOR_CONFIG /home/condor/condor_config
$ ./configure --prefix=/local/raid1/condor/binaries/mw-0.3.0-`uname`-`uname -p` \
--with-condor=/local/raid1/condor/binaries/`uname`-`uname -p` \
--without-pvm --without-mwfile --without-chirp --with-mpi
checking for g++... g++
checking for C++ compiler default output file name... a.out
checking whether the C++ compiler works... yes
checking whether we are cross compiling... no
checking for suffix of executables...
checking for suffix of object files... o
...
$ make [ "__src examples" = "__" ] || for subdir in `echo "src examples"`; do (cd $subdir && make all) ; done make[1]: Entering directory `/tmp/mw-src/src' /usr/lib/ccache/g++ -DPACKAGE_NAME=\"\" -DPACKAGE_TARNAME=\"\" \ -DPACKAGE_VERSION=\"\" -DPACKAGE_STRING=\"\" -DPACKAGE_BUGREPORT=\"\" \ -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 \ -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 \ -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 \ -DHAVE_UNISTD_H=1 -DHAVE_FCNTL_H=1 -DHAVE_LIMITS_H=1 -DHAVE_SYS_TIME_H=1 \ -DHAVE_UNISTD_H=1 -DTIME_WITH_SYS_TIME=1 -DHAVE_VPRINTF=1 -DHAVE_GETCWD=1 \ -DHAVE_GETHOSTNAME=1 -DHAVE_GETTIMEOFDAY=1 -DHAVE_MKDIR=1 -DHAVE_STRSTR=1 \ -DHAVE_DYNAMIC_CAST= -DCONDOR_DIR=\"/local/raid1/condor/binaries/Linux-i686\" \ -DUSE_CHIRP=1 -DUSE_POLL=1 -I. -I. -IRMComm -IMW-File -IMW-CondorPVM \ -IMW-Socket -IMWControlTasks -g -O2 -Wall -c MW.C ...
Installing the resulting binaries is simples; note, however, you do not need to do this if you are just working with the examples shipped with MW. The above will build the examples for you and will leave them ready for use in the examples directory.
$ make install [ "__src examples" = "__" ] || for subdir in `echo "src examples"`; do (cd $subdir && make install) ; done make[1]: Entering directory `/tmp/mw-src/src' /bin/sh ../mkinstalldirs /local/raid1/condor/binaries/mw-0.3.0-Linux-i686/lib /usr/bin/install -c -m 644 libMW.a /local/raid1/condor/binaries/mw-0.3.0-Linux-i686/lib/libMW.a ...
$ cd /local/raid1/condor/binaries $ ln -sf mw-0.3.0-`uname`-`uname -p` mw-`uname`-`uname -p`
That should be all that you need to do to get MW installed. If there are any errors please refer to the MW home page, or speak directly with the Condor team.
MW Examples
MW ships with several examples. These examples can be found in the examples directory:
$ cd /tmp/mw/examples $ ls fib/ Makefile matmul/ newskel/ skel/ knapsack/ Makefile.in newmatmul/ n-queens/
- fib: calcluate fibonacci numbers
- knapsack: incomlete examples. Solve the knapsack problem with branch and bound.
- matmul: multiply two matrices
- n-queens: Find a chessboard with N queens on it so that no two queens attack each other.
- newmatmul: Ignore this bad example of matrix multiplication.
- newskel: An empty shell for an MW application
- skel: An empty shell for an MW application
Submitting MW Jobs
The submit file for the matmul example is submit_socket.
Look at submit_socket:
# Now we're in the scheduler universe Universe = Scheduler # The name of our exeutable Executable = mastermatmul_socket # Assume a max image size of 4 Megabytes. Image_Size = 4 Meg +MemoryRequirements = 4 # This goes into stdin for the master. Input = in_master.socket # Set the output of this job to go to out_master Output = out_master.socket # Set the stderr of this job to go to out_worker. It is named # out_worker because the output of the workers is directed to stderr Error = out_worker.socket # Keep a log in case of problems. Log = work.log # Send us an email if anything happens notify_user = invalid-user@cs.uleth.ca Queue
Notice two things about this submit file:
- Change the notify_user line to be correct for you.
- This is a Scheduler Universe job. It's a job that runs on the submit computer as soon as you submit it. You get all the benefits of Condor (reliability, logging, etc.) with a job that executes locally. We use it for DAGMan and MW: it is a job that submits other jobs and watches over them. In this case, it will be master, which will spawn the other workers (as jobs) and will send them their tasks.
Before you submit the job, ensure the executables have the correct permissions:
$ chmod 775 mastermatmul* workermatmul*
Now submit the job and watch it run:
$ condor_submit submit_socket Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 1. $ condor_q -- Submitter: tenor.cs.uleth.ca : <142.66.140.38:52276> : tenor.cs.uleth.ca ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 condor 2/10 19:34 0+00:00:23 R 0 4.2 mastermatmul_socke 1 jobs; 0 idle, 1 running, 0 held ... $ condor_q -- Submitter: tenor.cs.uleth.ca : <142.66.140.38:52276> : tenor.cs.uleth.ca ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 condor 2/10 19:34 0+00:00:23 R 0 4.2 mastermatmul_socke 2.0 condor 2/10 19:34 0+00:00:00 I 0 0.0 mw_exec0.$$(Opsys) 2.1 condor 2/10 19:34 0+00:00:00 I 0 0.0 mw_exec0.$$(Opsys) 2.2 condor 2/10 19:34 0+00:00:00 I 0 0.0 mw_exec0.$$(Opsys) 2.3 condor 2/10 19:34 0+00:00:00 I 0 0.0 mw_exec0.$$(Opsys) 2.4 condor 2/10 19:34 0+00:00:00 I 0 0.0 mw_exec0.$$(Opsys) 2.5 condor 2/10 19:34 0+00:00:00 I 0 0.0 mw_exec0.$$(Opsys) 7 jobs; 6 idle, 1 running, 0 held ... $ condor_q -- Submitter: tenor.cs.uleth.ca : <142.66.140.38:52276> : tenor.cs.uleth.ca ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held
We saw the master submit six workers. Several of them started to run, and they did all of the work. Look at out_master.socket to see the result of the run:
$ cat out_master.socket 19:34:08.435 MWDriver is pid 27140. 19:34:08.435 Starting from the beginning. 19:34:08.435 argc=1, argv[0]=condor_scheduniv_exec.1.0 19:34:08.435 Adding executable workermatmul_socket for ((OPSYS=="LINUX")&&(ARCH=="INTEL")) 19:34:08.435 MWRMComm::set_num_arch_classes to 1 19:34:08.435 MWRMComm::set_arch_class_attributes for arch class 0 to ((OPSYS=="LINUX")&&(ARCH=="INTEL")) 19:34:08.435 Making a link from workermatmul_socket to mw_exec0.LINUX.INTEL.exe 19:34:08.687 In MWFileRC::init_beginning_workers() 19:34:08.687 Good to go. 19:34:08.687 In hostaddlogic ...
If you look at the output carefully, you'll notice that only one worker did all of the tasks. That is because the time to do the tasks in this simple case was really short.
Creating new MW Applications
As we saw above, MW ships with several examples. One of these examples can be used to create skeleton MW applications. Look at the examples/newskel directory:
$ cd /tmp/mw/examples/newskel $ ls configure in_master.pvm README Task_MYAPP.h configure.in in_master.socket rep_MYAPP.pl WorkerMain_MYAPP.C Driver_MYAPP.C Makefile.in submit_pvm Worker_MYAPP.C Driver_MYAPP.h MasterMain_MYAPP.C submit_socket Worker_MYAPP.h in_master.indp new_app Task_MYAPP.C
We are most interested in the new_app script. The script takes on argument, the new application's name (test_app). The script will create a directory ../test_app, create template files in that directory, configure and build the simple example for you.
Unfortunately, there is a bug in the code generation script which ends up creating non-functional Makefiles if you use an installed version of MW vs. a version you have built yourself (i.e. any include line that refers to the src directory will silently fail). Apply the following patch to fix the problem:
--- Makefile.in- 2010-02-10 21:10:25.000000000 −0700 +++ Makefile.in 2010-02-11 19:19:20.000000000 −0700 @@ -66,12 +66,12 @@ MW_LIBDIR = $(MW_LIBDIR_DEBUG) # MW-File doesn't work with Insure, so will not compile *_file if DEBUG_BUILD PROGRAMS = master_MYAPP_socket worker_MYAPP_socket $(INDEPENDENT_PROGS) - INCLUDES = -I$(MWDIR)/src -I$(MW_DIR)/src/MWControlTasks -I$(MW_DIR)/src/RMComm -I$(MW_DIR)/src/RMComm/MW-CondorPVM \ + INCLUDES = -I$(MW_DIR)/include -I$(MW_DIR)/src -I$(MW_DIR)/src/MWControlTasks -I$(MW_DIR)/src/RMComm -I$(MW_DIR)/src/RMComm/MW-CondorPVM \ -I$(INDEPENDENT_INCLS) $(MEASURE_DEFN) else PROGRAMS = \ master_MYAPP_socket worker_MYAPP_socket $(INDEPENDENT_PROGS) - INCLUDES = -I$(MW_DIR)/src -I$(MW_DIR)/src/MWControlTasks -I$(MW_DIR)/src/RMComm -I$(MW_DIR)/src/RMComm/MW-File \ + INCLUDES = -I$(MW_DIR)/include -I$(MW_DIR)/src -I$(MW_DIR)/src/MWControlTasks -I$(MW_DIR)/src/RMComm -I$(MW_DIR)/src/RMComm/MW-File \ -I$(MW_DIR)/src/RMComm/MW-CondorPVM -I$(INDEPENDENT_INCLS) $(MEASURE_DEFN) endif
To generate a set of files for your application, run:
$ ./new_app test_app (1) Creating directory ../test_app (2) Making new template files in ../test_app (3) Now you must run configure with appropriate options, then make
Since the new application is created at the same level as 'newskel', which is usually at mw/examples/app_name, and the default MW install path (../..) should, in general, work; however, since we are going to be running our application on the CS pool, we need to specify our local MW path in configure:
$ cd ../test_app
$ autoconf ; configure --with-MW=/local/raid1/condor/binaries/mw-`uname`-`uname -p` \
--with-condor=/local/raid1/condor/binaries/`uname`-`uname -p` \
--without-pvm --without-chirp
checking for g++... g++
checking for C++ compiler default output file name... a.out
checking whether the C++ compiler works... yes
checking whether we are cross compiling... no
checking for suffix of executables...
...
Now, compiling the new application is easy:
$ make /usr/lib/ccache/g++ -DPACKAGE_NAME=\"\" -DPACKAGE_TARNAME=\"\" \ -DPACKAGE_VERSION=\"\" -DPACKAGE_STRING=\"\" -DPACKAGE_BUGREPORT=\"\" \ -DUSE_CHIRP=1 -I. \ -I/local/raid1/condor/binaries/mw-Linux-i686/src \ -I/local/raid1/condor/binaries/mw-Linux-i686/src/MWControlTasks \ -I/local/raid1/condor/binaries/mw-Linux-i686/src/RMComm \ -I/local/raid1/condor/binaries/mw-Linux-i686/src/RMComm/MW-File \ -I/local/raid1/condor/binaries/mw-Linux-i686/include \ -I/local/raid1/condor/binaries/mw-Linux-i686/src/RMComm/MW-CondorPVM \ -I/local/raid1/condor/binaries/mw-Linux-i686/src/RMComm/MW-Independent \ -g -O2 -Wall -c Driver_test_app.C ...
We can use the default submit file for the skeleton application:
Universe = Scheduler Executable = master_test_app_socket Image_Size = 4 +MemoryRequirements = 4 Input = in_master.socket Output = out_master.socket Error = out_worker.socket Log = test_app.log Requirements = (Arch == "INTEL" && OPSYS=="LINUX") # Only start out with one machine; add more ourselves # The two numbers are the min and max number of hosts to get before startup. Machine_count = 1..1 Queue
Again, before you submit the job, ensure the executables have the correct permissions:
$ chmod 775 master_test_app_* worker_test_app_*
Now we can finally submit our new application to Condor:
$ condor_submit submit_socket Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 7. ... $ condor_q -- Submitter: tenor.cs.uleth.ca : <142.66.140.38:52276> : tenor.cs.uleth.ca ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 7.0 condor 2/10 22:04 0+00:00:07 R 0 0.0 master_test_app_so 1 jobs; 0 idle, 1 running, 0 held ... $ condor_q -- Submitter: tenor.cs.uleth.ca : <142.66.140.38:52276> : tenor.cs.uleth.ca ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 7.0 condor 2/10 22:04 0+00:00:07 R 0 0.0 master_test_app_so 8.0 condor 2/10 22:04 0+00:00:00 I 0 0.0 mw_exec0.$$(Opsys) 8.1 condor 2/10 22:04 0+00:00:00 I 0 0.0 mw_exec0.$$(Opsys) 8.2 condor 2/10 22:04 0+00:00:00 I 0 0.0 mw_exec0.$$(Opsys) 8.3 condor 2/10 22:04 0+00:00:00 I 0 0.0 mw_exec0.$$(Opsys) 5 jobs; 4 idle, 1 running, 0 held ... $ condor_q -- Submitter: tenor.cs.uleth.ca : <142.66.140.38:52276> : tenor.cs.uleth.ca ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held
We saw the master submit four workers. Each did it's own part of the work, which is to pick out the largest number from a bunch of integers given in the input file. Look at out_master.socket to see the result of the run:
$ cat out_master.socket 22:04:28.938 The master is starting. 22:04:28.938 MWDriver is pid 7480. 22:04:28.939 Starting from the beginning. 22:04:28.939 argc=1, argv[0]=condor_scheduniv_exec.7.0 22:04:28.939 Enter Driver_test_app::get_userinfo 22:04:28.939 arg 0: condor_scheduniv_exec.7.0 22:04:28.939 Set the arch class to 1. 22:04:28.939 worker_test_app_socket 22:04:28.939 tempnum_executables = 0 22:04:28.939 Making a link from worker_test_app_socket to mw_exec0.LINUX.INTEL.exe ...
From this skeleton a new MW application can be created to serve any function.