CCTools Hackers Guide
Introduction
This "hacker's guide" is meant to orient new students and external developers
to the layout of the CCTools source code. It is not meant to be a complete listing of all
the quirks of every line of code -- for that you must go to the code.
However, it should provide a roadmap that gives some idea of where to
start, how the pieces fit together, and what new code should look like.
Below, a map of the software shows most of the modules, and contains
hyperlinks to a short blurb about each module and its relationship to
its peers. Use the map to understand the overall layout, and then
proceed to the code. If you add a new module or change the behavior
of an existing module significantly, please update the hacker's guide.
The package is written almost entirely in C. If you aren't an expert
in C, you should get the Kernighan and Ritchie book, and become an expert.
The use of C++ is strongly discouraged for new code.
Although C++ has many interesting features from a software
engineering perspective, it presents enormous challenges for interoperability.
Libaries built with one C++ compiler are generally not usable with code
built by any other C++ compiler. In addition, the sheer complexity of
the language nearly guarantees that valid code will not even compile on
multiple platforms. Note that parts of Parrot were written in
C++, but we are slowly converting this back to C.
Of course, this does not mean that we cannot use object-oriented concepts!
Anything that you could accomplish in C++, you can also accomplish in C,
albeit with slightly different syntax. Instead of declaring a C++ class,
create a C structure and functions that manipulate it. C can even achieve
greater encapsulation than C++ by using factories and hiding structure definitions
from the caller. More on this below.
Basic Structure
The software is divided into several fairly large packages
Note that within each package, a make does what you might expect.
However, to build all packages, you must do a make from the top level.
If you make changes in two distinct packages, you must do a make clean; all
from the top level. The packages are:
chirp - A distributed storage and file system, consisting of a
a server, a set of libraries to access the server, and simple tools that
employ the libraries. Chirp is most easily used by Parrot:
parrot - A tool for attaching new operating-system-like features
to existing applications from user-level. Parrot makes systems like HTTP,
FTP, and Chirp appear as if they were ordinary filesystems visible from
the root directory. Parrot currently only runs on Linux/x86.
dttools - A collection of data structures and other basic
software tools used by the rest of the packages.
ftp_lite - A lightweight FTP library that employs the Globus
toolkit for authentication, but is simple enough to be linked in and used
by other tools.
ftsh - The fault tolerant shell, a novel language for performing
adminsitrative tasks in a distributed, fault prone environment.
maptools - A toolkit for transforming internet addresses into
latitude and longitude. Useful for constructing maps of distributed systems.
Style and Conventions
For an example of general coding style, look at the hash_table module
in the dttools directory. This module would be a good example to follow
when developing new code. Here are some general precepts to follow:
Expose as few details as possible.
Carefully distinguish between public and private members of a module.
Do not simply throw all prototypes and structure definintions
into the header file -- this leads to disaster in large projects.
Instead, identify the smallest set of functions and interfaces that the
outside world needs to know about, and place those in the header file.
Notice, for example, that hash_table.h exposes a few key functions
in the hash_table.h, but makes all structure definitions in hash_table.c.
Protect the namespace. Every module needs to exist in a namespace
so that multiple modules don't accidentally define the name for either functions
or variables. Public items should be prefixed with the module name (hash_table)
while private items may have any reasonable name, provided that they are declared
static or otherwise hidden from view. For example, notice that
struct entry is used within hash_table.c but not visible outside.
In a similar manner, the actualy functions used for hashing strings (jenkins_hash)
are declared static and are invisible to other modules.
Use factories to create private structures. If you need to pass around
a pointer to a structure, but you don't want to reveal its contents, then create
a factory function that returns a pointer to the structure. It is not necessary
to place the structure definition in the header file, because callers may pass
around a pointer without knowing the contents of the structure. For example,
hash_table_create creates a struct hash_table and returns
a pointer to the caller. It is not necessary for the caller to actually see
the definition of struct hash_table, because the pointer will only be
used to pass the structure back to other hash_table functions.
Keep it simple. Keep modules and functions short and sweet.
(There are some exceptions to this, for example, enormous switch statements
such as found in chirp_server and pfs_dispatch are enormous
simply because they have many cases!) The corollary is Divide and conquer:
If you need to implement something complex, attempt to divide it into multiple layers.
For example, the hash_cache module provides a hash table with automatic
data expiration. It is implemented by creating a hash_table and then storing
in it a structure that contains both the data items and the expiration times.
Another example of this is found in Chirp: the chirp_client module implements
the Chirp protocol, but has the possibility of errors. chirp_reli implements
reliability on top of chirp_client. Both of the modules are relatively straightforward,
because they each do one thing well. If we mixed it all into one module, it would be a mess!
CVS for Versioning, C is for Code. The repository records who modified
what code, when they modified it, what was added, what was removed, and so forth.
Anytime we need to go back and look at how and why something was changed, CVS is available.
Conversely, the C code should not be used for this purpose: if we documented who wrote
the code and what the old code was, it would quickly become unreadable. The C code should
be lean and mean and simply reflect how things currently work. Get rid of any uncalled,
unused, or commented-out code.
Use All Your Pixels. How many people on planet Earth use a VT100 to edit and write code?
That's right: none! Don't go out of your way to wrap code to 80 lines, just make your
editor window bigger or rely on visual line wrapping.
By keeping related things on the same line,
it becomes much easier to use simple tools such as grep to analyze the code. For example, if all function arguments are on the same line, it is very easy to grep funcname */* and see all
calls to funcname with the necessary context. Without line wrapping, each instance
must be investigated manually with the editor. Of course, there are reasonable exceptions:
it makes sense to break a 500 character expression into multiple lines of ANDed sub-expressions.
However, in general, don't go around wrapping every 82-column line.
Match Styles. There is no universal "best" coding style. Everyone has their
own favorite tab size and so forth. However, it *is* important to match styles within
the same module, otherwise the code quickly becomes unreadable. If the previous author
used a different tab size, then grit your teeth and use the same size. Use tools such
as indent or change your editor settings to help yourself remember.
Software Map
This map shows the most important modules in the package and their primary relationships.
Note that it does not list every last module; some have very limited use.
Nor does it list all relationships between modules; some, like debug, are used by every other module.
However, it does show the primary flow of control in the most common use cases of the software.
Click on a box for a short blurb about the purpose of the module.
Note on rebuilding this map: To edit and rebuild this map, use xfig
to edit map.fig, and place hyperlinks in the comments of boxes using the
"edit" tool. Then, do a make in the hackers directory.
Modules
pfs_main
- Contains the main loop for Parrot that traces children, traps
their system calls, and then invokes pfs_dispatch in order to
handle each system call. Invokes pfs_poll to see when sleeping
processes should wake up and pfs_process in order to create and
destroy child proceses.
(back to map)
pfs_dispatch
- When called by pfs_main, decodes system calls attempted by child
processes. This can be quite complicated and may involve copying
data directly in and out of the child, or forcing the child to
perform certain system calls. Once the call is decoded, it invokes
pfs_sys in the portable layer to implement the system call.
(back to map)
pfs_poll
- Handles the sleep and wakeup facility necessary for dealing with
multiple processes. When a process needs to block on a file descriptor,
it tells pfs poll to watch a particular fd and wake up a given pid
when it is ready. pfs_main calls pfs_poll periodically to check for
wakeups and perform them.
(back to map)
pfs_process
- Stores the basic structures for managing multiple processes.
Tracks the state of each process, and process-specific variables
such as the open file table, current state, parent process, and so forth.
(back to map)
pfs_sys
- The interface between the portable and non-portable parts of Parrot.
pfs_sys is a clean set of functions called to service file accesses.
The calls are not implemented here, but each call is logged on the
debug stream and retried in case of temporary failure. Calls pfs_table
to actually implement the functions.
(back to map)
pfs_table
- IMplements the open file table for one process. Keeps track of all
open files, file location pointers, duplicated files, and so forth.
When a file is opened, pfs_table identifies what service (i.e. filesystem)
should handle the open, and directs the call there.
(back to map)
pfs_service
- An abstract interface that defines what services a filesystem must provide.
Instances of pfs_service represent all sorts of remote filesystems.
The default implementations of pfs_service do nothing but return "not implemented".
(back to map)
pfs_service_local
- Implements access to local filesystems through Parrot.
This is fairly simple. A pfs_open() becomes an open(),
a pfs_pread() becomes a read(), and so forth. Note that when
identity boxing is enabled, pfs_service_local will perform
access control by looking up Chirp-like ACLs, therefore it
also depends on chirp_acl.
(back to map)
pfs_service_http
- Implements access to HTTP servers. Of course, HTTP is not a fully featured
filesystem, so this driver cannot support directory listings, metadata lookups,
and so forth. It only supports sequential reading of files, and emulates stat
operations by simply checking for file existence.
(back to map)
pfs_service_ftp
(back to map)
- Implements access to FTP servers by invoking functions in the ftp_lite library.
Note that FTP does not always correspond closely to Unix operations, so frequently
many operations are needed to determine sufficient detail. For example, this
driver must invoke both a remove SIZE and CDIR in order to determine whether a
remote name is a directory or a file.
pfs_service_chirp
(back to map)
- This very thin driver simply maps operations from PFS UNIX methods into the
corresponding methods in chirp_global. Note that this driver allows the called
to see all known Chirp servers in the top level /chirp directory.
pfs_service_multi
- This very thin driver simply maps operations from PFS UNIX methods into the
corresponding methods in chirp_multi. Note that this driver allows the called
to see all known Chirp servers in the top level /chirp directory.
(back to map)
chirp_global
-Implements a global view of all Chirp servers as if in a single file system.
Uses catalog_query to build a list of all servers, and then uses chirp_reli
to access those servers reliably.
(back to map)
chirp_multi
- Creates a large filesystem spread across multiple Chirp servers.
One server is used to store the directory structure, which contains pointers
to files on other servers. Uses chirp_reli to reliably access each server.
(back to map)
chirp_reli
- Implements a reliability layer on top of chirp_client. A cache of connections
allows the caller to avoid opening and closing connections. When a connection is
lost, chirp_reli is responsible for re-creating the connection and opening
files as needed.
(back to map)
chirp_client
- Implements an RPC interface to the Chirp protocol. Callers must open connections
manually, and then within those connections, make I/O requests to remote servers.
If the connection is lost, the caller must discard the connection object and create another.
Uses the link module to make TCP connections to chirp_servers.
Uses the auth module to authenticate those connections.
(back to map)
chirp_server
- Provides file service to remote users. The main loop forks a new process for each
incoming client, authenticates with the auth module, and then services Chirp requests.
As each request is decoded, permissions are checked using the chirp_acl module, and
then the action is carried out. An experimental facility allows programs to be
executed through the Chirp interface: this is handled by chirp_exec.
(back to map)
chirp_acl
- Implements access control lists on each directory. As chirp_server decodes a Chirp
command, chirp_acl determines whether the action is allowed, according to the ACL
files in each directory. Where there is no ACL, the server treats the caller as
the Unix user nobody.
(back to map)
chirp_tool
- A command-line interface that allows the user to interact with
a single Chirp server at a time, much like an FTP client.
This tool can set and get ACLs, which Parrot is unable to do.
Relies on chirp_client to make connections.
(back to map)
chirp_status
- A command-line tool that queries the catalog server via
http and displays a list of known Chirp servers.
(back to map)
link
- Link is an abstraction over TCP and sockets. It allows the caller to do the expected
things such as make and break connections, read and write data. However, it offers
three distinct benefits. 1 - All operations have clean and explicit timeouts.
2 - Small reads are buffered, to allow for efficient protocol construction.
3 - Connects created and lost are hooked into the debugging system.
(back to map)
datagram
- Datagram is an abstraction over UDP and sockets. It allows the caller to do
non-blocking send and receive of free form datagrams, with an optional timeout.
(back to map)
nvpair
- nvpair is an abstraction around data structures that are sets of name-value pairs.
It can be thought of as a subset of XML or ClassAds, without requiring the immense
libraries and portability problems posed by external software. nvpairs can be output
in a variety of formats compatible with other software. Used extensively by catalog_query
and catalog_server.
(back to map)
debug
- The debugging module is used by nearly every module in the system.
A single function, debug, allows various modules to record printf-like strings.
Each main program choose what debugging flags it wishes to enable, allowing
for fine grained information about each module to be enabled and disabled at will.
Messages may be sent to the console, to a rotating file, or by UDP to a debug server
on the network.
(back to map)
catalog_query
- Implements queries on the catalog server, returning the results as a series
of nvpair objects that can be examined for further detail. Used by the chirp_global
module to build a list of all servers. Makes use of the http_query and nvpair
modules to implement the query.
(back to map)
http_query
- Implements simple HTTP/1.0 requests on HTTP servers.
Only a very few features are supported. Object redirections are handled silently
by generating a new query internally. Connections are not cached. Uses the link
module for network access. Used by pfs_service_http and catalog_query.
(back to map)
catalog_server
- Implements the global catalog of chirp_servers. Each server sends periodic updates
to the catalog via UDP datagrams. Queries to the catalog are made via HTTP queries.
Data may be returned in a variety of formats. Records are stored in a hash_cache
and discarded if a server is not heard from in 30 minutes.
(back to map)
hash_cache
- Implements a cache of arbitrary objects hashed by a string name and given an expiration time.
Objects that stay in past the expiration time are automatically deleted and ignored
during a lookup pass. Uses hash_cache in order to implement the hash_table.
(back to map)
hash_table
- Implements a hash table of arbitrary objects indexed by string name.
Objects may be anything that can be cast to a void pointer.
The caller is responsible for allocating and de-allocating the objects.
(back to map)
auth
- Implements an authentication negotiation mechanism. Both clients and servers must
call auth_register at startup to indicate which authentication mechanism they wish to use,
as well as the preferred order.
(One may call auth_register_all to register all known types in a fixed order.)
Upon establishing a connection, clients call auth_assert to assert their identity,
while servers call auth_accept. An acceptable mechanism is negotiated, and then the
underlying module actually attempts to authenticate.
(back to map)
auth_hostname
- The simplest authentication module simply identifies the calling client by performing
a reverse DNS lookup on its IP address. If successful, the client is then known by
a name such as hostname:fred.cse.nd.edu
(back to map)
auth_unix
- This module can identify local users. The client is challenged to touch a file in a local file system.
If the client is able to do so, the server infers the client's identity from the owner of the created file.
The server then knows the client to be that Unix user. Only makes sense when authenticating to a server
on the same system. If authentication succeeds, the client is known by a name such as unix:fred
(back to map)
auth_kerberos
- Implements authentication using the Kerberos system. Note that the server must be running as root
in order to access the local Kerberos credentials. If authentication succeeds, the client is known
by a name such as kerberos:fred@nd.edu
(back to map)
auth_globus
- Implements authentication using the Globus Security Infrastructure. The client must have a Globus
certificate, and have generated a proxy with grid-proxy-init. The server must also have Globus credentials
as well. If authentication succeeds, the client is known by a name such as /O=NotreDame/CN=Fred
(back to map)