↑↑ Home ↑ UNIX

Some useful commands

Documentation -- User settings -- Shutting up programs -- Text files -- Searching for files -- Packages of files -- Files from other systems -- Comparing and patching files -- A calculator -- Image processing

Documentation

The most important, and most difficult, thing in the lack-of-information age is to obtain information. (I know, they call it the information age, but since every device designed to obtain and store information requires tons of handbooks which describe how to use it I'm not sure whether the ratio of available to total information isn't actually decreasing. An obvious antidote is to produce devices that do what they want and treat the user as a well-trained extension, an approach practised by the dominating software company of our days.)

The central help feature of UNIX systems are the manual pages (man pages for short). Enter "man man" on the command line of a terminal or XTerm window, and you will get a manual of the manual system. Manual pages exist of programs, C library functions and other things, and you can add your own ones. One difficulty for such an all-encompassing help facility is that there may be different kinds of things with the same name, for instance a C function and a program. The solution adopted for man is to divide man pages into different sections. The sections correspond to the following types of pages: (This really ought to be in the man page of man but in some distributions it still isn't there.)

1 Executable programs or shell commands
2 System calls (functions provided by the kernel)
3 Library calls (functions within system libraries)
4 Special files (usually found in /dev)
5 File formats and conventions eg /etc/passwd
6 Games
7 Macro packages and conventions eg man(7), groff(7).
8 System administration commands (usually only for root)
9 Kernel routines [Non standard]

So if you want the documentation of a program but get the description of an obscure C routine (or vice versa), specify the section by typing

man 1 whatever
When a cross-reference is given on a man page, the section will be given in parentheses after the name, eg regex(7). All the same it must be passed to the man program in the above syntax.

All man pages contain a short description of their topic. You can search these descriptions for keywords with man -k keyword or apropos keyword. However, these descriptions are typically very short so you are likely to get a lot of garbage and/or miss what you are looking for. If you know a program which does something similar (or the reverse), start from its manual page and look at the cross-references, and cross-references on the cross-referenced pages.

There are other documentation systems than man: The commands provided by the shell (bash) are described by the help command (but more extensively in bash's man page).

Lastly, various innovators have invented help system that are supposed to be better, brighter, easier to use,... than the man pages, but I have fared quite well so far using them hardly ever. It is helpful to know that documentation for some software packages is located in the directory /usr/share/doc/ (a fact that is documented nowhere and appears to be a closely guarded secret) and that one should never use the info program for viewing info pages but start emacs and type Ctrl-h i m and the name of the program one wants documentation about. Alternatively, you can redefine the info command as a function which starts emacs and makes it display an info document:

function info() { emacs --eval '(info "'$1'")'; }

Basic user settings

When you have opened an XTerm (do I use XTerms? I live in them!), you will be dumped on the command line of your shell. This will usually be bash since it is the default in most Linux distributions, but there is nothing God-given about that. (In fact nothing much is God-given in UNIX systems, unlike the one from the software company whose name flatly contradicts its size and the colouration of whose products flatly contradicts all good taste.) You can change it with chsh. If you are nostalgic for C-64 times, enter your favourite BASIC interpreter. (Honestly, you might have to put it into /etc/shells first.)

Another personal setting is the mask of access permissions for your newly created files, called "umask" and changed with the program of the same name. Since it is stored nowhere, you have to put the umask command in your .bashrc (if you use bash) so it is set again in your next session. If you want to restrict access to your files only occasionally, it is better to use chmod.

Other settings it can sometimes be useful to know of are the ones changed with ulimit. You can limit the size of core files (files created as post-mortem information about crashed programs). I once knew someone who had repeated problems with large core files on a rather slow workstation, but I didn't know about ulimit myself then (sorry, Cristian, bad timing). umask and ulimit are commands provided by the bash shell, so I cannot guarantee your favourite BASIC interpreter has them ;).

Oh, yes, and of course there's passwd. But everyone knows about that.

How to keep programs from bothering you

While it is nice to have a say in what programs do, having to confirm every one of your commands can be annoying, especially when you have to do the same thing many times in a row. Fortunately UNIX systems offer an antidote (unlike - well, you guessed). There is a program called yes which will answer all questions in the affirmative. Using pipes, you can feed its output (consisting of an arbitrary number of lines containing just the letter "y") into any program asking stupid questions, like this:

yes | askstupidquestions
Now your program might require an input different from just "y", for instance the complete word "yes" or some other letter. yes is very cooperative in this respect. It will output any string you pass to it in the command line. I use this feature whenever I cannot remember how to make the latex program ignore errors (see also here). Since latex continues when the user presses return, I make yes output empty lines:
yes "" | latex test.tex

Fine so far. But what if the program concerned requires a different input each time, or if you want to affirm certain questions and say no to others? yes won't help you here, but all is not lost. If you want to do the same task a hundred times, it still pays to automate it. You do this by writing the required input into a file and using cat to feed it into the program you want to (ab)use. cat normally just outputs a text file on a terminal. With the help of a pipe, its output is used as input for another program:

cat input.txt | dosomething
So all that remains to be done is to remember the few hundred questions the program will ask you, and the right responses... There is an easier way that that. A program called tee echos its input to its output and in addition writes it to a file. So you can execute your task once interactively, typing
tee input.txt | dosomething
and then use cat as described above. tee will write your input into the file input.txt as well as pass it on to the program dosomething.

There is one final piece of knowledge you ought to have to stop being bothered by your own machine: how to discard output. This is slightly complicated by the fact that a program has two output channels available: one for errors and one for ordinary output. UNIX shells (honestly, I can vouch only for bash) allow to redirect them separately or together. If you want to discard a program's output, you have to redirect both of them into the UNIX equivalent of a black hole, the null device. I have occasionally done this with latex:

yes "" | latex test.tex &> /dev/null

Processing text files

Lots of things under UNIX are done with the help of, by or to plain text files. There is a largish number of nifty programs for processing them. First, you may want to read a text file. cat will print it out on your terminal without stopping, while more and less will stop at the end of every page. In some Linux distributions, less is configured to delete all its output after finishing so that the user has the same text on the terminal as before; if you want to read a file while typing the next instruction, use cat for small files.

less has a wealth of functions: you can search for regular expressions in the forward direction by typing "/" and in the backward direction with "?". You go to the next match with "n" and back to the previous with "N". By typing a number and "g", you go to a specific line in the file. You can display several files and jump to and fro between them with ":n" (next) and ":p" (previous). "h" displays a brief help.

A useful utility when writing abstracts of articles is wc (no shit!). It counts the words, characters and lines of a piece of text. If your abstract is part of a larger text file, select it, start wc without arguments, paste the abstract into the terminal and type Ctrl-d. wc will then print the size of the pasted piece of text. wc is one of a group of small and useful text processing utilities developed by the GNU project, some more of which will be described in the following. If you are looking for a program that performs a certain operation on text files, type "info wc", and u to view the menu of all these utilities. Very probably there is one which does what you want.

Now for formatting. What if the lines in a text file are too long to fit into the terminal? With modern versions of less, you usually have to press the right and left arrow keys to see the right or left half of it, and the alternative - forcibly broken lines - looks ugly. The name of the program helping you is derived from the word format: it is fmt. It breaks the lines in the document anew, taking into account the width of your terminal. Empty lines are taken as delimiters of paragraphs, indentation is retained.

expand is the tool for expanding tabulator characters. Use it after obtaining a text file with unusual tabulator positions. Both fmt and expand are useful for bringing text files into a form easy to read in your standard-width terminal. Of course, you could load the file into an editor and convert it with the appropriate editor command, but the strength of these simple stand-alone programs is that they can be automated. Shell scripts can call fmt and expand, but nothing can click on menu items for you.

You can also build tables. paste will paste two files containing columns of a table together. column works similarly but adapts the number of columns so that the table fits on your terminal. Since entries from different files may end up in the same column, this is unsuitable for formatting, but probably useful for other things. pr is a program which separates text into pages, with an optional header. It can also print the text in several columns or display several files next to each other.

Taking a table apart is not so easy. You can use colrm to remove a range of one character wide columns. If you want to extract a certain colunm of a table, however, this is not very helpful. To my knowledge there is no simple tool program for that. But there is a very powerful one that can easily be made to think down to that level. It is called awk. Without knowing much about this programming language for text processing (for that's what it is), you can still do many useful things. You get the second column of a file formatted as a table like this:

awk '{print $2}' table-file
The inverted commas should always be there, no matter what you do with awk. They keep the shell from interpreting the braces and dollar signs which have special (and quite similar) meaning in both the shell and awk language. If you put another number after the dollar sign, a different column will be printed. You can extract several columns in this way. If you want to get a nice formatted table again afterwards, you have to insert tab characters:
awk '{print $2 "\t"  $1}' table-file
will exchange the first two columns of the table (and delete the others). In order to obtain a file containing the result, you have to redirect awk's output to a file different from the source file.

awk can do more sophisticated things. It does not have to act on all rows of the table, and it can handle "tables" with entries separated by something other than tabs. I used this once when I had to find the name of a user in a shell script which only knew the numerical user id. The program id unfortunately supports only the reverse direction. But all users are listed in the file /etc/passwd, which contains their user name, numerical id, real-life name and other stuff (but not the password, nowadays :)). The following two commands print the user id and real name corresponding to the numerical user id stored in the shell variable NUM_UID:

awk -F : '$3=="'$NUM_UID'" {print $1}' /etc/passwd
awk -F : '$3=="'$NUM_UID'" {print $5}' /etc/passwd
The "-F" option tells awk that the field separator, ie the character(s) separating table columns, is the colon. The next expression before "/etc/passwd" is one argument which tells it what to do. The inverted commas are interrupted around "$NUM_UID" to allow the shell to interpret this expression and put in the value of the variable. They can be removed if you give the numerical id explicitly. The rest is straightforward: awk looks only at the lines in which the third item is the right numerical user id. (There's only one such line since user ids are unique.) It then prints the first or fifth entry of that line, which contain the user name and personal name of the user, respectively.

No summary of UNIX tools for text processing can be complete without sed ("stream editor"). I use it frequently; I'm going to show you how to use it to help add up the number of points students accumulated in exercises over the term. I stored the records in a plain text file which contained in each line the name of the student followed by a colon, then some spaces to set the numbers apart, and then the points for each exercise separated by commas. A dash indicated that the respective student had not even tried to solve a problem - equivalent to zero points but a different situation. I had already automated some similar text-processing tasks, so I chose this format without worrying overmuch about how I'd later do the sums - when you have this sort of information in a text file under UNIX, a way to convert it to valid input for a calculator can usually be found. Here is the simple version:

sed -e 's/^.*: *//' -e 'y/,-/+0/' point-file | bc -l
bc is a calculator running in a terminal receiving input from the keyboard or from whatever is piped into it. (The -l option sets a mode.) Now let's look at the sed command. It contains two -e options, each followed by an instruction for sed. The first is a search-and-replace command. The expression between the first two slashes is replaced by the one between the second and the third. The expression to be searched for in this case is the beginning of the line (^) followed by an arbitrary number of arbitrary characters (.*), followed by a colon and an arbitrary number of spaces. It is replaced by nothing. This removes the students' names which the calculator would not know what to do with.

The second command ("y") also has two arguments enclosed in slashes. It replaces each occurrence of the first character of the first argument by the first of the second and so on. (This is just what is called a substitution cipher in cryptography.) The commas between the points are replaced by plus signs and the dashes by zeros. The result of the two operations is a string of numbers separated by plus signs - a long sum which the calculator can process.

Much more remains to be said abount sed. The slashes in the s and y commands can be replaced by any other character. This makes it possible to have slashes in the expressions. If you want to replace every occurrence with the s command, you have to put a "g" after the last slash (or other delimiter). If you want to use sed a lot, you should not only read its man page, but also its info documentation with the command "info sed". If you intend to use the more complex commands for inserting text, it is helpful to know that the error message "Extra characters after command" actually means that you forgot the backslash after the command character. The manual pages of ed(1) and regex(7) may also help. The latter briefly explains the format of regular expressions suitable for searching. For the curious (or for the hard core?), here is the advanced version of the above command in the form of a shell script which prints out the name, total points and percentage of maximum points for each student. It is a bit messy ;), it uses three fifos and paste.

Searching files and directories

The most important command for searching a directory tree for files with specific properties is find. The one thing you have to remember about it is that its first argument is the directory to search, and only after it come the options detailing what to search for. find will print out all files which match the criteria including their path. If it outputs nothing, that means no files were found. It can search according to a multitude of criteria which can be found in its man page. The most important of them is no doubt -name, or better -iname. Both search for files with a specific name; the latter ignores the case of the letters. You may give wildcards in the name following this option but you have to escape them by putting a backslash in front of them, or the shell will try to expand them. For example

find . -iname dsigdt\*
will search the current directory (".") and its subdirectories for files that start with "dsigdt", "Dsigdt", "DSIGDT" or similar.

Another useful search criterion is -mtime. It searches for files which were modified a certain number of days ago. By preceding its numerical parameter with "+" or "-" you make the given time a lower or upper limit.

find . -mtime -2
will list all files that were modified within the last 48 hours.

The option -mmin is the same except it counts in minutes not days. This is useful when a program creates a lot of unasked-for files in the current directory. They can be hard to tell apart from the ones that were already there before, but find . -mmin -5 tells you which are new. One fairly frequent situation where this comes in useful is after unpacking an archive with tar which did not contain a top-level directory comprising all the files (as it should). One can create a directory and move all files that were just unpacked into it with the following two commands:

mkdir unpackdir
mv `find . -type f -cmin -5` unpackdir
The option -cmin works on the time the file's status was last changed, rather than the file modification time. This is necessary here because file modification times are preserved by the process of packaging and unpacking and will therefore be old.

Similar to these options is -newer which searches for files which are newer than a given reference file. It can be used to find out which (how many) files have been changed since the last backup (see below).

What if you want more information about the files found than their name and path? find offers a quite general interface to other programs. The option -exec is followed by a command which will be executed for each of the files found. For instance, if you want a detailed listing of the files found, you could type

find . -iname dsigdt\* -exec ls -l '{}' \;
The string "{}" is a placeholder for the file name; it has to be escaped with inverted commas to prevent the shell from interpreting it. The command line has to be concluded with a semicolon, which again has to be escaped. The semicolon is necessary whenever you use -exec. Everything between "-exec" and ";" is part of the command line to be executed. Note that shell aliases are not available for these commands, so you could not use "ll" in this example. Likewise, built-in shell commands are unavailable. Here's one last example for find. It creates an archive of all the files which were modified since the last backup stored in last_backup.tgz.
find . -newer last_backup.tgz -a -type f -exec tar -rf incr.tar '{}' \;
Here the command to be executed is the archiving program tar. Its -r option makes it add a file to an archive. The find option -type specifies the file type as ordinary files (not directories).

You can use find without any options to get an overview of the files in a directory tree (eg "find ."). On the other hand, if you want to apply a specific operation to every file in a directory tree, it is often unnecessary to use find, since many programs can process directories recursively. Usually the option for recursion is -R; sometimes -r can also be used (rm), possibly with a slight difference in meaning (cp).

Applying find to large directory trees takes a long time. For searching through your whole hard disk, on most Linux systems there is an alternative: locate. It works not by searching directories when the user wants to look for something, but by regularly building up a database of all files on your system. It is particularly useful for searching for configuration files if you have an idea what their name might be. You can use wildcard characters as with the -name option of find. Unlike find, locate implies a "*" before and after the name if no wildcard is given explicitly.

Now what if you can't remember a file's name but know that it is a text file which contains a specific phrase? The answer is the program grep. Its name is often interpreted as general regular expression parser, but in fact it is derived from a command in the ur-editor ed, "g/re/p". It means: general (ie for every line in the file), search for a regular expression, and print the line if you found one. That's what grep does: print all lines containing a certain phrase. The option -i makes it ignore case. Phrases containing spaces should be enclosed in quotes, and ones that contain characters interpreted by the shell in inverted commas. For instance,

grep -i key_alt *.cc
would be a way to search for all C++ files in the current directory which use the Alt key in some way. By the way, grep is also very useful for filtering the output of other programs. You could type
ps -ax | grep -i math
to list all processes whose command has "math" in its name.

But let's get back to searching files. grep can be made to print just the name of the file where a match was found (-l) or to search recursively all files in a directory tree (-r). The latter takes considerably longer than a find, so it should be used only as a last resort. If you know the type of file where the phrase might occur, you can use grep together with find:

find . -name *.cc -exec grep -i -H key_alt '{}' \;
will list the name of all C++ files in the current directory tree which contain the phrase "key_alt" in upper or lower case letters and output the matching line for each file. (The -H option makes grep output the filename if a match was found; otherwise it would not do that here since it does not know several files are being searched - its command line contains just one at a time.)

The programs presented here can be used in still more sophisticated ways: find can combine criteria with logical operators, and grep can search for a regular expression. You have to read their respective manual pages to get a list of all their options.

Unpacking packages and packaging

THE packaging tool under UNIX is tar. Unusually, its main options (ie those which tell tar what to do) traditionally have no minus sign in front, eg "tar c..." (but modern Linux systems are tolerant in that respect). The most important ones for casual use are c (create archive), t (list archive contents) and x (extract from an archive). tar always acts recursively on directories, ie if you tell it to archive a directory, it will include all files in all subdirectories.

tar stands for "tape archive", and by default tar writes to or reads from the tape drive device, /dev/rmt0. Since that is not usually what one wants (who has got a tape drive these days?), one should always give the option -f <file>, which redirects tar's output to a file. (In fact, on Linux systems without a tape drive, the corresponding device does not exist, with the consequence that tar outputs to the standard output channel. This allows to use pipes or redirection.)

So what to do if you want to archive a directory for backup? The following four commands are equivalent:

tar c -f mydir.tar mydir
tar cf mydir.tar mydir
tar -c -f mydir.tar mydir
tar -cf mydir.tar mydir
One can also pack several directories and files into the same archive (but subdirectories do not have to be given separately, they are always included):
tar cf my.tar dir1 file1 dir2/subdir1/file2 dir2/subdir2 ...
If the archive file is in one of the directories being archived (as can happen when one archives one's complete home directory), tar skips it automatically.

In the following way a listing of the contents of the archive can be obtained:

tar tf my.tar
tar tf my.tar | less
Since the list of archived files can be very long, it is preferable to paginate the output with less as shown in the second line. For getting more information than just the names of archived files and directories, there is the option -v ("verbose"):
tar t -v -f my.tar | less
tar tvf my.tar | less
tar -t -v -f my.tar | less
tar -tvf my.tar | less
All these commands list the archived files in a similar way to "ls -l", with size, modification date, owner, group and access permissions.

To unpack a complete tar archive, just type

tar xf my.tar
(From now on, only the shortest form of the options will be given. x -f, -xf and -x -f work just as well.) Directories will be created as needed. Existing files will be overwritten unless the option -k (keep existing files) is given. To extract just one or some files or directories from the archive, their name(s) have to be given a the final arguments, eg
tar cf my.tar file1 dir2/subdir2 
The file names have to include the full path given in the listing.

When making backups, it usually makes sense to compress the data. It doesn't take much time to compress an archive, and anyway one does not access backups very often. The most widely used compression program on UNIX systems is gzip. It compresses the files given as its arguments, writes the result to files which have ".gz" appended to their name and deletes the original files. For example:

gzip longtext.ps
After executing this command there will be the file longtext.ps.gz instead of the original one, and it will hopefully be smaller. A "gzipped" file is decompressed with gunzip:
gunzip longtext.ps.gz
This works converse to compressing: The decompressed file is written to longtext.ps, and the compressed file is removed. The compressed file has to have the suffix .gz, or gunzip will refuse to work on it. (That said, some file types have abbreviated suffixes when they are gzipped, such as .tgz for .tar.gz. gunzip accepts those, too, and substitutes the original suffix, .tar.) A further program belonging to gzip and gunzip is zcat. It is the cat of gzipped files; it writes the decompressed file(s) to standard output.

Two other compression programs are usually available on up-to-date Linux systems: bzip2 and compress. bzip2 is rather new and provides better compression than gzip. It is used analogously to gzip and creates files with the suffix .bz2. The analogue to zcat is bzcat. compress (and uncompress) is so old it is compatible to programs used under the longest virus in the world ;).

To create a compressed archive, one could first make an archive with tar and then compress it with gzip. However, the option -z of tar does both in one go: It creates an archive compressed with gzip. The same option can be given when listing or unpacking an archive.

tar czf my.tgz dir1 dir2/subdir1 dir3/file1 ...
tar tzf my.tgz
tar xzf my.tgz
The option -Z does the same with the compress program, and -j (de)compresses with bzip2. The latter exists only on fairly recent versions of tar, so read its manual page. There is a generalisation of this: The long option --use-compress-program <program> makes tar put the archive through the given program. The program must interpret the option -d as an instruction to decompress rather than compress. This allows you to use bzip2 with older versions of tar or to use your own home-made compression or encryption program.

One other packaging program one comes into contact with occasionally under Linux is rpm. In fact it is the package manager of several Linux distribution, that is, it is the program that ensures all pieces of software a certain program needs are installed before it is installed itself. Usually it is called by easy-to-use front-ends like yast under SuSe. But in order to install additional packages which do not come with your distribution or to get information about installed programs, it helps to know how to use it. A package contained in a package file is installed as follows:

rpm -i package-file.rpm
If rpm complains that other packages should be installed first because they are needed by your package, you might consider using the option --nodeps to ignore this and find out for yourself whether they are really needed. Another useful option is --noscripts which inhibits the execution of install scripts before or after installation of the package. These scripts frequently mess things up if the package you install does not come from exactly the same version of your distribution as other installed packages.

An installed package is removed with the option -e (erase):

rpm -e package
Note that the package name is not the name of the package file. It does not contain the suffix .rpm and the name of the architecture (processor) for which it was compiled. If you gave the name correctly and rpm complains that the package is not installed even though you installed it five minutes ago, try rpm -i --justdb .... This should update the rpm database (--justdb = just database) to show that the package is installed. The converse - rpm -e --justdb ... - can be used to convince rpm that a package is not installed. Such discrepancies between rpm's database and reality happen frequently when the installation does not proceed smoothly. The options --noscripts and --nodeps also work on deinstallation. --nodeps allows you to remove packages even though others depend on them - use with care.

If you do not install packages yourself, or if you work on a computer administrated by someone else, rpm still has its uses. The main option -q (query) allows you to obtain information on installed packages. "rpm -q -a" prints a list of all installed packages. The following two commands print information about a package and list the files contained in it, respectively:

rpm -q -i package
rpm -q -l package
Since you as an ordinary user usually don't know the package name corresponding to a certain program you might want information about, it is fortunate you can specify the package in another way: Giving the option -f <file> queries the package to which the file belongs. You have to specify the whole path of the file. If you want information about a program, you can obtain the path with the program which. Using backticks (see my page about the shell bash) to use the result of which as a command line argument, the rpm command looks like this: (Multiple options can be bundled as shown in the second line.)
rpm -q -i -f `which gzip`
rpm -qif `which gzip`
Using -l instead of -i, one gets a list of files belonging to the same package as a program, including documentation files (if any). If there is no manual page for a certain program, this sometimes helps to find out where its documentation is.

There are also rpm packages containing source files. These do their best to stop users from actually getting their hands on the source code. The intended way to extract files from them seems to be as follows:

Repeat this process for every source package you want to unpack. NEVER install two different source RPMs at the same time. Since they are installed to the same directory, usually without creating a top-level subdirectory, you don't know which source files belong to which package. Considering all this could be done with one tar command for a tar archive, the moral is clear: don't download source RPMs unless you really have to.

In fact, there is a more flexible way to extract RPM packages. It is also (to my knowledge) the only way to install files from an RPM package in arbitrary locations. RPM packages contain, after their header, standard archives. The header and archive components are obtained, respectively, with rpm2header and rpm2cpio. The cpio archive can then be unpacked to a directory in your home directory and installed whereever you want. To unpack an RPM package, do the following in an otherwise empty directory (as an ordinary, non-root user, to be on the safe side):

rpm2cpio xy.rpm > xy.cpio
cpio -i -d -I xy.cpio
The options given to cpio mean, in order: extract archive, create directories as necessary, and extract from archive xy.cpio. The root of the directory tree in the package is the current directory, ie executables will be installed in ./usr/bin, libraries in ./usr/lib etc.

The aforementioned "non-standard" way of unpacking RPMs is also a necessity when an installed package messes up rpm in such a way that it is impossible to uninstall it. This is something that should not but in fact does sometimes happen. (For the curious, I experienced this when upgrading glibc-2.2.4-6mdk to glibc-2.3.2-14mdk. It was not the upgrading of existing libraries which did the damage but the installation of additional ones contained in the newer package. Since rpm is statically linked (as it should be), it is hard to understand how this can be, but there we are.) To recover from such a dilemma, unpack the previously installed package with cpio and copy the files to their locations by hand. Then delete the files which are only part of the newer package.

Another thing that will occasionally get messed up is the RPM database containing the package dependencies. If rpm claims a package is installed when you try to install it and claims it isn't when you try to erase it, or when nothing works any more, recreate it with rpm --rebuilddb. To be on the safe side, save the database directory with tar czf /tmp/oldrpmdb.tgz /var/lib/rpm before you try to rebuild it (substitute the right rpm database directory for your system, which you can find with rpm -ql rpm).

As an observant reader will have realised, rpm is hazardous to use in most cases except when installing packages from the same release of the same distribution (which is its prime purpose). It is usually better to download sources and compile them on your own system to ensure they get on well with the rest of your system.

Having written at length about rpm, I should also say a little bit about the Debian package manager, dpkg. The prominence of rpm is due to the fact that until recently, I have had exclusively to do with it. dpkg's commands for installing and removing packages are:

dpkg -i package-file.deb
dpkg -r package

Note that as with rpm, the installation command takes the package file name as its argument, while the remove command needs the package name. The search features of dpkg are more versatile than those of rpm. Both files and packages can be searched for if you know only part of the name.

dpkg -S phrase

searches for packages containing files with names containing "phrase" and lists all suitable package and file names.

dpkg -l phrase

lists all packages with names containing "phrase". The following commands display the package description and the list of files contained in a package, respectively.

dpkg -p package
dpkg -L package

Listing all files contained in the same package as an executable is less easy with dpkg than with rpm. For one thing, the package search and file list output cannot be done with the same command. Besides, dpkg -l outputs both the package names and the file names. The following one-liner prints all files in the same package as the gzip command:

dpkg -L $(dpkg -S $(which gzip) | sed -e 's/:.*$//')

Like rpm packages, Debian packages also are based on standard archives. In their case, two types of archives are wrapped within each other. ar is used for the actual package file and can be used to list and extract its contents:

ar -tv package-file.deb
ar -x package-file.deb

This top-level package contains a small file with a version number and two tar archives. One contains the package description, MD5 checksums and install/uninstall scripts, the other the files to be installed.

Reading non-indigenous file formats

Even though UNIX systems are probably the largest group of operating systems conforming to a common standard, they do not run on a majority of machines, not least because they were developed originally for systems too large and expensive to be used privately. As a consequence, other operating systems are much more widespread, including the longest virus in the world and a system purporting to make things especially simple by using one-buttoned mice ;). Even though these systems use standard file formats for many purposes, e-mails from users of such computers sometimes contain file formats peculiar to that operating systems. Enthusiastic users sometimes forget not everybody has the same system.

One of these non-standard formats is the Mac binhex format. These are ASCII files which usually contain the message "This file must be converted with BinHex 4.0" (or similar) in the first line. It can be decoded under Linux with the program hexbin. (providing it is installed. If it isn't there, look through your distribution's packages to locate it. The package may be called "macutils".) To obtain only the data content of the encoded file(s), you have to run it with the option -d. The other parts are presumably useful only on a Mac.

hexbin -d binhexfile
The result will be written to "binhexfile.data".

If you decode text files with hexbin, you will get files with lots of carriage return (CR) characters but no line feeds (LF). This is because on the Mac CR is the character signalling the end of a line. (Under DOS/Windows, the end of the line is indicated by a cariage return and a line feed, so you also have to deal with carriage returns after downloading text files from a Windows System.) hexbin is supposed to be able to exchange CR and LF automatically (option -u), but in my experience it doesn't work (yet? It is beta code). So something else is needed. There are two dedicated small programs for just that conversion, dos2unix and unix2dos. However, I have to admit I have no experience of them since they are not included in my distribution. There are a number of programs on the web which do the same thing, but as we will see, standard text tools are perfectly sufficient.

The small program tr transliterates (ie replaces) characters. The sed command s can also be used for our purposes. The command lines which translate Apple / Mac OS 9 newline characters to UNIX are the following:

tr '\r' '\n'  < file.mac  > file.unix
sed -e 's/\r/\n/'  < file.mac  > file.unix

Transforming from DOS/Windows to UNIX and vice versa using sed is done the following way:

sed -e 's/\r$//'  < file.dos  > file.unix
sed -e 's/$/\r/'  < file.unix  > file.dos

If you don't like all this shell hackery, and if you use KDE, you can also use khexedit. It is a graphical editor for binary files and has a nice search-and-replace feature.

Another format with which one is occasionally confronted is the Microsoft Word format. It can be converted to HTML with the program mswordview. Unfortunately it has no manual page and no other documentation I know of, but it prints out a list of options when called without arguments, and the long options are fairly self-explanatory. Its basic usage is simple:

mswordview file.doc

This command creates the file "file.html" which contains an HTML rendering of the Word document. An interesting alternative is catdoc which saves you the trouble of using an HTML viewer because it produces plain text output right away. You can view a .doc file by typing:

catdoc file.doc | less

Besides mswordview and catdoc, there is Open Office, a near-exact clone of a piece of billionaire junkware from which most formats incompatible with UNIX systems originate. There is no need to say anything about it except that it can be started with the command ooffice. Like its role model, it already knows for a fact what you want to do and cannot under any circumstances be convinced otherwise. Every time you are asked whether you are sure, click on "Yes". If you want to save time, uninstall it.

Diff and patch

The program diff serves for listing the differences of two text files, usually an older and a newer version of the same file. Its basic syntax is:

diff file1 file2
The output contains a list of the parts in which the files differ. Before each chunk the range of line numbers in the first file is printed, followed by a letter and the range of lines in the second file. The size of the ranges differ if something was inserted or deleted or if part of a file was repaced by something longer or shorter. The letter indicates what happened to the second file relative to the first: a means something was added, c stands for "changed" and d for "deleted". After the line numbers, the differing portions of text are printed. Lines from the first file are always preceded by a "<", lines from the second file by a ">". (So if you want to know which file is newer, it is important to remember the order in which you put the files in diff's command line.)

With the help of input redirection, you can make diff compare the output of two programs (see the section about input/output redirection for a simple example). diff can also compare binary files, but it will only output a message if they differ. If diff outputs nothing, that means that the files are equal.

The output format described above is just one of several diff can be made to use. See diff's manual page for details. One very important format is the context output format. I will not describe it here since it is not usually necessary to read it yourself. It is important because a different program called patch can use context diffs to update files to a newer version. That is why, if you send contributions to an open source project, you should always use context diffs. They work as follows:

diff -c oldfile newfile > file.diff
patch oldfile file.diff
The -c option makes diff output context diffs; the operator ">" redirects its output to the file "file.diff" (see input/output redirection). The file "file.diff" (which is customarily called a "patch") can easily be sent by mail since it is usually small, even if the modified file is very large. The recipient has the unmodified version of the file which (s)he can change with patch.

diff and patch can also be used to update whole directory trees. I have used this with my PhD thesis which I wrote mostly at home but of which I also kept a copy at my institute. It is necessary to keep a copy of the unmodified directory tree next to the one one works on. You obtain the differences between all text files in two directory trees in the following way:

diff -rNc -x \*.log -x \*.aux -x \*.toc -x \*.ps olddir newdir > dir.diff
The option -r tells diff to operate recursively on a directory tree. -c means context diffs as we had before. -N is a useful option which makes diff treat new files as empty in the old directory and removed files as empty in the new directory. Since patch removes files which are empty as a result of the patch, this makes for proper updating of removed and new files. Otherwise diff would ignore new and removed files. The various options "-x ..." exclude files with specific extensions from the comparison. The above example excludes all types of files that are created automatically by latex. (.dvi files are excluded automatically since they are binary files.)

If you modify just a few files of a large directory tree (for instance an open-source project), you usually do not want to keep an unchanged copy of the whole directory tree. The script gendiff helps you there, if it is available on your system. It searches for files with a specified file name extension and compares only them with the corresponding file without the extension, using diff. To be able to use gendiff, you save a copy of the original of the file you want to modify to a file called filename.orig. To find the differences, run

gendiff directory .orig
gendiff uses diff's unified output format, which can also be handled by patch. In case your system doesn't have gendiff, here is a copy. It contains some stuff about RPM directories which I think you can delete; that is just because it is apparently used by the package manager RPM.

Now how to apply the patch? patch is supposed to be very good at extracting the right file name from a diff file and even tolerant towards wrong line numbers; however, in my experience it is not easy to make it work. The method which I have used successfully so far is the following:

cd olddir
patch -p1 < dir.diff
First descend into the directory you want to patch, then execute patch with the option -p1 which makes it strip the top-level directory from all file names in the patch. Since the directories compared with diff had different names, there are two different top-level directory names in the patch. In my experience, without the -p1 option patch automatically picks the wrong one and complains it can't find the file to patch.

Using diff and patch to keep copies of a directory tree up to date requires some discipline. You have to apply all patches to both the remote copy and the "reference" copy at home you use to compare your working directory to. If you forget to apply one patch and try to apply the following one, patch will become hopelessly confused and quite likely garble your files. There are programs which check automatically that patches are applied in order, like cvs. But it requires setting up a "master" copy of the directory tree called the repository and is intended for large projects. If you are advanced enough to want to use cvs, you will no doubt be able to understand its manual page.

A calculator

Of course there are "pocket calulators" under UNIX, notably xcalc, and kcalc under Linux with KDE desktop. If you do each calculation only once, they are the tools to use. But what if you want to do a complex calculation several times with changed parameters, you'll want something more efficient.

For this purpose, there is a different program, called bc. It is text-based and offers many of the features of a shell - command history, emacs-compatible key sequences for moving, deleting and pasting, and not least the possibility to write scripts for frequent calculations. The sophistcated editing features are due to the readline library for text input which is also used by the shell bash.

For floating-point arithmetic, one should call bc with the option -l. This sets the number of digits after the decimal point to the maximum (default: none) and defines functions from a math library (s(x) sine, c(x) cosine, a(x) arctangent, e(x) exponential, l(x) logarithm and j(n,x) J Bessel functions). Besides these, one can use the function sqrt(x) which is predefined even without -l. Calculations are done by entering a mathematical expression and pressing Return. The previous input (and earlier ones) can be fetched with the Up arrow key and edited before executing it anew. Several expressions can be entered in one line separated by a semicolon. The result of the previous calculation can be reused and is denoted by a single dot.

The operators +, -, * and / can be used in the standard way. In addition there is % for modulo (remainder of a division) and ^ for integer powers of a number. Variables are created by assigning a value to them with =. The C pre/post-increment/decrement operators can be used as well as the abbreviation += for variable=variable+... (and similarly for other operators). There are also control structures like if, while and for commands; they are described in the manual page of bc.

If you want to print out the result of just one calculation and avoid the start-up message, you can pipe a command into bc. This is done with the help of the program echo which echos its command line arguments. For instance,

echo "4*a(1)" | bc -l
outputs the number 4*arctan(1)=pi to quite a lot of digits.

The most powerful way of using bc is with "here documents". They are scripts for interpreters used within shell scripts which allow the command-line arguments of the shell script to be used as parameters. (See my page about the shell bash for more information about here documents. For instance, the following script converts a length given in feet and inches into metres.

#!/bin/sh

if [ $# -ne 2 ]; then
  echo "usage: amlen <f> <i> converts the length f feet i inches to metres"
  exit 0
fi

bc -l <<EOF
scale=3
$1*0.3048+$2*0.0254
EOF
The first part of the script only prints a usage message if it is called with the wrong number of arguments. Then comes the script for bc. The expressions "$1" and "$2" expand to the first and second command line argument of the script, ie the number of feet and inches, respectively. The calculation itself is simple. The line "scale=3" sets the number of digits after the decimal point to three.

Finally, one can create lookup-tables of a function with the help of bc and some advanced UNIX wizardry. The plan is as follows: First we create an infinity of empty lines with "yes """. Then we remove all except a finite number of lines at the top with "head -<number>". Then these (empty) lines are numbered with "nl -b a". The output now contains successive numbers starting from 1. To convert each of these lines into an expression (the desired function), we use awk. In awk, the variable $1 contains the line number, and we have to put our expression around it. The number will usually have to be rescaled to give the right interval. The resulting expression is then given to bc to execute. If we put all this together in a pipe, it looks like this:

yes "" | head -16 | nl -b a | awk '{ print "s(2*4*a(1)*(" $1 "-1)/16)"; }' | bc -l
This line outputs a sine lookup table with arguments at multiples of pi/8. "4*a(1)" was used to compute pi.

Image processing

Even though the main fare of UNIX systems are plain text files, programs exist for the automatic processing of graphics files. The most powerful of them all, the scripting capability of the GIMP, is as yet beyond my experience, which forces me to summarily ignore it until some future version of this page. A very useful, if less powerful program is convert. It serves for converting images between different file formats, including many used on non-UNIX systems. Its main purpose seems to be to convert pixel-based formats; it does not work very well when converting to vector formats like Postscript. Its basic syntax is:

convert image.tiff image.gif
The destination file format is determined from the file name extension. The quality for the conversion to JPEG is determined with the option "-quality <number>". The number has to be in the range form 0 to 100; the default is 75. convert can also draw into images, apply special effects and create new images, for instance by tiling smaller ones. For these more complicated features, see its manual page.

For combining two images, use composite. It can superimpose images in many ways, including arithmetical operations. This offers a way to check how similar two pictures of equal size are:

composite -compose subtract image1.png image2.png - | gzip | wc -c
compose subtracts the colour values of the two images, resulting in small values for pixels of similar colours. gzip compresses the result, and wc -c outputs the number of bytes of the compressed data. Since compression works the better the smaller the range of colour values is, that number is the smaller the more similar the original images are.

One can also use composite for creating special effects: For instance, adding an image to itself can create psychedelic, brightly coloured images.


TOS / Impressum