Usage of `osm2pgsql`

Syntax -- Functionality -- Operation -- Projections -- Options -- Examples -- Database -- Authors -- Development -- References

This documentation page was cobbled together from the various locations where documentation for osm2pgsql has been hidden to make it hard to find: The OpenStreetmap Wiki, source readme file, help message and manual page (referenced below). I have also added my own experiences where possible.

Command line syntax

osm2pgsql [options] planet.osm

osm2pgsql [options] planet.osm.{gz,bz2,pbf}

osm2pgsql [options] file1.osm file2.osm file3.osm

The Wiki suggests running decompression and database import in parallel on multi-core machines, like this:

bunzip -c file.osm.bz2 | osm2pgsql [options] /dev/stdin

Functionality

osm2pgsql serves to import packed OpenStreetmap geodata into a PostgreSQL / PostGIS database. This live database can then be used by rendering programs or the Nominatim geocoder. OSM planet snapshots can be downloaded from here. Geodata files for various countries and regions ("extracts") are available here in the especially small PBF (ProtoBufBinary) format, and from other sources (see reference list in the Wiki).

osm2pgsql currently supports two different database schemas that can be selected with its -o option:

A database schema that is optimized for ease of rendering by Mapnik.
A database schema that is optimized for geocoding with Nominatim, emphasizing the spatially hierarchical organizations of objects.

Both schemas were specifically optimized for the purpose they were intended for and they may therefore be less suitable for other general purpose processing. Nevertheless, the rendering schema might be useful for other purposes as well, and has been used for a variety of additionally purposes.

Feature list:

Converts OSM files to a PostgreSQL database
Conversion of OSM tags to database columns is configurable in the style file
Able to read .gz, .bz2, .pbf and .o5m files directly
Can apply diffs to keep the database up to data
Supports the choice of output projection
Configurable table names
Gazetteer back-end for Nominatim
Support for hstore field type to store the complete set of tags in one database field if desired

If you are completely new to osm2pgsql, you may prefer to follow a tutorial rather than read the docs. Tile server setup tutorials are linked here, a tutorial for creating map tiles for offline use is here.

The most important options determining its mode of operation are:

-s, --slim: Store temporary tables in database. This allows incremental updates using diff files also available at OSM data servers, and reduces memory usage at a cost in disk space and import time. This mode of operation is recommended. See this Wiki page on how to keep a database up to date.
--flat-nodes file: Store temporary tables for --slim outside database in file. Reduces disk space considerably. It is still only recommended for whole-planet databases because the organisation of the file is optimised for that case.
-k, --hstore: Hstore is a data type in a PostgreSQL database that can itself store multiple key-value pairs. With this option, osm2pgsql stores non-standard tags in an additional hstore-type column, while standard tags are still stored in a column of their own as usual. This option requires that the hstore extension has been enabled in the database (see below).

Steps of operation

osm2pgsql performs the following actions:

osm2pgsql connects to database and creates the following 4 tables when used with the default output back-end (pgsql): planet_osm_point, planet_osm_line, planet_osm_roads and planet_osm_polygon. The default prefix "planet_osm" can be changed with the --prefix option. If you are using --slim mode, it will create the following additional 3 tables: planet_osm_nodes, planet_osm_ways and planet_osm_rels.
Runs a parser on the input file and processes the nodes, ways and relations.
If a node has a tag declared in the style file then it is added to planet_osm_point. If it has no such tag then the position is noted, but not added to the database.
Ways are read in and converted into WKT geometries by using the positions of the nodes read in earlier. If the tags on the way are listed in the style file then the way will be written into the line or roads tables.
If the way has one or more tags marked as "polygon" and forms a closed ring then it will be added to the planet_osm_polygon table.
The relations are parsed. Osm2pgsql has special handling for a limited number of types: multipolygon, route and boundary. The code will build the appropriate geometries by referencing the members and outputting these into the database.
Indexes are added to speed up the queries by Mapnik.

Geographic projections

The downloadable OpenStreetmap data contain the geographical latitude and longitude of nodes (which are points out of which all else is composed). When importing the data into a database, osm2pgsql already performs a map projection. Presumably this serves to prevent duplication of the projection computation later on, as most node coordinates will be needed multiple times when rendering different zoom levels of the same region. The projection used by default is the Pseudo-Mercator projection, so called because it is slightly inconsistent and therefore can distort angles very slightly. It is universally used by web map services and also called Web Mercator and spherical Mercator. It can also be explicitly selected with the -m option.

Two other projections can be selected with simple command-line switches. The first, -l, is not an actual projection, but stores the latitude and longitude themselves. This may be helpful if you want to query the database yourself using the geocoordinates. The other projection with a dedicated command-line option, WGS84 Mercator (-M), is considered obsolete. (The WGS84, in case you wondered, stands for today's standard ellipsoid used to approximate the earth's surface.)

In addition, osm2pgsql can create the database using any projection known to the proj program that converts plain-text coordinate files between projections. (It is an indirect dependency of osm2pgsql, so it has to be installed.) The proj package contains a list of projections standardised by the EPSG in /usr/share/proj/epsg. The osm2pgsql option -E allows to choose any of them by number. The projections with dedicated options can also be represented as EPSG projections:

Projection	osm2pgsql option	EPSG index	Spatial Reference System (SRS) string for proj (and Mapnik and others using it)
Latitude / longitude	`-l`	4326	`+proj=longlat +datum=WGS84 +no_defs`
Pseudo-Mercator	`-m` (default)	3857 (900913, 3785)	`+proj=merc +a=6378137 +b=6378137 +lat_ts=0.0 +lon_0=0.0 +x_0=0.0 +y_0=0 +k=1.0 +units=m +nadgrids=@null +no_defs +over`
WGS84 Mercator	`-M`	3395	`+proj=merc +datum=WGS84 +k=1.0 +units=m +over +no_defs`

In order to do manual conversions with proj, you need not enter the whole SRS string. Because proj already has the list of EPSG projections, you can use the option +init=epsg:index instead.

The default OpenStreetmap rendering style for Mapnik assumes the default (Pseudo-Mercator) projection, so you probably want to stick with that unless you have special needs.

All options

osm2pgsql follows the usual GNU command line syntax, with long options starting with two dashes (`-').

Mode of operation options

-a|--append: Add the OSM file into the database without removing existing data. This can reputedly lead to problems if objects in the new data are already in the database, such as when a region is extended by an adjacent one. The referenced post hints that this can be fixed by removing a uniqueness constraint of the database.
-c|--create: Remove existing data from the database. This is the default if --append is not specified.
--flat-nodes /path/to/nodes.cache: The flat-nodes mode is a separate method to store slim mode node information on disk. Instead of storing this information in the main PostgreSQL database, this mode creates its own separate custom database to store the information. As this custom database has application level knowledge about the data to store and is not general purpose, it can store the data much more efficiently. Storing the node information for the full planet requires about 100 GB in PostgreSQL, the same data is stored in only ~20 GB (Oct 2013) using the flat-nodes mode. This can also increase the speed of applying diff files. This option activates the flat-nodes mode and specifies the location of the database file. It is a single large > 16GB file. This mode is only recommended for full planet imports as it doesn't work well with small extracts. The default is disabled.
-s|--slim: Store temporary data in the database. Without this mode, all temporary data is stored in RAM and if you do not have enough the import will not work successfully. With slim mode, you should be able to import the data even on a system with limited RAM, although if you do not have enough RAM to cache at least all of the nodes, the time to import the data will likely be greatly increased. You must use slim mode for planet imports on 32-bit systems, since there are too many nodes to store in RAM otherwise. This switch is also required for later updates with --append.
--drop: Drop the slim mode tables from the database once the import is complete. This can greatly reduce the size of the database, as the slim mode tables typically are the same size, if not slightly bigger than the main tables. It does not, however, reduce the maximum spike of disk usage during import. It can furthermore increase the import speed, as no indices need to be created for the slim mode tables, which (depending on hardware) can nearly halve import time. Slim mode tables however have to be persistent if you want to be able to update your database, as these tables are needed for diff processing.
-r|--input-reader format: Select input format reader. Does not normally need to be set, as the type of input file is recognised automatically (by extension or by content?). Available choices are libxml2 and primitive for XML files, o5m for o5m formatted files and pbf for protobuf encoded files (OSM Wiki, library docs). Not all choices may be available on all systems, as they can be disabled at compile time.
-S|--style /path/to/style: Location of the osm2pgsql style file. This specifies which tags from the data get imported into database columns and which tags are dropped. Defaults to /usr/share/osm2pgsql/default.style.
-o|--output backend: Specifies the output back-end or database schema to use. Currently osm2pgsql supports pgsql, gazetteer and null. pgsql is the default output back-end that creats a PostGIS database schema for rendering with Mapnik. gazetteer is creates a PostGIS database schema optimized for geocoding by Nominatim. null does not write any output and is only useful for testing.
-u|--utf8-sanitize: Obsolete option to repair bad UTF-8 input data from before August 2007. Adds about 10% overhead.

Projection options

-l|--latlong: Store data in degrees of latitude and longitude. This is incompatible with the standard Mapnik style. But it could allow creating maps in custom projections, including at or near the poles, where the Mercator projection does not extend. It might also be an advantage if you want to query the resulting database manually or with custom programs based on latitude and longitude.
-m|--merc: Store data in Pseudo-Mercator projection. This is a variant of the Mercator projection that is slightly inconsistent mathematically and as a consequence distorts angles slightly but is in wide use on the web because it is easy to calculate. See here for a discussion. This is the default.
-M|--oldmerc: Store data in WGS84 World Mercator projection. This is considered deprecated.
-E|--proj num: Store data in EPSG projection EPSG:num. See above.

Database access options

-d|--database name: The name of the PostgreSQL database to connect to (default: gis).
-p|--prefix prefix_string: Prefix for table names (default: planet_osm).
-U|--username name: PostgreSQL database user name.
-W|--password: Force password prompt. Otherwise specify password in PGPASS environment variable.
-H|--host hostname: Database server hostname or UNIX-domain socket location. Seems to default to local host.
-P|--port num: Database server port. Defaults to 5432, PostgreSQL's default port.

Options related to an additional hstore-type column for tags

These option requires that the hstore extension has been enabled in the database (see below).

-k|--hstore: Add tags without column to an additional hstore (key/value) column in PostgreSQL tables.
-j|--hstore-all: Add all tags to an additional hstore (key/value) column in PostgreSQL tables.
-z|--hstore-column key_name: Add an additional hstore (key/value) column containing all tags that start with the specified string, e.g. --hstore-column "name:" will produce an extra hstore column that contains all name:xx tags
--hstore-match-only: Only keep objects that have a value in one of the columns (normal action with --hstore is to keep all objects).
--hstore-add-index: Create indices for the hstore column during import.

Performance options

-C|--cache num: Only for slim mode: Use up to num MB of RAM for caching nodes. Giving osm2pgsql sufficient cache to store all imported nodes typically greatly increases the speed of the import. Each cached node requires 8 bytes of cache, plus about 10% to 30% overhead. For a current OSM full planet import with its ~ 1.9 billion nodes, a good value would be 17000 if you have enough RAM. If you don't have enough RAM, it is likely beneficial to give osm2pgsql close to the full available amount of RAM. Without slim mode, there is also a "way" cache which takes up about as much space as the node cache does, but is not charged against -C. [Remark by VS: This may be code for osm2pgsql using more than twice as much RAM as given in the -C option, which seems to be confirmed by experience.] num defaults to 800.
--cache-strategy strategy: There are a number of different modes in which osm2pgsql can organize its node cache in RAM. These are optimized for different assumptions of the data and the hardware resources available. Currently available strategies are dense, chunk, sparse and optimized. dense assumes that the node IDs are densely packed, i.e. only a few IDs in a range are missing. For extracts this is usually not the case, making the cache very inefficient and wasteful of RAM. chunk is optimised for non-contiguous memory allocation and is the default on 32-bit systems. sparse assumes node IDs in the data are not densely packed, greatly increasing caching efficiency in these cases. If node IDs are densely packed, like in the full planet, this strategy has a higher overhead for indexing the cache. optimized uses both dense and sparse strategies for different ranges of the ID space. On a block by block basis it tries to determine if it is more effective to store the block of IDs in sparse or dense mode. This may use twice as much virtual memory, but no more physical memory. It is the default on 64-bit machines.
--number-processes num: Specifies the number of parallel processes used for certain operations. If disks are fast enough e.g. if you have an SSD, then this can greatly increase speed of the "going over pending ways" and "going over pending relations" stages on a multi-core server. [Remark by VS: It is not clear (and of course not documented) how much using more processors affects RAM usage, so you may have to reduce the parameter of the -C option.]
-I|--disable-parallel-indexing: By default osm2pgsql initiates the index building on all tables in parallel to increase performance. This can be disadvantageous on slow disks, or if you don't have enough RAM for PostgreSQL to perform up to 7 parallel index building processes (e.g. because maintenance_work_mem is set high).
-i|--tablespace-index tablespacename: Store all indices in a separate PostgreSQL tablespace named by this parameter. This allows to e.g. store the indices on faster storage like SSDs. The following options allow more fine-grained control:
--tablespace-main-data tablespacename: Store the data tables (non slim) in the given tablespace.
--tablespace-main-index tablespacename: Store the indices of the main tables (non slim) in the given tablespace.
--tablespace-slim-data tablespacename: Store the slim mode tables in the given tablespace.
--tablespace-slim-index tablespacename: Store the indices of the slim mode tables in the given tablespace.
--unlogged: Use PostgreSQL's unlogged tables for storing data. This requires PostgreSQL 9.1 or above. Data written to unlogged tables is not written to PostgreSQL's write-ahead log, which makes them considerably faster than ordinary tables. However, they are not crash-safe: an unlogged table is automatically truncated after a crash or unclean shutdown.

Informational options

-h|--help: Help information. With -v, displays complete option list.
-v|--verbose: Verbose output.

Advanced options

-b|--bbox minlon,minlat,maxlon,maxlat: Apply a bounding box filter on the imported data, e.g. --bbox -0.5,51.25,0.5,51.75. The general-purpose converter Osmosis is said to be more efficient at extracting partial datasets than this option. In my experience, this option cuts off ways at the boundary and removes relations that are not completely within the region, so it may be advantageous to choose a box a bit larger than the map region you want to render.
-e|--expire-tiles [min_zoom-]max-zoom: Create a tile expiry list.
-o|--expire-output /path/to/expire.list: Output file name for expired tiles list.
-G|--multi-geometry: Generate multi-geometry features in PostgreSQL tables. Normally osm2pgsql splits multi-part geometries into separate database rows per part. A single OSM id can therefore have several rows. With this option, PostgreSQL instead generates multi-geometry features in the PostgreSQL tables. Multi-geometry objects are a PostGIS feature representing collections of geometrical objects that can represent OSM relations combining multiple boundaries to an area (e.g. with holes).
--tag-transform-script /path/to/script: Specify a lua script to handle tag filtering and normalisation. The script contains callback functions for nodes, ways and relations, which each take a set of tags and returns a transformed, filtered set of tags which are then written to the database.
-x|--extra-attributes: Include attributes for each object in the database. This includes the user name and user ID of the last contributor and the time stamp and version of the latest revision. Note: this option also requires additional entries in your style file.
-K|--keep-coastlines: Keep coastline data rather than filtering it out. By default "natural=coastline" tagged data will be discarded based on the assumption that post-processed Coastline Checker shape files will be used.
--exclude-invalid-polygon: OpenStreetMap data is defined in terms of nodes, ways and relations and not in terms of actual geometric features. osm2pgsql therefore tries to build postgis geometries out of this data representation. However not all ways and relations correspond to valid PostGIS geometries (e.g. self intersecting polygons). By default osm2pgsql tries to automatically fix these geometries using ST_Buffer(0) around the invalid polygons. With this option, invalid polygons are instead simply dropped from the database.

Usage examples

Basic usage:

osm2pgsql -s -U postgres -d nameofdatabase /file/path/toosm/fileorpbf/name.osm

In the example above, postgres is the user of the database, the user enabled slim mode (generally recommended). The above sample is feasible for someone looking to export a city's OSM data into a PostGIS database. For more advanced and larger datasets, read the Optimization section.

Setting up the database

Before running osm2pgsql, you have to create a PostgreSQL user and a database with the PostGIS functions enabled. This requires access as the database administrator, normally the postgres user.

PostgreSQL 9.1 and PostGIS 2.0 or later are strongly suggested for databases in production. It is generally best to run the latest released versions if possible. PostgreSQL 8.4 and PostGIS 1.5 will work but are substantially slower. Additionally, PostGIS 2.0 contains enhancements that increase reliability as well as add new features that style sheet authors can use.

The default name for this database is gis but this may be changed by using osm2pgsql's --database option.

If the <username> matches the unix user id running the import and rendering then this allows the PostgreSQL 'ident sameuser' authentication to be used which avoids the need to enter a password when accessing the database. This is setup by default on many Unix installs but does not work on Windows (due to the lack of unix sockets).

Some example commands are given below but you may also want to look at this wiki page.

sudo -u postgres createuser <username>
sudo -u postgres createdb -E UTF8 -O <username> <dbname>

To add the PostGIS extensions, you will have to run an SQL script that comes with PostGIS. The version numbers in the path will vary.

sudo -u postgres psql -d <dbname> -f /usr/share/postgresql/contrib/postgis-2.1/postgis.sql

Next we need to give the <username> access to update the postgis meta-data tables:

sudo -u postgres psql -d <dbname> -c "ALTER TABLE spatial_ref_sys OWNER TO <username>;"

The 900913 spatial reference system is not normally included with PostGIS. To add it you should run the SQL script that comes with osm2pgsql:

sudo -u postgres psql -d <dbname> -U <username> -f /usr/share/osm2pgsql/900913.sql

If you want to use hstore support then you will also need to enable the PostgreSQL hstore extension:

sudo -u postgres psql -d <dbname> -c "CREATE EXTENSION hstore;"

On PostgreSQL versions before 9.1 you need to install the hstore-new extension instead by running an SQL script:

sudo -u postgres psql -d <dbname> -f /usr/share/postgresql/8.4/contrib/hstore-new.sql

Optimising database configuration for osm2pgsql imports

For an efficient operation of PostgreSQL you will need to tune the config parameters of PostgreSQL from its default values. These are set in the config file, which is located in the directory storing the database data, by default /var/lib/postgres/data/postgresql.conf.

The values you need to set will depend on the hardware you have available, but you will likely need to increase the values for the following parameters: shared_buffers, checkpoint_segments, work_mem, maintenance_work_mem, and effective_cache_size. Suggestions for values can be found in the section "Configure the PostGIS database" on this page.