Understanding the Generation Process

We describe next the main steps in the generation of the BerlinMOD scenario. The generator uses multiple parameters that can be set to customize the generation process. We explain in detail these parameters in the section called “Tuning the Generator Parameters”. It is worth noting that the procedures explained in this section have been slightly simplified with respect to the actual procedures by removing ancillary details concerning the generation of tracing messages at various verbosity levels.

We start by creating a first set of tables for containing the generated data as follows.

CREATE TABLE Vehicle(id int PRIMARY KEY, home bigint NOT NULL, work bigint NOT NULL,
  noNeighbours int);
CREATE TABLE Destinations(vehicle int, source bigint, target bigint,
  PRIMARY KEY (vehicle, source, target));
CREATE TABLE Licences(vehicle int PRIMARY KEY, licence text, type text, model text);
CREATE TABLE Neighbourhood(vehicle int, seq int, node bigint NOT NULL,
  PRIMARY KEY (vehicle, seq));

-- Get the number of nodes
SELECT COUNT(*) INTO noNodes FROM Nodes;

FOR i IN 1..noVehicles LOOP
-- Fill the Vehicles table
IF nodeChoice = 'Network Based' THEN
  homeNode = random_int(1, noNodes);
  workNode = random_int(1, noNodes);
ELSE
  homeNode = berlinmod_selectHomeNode();
  workNode = berlinmod_selectWorkNode();
END IF;
IF homeNode IS NULL OR workNode IS NULL THEN
  RAISE EXCEPTION '    The home and the work nodes cannot be NULL';
END IF;
INSERT INTO Vehicle VALUES (i, homeNode, workNode);

-- Fill the Destinations table
INSERT INTO Destinations(vehicle, source, target) VALUES
  (i, homeNode, workNode), (i, workNode, homeNode);

-- Fill the Licences table
licence = berlinmod_createLicence(i);
type = berlinmod_vehicleType();
model = berlinmod_vehicleModel();
INSERT INTO Licences VALUES (i, licence, type, model);

-- Fill the Neighbourhood table
INSERT INTO Neighbourhood
WITH Temp AS (
  SELECT i AS vehicle, N2.id AS node
  FROM Nodes N1, Nodes N2
  WHERE N1.id = homeNode AND N1.id <> N2.id AND
    ST_DWithin(N1.geom, N2.geom, P_NEIGHBOURHOOD_RADIUS) )
SELECT i, ROW_NUMBER() OVER () as seq, node
FROM Temp;
END LOOP;

CREATE UNIQUE INDEX Vehicle_id_idx ON Vehicle USING BTREE(id);
CREATE UNIQUE INDEX Neighbourhood_pkey_idx ON Neighbourhood USING BTREE(vehicle, seq);

UPDATE Vehicle V
SET noNeighbours = (SELECT COUNT(*) FROM Neighbourhood N WHERE N.vehicle = V.id);

We start by storing in the Vehicles table the home and the work node of each vehicle. Depending on the value of the variable nodeChoice, we chose these nodes either with a uniform distribution among all nodes in the network or we call specific functions that take into account population and employment statistics in the area covered by the generation. We then keep track in the Destinations table of the two trips to and from work and we store in the Licences table information describing the vehicle. Finally, we compute in the Neighbourhood table the set of nodes that are within a given distance of the home node of every vehicle. This distance is stated by the parameter P_NEIGHBOURHOOD_RADIUS, which is set by default to 3 Km.

We create now auxiliary tables containing benchmarking data. The number of rows these tables is determined by the parameter P_SAMPLE_SIZE, which is set by default to 100. These tables are used by the BerlinMOD benchmark to assess the performance of various types of queries.

CREATE TABLE QueryPoints(id int PRIMARY KEY, geom geometry(Point));
INSERT INTO QueryPoints
WITH Temp AS (
  SELECT id, random_int(1, noNodes) AS node
  FROM generate_series(1, P_SAMPLE_SIZE) id )
SELECT T.id, N.geom
FROM Temp T, Nodes N
WHERE T.node = N.id;

CREATE TABLE QueryRegions(id int PRIMARY KEY, geom geometry(Polygon));
INSERT INTO QueryRegions
WITH Temp AS (
  SELECT id, random_int(1, noNodes) AS node
  FROM generate_series(1, P_SAMPLE_SIZE) id )
SELECT T.id, ST_Buffer(N.geom, random_int(1, 997) + 3.0, random_int(0, 25)) AS geom
FROM Temp T, Nodes N
WHERE T.node = N.id;

CREATE TABLE QueryInstants(id int PRIMARY KEY, instant timestamptz);
INSERT INTO QueryInstants
SELECT id, startDay + (random() * noDays) * interval '1 day' AS instant
FROM generate_series(1, P_SAMPLE_SIZE) id;

CREATE TABLE QueryPeriods(id int PRIMARY KEY, period period);
INSERT INTO QueryPeriods
WITH Instants AS (
  SELECT id, startDay + (random() * noDays) * interval '1 day' AS instant
  FROM generate_series(1, P_SAMPLE_SIZE) id )
SELECT id, Period(instant, instant + abs(random_gauss()) * interval '1 day',
  true, true) AS period
FROM Instants;

We generate now the leisure trips. There is at most one leisure trip in the evening of a week day and at most two leisure trips each day of the weekend, one in the morning and another one in the afternoon. Each leisure trip is composed of 1 to 3 destinations. The leisure trip starts and ends at the home node and visits successively these destinations. In our implementation, the various subtrips from a source to a destination node of a leisure trip are encoded independently, contrary to what is done in Secondo where a leisure trip is encoded as a single trip and stops are added between successive destinations.

CREATE TABLE LeisureTrip(vehicle int, day date, tripNo int, seq int, source bigint,
  target bigint, PRIMARY KEY (vehicle, day, tripNo, seq));
-- Loop for every vehicle
FOR i IN 1..noVehicles LOOP
-- Get home node and number of neighbour nodes
SELECT home, noNeighbours INTO homeNode, noNeigh
FROM Vehicle V WHERE V.id = i;
day = startDay;
-- Loop for every generation day
FOR j IN 1..noDays LOOP
  weekday = date_part('dow', day);
  -- Generate leisure trips (if any)
  -- 1: Monday, 5: Friday
  IF weekday BETWEEN 1 AND 5 THEN
    noLeisTrips = 1;
  ELSE
    noLeisTrips = 2;
  END IF;
  -- Loop for every leisure trip in a day (1 or 2)
  FOR k IN 1..noLeisTrips LOOP
    -- Generate a leisure trip with a 40% probability
    IF random() <= 0.4 THEN
      -- Select a number of destinations between 1 and 3
      IF random() < 0.8 THEN
        noDest = 1;
      ELSIF random() < 0.5 THEN
        noDest = 2;
      ELSE
        noDest = 3;
      END IF;
      sourceNode = homeNode;
      FOR m IN 1..noDest + 1 LOOP
        IF m <= noDest THEN
          targetNode = berlinmod_selectDestNode(i, noNeigh, noNodes);
        ELSE
          targetNode = homeNode;
        END IF;
        IF targetNode IS NULL THEN
          RAISE EXCEPTION '    Destination node cannot be NULL';
        END IF;
        INSERT INTO LeisureTrip VALUES
          (i, day, k, m, sourceNode, targetNode);
        INSERT INTO Destinations(vehicle, source, target) VALUES
          (i, sourceNode, targetNode) ON CONFLICT DO NOTHING;
        sourceNode = targetNode;
      END LOOP;
    END IF;
  END LOOP;
  day = day + 1 * interval '1 day';
END LOOP;
END LOOP;

CREATE INDEX Destinations_vehicle_idx ON Destinations USING BTREE(vehicle);

For each vehicle and each day, we determine the number of potential leisure trips depending on whether it is a week or weekend day. A leisure trip is generated with a probability of 40% and is composed of 1 to 3 destinations. These destinations are chosen so that 80% of the destinations are from the neighbourhood of the vehicle and 20% are from the complete graph. The information about the composition of the leisure trips is then added to the LeisureTrip and Destinations tables.

We then call pgRouting to generate the path for each source and destination nodes in the Destinations table.

CREATE TABLE Paths(
  -- This attribute is needed for partitioning the table for big scale factors
  vehicle int,
  -- The following attributes are generated by pgRouting
  start_vid bigint, end_vid bigint, seq int, node bigint, edge bigint,
  -- The following attributes are filled from the Edges table
  geom geometry NOT NULL, speed float NOT NULL, category int NOT NULL,
  PRIMARY KEY (vehicle, start_vid, end_vid, seq));

-- Select query sent to pgRouting
IF pathMode = 'Fastest Path' THEN
query1_pgr = 'SELECT id, source, target, cost_s AS cost,'
  'reverse_cost_s as reverse_cost FROM edges';
ELSE
query1_pgr = 'SELECT id, source, target, length_m AS cost,'
  'length_m * sign(reverse_cost_s) as reverse_cost FROM edges';
END IF;
-- Get the total number of paths and number of calls to pgRouting
SELECT COUNT(*) INTO noPaths FROM (SELECT DISTINCT source, target FROM Destinations) AS T;
noCalls = ceiling(noPaths / P_PGROUTING_BATCH_SIZE::float);

FOR i IN 1..noCalls LOOP
query2_pgr = format('SELECT DISTINCT source, target FROM Destinations '
  'ORDER BY source, target LIMIT %s OFFSET %s',
  P_PGROUTING_BATCH_SIZE, (i - 1) * P_PGROUTING_BATCH_SIZE);
INSERT INTO Paths(vehicle, start_vid, end_vid, seq, node, edge, geom, speed, category)
WITH Temp AS (
  SELECT start_vid, end_vid, path_seq, node, edge
  FROM pgr_dijkstra(query1_pgr, query2_pgr, true)
  WHERE edge > 0 )
SELECT D.vehicle, start_vid, end_vid, path_seq, node, edge,
  -- adjusting direction of the edge traversed
  CASE
    WHEN T.node = E.source THEN E.geom
    ELSE ST_Reverse(E.geom)
  END AS geom, E.maxspeed_forward AS speed,
  berlinmod_roadCategory(E.tag_id) AS category
FROM Destinations D, Temp T, Edges E
WHERE D.source = T.start_vid AND D.target = T.end_vid AND E.id = T.edge;
END LOOP;

CREATE INDEX Paths_vehicle_start_vid_end_vid_idx ON Paths USING
BTREE(vehicle, start_vid, end_vid);

The variable pathMode determines whether pgRouting computes either the fastest or the shortest path from a source to a destination node. Then, we determine the number of calls to pgRouting. Indeed, depending on the available memory of the computer, there is a limit in the number of paths to be computed by pgRouting in a single call. The paths are stored in the Paths table. In addition to the columns generated by pgRouting, we add the geometry (adjusting the direction if necessary), the maximum speed, and the category of the edge. The BerlinMOD data generator considers three road categories: side road, main road, and freeway. The OSM road types are mapped to one of these categories in the function berlinmod_roadCategory.

We are now ready to generate the trips.

DROP TYPE IF EXISTS step CASCADE;
CREATE TYPE step as (linestring geometry, maxspeed float, category int);

CREATE FUNCTION berlinmod_createTrips(noVehicles int, noDays int, startDay date,
  disturbData boolean)
RETURNS void LANGUAGE plpgsql STRICT AS $$
DECLARE
  /* Declaration of variables and parameters ... */
BEGIN
  DROP TABLE IF EXISTS Trips;
  CREATE TABLE Trips(vehicle int, day date, seq int, source bigint, target bigint,
    trip tgeompoint, trajectory geometry, PRIMARY KEY (vehicle, day, seq));
  -- Loop for each vehicle
  FOR i IN 1..noVehicles LOOP
    -- Get home -> work and work -> home paths
    SELECT home, work INTO homeNode, workNode
    FROM Vehicle V WHERE V.id = i;
    SELECT array_agg((geom, speed, category)::step ORDER BY seq) INTO homework
    FROM Paths WHERE vehicle = i AND start_vid = homeNode AND end_vid = workNode;
    SELECT array_agg((geom, speed, category)::step ORDER BY seq) INTO workhome
    FROM Paths WHERE vehicle = i AND start_vid = workNode AND end_vid = homeNode;
    d = startDay;
    -- Loop for each generation day
    FOR j IN 1..noDays LOOP
      weekday = date_part('dow', d);
      -- 1: Monday, 5: Friday
      IF weekday BETWEEN 1 AND 5 THEN
        -- Crete trips home -> work and work -> home
        t = d + time '08:00:00' + CreatePauseN(120);
        createTrip(homework, t, disturbData);
        INSERT INTO Trips VALUES (i, d, 1, homeNode, workNode, trip, trajectory(trip));
        t = d + time '16:00:00' + CreatePauseN(120);
        trip = createTrip(workhome, t, disturbData);
        INSERT INTO Trips VALUES (i, d, 2, workNode, homeNode, trip, trajectory(trip));
        tripSeq = 2;
      END IF;
      -- Get the number of leisure trips
      SELECT COUNT(DISTINCT tripNo) INTO noLeisTrip
      FROM LeisureTrip L
      WHERE L.vehicle = i AND L.day = d;
      -- Loop for each leisure trip (0, 1, or 2)
      FOR k IN 1..noLeisTrip LOOP
        IF weekday BETWEEN 1 AND 5 THEN
          t = d + time '20:00:00' + CreatePauseN(90);
          leisNo = 1;
        ELSE
          -- Determine whether it is a morning/afternoon (1/2) trip
          IF noLeisTrip = 2 THEN
            leisNo = k;
          ELSE
            SELECT tripNo INTO leisNo FROM LeisureTrip L
            WHERE L.vehicle = i AND L.day = d LIMIT 1;
          END IF;
          -- Determine the start time
          IF leisNo = 1 THEN
            t = d + time '09:00:00' + CreatePauseN(120);
          ELSE
            t = d + time '17:00:00' + CreatePauseN(120);
        END IF;
        END IF;
        -- Get the number of subtrips (number of destinations + 1)
        SELECT count(*) INTO noSubtrips
        FROM LeisureTrip L
        WHERE L.vehicle = i AND L.tripNo = leisNo AND L.day = d;
        FOR m IN 1..noSubtrips LOOP
          -- Get the source and destination nodes of the subtrip
          SELECT source, target INTO sourceNode, targetNode
          FROM LeisureTrip L
          WHERE L.vehicle = i AND L.day = d AND L.tripNo = leisNo AND L.seq = m;
          -- Get the path
          SELECT array_agg((geom, speed, category)::step ORDER BY seq) INTO path
          FROM Paths P
          WHERE vehicle = i AND start_vid = sourceNode AND end_vid = targetNode;
          trip = createTrip(path, t, disturbData);
          tripSeq = tripSeq + 1;
          INSERT INTO Trips VALUES
            (i, d, tripSeq, sourceNode, targetNode, trip, trajectory(trip));
          -- Add a delay time in [0, 120] min using a bounded Gaussian distribution
          t = endTimestamp(trip) + createPause();
        END LOOP;
      END LOOP;
      d = d + 1 * interval '1 day';
    END LOOP;
  END LOOP;
  RETURN;
END; $$

We create a type step which is a record composed of the geometry, the maximum speed, and the category of an edge. The procedure loops for each vehicle and each day and calls the procedure createTrip for creating the trips. If the day is a weekday, we generate the trips from home to work and from work to home starting, respectively, at 8 am and 4 pm plus a random non-zero duration of 120 minutes using a uniform distribution. We then generate the leisure trips. During the week days, the possible evening leisure trip starts at 8 pm plus a random random non-zero duration of 90 minutes, while during the weekend days, the two possible morning and afternoon trips start, respectively, at 9 am and 5 pm plus a random non-zero duration of 120 minutes. Between the multiple destinations of a leisure trip we add a delay time of maximum 120 minutes using a bounded Gaussian distribution.

Finally, we explain the procedure that create a trip.

CREATE OR REPLACE FUNCTION createTrip(edges step[], startTime timestamptz,
  disturbData boolean)
RETURNS tgeompoint LANGUAGE plpgsql STRICT AS $$
DECLARE
  /* Declaration of variables and parameters ... */
BEGIN
  srid = ST_SRID((edges[1]).linestring);
  p1 = ST_PointN((edges[1]).linestring, 1); x1 = ST_X(p1); y1 = ST_Y(p1);
  curPos = p1; t = startTime;
  instants[1] = tgeompoint_inst(p1, t);
  curSpeed = 0; l = 2; noEdges = array_length(edges, 1);
  -- Loop for every edge
  FOR i IN 1..noEdges LOOP
    -- Get the information about the current edge
    linestring = (edges[i]).linestring; maxSpeedEdge = (edges[i]).maxSpeed;
    category = (edges[i]).category;
    -- Determine the number of segments
    SELECT array_agg(geom ORDER BY path) INTO points
    FROM ST_DumpPoints(linestring);
    noSegs = array_length(points, 1) - 1;
    -- Loop for every segment
    FOR j IN 1..noSegs LOOP
      p2 = points[j + 1]; x2 = ST_X(p2); y2 = ST_Y(p2);
      -- If there is a segment ahead in the current edge compute the angle of the turn
      -- and the maximum speed at the turn depending on this angle
      IF j < noSegs THEN
        p3 = points[j + 2];
        alpha = degrees(ST_Angle(p1, p2, p3));
        IF abs(mod(alpha::numeric, 360.0)) < P_EPSILON THEN
          maxSpeedTurn = maxSpeedEdge;
        ELSE
          maxSpeedTurn = mod(abs(alpha - 180.0)::numeric, 180.0) / 180.0 * maxSpeedEdge;
        END IF;
      END IF;
      -- Determine the number of fractions
      segLength = ST_Distance(p1, p2);
      IF segLength < P_EPSILON THEN
        RAISE EXCEPTION 'Segment % of edge % has zero length', j, i;
      END IF;
      fraction = P_EVENT_LENGTH / segLength;
      noFracs = ceiling(segLength / P_EVENT_LENGTH);
      -- Loop for every fraction
      k = 1;
      WHILE k < noFracs LOOP
        -- If the current speed is zero, apply an acceleration event
        IF curSpeed <= P_EPSILON_SPEED THEN
          -- If we are not approaching a turn
          IF k < noFracs THEN
            curSpeed = least(P_EVENT_ACC, maxSpeedEdge);
          ELSE
            curSpeed = least(P_EVENT_ACC, maxSpeedTurn);
          END IF;
        ELSE
          -- If the current speed is not zero, apply a deceleration or a stop event
          -- with a probability proportional to the maximun speed
          IF random() <= P_EVENT_C / maxSpeedEdge THEN
            IF random() <= P_EVENT_P THEN
              -- Apply a stop event
              curSpeed = 0.0;
            ELSE
              -- Apply a deceleration event
              curSpeed = curSpeed * random_binomial(20, 0.5) / 20.0;
            END IF;
          ELSE
            -- Otherwise, apply an acceleration event
            IF k = noFracs AND j < noSegs THEN
              maxSpeed = maxSpeedTurn;
            ELSE
              maxSpeed = maxSpeedEdge;
            END IF;
            curSpeed = least(curSpeed + P_EVENT_ACC, maxSpeed);
          END IF;
        END IF;
        -- If speed is zero add a wait time
        IF curSpeed < P_EPSILON_SPEED THEN
          waitTime = random_exp(P_DEST_EXPMU);
          IF waitTime < P_EPSILON THEN
            waitTime = P_DEST_EXPMU;
          END IF;
          t = t + waitTime * interval '1 sec';
        ELSE
          -- Otherwise, move current position towards the end of the segment
          IF k < noFracs THEN
            x = x1 + ((x2 - x1) * fraction * k);
            y = y1 + ((y2 - y1) * fraction * k);
            IF disturbData THEN
              dx = (2 * P_GPS_STEPMAXERR * rand()) - P_GPS_STEPMAXERR;
              dy = (2 * P_GPS_STEPMAXERR * rand()) - P_GPS_STEPMAXERR;
              errx = errx + dx; erry = erry + dy;
              IF errx > P_GPS_TOTALMAXERR THEN
                errx = P_GPS_TOTALMAXERR;
              END IF;
              IF errx < - 1 * P_GPS_TOTALMAXERR THEN
                errx = -1 * P_GPS_TOTALMAXERR;
              END IF;
              IF erry > P_GPS_TOTALMAXERR THEN
                erry = P_GPS_TOTALMAXERR;
              END IF;
              IF erry < -1 * P_GPS_TOTALMAXERR THEN
                erry = -1 * P_GPS_TOTALMAXERR;
              END IF;
              x = x + dx; y = y + dy;
            END IF;
            curPos = ST_SetSRID(ST_Point(x, y), srid);
            curDist = P_EVENT_LENGTH;
          ELSE
            curPos = p2;
            curDist = segLength - (segLength * fraction * (k - 1));
          END IF;
          travelTime = (curDist / (curSpeed / 3.6));
          IF travelTime < P_EPSILON THEN
            travelTime = P_DEST_EXPMU;
          END IF;
          t = t + travelTime * interval '1 sec';
          k = k + 1;
        END IF;
        instants[l] = tgeompoint_inst(curPos, t);
        l = l + 1;
      END LOOP;
      p1 = p2; x1 = x2; y1 = y2;
    END LOOP;
    -- If we are not already in a stop, apply a stop event with a probability
    -- depending on the category of the current edge and the next one (if any)
    IF curSpeed > P_EPSILON_SPEED AND i < noEdges THEN
      nextCategory = (edges[i + 1]).category;
      IF random() <= P_DEST_STOPPROB[category][nextCategory] THEN
        curSpeed = 0;
        waitTime = random_exp(P_DEST_EXPMU);
        IF waitTime < P_EPSILON THEN
          waitTime = P_DEST_EXPMU;
        END IF;
        t = t + waitTime * interval '1 sec';
        instants[l] = tgeompoint_inst(curPos, t);
        l = l + 1;
      END IF;
    END IF;
  END LOOP;
  RETURN tgeompoint_seq(instants, true, true, true);
END; $$

The procedure receives as first argument a path from a source to a destination node, which is an array of triples composed of the geometry, the maximum speed, and the category of an edge of the path. The other arguments are the timestamp at which the trip starts and a Boolean value determining whether the points composed the trip are disturbed to simulate GPS errors. The output of the function is a temporal geometry point following this path. The procedure loops for each edge of the path and determines the number of segments of the edge, where a segment is a straight line defined by two consecutive points. For each segment, we determine the angle between the current segment and the next one (if any) to compute the maximum speed at the turn. This is determined by multiplying the maximum speed of the segment by a factor proportional to the angle so that the factor is 1.00 at both 0° and 360° and is 0.0 at 180°. Examples of values of degrees and the associated factor are given next.

0: 1.00, 5: 0.97, 45: 0.75, 90: 0.50, 135: 0.25, 175: 0.03
180: 0.00, 185: 0.03, 225: 0.25, 270: 0.50, 315: 0.75, 355: 0.97, 360: 0.00

Each segment is divided in fractions of length P_EVENT_LENGTH, which is by default 5 meters. We then loop for each fraction and choose to add one event that can be an acceleration, a deceleration, or a stop event. If the speed of the vehicle is zero, only an accelation event can happen. For this, we increase the current speed with the value of P_EVENT_ACC, which is by default 12 Km/h, and verify that the speed is not greater than the maximum speed of either the edge or the next turn for the last fraction. Otherwise, if the current speed is not zero, we apply a deceleration or a stop event with a probability proportional to the maximum speed of the edge, otherwise we apply an acceleration event. After applying the event, if the speed is zero we add a waiting time with a random exponential distribution with mean P_DEST_EXPMU, which is by default 1 second. Otherwise, we move the current position towards the end of the segment and, depending on the variable disturbData, we disturbe the new position to simulate GPS errors. The timestamp at which the vehicle reaches the new position is determined by dividing the distance traversed by the current speed. Finally, at the end of each segment, if the current speed is not zero, we add a stop event depending on the categories of the current segment and the next one. This is determined by a transition matrix given by the parameter P_DEST_STOPPROB.