Architecture

http://mozilla-push-service.readthedocs.io/en/latest/assets/push_architecture.svg

Overview

For Autopush, we will focus on the section in the above diagram in the Autopush square.

Autopush consists of two types of server daemons:

autopush (connection node)

Run a connection node. These handle large amounts of user agents (Firefox) using the Websocket protocol.

autoendpoint (endpoint node)

Run an endpoint node. These provide a WebPush HTTP API for Application Servers to HTTP POST messages to endpoints.

To have a running Push Service for Firefox, both of these server daemons must be running and communicating with the same DynamoDB tables. A local DynamoDB can be run or AWS DynamoDB.

Endpoint nodes handle all Notification POST requests, looking up in DynamoDB to see what Push server the UAID is connected to. The Endpoint nodes then attempt delivery to the appropriate connection node. If the UAID is not online, the message may be stored in DynamoDB in the appropriate message table.

Push connection nodes accept websocket connections (this can easily be HTTP/2 for WebPush), and deliver notifications to connected clients. They check DynamoDB for missed notifications as necessary.

There will be many more Push servers to handle the connection node, while more Endpoint nodes can be handled as needed for notification throughput.

Cryptography

The HTTP endpoint URL’s generated by the connection nodes contain encrypted information, the UAID and Subscription to send the message to. This means that they both must have the same CRYPTO_KEY supplied to each.

See make_endpoint() for the endpoint URL generator.

If you are only running Autopush locally, you can skip to Running Autopush as later topics in this document apply only to developing or production scale deployments of Autopush.

DynamoDB Tables

Autopush uses a single router table and multiple messages tables, one for each month of the year. On startup, Autopush will create the router table and a message table for the prior month and the current month of the year.

For more information on DynamoDB tables, see http://docs.aws.amazon.com/amazondynamodb/latest/gettingstartedguide/Welcome.html

Router Table Schema

The router table stores metadata for a given UAID as well as which month table should be used for clients with a router_type of webpush.

For Bridging, additional bridge-specific data may be stored in the router record for a UAID.

uaid partition key - UAID
router_type Router Type
node_id Hostname of the connection node the client is connected to.
connected_at Precise time (in milliseconds) the client connected to the node.
last_connect global secondary index - year-month-hour that the client has last connected.
curmonth Message table name to use for storing WebPush messages.

Autopush uses an optimistic deletion policy for node_id to avoid delete calls when not needed. During a delivery attempt, the endpoint will check the node_id for the corresponding UAID. If the client is not connected, it will clear the node_id record for that UAID in the router table.

If an endpoint node discovers during a delivery attempt that the node_id on record does not have the client connected, it will clear the node_id record for that UAID in the router table.

The last_connect has a secondary global index on it to allow for maintenance scripts to locate and purge stale client records and messages.

Clients with a router_type of webpush drain stored messages from the message table named curmonth after completing their initial handshake. If the curmonth entry is not the current month then it updates it to store new messages in the latest message table after stored message retrieval.

Message Table Schema

The message table stores messages for users while they’re offline or unable to get immediate message delivery.

uaid partition key - UAID
chidmessageid sort key - CHID + Message-ID.
chids Set of CHID that are valid for a given user. This entry is only present in the item when chidmessageid is a space.
data Payload of the message, provided in the Notification body.
headers HTTP headers for the Notification.
ttl Time-To-Live for the Notification.
timestamp Time (in seconds) that the message was saved.
updateid UUID generated when the message is stored to track if the message is updated between a client reading it and attempting to delete it.

The subscribed channels are stored as chids in a record stored with a blank space set for chidmessageid. Before storing or delivering a Notification a lookup is done against these chids.

Message Table Rotation

To avoid costly table scans, autopush uses a rotating message and router table. Clients that haven’t connected in 30-60 days will have their router and message table entries dropped and need to re-register.

Tables are post-fixed with the year/month they are meant for, ie:

messages-2015-02

Tables must be created and have their read/write units properly allocated by a separate process in advance of the month switch-over as autopush nodes will assume the tables already exist. Scripts are provided that can be run weekly to ensure all necessary tables are present, and tables old enough are dropped.

Within a few days of the new month, the load on the prior months table will fall as clients transition to the new table. The read/write units on the prior month may then be lowered.

Message Table Interaction Rules

Due to the complexity of having notifications spread across two tables, several rules are used to avoid losing messages during the month transition.

The logic for connection nodes is more complex, since only the connection node knows when the client connects, and how many messages it has read through.

The router table uses the curmonth field to indicate the last month the client has read notifications through. This is independent of the last_connect since it is possible for a client to connect, fail to read its notifications, then reconnect. This field is updated for a new month when the client connects after it has ack’d all the notifications out of the last month.

To avoid issues with time synchronization, the node the client is connected to acts as the source of truth for when the month has flipped over. Clients are only moved to the new table on connect, and only after reading/acking all the notifications for the prior month.

Rules for Endpoints

  1. Check the router table to see the current_month the client is on.

  2. Read the chan list entry from the appropriate month message table to see if its a valid channel.

    If its valid, move to step 3.

  3. Store the notification in the current months table if valid. (Note that this step does not copy the blank entry of valid channels)

Rules for Connection Nodes

After Identification:

  1. Check to see if the current_month matches the current month, if it does then proceed normally using the current months message table.

    If the connection node month does not match stored current_month in the clients router table entry, proceed to step 2.

  2. Read notifications from prior month and send to client.

    Once all ACKs are received for all the notifications for that month proceed to step 3.

  3. Copy the blank message entry of valid channels to the new month message table.

  4. Update the router table for the current_month.

During switchover, only after the router table update are new commands from the client accepted.

Handling of Edge Cases:

  • Connection node gets more notifications during step 3, enough to buffer, such that the endpoint starts storing them in the previous current_month. In this case the connection node will check the old table, then the new table to ensure it doesn’t lose message during the switch.
  • Connection node dies, or client disconnects during step 3/4. Not a problem as the reconnect will pick it up at the right spot.

Push Characteristics

  • When the Push server has sent a client a notification, no further notifications will be accepted for delivery (except in one edge case). In this state, the Push server will reply to the Endpoint with a 503 to indicate it cannot currently deliver the notification. Once the Push server has received ACKs for all sent notifications, new notifications can flow again, and a check of storage will be done if the Push server had to reply with a 503. The Endpoint will put the Notification in storage in this case.
  • (Edge Case) Multiple notifications can be sent at once, if a notification comes in during a Storage check, but before it has completed.
  • If a connected client is able to accept a notification, then the Endpoint will deliver the message to the client completely bypassing Storage. This Notification will be referred to as a Direct Notification vs. a Stored Notification.
  • Provisioned Write Throughput for the Router table determines how many connections per second can be accepted across the entire cluster.
  • Provisioned Read Throughput for the Router table and Provisioned Write throughput for the Storage table determine maximum possible notifications per second that can be handled. In theory notification throughput can be higher than Provisioned Write Throughput on the Storage as connected clients will frequently not require using Storage at all. Read’s to the Router table are still needed for every notification, whether Storage is hit or not.
  • Provisioned Read Throughput on for the Storage table is an important factor in maximum notification throughput, as many slow clients may require frequent Storage checks.
  • If a client is reconnecting, their Router record will be old. Router records have the node_id cleared optimistically by Endpoints when the Endpoint discovers it cannot deliver the notification to the Push node on file. If the conditional delete fails, it implies that the client has during this period managed to connect somewhere again. It’s entirely possible that the client has reconnected and checked storage before the Endpoint stored the Notification, as a result the Endpoint must read the Router table again, and attempt to tell the node_id for that client to check storage. Further action isn’t required, since any more reconnects in this period will have seen the stored notification.

Push Endpoint Length

The Endpoint URL may seem excessively long. This may seem needless and confusing since the URL consists of the unique User Agent Identifier (UAID) and the Subscription Channel Identifier (CHID). Both of these are class 4 Universally Unique Identifiers (UUID) meaning that an endpoint contains 256 bits of entropy (2 * 128 bits). When used in string format, these UUIDs are always in lower case, dashed format (e.g. “01234567-0123-abcd-0123-0123456789ab”).

Unfortunately, since the endpoint contains an identifier that can be easily traced back to a specific device, and therefore a specific user, there is the risk that a user might inadvertently disclose personal information via their metadata. To prevent this, the server obscures the UAID and CHID pair to prevent casual determination.

As an example, it is possible for a user to get a Push endpoint for two different accounts from the same User Agent. If the UAID were disclosed, then a site may be able to associate a single user to both of those accounts. In addition, there are reasons that storing the UAID and CHID in the URL makes operating the server more efficient.

Naturally, we’re always looking at ways to improve and reduce the length of the URL. This is why it’s important to store the entire length of the endpoint URL, rather than try and optimize in some manner.