Local Database

Google Safe Browsing v5 expects the client to maintain a local database, except when the client chooses the No-Storage Real-Time Mode. It is up to the client the format and storage of this local database. The contents of this local database can conceptually be thought of as a folder containing various lists as files, and the contents of these files are SHA256 hashes, or their corresponding prefixes with four byte hash prefix being the most commonly used hash length.

Available Lists

UPDATE: The following table consists of newly introduced lists, which follows a naming convention where the name contains a suffix that signifies the length of the hash you should expect in the list. The hashList method initially supported multiple hash lengths, with an option for the client to request for a particular hash length by specifying a query-param desiredHashLength. We are replacing this with a more cleaner API surface in order to support single hash length, instead of multiple. Hash lists with the same threat type but different hash length will be a separately named list that's qualified with a suffix that indicates the hash length. With this change we will be deprecating the desiredHashLength field in both hashList.get and the hashLists.batchGet methods, as there will only be a single hash length supported for a given list. An update to the hashList.hashListMetadata fields to reflect this change can be expected.

The following lists are recommended for use in v5alpha1:

List Name Corresponding v4 ThreatType Enum Description
gc-32b None This list is a Global Cache list. It is a special list only used in the Real-Time mode of operation.
se-4b SOCIAL_ENGINEERING This list contains threats of the SOCIAL_ENGINEERING threat type.
mw-4b MALWARE This list contains threats of the MALWARE threat type for desktop platforms.
uws-4b UNWANTED_SOFTWARE This list contains threats of the UNWANTED_SOFTWARE threat type for desktop platforms.
uwsa-4b UNWANTED_SOFTWARE This list contains threats of the UNWANTED_SOFTWARE threat type for Android platforms.
pha-4b POTENTIALLY_HARMFUL_APPLICATION This list contains threats of the POTENTIALLY_HARMFUL_APPLICATION threat type for Android platforms.

Additional lists will become available at a later date, at which time the above table will be expanded.

Database Updates

The client will regularly call the hashList.get method or the hashLists.batchGet method to update the database. Since the typical client will want to update multiple lists at a time, it is recommended to use hashLists.batchGet method.

Lists are identified by their distinct names. The names are short ASCII strings a few characters long.

After the API has been GA'd, the list names will never be renamed. Furthermore, once a list has appeared, it will never be removed (if the list is no longer useful, it will become empty but will continue to exist). Therefore, it is appropriate to hard code these names in the Google Safe Browsing client code.

Both the hashList.get method and the hashLists.batchGet method support incremental updates. Using incremental updates saves bandwidth and improves performance. Incremental updates work by delivering a delta between client's version of the list and the latest version of the list. (If a client is newly deployed and does not have any versions available, a full update is available.) The incremental update contains removal indices and additions. The client is first expected to remove the entries at the specified indices from its local database, and then apply the additions.

Finally, to prevent corruption, the client should check the stored data against the checksum provided by the server. Whenever the checksum does not match, the client should perform a full update.

Decoding the List Content

Decoding Hashes and Hash Prefixes

All lists are delivered using a special encoding to reduce size. This encoding works by recognizing that Google Safe Browsing lists contain, conceptually, a set of hashes or hash prefixes, which are statistically indistinguishable from random integers. If we were to sort these integers and take their adjacent difference, such adjacent difference is expected to be "small" in a sense. Golomb-Rice encoding then exploits this smallness.

Suppose that three host-suffix path-prefix expressions, namely a.example.com/, b.example.com/, and y.example.com/, are to be transmitted using 4-byte hash prefixes. Further suppose that the Rice parameter, denoted by k, is chosen to be

  1. The server would start by calculating the full hash for these strings, which are, respectively:
291bc5421f1cd54d99afcc55d166e2b9fe42447025895bf09dd41b2110a687dc  a.example.com/
1d32c5084a360e58f1b87109637a6810acad97a861a7769e8f1841410d2a960c  b.example.com/
f7a502e56e8b01c6dc242b35122683c9d25d07fb1f532d9853eb0ef3ff334f03  y.example.com/

The server then forms 4-byte hash prefixes for each of the above, which is the first 4 bytes of the 32-byte full hash, interpreted as big-endian 32-bit integers. The big endianness refers to the fact that the first byte of the full hash becomes the most significant byte of the 32-bit integer. This step results in the integers 0x291bc542, 0x1d32c508, and 0xf7a502e5.

It is necessary for the server to sort these three hash prefixes lexicographically (equivalent to numerical sorting in big endian), and the result of the sorting is 0x1d32c508, 0x291bc542, 0xf7a502e5. The first hash prefix is stored unchanged in the first_value field.

The server then calculates the two adjacent differences, which are 0xbe9003a and 0xce893da3 respectively. Given that k is chosen to be 30, the server splits these two numbers into the quotient parts and remainder parts that are 2 and 30 bits long respectively. For the first number, the quotient part is zero and the remainder is 0xbe9003a; for the second number, the quotient part is 3 because the most significant two bits are 11 in binary and the remainder is 0xe893da3. For a given quotient q it is encoded into (1 << q) - 1 using exactly 1 + q bits; the remainder is encoded directly using k bits. The quotient part of the first number is encoded as 0, and the remainder part is in binary 001011111010010000000000111010; the quotient part of the second number is encoded as 0111, and the remainder part is 001110100010010011110110100011.

When these numbers are formed into a byte string, little endian is used. Conceptually it may be easier to imagine a long bitstring being formed starting from the least significant bits: we take the quotient part of the first number and prepend the remainder part of the first number; we then further prepend the quotient part of the second number and prepend the remainder part. This should result in the following large number (linebreaks and comments added for clarity):

001110100010010011110110100011 # Second number, remainder part
0111 # Second number, quotient part
001011111010010000000000111010 # First number, remainder part
0 # First number, quotient part

Written in a single line this would be

00111010001001001111011010001101110010111110100100000000001110100

Obviously this number far exceeds the 8 bits available in a single byte. The little endian encoding then takes the least significant 8 bits in that number, and outputs it as the first byte which is 01110100. For clarity, we can group the above bitstring into groups of eight starting from the least significant bits:

0 01110100 01001001 11101101 00011011 10010111 11010010 00000000 01110100

The little endian encoding then takes each byte from the right and puts that into a bytestring:

01110100
00000000
11010010
10010111
00011011
11101101
01001001
01110100
00000000

It can be seen that since we conceptually prepend new parts to the large number on the left (i.e. adding more significant bits) but we encode from the right (i.e. the least significant bits), the encoding and decoding can be performed incrementally.

This finally results in

additions_four_bytes {
  first_value: 489866504
  rice_parameter: 30
  entries_count: 2
  encoded_data: "t\000\322\227\033\355It\000"
}

The client simply follows the above steps in reverse to decode the hash prefixes.

Decoding Removal Indices

Removal indices are encoded using the exact same technique as above using 32-bit integers.

Update Frequency

The client should inspect the server's returned value in the field minimum_wait_duration and use that to schedule the next update of the database. This value is possibly zero (the field minimum_wait_duration is completely missing), in which case the client SHOULD immediately perform another update.