Character Encoding Issues Caused by Enabling Gzip in Nginx

Character Encoding Issues Caused by Enabling Gzip in Nginx

nginx-gzip

Problem: To optimize the backend structure, an Nginx scheduling server was added in front of the business server layer. Both the business server and the scheduling server had gzip and caching enabled. It was found that some mobile phones displayed garbled characters, and this occurred on a small number of devices. The scenario of the garbled characters causing program errors seemed accidental. However, after disabling the scheduling server’s cache, all problems disappeared.

Initially, it was suspected that the frontend cache was corrupted due to improper gzip settings. (Because the Nginx documentation mentions proxy cache corruption in the gzip_http_version section: “When HTTP version 1.0 is used, the Vary: Accept-Encoding header is not set. As this can lead to proxy cache corruption, consider adding it with add_header.”)

Because of a lack of understanding of the gzip configuration directives (the documentation was too sloppy), the true cause of the problem was not determined for a long time. Through subsequent in-depth study of these directives and repeated experimentation, the cause of the problem was finally clarified.

gzip_http_version enables or disables gzip compression depending on the HTTP request version. (It is enabled if the request HTTP version is larger than the setting value.)

In the nginx gzip code, it was found that nginx determines whether to enable compression for a response based on several conditions:

/* http/modules/ngx_http_gzip_filter_module.c: 261 */
if (!r->gzip_tested) {
    if (ngx_http_gzip_ok(r) != NGX_OK) {
        return ngx_http_next_header_filter(r);
    }    

} else if (!r->gzip_ok) {
    return ngx_http_next_header_filter(r);
}

/* http/ngx_http_core_module.c: 1915 */
if (r->http_version < clcf->gzip_http_version) {
    return NGX_DECLINED;
} 

As can be seen from the code above, if the version number of any request is less than the value set by gzip_http_version, nginx will not compress the request result.

Vary: Accept-Encoding
gzip_vary directive:

Enable response header of “Vary: Accept-Encoding”.

So what exactly is the role of Vary: Accept-Encoding?

Looking at the source code, this directive doesn’t affect much logic during request and response processing. It only determines whether to add Vary: Accept-Encoding in the final packet header filter.

/* ngx_http_header_filter_module.c: 397 */
#if (NGX_HTTP_GZIP)
    if (r->gzip_vary) {
        if (clcf->gzip_vary) {
            len += sizeof("Vary: Accept-Encoding" CRLF) - 1;

        } else {
            r->gzip_vary = 0;
        }   
    }   
#endif

from rfc:

An HTTP/1.1 server SHOULD include a Vary header field with any cacheable response that is subject to server-driven negotiation. Doing so allows a cache to properly interpret future requests on that resource and informs the user agent about the presence of negotiation on that resource. […] A Vary field value consisting of a list of field-names signals that the representation selected for the response is based on a selection algorithm which considers ONLY the listed request-header field values in selecting the most appropriate representation. A cache MAY assume that the same selection will be make for future requests with the same values for the listed field names, for the duration of time for which the response is fresh.

from stackoverflow:

in other words, Vary: Accept-Encoding tells the browser that two cacheable responses of the same resource will be the same even if the Accept-Encoding request is different (“varies”).

GET /js/somefile.js HTTP/1.1
Accept-Encoding: gzip

HTTP/1.1 200 OK
Vary: Accept-Encoding
Content-Encoding: gzip

This means that you’ll get the same script, no matter if you request compression or not.

from stackoverflow:

It informs the behavior of the server with respect to cacheing he representation of the requested resource. If a new request for a previously cached resource is received, it will be served from the cache unless the Accept-Encoding header of the new request is different from the previously cached representation, at which point the request will be treated as a new request and will not be served from cache.

**If you’re serving a compressed file from cache and the client doesn’t accept your compression mechanism they’ll get a page of junk, so it’s necessary. **

**If Gzipped version is in cache and a client does not accept GZIP, they’ll be served gobbledegook. **

from stackoverflow:

It is allowing the cache to serve up different cached versions of the page depending on whether or not the browser requests GZIP encoding or not. The Vary header instructs the cache to store a different version of the page if there is any variation in the indicated header.

As things stand, there will be one (possibly compressed) copy of the page in cache. Say it is the compressed version: if somebody requested the resource but does not support gzip encoding, they’ll be served the wrong content.

Squid’s Handling of Vary: Accept-Encoding

Taobao Core System Team Blog: Regarding Squid requests to the origin server with Vary headers

The origin server’s response header does not include “Vary: Accept-Encoding”

Regardless of whether the client request header contains “Accept-Encoding: gzip, deflate”, Squid only caches one copy of the object.

If the first request that results in a Squid MISS includes “Accept-Encoding: gzip, deflate” in its header, subsequent requests to the origin server will return a gzip-compressed object. Squid caches this compressed object. Afterward, all subsequent client requests, regardless of whether they include “Accept-Encoding: gzip, deflate” in their headers, will return this compressed object.

If the first request that results in a Squid MISS does not include “Accept-Encoding: gzip, deflate” in its header, subsequent requests to the origin server will return a non-gzip-compressed object. Squid caches this non-compressed object. After this, regardless of whether other client requests include “Accept-Encoding: gzip, deflate” in the header, Squid will return this uncompressed object.

The origin server returns a response header with “Vary: Accept-Encoding”.

Squid will cache multiple copies of the object based on the “Accept-Encoding” value in the client request.

Squid requests the corresponding data (compressed or uncompressed) from the origin server based on the “Accept-Encoding” value in the client request, returns this request to the client, and then stores multiple copies of the object in its local cache based on different “Accept-Encoding” values.

Conclusion: Under normal circumstances, to support “Vary: Accept-Encoding”, nginx needs to cache uncompressed data objects locally to handle requests with “Accept-Encoding: gzip, deflate” (nginx only supports the gzip compression algorithm) and requests without this header.

However, in the configuration where the problem described at the beginning occurred, both the scheduling server and the backend server had compression enabled. So, if the initial request processed by the scheduling server does not include “Accept-Encoding”, it will request uncompressed data from the application server. Subsequent requests will then be served normally.

However, if the initial request includes “Accept-Encoding”, the scheduling server will retrieve compressed data from the application server and cache it locally. Afterward, regardless of whether the client has “Accept-Encoding” enabled, it will receive compressed data. In this case, all response headers will resemble the following:

HTTP/1.1 200 OK
Server: nginx/1.1.19
Date: Mon, 16 Jul 2012 18:31:47 GMT
Content-Type: text/html; charset=utf-8
Transfer-Encoding: chunked
Connection: keep-alive
Vary: Accept-Encoding
X-Powered-By: PHP/5.3.13
Expires: Mon, 16 Jul 2012 19:31:47 GMT
Cache-Control: max-age=3600
Content-Encoding: gzip

This compressed data will not cause any problems for browsers or SDKs capable of handling gzip. However, if the browser or SDK lacks gzip processing capabilities (which is why these programs or code won’t set “Accept-Encoding: gzip, deflate”), they will receive garbled text.

Solution: Objective: Ensure the scheduling server only caches uncompressed data.

Disable compression settings on the application server.

Add “Via: xxx” or “Accept-Encoding: “” to the proxy request header of the scheduling server.

If the value is an empty string, the header will not be sent to the upstream. For example, this setting can be used to disable gzip compression on the upstream.

proxy_set_header Accept-Encoding "";

Additional Question

Q. If the scheduling server has already cached the compressed object and enabled compression settings, will a client with “Accept-Encoding: gzip, deflate” set to receive data that has been compressed twice?

A. To answer this question, we need to go back to the code block in gzip_module that checks for compression. The previous section explaining the function of gzip_http_version did not fully extract the code segment that checks the compression conditions of gzip.

if (!conf->enable
    || (r->headers_out.status != NGX_HTTP_OK
        && r->headers_out.status != NGX_HTTP_FORBIDDEN
        && r->headers_out.status != NGX_HTTP_NOT_FOUND)
    || (r->headers_out.content_encoding
        && r->headers_out.content_encoding->value.len)
    || (r->headers_out.content_length_n != -1
        && r->headers_out.content_length_n < conf->min_length)
    || ngx_http_test_content_type(r, &conf->types) == NULL
    || r->header_only)
{
    return ngx_http_next_header_filter(r);
}

r->gzip_vary = 1;

if (!r->gzip_tested) {
    if (ngx_http_gzip_ok(r) != NGX_OK) {
        return ngx_http_next_header_filter(r);
    }

} else if (!r->gzip_ok) {
    return ngx_http_next_header_filter(r);
}

When using cached data to respond to client requests, nginx reads all the response headers from the upstream cache file into the ngx_http_request_t::headers_out structure. Since the upstream returns compressed data, the header fields will definitely contain the “Content-Encoding: gzip” field. Therefore, in the above…

r->headers_out.content_encoding

If the condition is true, nginx skips the processing of gzip_module.

Q. Why does Via prevent the application server from compressing the response? A. In fact, this is the field identified by the gzip_proxied directive, whose file description is ambiguous (my English is too poor?).

gzip_proxyed off | expired | no-cache | no-store | private | no_last_modified | no_etag | auth | any

It allows or disallows the compression of the response of for the proxy request in the dependence on the request and the response. The fact that, request proxy, is determined on the basis of line “Via” in the headers of request.

Its default value is off. This is also the setting I use in the scheduling server and the business server.

Another explanation

As long as client request was identified as one came from proxy server (“Via” header present) nginx is able to disable or enable it’s own gzip depending on various conditions. These conditions are controlled via gzip_proxied directive.

This guy explained it very clearly: Therefore, requests containing the “Via” field in the packet header will be treated by nginx as requests from another proxy program. Whether to enable gzip for this request, or under what conditions to enable gzip, can be controlled by gzip_proxied. The default value is to uncompress all proxied requests.

Looking back at the official documentation, describing it as “proxy request” is more appropriate.

So, what does the code for controlling compression using “Via” look like?

/* ngx_http_core_module.c:1919, ngx_http_gzip_ok */
if (r->headers_in.via == NULL) {
    goto ok;
}

p = clcf->gzip_proxied;

if (p & NGX_HTTP_GZIP_PROXIED_OFF) {
    return NGX_DECLINED; 
}

if (p & NGX_HTTP_GZIP_PROXIED_ANY) {
    goto ok;
}
/* ... */

ngx_http_request_t::headers_in and ngx_http_request_t::headers_out store all fields of the request. If the “Via” field is present in all requests, further judgment is made based on the gzip_proxied configuration to determine whether the response data for this request needs to be compressed.

Q. Which Android SDKs do not support gzip decompression? A. ???

Q. What is the difference between the invalid value in proxy_cache_path and the time value in proxy_cache_valid? A. Directive proxy_cache_valid specifies how long response will be considered valid (and will be returned without any requests to backend). After this time response will be considered “stale” and either won’t be returned or will be depending on proxy_cache_use_stale setting.

Argument inactive of proxy_cache_path specifies how long response will be stored in cache after last use. Note that even stale responses will be considered recently used if there are requests to them.

Q. What is the difference between Cache-Control and Expires? A. Expires is defined by HTTP/1.0, and Cache-Control is defined by HTTP/1.1. max-age is just a straight integer number of seconds, while Expires has a somewhat complex date format. And even small errors in generating the Expires values ​​can cause downstream caches to misintepret it. It happens more often than you think.

Cache-Control was introduced in HTTP/1.1 and offers more options than Expires. They can be used to accomplish the same thing but the data value for Expires is an HTTP date whereas Cache-Control max-age lets you specify a relative amount of time so you could specify “X hours after the page was requested”.

To sum up though, Expires is recommended for static resources like images and Cache-Control when you need more control over how caching is done.

Q. What does proxy_ignore_headers do and how to use it? A. Upstream cache-related directives have priority over proxy_cache_valid value, in particular the order is:

  • X-Accel-Expires
  • Expires/Cache-Control
  • proxy_cache_valid

The order in which your backend return HTTP headers change cache behavior. You may ignore the headers using

proxy_ignore_headers X-Accel-Expires Expires Cache-Control

proxy_ignore_headers determines which of upstream headers or proxy_cache_valid is used by nginx to decide for how long nginx will cache the response.

Separate from that, you can use proxy_hide_header to tell nginx not to send some headers that came from upstream, to the client.

Separate from that (mostly), you can use expires to tell nginx how to set Expires and Cache-Control headers in response to the client. (mostly) is there because nginx will not send a single Expires header, so if you use expires to set one, then the one from upstream will not go to the client, even if it isn’t in proxy_hide_header.

To explicitly set Cache-Control/Expires headers, use the expires directives.

Qexpires A. Controls whether the response should be marked with an expiry time, and so, what time that is.

  • off prevents changes to the Expires and `Cache-Control headers.
  • epoch sets the Expires header to 1 January, 1970 00:00:01 GMT.
  • max sets the Expires header to 31 December 2037 23:59:50 GMT, and the Cache-Control max-age to 10 years.
  • A time without an @ prefix specifies an expiry time relative to either the response time (if the time is not preceded with “modified”) or the file’s modification time (when “modified” is present). A negative time can be specified, which sets the Cache-Control header to no-cache.
  • Times written with an @ prefix represent an absolute time time-of-day expiry, written in either the form Hh or Hh:Mm, where H ranges from 0 to 24, and M ranges from 0 to 59.

A non-negative time or time-of-day sets the Cache-Control header to max-age=#, where # is the appropriate time in seconds.

Noteexpires works only for 200, 204, 301, 302, and 304 responses.

Q. Is it truly effective to use different cache times for the scheduling server and the business server? A. The answer can be found above.

misc

  • Expires/Cache-Control controls whether the browser retrieves data directly from its cache or resends a request to the server. Cache-Control offers more control than Expires.
  • Last-Modified/If-Modified-Since and ETag/If-None-Match determine whether a file has been modified after the browser sends a request to the server. If the file hasn’t been modified, the server returns a 304 status. If it has been modified, the server will resend the data to the browser.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *