Getting info from Google Analytics seemed like a pretty straightfoward job. Authenticate with OAuth, connect to the Data Feed, ask for JSON, parse, done. Luck had it that the library we’re heavily using here for networking tasks was giving errors. So I started scratching my head, playing with the Google OAuth Playground, trying different things, and it all seemed to be working. But my library was definantly giving me errors.
At this moment you are probably asking: “but what error is it?”. The answer is: an error signalling that the connection to the server has been closed and the library is not sure if all data has been received. Basically it was saying that Google was disconnecting me in the middle of getting the data. Now I was curious.
My usual way of approaching such issues is firing up Wireshark and looking at the bytes being sent on the wire to see exactly what is happening. The only problem was that the Google Analytics Data Feed is using SSL, so I couldn’t actually see the HTTP requests going back and forward. Trying HTTP instead of HTTPS turned out to be a dead end: Google was answering with a 302 redirect to the HTTPS host. And here is where socat enters the scene.
socat is a multipurpose relay tool that’s roughly similar to netcat. It allows you to connect two endpoints and set up a bidirectional transfer between them. I won’t be going into details, the man page is very exhaustive, and a command line is worth a thousand words. Here is what I used:
socat TCP-LISTEN:10001, reuseaddr=1, fork OPENSSL-CONNECT:www.google.com:443, verify=0, reuseaddr=1
This makes socat open a local TCP socket on port 10001, forking a new process for each connection, and connect it to a SSL-over-TCP socket that gets connected to www.google.com on port 443. The reuseaddr parameters set SO_REUSEADDR on the socket, so you can retry immediately after you Ctrl+C socat (very useful). With that in place I fired up Wireshark, asked it to dump traffic from the loopback interface, changed the Google URL in my application to https://localhost:10001, and started it again. And it responded with a 302. But this time I saw the bytes in Wireshark and could see that it was because I was sending the wrong Host header (namely “localhost:10001″), breaking Google’s virtual hosting and triggering a redirect to the default site (www.google.com).
After kicking my app into submission to force it to send a “www.google.com” Host header, I got a 401 Unauthorized. Another round of head scratching ensuned, and after it I realized that with OAuth you sign the request URL, which for me looked like “http://localhost:10001/analytics/…”, but Google was expecting the signed URL to be “http://www.google.com/analytics/…”. After hacing the OAuth library to sign a different URL that the one I’ve been using to issue the request, I finally have been able to reproduce my original problem, but this time with the entire HTTP conversation captured in Wireshark.
And that was all I needed to confirm that the problem was in the Google server not setting neither Content-Length nor Transfer-Encoding headers on their responses. Without the Content-Length header, when the server disconnects you, there is no way to be sure whether you’ve received all of the content, or the connection has been broken in the middle. The library we’re using was reacting by raising a error warning about a potential loss of data and it had all rights to do this. On the other hand, what I saw in Wireshark seemed to be the entire content, so it seems that Google is simply neglecting to set Content-Length in their responses. For extra fun, the OAuth playground returns responses with Transfer-Encoding: chunked, which obviates the need to use Content-Length, but the real service apparently does not (or perhaps only does it for larger responses).
After that it was just a matter of ignoring that error when using Google Data Services and assuming that all content has been transferred even if the response was missing Content-Length. But it might have been a much longer and painful debugging session if it weren’t for socat and Wireshark.
And finally, a question to all of you. Can you guess which library we are using for networking? I believe I gave enough hints already, but I’ll drop a last one: it’s quirky, but it’s awesome.