GuerrillaBrowser presents:
The Web Technology Crash Course



"The Web Technology Crash Course" (WebTech.htm) describes Internet operation using a series of illustrations.  Here is a more in-depth discussion of them—sort of like the "audio commentary" feature on your DVDs.



HTTP protocol stack — protocol, message unit, layer, addressing


The word "protocol" means the same in data communications that it does in international diplomacy: a set of rules governing how parties will interact.  On a computer network, the parties are just any computers (called "hosts") that are attached to the network.

Network protocols are layered up into something called a "stack".  The idea is that each protocol can be more specialized, and that protocols use the services of protocols in the layer below and provide services to protocols in the layer above.  The messages exchanged in most protocols consist of some header information followed by the actual payload.

Suppose you put a letter into an envelope and sent it down to the company mail room.  Imagine that envelope being put inside another one to make the trip to the local post office, and yet another envelope to go to Istanbul.  Protocol layering is a little like that.  The payload of an Ethernet "frame" on your local network might be all or part of an IP "datagram", and its payload might be a TCP "segment", and its payload might be all or part of an HTTP "request" or "response".

Physical networks use different "link" layer protocols.  At your home or office, it might be Ethernet or wireless LAN or something like that.  The telecommunications companies might be using some other protocols like ISDN, Frame Relay, ATM or whatever.  The job of the Internet Protocol (IP) is to make all those heterogeneous separate networks appear to be one homogeneous grand internetwork.  Every host attached to the Internet must have a unique IP address (like 69.147.114.210).

TCP (Transmission Control Protocol) is the TCP/IP suite's connection-oriented "transport" layer protocol.  It ensures reliable delivery by making sure that chopped-up messages get reassembled in the right order, duplicated pieces get dropped, and missing pieces get retransmitted.  It uses IP, but in addition to an IP address it also needs a "port" number (like 80) to allow multiple processes on the same host to use it simultaneously.

HTTP uses TCP, so it needs an IP address and port number for each end of the connection.  You don't need to know all your buddies' phone numbers—so long as they all know yours, you can still talk.  The "server" lives at a well-known address and port, and the "client" (that's you) gets whatever IP address your ISP assigned you and a temporary port number.  Additionally, on the server end you'll need some "path" information to identify the specific resource you want.

The other practical issue concerning the protocol stack is that if the TCP connection drops (which is not too rare on the busy Internet), it'll have to be reestablished before any more data can be exchanged with an HTTP server.



Sample DNS lookup — domain name and IP address


It isn't usually a server's IP address (like 69.147.114.210) that is well-known, but its "domain name" (like www.yahoo.com).  You can't establish a TCP connection using a domain name, but the Domain Name System (DNS) maintains a sort of directory that can map domain names to IP addresses.  Web browsers will automatically do a DNS lookup given a domain name, but will generally let you enter the IP address in the first place if that's all you have.

Occasionally, a DNS lookup will fail, usually because the server in question has gone away (possibly for good).



URL format and example


Technicians have replaced the term "URL" (Uniform Resource Locator) with "URI" (Uniform Resource Identifier), but for HTTP they amount to the same thing so you can consider them interchangeable.  URLs are the key to the Web, since they're about the only means you have of getting a particular resource from a web server.

The "host" part of a URL is usually in the form of a domain name (like www.google.com), and gets converted to an IP address by doing a DNS lookup.  If no "port" number is given, HTTP's preassigned port number 80 will be used, but if you were told a different one you'll want to supply it in the URL since there may be nothing listening on port 80 at that server.  If you omit the port number, omit the ":" character also.

The "path" portion of the URL begins at the leftmost "/" after the host name (and port, if any), and extends to the leftmost "?" character (if any).  Two paths that differ between the leftmost "/" and rightmost "/" generally refer to different resources on that server, but two paths that differ only after the rightmost "/" may or may not refer to the same resource—it just depends on the particular server.

The "query" part of the URL (if present) is available as input to software on the server and is usually used to govern what the server sends back.  It can be almost anything, but often it's a set of "key=value" terms joined together by ampersands (like hl=en&q=dog+grooming&btnG=Search).

Some characters (like spaces) are a problem in URLs, because they have special meaning to HTTP or in some other situation arising during transport.  Spaces can be replaced with "+" or escaped using the "%" character (like "%20", where 20 is the hexadecimal character code for space).  Programmers frequently want to use some normal character (like "%" or "&") to have a special meaning, and then wind up having to unescape the character (like "%25" to mean a percent sign or "&" to mean an ampersand) when they want its regular meaning!



Sample HTTP response and request


If you were talking to your buddy on a two-way radio, you might say "How's the weather there in Michigan, over" where the "over" is a sort of flag to signal that you're temporarily done.  Programmers frequently need to know when the end of something like a data stream has been reached, and have several techniques to do it.  One way is to send a prearranged amount of data, another is to send a length value ahead followed by that amount of data, and another is to send whatever amount of data followed by a special flag value to signal its end.

HTTP messages have a header portion followed by the payload.  The headers are just lines of text ending with a carriage return (CR), line feed (LF) character sequence (like Windows Notepad inserts every time you hit the "Enter" key).  The body of the message (if present) might be text (like in HTML) or binary data (like a JPEG image) or whatever.  Because there's no set number of headers, HTTP signals the end of the headers with an "empty" line—no spaces, tabs or anything, just a CR-LF pair by itself.  (In the DeChunk program, CR-LF shows up as a blue "MJ", for Ctrl+M followed by Ctrl+J).

The message you send to a server is called a "request" and the message it sends back is called a "response".  An HTTP request has a "Request-Line", followed by some header lines, followed by an empty line.  There is not usually a message body, because the preceding lines completely describe what you want the server to do.  The Request-Line contains the "method" (like "GET"), a path identifying the requested resource, and the protocol version number (like "HTTP/1.1").

The sample HTTP response shown in the illustration was returned by a (dummy) web server when the client requested "http://192.168.0.100/index.html".  It contains a "Status-Line", followed by some header lines, followed by an empty line, followed by some HTML.  The Status-Line contains the protocol version, and a "status code" with text ("200 OK" in this case).  The HTML is a tiny little web page containing one hyperlink and some JavaScript code.

Clicking on the underlined "test resource" text in the browser produced the sample HTTP request in the illustration.  Notice that the "tail" of the hyperlink (http://192.168.0.100/index.html) appears in the "Referer:" header, and the "head" of the hyperlink ("/a/test.html", same host) appears in the Request-Line as the resource being requested.  Hyperlinks go from "tail" to "head" (sounds backwards, doesn't it?).

Notice that the "<script>" element in the sample response is sort of floating out in front of the HTML (when it would normally occur within the "<HEAD>" element, like "<TITLE>" does).  Conventional web browsers have been confronted with so much weirdness like that over the years that they've been adapted to cope with just about anything (HTML "tag soup") rather than make themselves look bad.  Of course, since standards don't cover a thing like that, different browsers will cope differently.

Another interesting detail is that the server tried to feed the browser several "cookies", both in "Set-Cookie:" headers and also in the JavaScript.  The browser (IE 6.0) just pasted them together and sent them back in a single "Cookie:" header line in the subsequent request.

HTTP is called a "stateless" protocol, meaning that it gives web servers no way to remember what any particular client was doing from one request to the next.  Obviously, a server knows your IP address (unless you're using a proxy server), but those tended to get reassigned a lot back when most people had dialup, so cookies were created as an add-on to HTTP to give servers some way to track you (great, huh?).

Cookies are a major nuisance from a privacy standpoint.  Web pages often contain tons of links that your conventional browser activates automatically, passing cookies back and forth and making it easy for some snooper types to track your movements.  Fortunately, most websites will operate without them, but some use them to track what's in your shopping cart or the fact that you're logged in or whatever, so you're stuck with that when using those sites.

GuerrillaBrowser is designed as a flexible, low-level surfing tool, so it allows you to play with cookies manually if you feel the need to do that, but it has no involvement whatever with your conventional browser's cookie cache (or history, or anything else).



Some relevant HTTP status codes


HTTP version 1.1 defines 40 or more status codes, most of which you'll never see.  They've been collected into 5 groups, like the illustration shows.  What you'll discover is that a lot of web servers are not all that formal about how they use these codes.

The "200" ("OK") code is supposed to mean that your request was successfully serviced.  It by no means guarantees that you actually got what you thought you were asking for.  It would seem that often what appeared to be on offer never actually existed—sort of a "bait-and-switch" scenario.

The 300s codes are supposed to mean that the URL in your request is temporarily or permanently inactive, and that the "Location:" header in the server's response contains a substitute.  Sometimes the redirected resource will appear tacked onto the same response message, but most often you're expected to make another request (which your conventional browser will probably do without asking you).  That's not something you'll always want—a lot of "bait-and-switch" stuff going on with the redirection status codes also.

The 400s codes might be the result of some garbage characters in your request URL, which you may be able to spot and fix.  The "404" ("Not Found") code occurs so often you'll wonder how so many URLs got published for apparently nonexistent resources.

There probably isn't much you can do about a "500" ("Internal Server Error") on those rare occasions when you see one.  A "504" ("Gateway Time-out") might be the result of Internet congestion, and could be worth a retry.



Some relevant HTTP headers — requests and responses


HTTP version 1.1 defines 47 or more headers, the majority of which have little or no bearing on guerrilla surfing.

On the request side, the "Host:" header is only interesting because HTTP/1.1 requires it.  It contains the domain name (or possibly IP address) of the host containing the requested resource.  The "User-Agent:" header is meant to identify your browser (see the next topic).

The "Referer:" header's value is available to software on the server, allowing it to know what URL linked to the requested URL.  Servers sometimes use that information to keep other websites from "hotlinking" to their images or whatever.  If you've ever tried entering something like "http://xyz.com/1.jpg" in your conventional browser's address bar, you may have noticed that it can't always retrieve it.  That's probably because (unlike GuerrillaBrowser) it gives you no way to "spoof" a Referer.

"Cookie:" headers are used to return cookies to the servers that fed them to you, and "Set-Cookie:" headers are used to do the feeding (see the discussion in the sample HTTP request/response topic).

On the response side, the "Location:" header is used in conjunction with redirection as described in the prior topic.  The "Content-Location:" header is supposed to provide a more authoritative URL than the one you supplied in your request.  It's important because it replaces your request URL as the "base URL" for any relative (partial) URLs contained in the HTML accompanying the response.

The "Content-Type:" header is supposed to contain the media type of the message's payload, but isn't super-reliable (see the topic on "magic numbers" and media types).  The "Content-Length:" header is one way for your browser to know how long the body of the message is, when the server can determine that ahead of time.  Otherwise, the server will send the response in "chunks" that are each preceded by their length (useful for dynamically-created content).  The use of chunking will be announced by the "Transfer-Encoding:" header.

Sometimes the message body will be compressed, using "gzip" compression, in which case the response will contain the "Content-Encoding:" header.  In rare cases, gzip compression will be applied to already-compressed media types (like JPEG), which makes no sense at all.

HTTP/1.1 allows "persistent" connections, which help reduce connection-related overhead for a long series of requests.  A "Connection: Keep-Alive" header means you're hooked up, while a "Connection: close" header means you ain't.



Sample "User-Agent:" header values


The illustration shows some sample "User-Agent:" header values for several popular browsers.  Your big, conventional browser likely sends one of these headers with every HTTP request, allowing the server to discover something about not only your browser but your operating system as well—useful for keeping statistics, tailoring HTML, or trying to force you to upgrade on their schedule (very annoying).

You can "spoof" your browser type by changing this header's value.  With GuerrillaBrowser, all you have to do is edit $ReqHdrs.txt (located in the GuerrillaBrowser "home" folder) using Notepad, then reload the file by clicking the "OK" button in the GuerrillaBrowser Options dialog.



Common file formats — typical header (bytes 1-12), filename extension, media type


Text files (like HTML, JavaScript, Cascading Style Sheets, etc.) generally start right at their first byte.  Binary files (like JPEG images, MPEG video or whatever) usually have some control information at the beginning, followed by the bytes giving their detailed description.  Often these file headers will contain some signature bytes (called "magic numbers") to help keep applications from trying to decode some format they can't understand.

The illustration shows bytes 1-12 of several common formats as they appear in the DeChunk utility, in hexadecimal (base 16) notation with their ASCII translation on top.

Computers use binary (base 2) numbers because two digits ("0" and "1") are easy to represent using voltages or magnets.  Most people are familiar with decimal (base 10) numbers, but programmers use hexadecimal a lot because it maps directly to the binary that computers understand.  An 8-bit byte can store unsigned integers in the range of 0-255, but in hex this would be 00-FF (the letters "A" through "F" represent digits ten through fifteen).

It's a nuisance to convert back and forth between decimal and hex unless your calculator can do it for you, but often it's enough just to recognize a pattern.  JPEG files begin with the byte pattern FF-D8-FF, for example, and GIF files begin with 47-49-46-38 (which are the ASCII values for "GIF8").

The table also shows common filename extensions and "media types" (sometimes called "MIME types") for these formats.  Media types take the form "type/subtype", so the type of a JPEG image is "image" and its subtype is "jpeg".  Media types appear in the "Content-Type:" HTTP header, and also show up in HTML on occasion.

Most of the time a file's media type, extension and header signature match, but occasionally formats get mislabeled.  If a file's extension says ".jpg" but its signature says "GIF8", you can bet that it's a GIF image, not a JPEG.  The DeChunk utility is provided with GuerrillaBrowser to allow you to easily check any file's signature.  The file "256.ASC" contains byte values 0-255 so that you can see what every character value looks like under DeChunk.



The two basic HTML document types — standard and frameset


HTML is an "application" of something called SGML (Standard Generalized Markup Language).  You can think of SGML as a sort of blueprint or template—plug in a "Document Type Definition" (DTD) and it spits out a specific markup language.  The DTD contains a bunch of definitions that tell a program like your browser how to interpret an HTML document.

HTML began its development in the early 1990s at the hands of some WWW pioneers (Tim Berners-Lee, Dave Raggett and others).  Then it was hijacked for a while by browser vendors like Netscape and Microsoft who wanted to add lots of features to it.  Now it's been taken over by a standards group called W3C (the World Wide Web Consortium).  So, it's a bit of a mess, which may explain why HTML authoring tools apparently have such a hard time producing well-formed HTML documents.

The basic idea is that HTML documents contain "content" (text, images, whatever) intended for the human viewer, along with "markup" designed to help programs determine the document's overall structure.  Most of the markup takes the form of "tags" ("start-tags" and "end-tags") which are used to delimit HTML "elements".  These elements all have names and follow certain rules as set forth in the DTD.

HTML elements can contain other ("nested") HTML elements, and for any particular document they arrange into a hierarchy beginning with the root "<HTML>" element.  So, the "<HTML>" element can contain "<HEAD>" and "<BODY>" elements, and the "<HEAD>" element can contain "<META>" and "<TITLE>" elements—that kind of thing.

HTML documents come in 2 basic flavors as shown in the illustration.  In regular documents, most of the content occurs more or less inline between the "<BODY>" start-tag and "</BODY>" end-tag.  With "frameset" documents, the content tends to be sucked in from external sources identified by "<FRAME>" elements.  Either type can contain "<IFRAME>" (inline frame) elements, which also cause content to get sucked in from an external source.

In fact, HTML has quite a few elements designed for "embedding" external documents, programs or whatever into the current document.  Your big, conventional browser will generally download all of those things without asking or telling.  It has no earthly way of knowing whether you actually want any or all of them.  Neither does GuerrillaBrowser, which is why it has a rule that it only tries to get what you specifically tell it to.



Some popular browsers — browser, layout engine, script engine


There are many web browsers available, but a tiny handful dominate Web usage.  Conventional browsers are huge, complex (and expensive) programs, with much code devoted to rendering websites.  Once upon a time, there was a market for browser software so that development costs could be recovered.  (Ask the former employees and shareholders of Netscape Communications what happened with that.)

So, the apparent variety of browsers is sort of an illusion, since many of them resort to one of a few available rendering "engines" (software).  That's a little like putting different-shaped fiberglass bodies on the same VW chassis.  The layout engines and script engines used by the big 4 browsers are shown in the illustration.

The Web is all about presentation.  Website authors are looking to put on a good show, and your big, conventional browser provides all kinds of facilities for helping them do it.  Of course, malware authors also want to put on a show, one you might not be all that eager to experience.  The history of the development of what you might call the "standard" browsing model has been littered with many successful exploitations of flaws in concept or execution.

Fortunately, most websites are produced by legitimate businesses or other organizations that have no interest in attacking your computer.  For the rest, or for sites that just want to give you a hard time, GuerrillaBrowser was designed to put you at less of a disadvantage, but it doesn't even try to support the standard browsing model.

So, for the full Web "experience" on trusted sites, you should use your conventional browser.  On untrusted sites, you now have a choice.  You can use GuerrillaBrowser to scope them out for danger signs.  Or you can use GuerrillaBrowser to pick them apart.  Or, you can roll the dice...



HTML markup handling — type of markup, example, result


The contents of an HTML document can belong to any of several classes which are handled differently, as indicated in the illustration.  Basically, your browser's layout engine applies whatever formatting information it finds in a document to the rest of its content to determine how to arrange everything on your screen.

Markup that begins with "<?" ("processing instructions") or "<!" ("markup declarations") isn't supposed to be rendered by your browser.  That includes comments (which begin with the characters "<!--").  HTML authors use them to include notes, and sometimes to make document sections "disappear" without actually having to delete them.  Sometimes a comment's closing "--" characters are missing, leaving a program no reliable way to tell where the comment ends.  You may be a lot smarter, or just want to re-include removed content, and can simply edit the HTML to put the closing "-->" characters wherever you want them.

SGML "entities" are often used in HTML.  Their "definitions" ("<!ENTITY whatever ...>") appear in the DTD and "references" to them ("&whatever;") appear in the document.  They provide a way to insert problematic characters into the document.  Numeric references ("&#number;" or "&#xhexnumber;") are also used, where the "number" is just the desired character's code.  Web servers will probably not understand URLs with something like "&amp;" in them, so references have to be translated before they're passed on.

Style sheets and HTML presentational elements like "<FONT>" (font change) are of interest to website designers, but not much use to guerrilla surfers.  Script is very interesting, mainly because it's potentially dangerous.  Script and flaws in the script engine itself have been used as the basis for many attacks.  Script is also used to create content on-the-fly, which is a royal pain because there's no good way to review the content without executing the code (and taking your chances).

HTML looks like a mess if you aren't accustomed to it, but after a while it gets to be like "Neo" looking at the Matrix—you can look past the clutter and sort of see what's going on behind it.  Guerrilla surfing is usually a matter of hunting out some key things and ignoring the rest.



HTML element format and example


The illustration shows the general format of an HTML element.  The names of the element and attribute(s) (if any) are not case sensitive, but in most real-world HTML they appear in lower case.  They refer back to the names given in the element ("<!ELEMENT whatever ...>") and attribute ("<!ATTLIST whatever ...>") declarations in the DTD.

Sometimes the end-tag or even the start-tag is omitted.  The end-tag is forbidden for elements that have no content, like "<BR>" (line break).  Another example is the "<IMG>" element, where all the action happens in the attribute list (particularly the "SRC=" attribute).

Attribute values are another common place where HTML gets mangled, with missing end quotes or whatever.



HTML elements grouped by function


All of HTML is interesting to someone, but the main things of interest to guerrilla surfers are hyperlinks (because they contain URLs of resources you may want) and script (potentially bad news, but occasionally informative).

Hyperlinks have 2 "anchors" (a "tail" and a "head").  The "<A>" (anchor) element identifies the tail, and contains a reference (the "HREF=" attribute) to the head somewhere out on the Internet.  Your conventional browser will generally wait until you click on an "<A>" link to activate it (download the resource), but other elements create hyperlinks that will be activated automatically—something you may not always want to happen.