"The Web Technology Crash Course" (WebTech.htm) describes
Internet operation using a series of illustrations. Here is a more
in-depth discussion of themsort of like the "audio commentary" feature
on your DVDs.
The word "protocol" means the same in data communications that it does in international diplomacy: a set of rules governing how parties will interact. On a computer network, the parties are just any computers (called "hosts") that are attached to the network.
Network protocols are layered up into something called a "stack". The idea is that each protocol can be more specialized, and that protocols use the services of protocols in the layer below and provide services to protocols in the layer above. The messages exchanged in most protocols consist of some header information followed by the actual payload.
Suppose you put a letter into an envelope and sent it down to the company mail room. Imagine that envelope being put inside another one to make the trip to the local post office, and yet another envelope to go to Istanbul. Protocol layering is a little like that. The payload of an Ethernet "frame" on your local network might be all or part of an IP "datagram", and its payload might be a TCP "segment", and its payload might be all or part of an HTTP "request" or "response".
Physical networks use different "link" layer protocols. At your home or
office, it might be Ethernet or wireless LAN or something like that. The
telecommunications companies might be using some other protocols like ISDN,
Frame Relay, ATM or whatever. The job of the Internet Protocol (IP) is to
make all those heterogeneous separate networks appear to be one homogeneous
grand internetwork. Every host attached to the Internet must have a unique
IP address (like 69.147.114.210).
TCP (Transmission Control Protocol) is the TCP/IP suite's connection-oriented
"transport" layer protocol. It ensures reliable delivery by making sure that
chopped-up messages get reassembled in the right order, duplicated pieces get
dropped, and missing pieces get retransmitted. It uses IP, but in addition
to an IP address it also needs a "port" number (like 80) to allow multiple
processes on the same host to use it simultaneously.
HTTP uses TCP, so it needs an IP address and port number for each end of the connection. You don't need to know all your buddies' phone numbersso long as they all know yours, you can still talk. The "server" lives at a well-known address and port, and the "client" (that's you) gets whatever IP address your ISP assigned you and a temporary port number. Additionally, on the server end you'll need some "path" information to identify the specific resource you want.
The other practical issue concerning the protocol stack is that if the TCP connection drops (which is not too rare on the busy Internet), it'll have to be reestablished before any more data can be exchanged with an HTTP server.
It isn't usually a server's IP address (like 69.147.114.210) that is
well-known, but its "domain name" (like www.yahoo.com). You can't establish
a TCP connection using a domain name, but the Domain Name System (DNS)
maintains a sort of directory that can map domain names to IP addresses. Web
browsers will automatically do a DNS lookup given a domain name, but will
generally let you enter the IP address in the first place if that's all you
have.
Occasionally, a DNS lookup will fail, usually because the server in question has gone away (possibly for good).
Technicians have replaced the term "URL" (Uniform Resource Locator) with "URI" (Uniform Resource Identifier), but for HTTP they amount to the same thing so you can consider them interchangeable. URLs are the key to the Web, since they're about the only means you have of getting a particular resource from a web server.
The "host" part of a URL is usually in the form of a domain name (like
www.google.com), and gets converted to an IP address by doing a DNS lookup.
If no "port" number is given, HTTP's preassigned port number 80 will be used,
but if you were told a different one you'll want to supply it in the URL
since there may be nothing listening on port 80 at that server. If you omit
the port number, omit the ":" character also.
The "path" portion of the URL begins at the leftmost "/" after the host name
(and port, if any), and extends to the leftmost "?" character (if any). Two
paths that differ between the leftmost "/" and rightmost "/" generally refer
to different resources on that server, but two paths that differ only after
the rightmost "/" may or may not refer to the same resourceit just
depends on the particular server.
The "query" part of the URL (if present) is available as input to software on
the server and is usually used to govern what the server sends back. It can
be almost anything, but often it's a set of "key=value" terms joined together
by ampersands (like hl=en&q=dog+grooming&btnG=Search).
Some characters (like spaces) are a problem in URLs, because they have
special meaning to HTTP or in some other situation arising during transport.
Spaces can be replaced with "+" or escaped using the "%" character (like
"%20", where 20 is the hexadecimal character code for space). Programmers
frequently want to use some normal character (like "%" or "&") to have a
special meaning, and then wind up having to unescape the character (like
"%25" to mean a percent sign or "&" to mean an ampersand) when they want
its regular meaning!
If you were talking to your buddy on a two-way radio, you might say "How's the weather there in Michigan, over" where the "over" is a sort of flag to signal that you're temporarily done. Programmers frequently need to know when the end of something like a data stream has been reached, and have several techniques to do it. One way is to send a prearranged amount of data, another is to send a length value ahead followed by that amount of data, and another is to send whatever amount of data followed by a special flag value to signal its end.
HTTP messages have a header portion followed by the payload. The headers are
just lines of text ending with a carriage return (CR), line feed (LF)
character sequence (like Windows Notepad inserts every time you hit the
"Enter" key). The body of the message (if present) might be text (like in
HTML) or binary data (like a JPEG image) or whatever. Because there's no set
number of headers, HTTP signals the end of the headers with an "empty" lineno
spaces, tabs or anything, just a CR-LF pair by itself. (In the DeChunk
program, CR-LF shows up as a blue "MJ", for Ctrl+M followed by Ctrl+J).
The message you send to a server is called a "request" and the message it
sends back is called a "response". An HTTP request has a "Request-Line",
followed by some header lines, followed by an empty line. There is not
usually a message body, because the preceding lines completely describe what
you want the server to do. The Request-Line contains the "method" (like
"GET"), a path identifying the requested resource, and the protocol version
number (like "HTTP/1.1").
The sample HTTP response shown in the illustration was returned by a (dummy)
web server when the client requested "http://192.168.0.100/index.html". It
contains a "Status-Line", followed by some header lines, followed by an empty
line, followed by some HTML. The Status-Line contains the protocol version,
and a "status code" with text ("200 OK" in this case). The HTML is a tiny
little web page containing one hyperlink and some JavaScript code.
Clicking on the underlined "test resource" text in the browser produced the
sample HTTP request in the illustration. Notice that the "tail" of the
hyperlink (http://192.168.0.100/index.html) appears in the "Referer:" header,
and the "head" of the hyperlink ("/a/test.html", same host) appears in the
Request-Line as the resource being requested. Hyperlinks go from "tail" to
"head" (sounds backwards, doesn't it?).
Notice that the "<script>" element in the sample response is sort of floating
out in front of the HTML (when it would normally occur within the "<HEAD>"
element, like "<TITLE>" does). Conventional web browsers have been
confronted with so much weirdness like that over the years that they've been
adapted to cope with just about anything (HTML "tag soup") rather than make
themselves look bad. Of course, since standards don't cover a thing like
that, different browsers will cope differently.
Another interesting detail is that the server tried to feed the browser
several "cookies", both in "Set-Cookie:" headers and also in the JavaScript.
The browser (IE 6.0) just pasted them together and sent them back in a single
"Cookie:" header line in the subsequent request.
HTTP is called a "stateless" protocol, meaning that it gives web servers no way to remember what any particular client was doing from one request to the next. Obviously, a server knows your IP address (unless you're using a proxy server), but those tended to get reassigned a lot back when most people had dialup, so cookies were created as an add-on to HTTP to give servers some way to track you (great, huh?).
Cookies are a major nuisance from a privacy standpoint. Web pages often contain tons of links that your conventional browser activates automatically, passing cookies back and forth and making it easy for some snooper types to track your movements. Fortunately, most websites will operate without them, but some use them to track what's in your shopping cart or the fact that you're logged in or whatever, so you're stuck with that when using those sites.
GuerrillaBrowser is designed as a flexible, low-level surfing tool, so it allows you to play with cookies manually if you feel the need to do that, but it has no involvement whatever with your conventional browser's cookie cache (or history, or anything else).
HTTP version 1.1 defines 40 or more status codes, most of which you'll never see. They've been collected into 5 groups, like the illustration shows. What you'll discover is that a lot of web servers are not all that formal about how they use these codes.
The "200" ("OK") code is supposed to mean that your request was successfully
serviced. It by no means guarantees that you actually got what you thought
you were asking for. It would seem that often what appeared to be on offer
never actually existedsort of a "bait-and-switch" scenario.
The 300s codes are supposed to mean that the URL in your request is
temporarily or permanently inactive, and that the "Location:" header in the
server's response contains a substitute. Sometimes the redirected resource
will appear tacked onto the same response message, but most often you're
expected to make another request (which your conventional browser will
probably do without asking you). That's not something you'll always wanta
lot of "bait-and-switch" stuff going on with the redirection status codes
also.
The 400s codes might be the result of some garbage characters in your
request URL, which you may be able to spot and fix. The "404" ("Not Found")
code occurs so often you'll wonder how so many URLs got published for
apparently nonexistent resources.
There probably isn't much you can do about a "500" ("Internal Server Error")
on those rare occasions when you see one. A "504" ("Gateway Time-out") might
be the result of Internet congestion, and could be worth a retry.
HTTP version 1.1 defines 47 or more headers, the majority of which have little or no bearing on guerrilla surfing.
On the request side, the "Host:" header is only interesting because HTTP/1.1
requires it. It contains the domain name (or possibly IP address) of the
host containing the requested resource. The "User-Agent:" header is meant to
identify your browser (see the next topic).
The "Referer:" header's value is available to software on the server,
allowing it to know what URL linked to the requested URL. Servers sometimes
use that information to keep other websites from "hotlinking" to their images
or whatever. If you've ever tried entering something like
"http://xyz.com/1.jpg" in your conventional browser's address bar, you may
have noticed that it can't always retrieve it. That's probably because
(unlike GuerrillaBrowser) it gives you no way to "spoof" a Referer.
"Cookie:" headers are used to return cookies to the servers that fed them
to you, and "Set-Cookie:" headers are used to do the feeding (see the
discussion in the sample HTTP request/response topic).
On the response side, the "Location:" header is used in conjunction with
redirection as described in the prior topic. The "Content-Location:" header
is supposed to provide a more authoritative URL than the one you supplied in
your request. It's important because it replaces your request URL as the
"base URL" for any relative (partial) URLs contained in the HTML accompanying
the response.
The "Content-Type:" header is supposed to contain the media type of the
message's payload, but isn't super-reliable (see the topic on "magic numbers"
and media types). The "Content-Length:" header is one way for your browser
to know how long the body of the message is, when the server can determine
that ahead of time. Otherwise, the server will send the response in "chunks"
that are each preceded by their length (useful for dynamically-created
content). The use of chunking will be announced by the "Transfer-Encoding:"
header.
Sometimes the message body will be compressed, using "gzip" compression, in
which case the response will contain the "Content-Encoding:" header. In rare
cases, gzip compression will be applied to already-compressed media types
(like JPEG), which makes no sense at all.
HTTP/1.1 allows "persistent" connections, which help reduce
connection-related overhead for a long series of requests. A
"Connection: Keep-Alive" header means you're hooked up, while a
"Connection: close" header means you ain't.
The illustration shows some sample "User-Agent:" header values for several
popular browsers. Your big, conventional browser likely sends one of these
headers with every HTTP request, allowing the server to discover something
about not only your browser but your operating system as welluseful for
keeping statistics, tailoring HTML, or trying to force you to upgrade on
their schedule (very annoying).
You can "spoof" your browser type by changing this header's value. With
GuerrillaBrowser, all you have to do is edit $ReqHdrs.txt (located in the
GuerrillaBrowser "home" folder) using Notepad, then reload the file by
clicking the "OK" button in the GuerrillaBrowser Options dialog.
Text files (like HTML, JavaScript, Cascading Style Sheets, etc.) generally start right at their first byte. Binary files (like JPEG images, MPEG video or whatever) usually have some control information at the beginning, followed by the bytes giving their detailed description. Often these file headers will contain some signature bytes (called "magic numbers") to help keep applications from trying to decode some format they can't understand.
The illustration shows bytes 1-12 of several common formats as they appear in the DeChunk utility, in hexadecimal (base 16) notation with their ASCII translation on top.
Computers use binary (base 2) numbers because two digits ("0" and "1") are easy to represent using voltages or magnets. Most people are familiar with decimal (base 10) numbers, but programmers use hexadecimal a lot because it maps directly to the binary that computers understand. An 8-bit byte can store unsigned integers in the range of 0-255, but in hex this would be 00-FF (the letters "A" through "F" represent digits ten through fifteen).
It's a nuisance to convert back and forth between decimal and hex unless your calculator can do it for you, but often it's enough just to recognize a pattern. JPEG files begin with the byte pattern FF-D8-FF, for example, and GIF files begin with 47-49-46-38 (which are the ASCII values for "GIF8").
The table also shows common filename extensions and "media types" (sometimes
called "MIME types") for these formats. Media types take the form
"type/subtype", so the type of a JPEG image is "image" and its subtype is
"jpeg". Media types appear in the "Content-Type:" HTTP header, and also show
up in HTML on occasion.
Most of the time a file's media type, extension and header signature match,
but occasionally formats get mislabeled. If a file's extension says ".jpg"
but its signature says "GIF8", you can bet that it's a GIF image, not a JPEG.
The DeChunk utility is provided with GuerrillaBrowser to allow you to easily
check any file's signature. The file "256.ASC" contains byte values 0-255 so
that you can see what every character value looks like under DeChunk.
HTML is an "application" of something called SGML (Standard Generalized Markup Language). You can think of SGML as a sort of blueprint or templateplug in a "Document Type Definition" (DTD) and it spits out a specific markup language. The DTD contains a bunch of definitions that tell a program like your browser how to interpret an HTML document.
HTML began its development in the early 1990s at the hands of some WWW pioneers (Tim Berners-Lee, Dave Raggett and others). Then it was hijacked for a while by browser vendors like Netscape and Microsoft who wanted to add lots of features to it. Now it's been taken over by a standards group called W3C (the World Wide Web Consortium). So, it's a bit of a mess, which may explain why HTML authoring tools apparently have such a hard time producing well-formed HTML documents.
The basic idea is that HTML documents contain "content" (text, images, whatever) intended for the human viewer, along with "markup" designed to help programs determine the document's overall structure. Most of the markup takes the form of "tags" ("start-tags" and "end-tags") which are used to delimit HTML "elements". These elements all have names and follow certain rules as set forth in the DTD.
HTML elements can contain other ("nested") HTML elements, and for any
particular document they arrange into a hierarchy beginning with the root
"<HTML>" element. So, the "<HTML>" element can contain "<HEAD>" and "<BODY>"
elements, and the "<HEAD>" element can contain "<META>" and "<TITLE>"
elementsthat kind of thing.
HTML documents come in 2 basic flavors as shown in the illustration. In
regular documents, most of the content occurs more or less inline between the
"<BODY>" start-tag and "</BODY>" end-tag. With "frameset" documents, the
content tends to be sucked in from external sources identified by "<FRAME>"
elements. Either type can contain "<IFRAME>" (inline frame) elements, which
also cause content to get sucked in from an external source.
In fact, HTML has quite a few elements designed for "embedding" external documents, programs or whatever into the current document. Your big, conventional browser will generally download all of those things without asking or telling. It has no earthly way of knowing whether you actually want any or all of them. Neither does GuerrillaBrowser, which is why it has a rule that it only tries to get what you specifically tell it to.
There are many web browsers available, but a tiny handful dominate Web usage. Conventional browsers are huge, complex (and expensive) programs, with much code devoted to rendering websites. Once upon a time, there was a market for browser software so that development costs could be recovered. (Ask the former employees and shareholders of Netscape Communications what happened with that.)
So, the apparent variety of browsers is sort of an illusion, since many of them resort to one of a few available rendering "engines" (software). That's a little like putting different-shaped fiberglass bodies on the same VW chassis. The layout engines and script engines used by the big 4 browsers are shown in the illustration.
The Web is all about presentation. Website authors are looking to put on a good show, and your big, conventional browser provides all kinds of facilities for helping them do it. Of course, malware authors also want to put on a show, one you might not be all that eager to experience. The history of the development of what you might call the "standard" browsing model has been littered with many successful exploitations of flaws in concept or execution.
Fortunately, most websites are produced by legitimate businesses or other organizations that have no interest in attacking your computer. For the rest, or for sites that just want to give you a hard time, GuerrillaBrowser was designed to put you at less of a disadvantage, but it doesn't even try to support the standard browsing model.
So, for the full Web "experience" on trusted sites, you should use your conventional browser. On untrusted sites, you now have a choice. You can use GuerrillaBrowser to scope them out for danger signs. Or you can use GuerrillaBrowser to pick them apart. Or, you can roll the dice...
The contents of an HTML document can belong to any of several classes which are handled differently, as indicated in the illustration. Basically, your browser's layout engine applies whatever formatting information it finds in a document to the rest of its content to determine how to arrange everything on your screen.
Markup that begins with "<?" ("processing instructions") or "<!" ("markup
declarations") isn't supposed to be rendered by your browser. That includes
comments (which begin with the characters "<!--"). HTML authors use them to
include notes, and sometimes to make document sections "disappear" without
actually having to delete them. Sometimes a comment's closing "--"
characters are missing, leaving a program no reliable way to tell where the
comment ends. You may be a lot smarter, or just want to re-include removed
content, and can simply edit the HTML to put the closing "-->" characters
wherever you want them.
SGML "entities" are often used in HTML. Their "definitions" ("<!ENTITY
whatever ...>") appear in the DTD and "references" to them ("&whatever;")
appear in the document. They provide a way to insert problematic characters
into the document. Numeric references ("&#number;" or "&#xhexnumber;") are
also used, where the "number" is just the desired character's code. Web
servers will probably not understand URLs with something like "&" in
them, so references have to be translated before they're passed on.
Style sheets and HTML presentational elements like "<FONT>" (font change)
are of interest to website designers, but not much use to guerrilla surfers.
Script is very interesting, mainly because it's potentially dangerous.
Script and flaws in the script engine itself have been used as the basis for
many attacks. Script is also used to create content on-the-fly, which is a
royal pain because there's no good way to review the content without
executing the code (and taking your chances).
HTML looks like a mess if you aren't accustomed to it, but after a while it gets to be like "Neo" looking at the Matrixyou can look past the clutter and sort of see what's going on behind it. Guerrilla surfing is usually a matter of hunting out some key things and ignoring the rest.
The illustration shows the general format of an HTML element. The names of
the element and attribute(s) (if any) are not case sensitive, but in most
real-world HTML they appear in lower case. They refer back to the names
given in the element ("<!ELEMENT whatever ...>") and attribute ("<!ATTLIST
whatever ...>") declarations in the DTD.
Sometimes the end-tag or even the start-tag is omitted. The end-tag is
forbidden for elements that have no content, like "<BR>" (line break).
Another example is the "<IMG>" element, where all the action happens in the
attribute list (particularly the "SRC=" attribute).
Attribute values are another common place where HTML gets mangled, with missing end quotes or whatever.
All of HTML is interesting to someone, but the main things of interest to guerrilla surfers are hyperlinks (because they contain URLs of resources you may want) and script (potentially bad news, but occasionally informative).
Hyperlinks have 2 "anchors" (a "tail" and a "head"). The "<A>" (anchor)
element identifies the tail, and contains a reference (the "HREF=" attribute)
to the head somewhere out on the Internet. Your conventional browser will
generally wait until you click on an "<A>" link to activate it (download the
resource), but other elements create hyperlinks that will be activated
automaticallysomething you may not always want to happen.