GuerrillaBrowser User Guide



Notice:  You should use GuerrillaBrowser and any materials you retrieve from the web in accordance with all applicable (copyright, etc.) law.  Misbehavior could mean banishment from some websites, loss of Internet service, or even legal action.  The Internet is a communal resource, so play nice!


Overview


What GuerrillaBrowser Isn't

GuerrillaBrowser isn't an image editor, text editor, or file manager.  Odds are you already have programs that do those jobs—use them.

Likewise, GuerrillaBrowser isn't a conventional browser, nor does it dream of becoming one when it grows up.  The two main reasons why GuerrillaBrowser doesn't do many of the things you may take for granted with your conventional browser are: (1) some of them are risky on a non-secure network like the Internet, and (2) trying to be all things to all people tends to produce an unwieldy, overbloated mess.

GuerrillaBrowser intentionally contains NO script engine.  It doesn't even render HTML in the normal manner.  As a result, web pages designed with conventional browsers in mind may not display everything you'll want to see in GuerrillaBrowser's streamlined user interface.  A text editor should allow you to safely view HTML and even script if you can learn to look past the (markup) clutter.  (For an alternative, see the "detoxed" HTML feature.)

GuerrillaBrowser also contains no support for secure connections, meaning it won't retrieve URLs which begin with the https: protocol scheme.  Face it—if you can't safely use your conventional browser to do online banking, you probably need a better bank even worse than you need a better browser!

Less Is More

GuerrillaBrowser is simply intended to give you an inexpensive way to protect yourself on the dangerous Internet, by giving you direct control over your interaction with web servers.  It can't force web servers to do anything (any more than your conventional browser can), but it provides few if any of the mechanisms commonly used by servers to hijack your PC.  If you've ever had your PC trashed by hackers, you'll know what that's worth.


Contents



1. The System


Package Contents

GuerrillaBrowser comes packaged in a ZIP file.  To install it, simply save that file in whatever folder you like, and then unzip it (e.g. by telling Windows Explorer to "Extract All...").  You should see the following components:

    GB.exe        -  GuerrillaBrowser program
    GB.ini        -  GuerrillaBrowser configuration file
    GBReadMe.htm  -  GuerrillaBrowser User Guide (this file)
    $ReqHdrs.txt  -  sample HTTP request template (see Smart/Verbatim Mode)
    GV.exe        -  beta version of GuerrillaViewer companion program
    DeChunk.exe   -  utility program for stripping HTTP responses

GuerrillaBrowser doesn't modify the Windows Registry during installation and can be "uninstalled" at any time by just deleting the above files.

GuerrillaBrowser Operation

GuerrillaBrowser is designed only to download files from Internet servers.  Unlike your conventional browser, it is not designed to provide any kind of platform to web servers to execute any tasks—good or bad—on your PC.

URLs (web addresses) are the key to the Internet.  GuerrillaBrowser tries to download ("grab") whatever URLs you give it (individually or in a list).  It also can parse ("scrub") downloaded HTML for more URLs, sorting them into lists, and produces a document called a "map" that associates these URLs with their descriptive text and thumbnail images in a regularized, streamlined way.  The GuerrillaBrowser user interface displays these Map files much like your conventional browser displays HTML and other documents.

GuerrillaBrowser requests only the URLs you give it, meaning it doesn't automatically download "frames" or other embedded objects or follow URL redirects from servers.  Sometimes these other files are something you actually want, sometimes not, but with GuerrillaBrowser the choice to retrieve these is left up to you.

GuerrillaBrowser has a Batch Mode that processes a list of URLs, or even a list of lists, meaning it can download literally hundreds of files at one go.  It can't increase your Internet bandwidth, but it can reduce the amount of flailing around otherwise needed to retrieve these files, which can minimize the amount of time you have to spend connected to the dangerous Internet.

The GB Cache

GuerrillaBrowser originally stores all downloaded files in its cache, which is simply a folder that you specify.  Thereafter, you can move them about as you please.  This is to protect you from having important files elsewhere on your computer maliciously overwritten (a tactic popular with hackers).

The GB Cache is not a "cache" in the HTTP sense (i.e. a place to retrieve content without having to go to the original server), just a place to store downloaded files and work files.  GuerrillaBrowser always goes to the server when you tell it to Grab files, and never goes to the server when you tell it to open a Map file.  It makes no effort to track whether files are "fresh" or locally available or other HTTP caching issues.

Each instance of GuerrillaBrowser you run can access only one GB Cache at a time, but you can create as many GB Caches as you like.  Each GB Cache is subdivided into 100 folders, numbered 00 through 99.  You enter this "cache index" into the Cache Index spinner on the toolbar to tell GuerrillaBrowser which subfolder you want to operate on.

GuerrillaBrowser tracks usage of each GB Cache in a file called $Cache.txt, located in the main folder for that cache.  You "open" a GB Cache (and automatically close the prior one) by entering the path to one of these $Cache.txt files in the Open Cache... dialog.  If the $Cache.txt file doesn't exist there, GuerrillaBrowser creates an empty GB Cache at that path.

GuerrillaBrowser reads the $Cache.txt file into memory, and uses that in-memory copy to track any changes you make to that cache (by means of the Grab HTML command).  To update the disk copy, use the Write Cache command, or open a different GB Cache, or quit GuerrillaBrowser.  If GuerrillaBrowser hasn't recorded any changes to the in-memory copy, it won't bother to rewrite the disk copy.

You should consider the GB Cache to be a work area.  Some operations delete files in the current GB Cache, including downloaded files, so you'll want to backup any important files soon after you get them.  (See also Novice Mode.)

GB Work Files

GuerrillaBrowser creates a number of work files to keep track of what's going on in the GB Cache and the various Cache Index subfolders.  Their filenames begin with "$" to help separate them from downloaded files, and their filename extensions generally reflect their type.  Most of them are plain US-ASCII text files, so that you can exercise some aftermarket control over GuerrillaBrowser's operation (of the kind not usually available with bloatware).

A typical GB Cache might contain files like the following (where the ".\" below represents the current GB Cache path):

    .\$Cache.txt            -  GB Cache Index/URL record file
    .\$ErrLog.txt           -  error log file
    .\$Batch.txt            -  list of Cache Indexes (see Batch Mode)
    .\$URLs.txt             -  list of URLs (see Autonumber command)
    .\scratch\              -  any subfolder you create
    .\05\$0.raw             -  raw HTTP response from web server
    .\05\$2.htm             -  "dechunked" HTML created from $0.raw
    .\05\$3U.txt            -  preliminary list of URLs from $2.htm
    .\05\$4.map             -  GB document file for Cache Index 05
    .\05\$5T.txt            -  thumbs download/alias list
    .\05\$6P.txt            -  pictures download list
    .\05\$7M.txt            -  movies download list
    .\05\$8O.txt            -  other media types URL list
    .\05\$9L.txt            -  arbitrary download list (see Save List)
    .\05\tmp\               -  any subfolder you create
    .\05\05.htm             -  "detoxed" HTML created from $2.htm
    .\05\05_files\0001.css  -  downloaded stylesheet
    .\05\05_files\0002.gif  -  downloaded image
    .\05\05_files\0003.jpg  -  downloaded thumbnail image (like sample1.jpg)
    .\05\sample1.mpg        -  downloaded movie file

GuerrillaBrowser divides web servers into two classes: link servers and content servers.  (The division is purely conceptual—many servers will perform both roles.)  HTML from content servers is expected to contain links to downloadable content—images, multimedia, etc.  HTML from link servers is expected to contain links to content servers and other link servers.

If the $0.raw file in Cache Index 05 above came from a content server, most of the links it contains would likely show up in the $6P.txt and/or $7M.txt lists—assuming GuerrillaBrowser recognizes their media types.  If $0.raw came from a link server, most or all of its links would probably go to the $8O.txt list.  They could be assigned to other Cache Indexes and used to retrieve additional $0.raw files (analogous to clicking on a link to another web page in your conventional browser).

When you tell GuerrillaBrowser to Grab HTML for Cache Index 05 (or whatever), it first deletes all files in the .\05\ and .\05\05_files\ subfolders.  The in-memory copy of the $Cache.txt record is also updated with the URL you gave for Index 05.  No files in folders that you may create (like .\scratch\ and .\05\tmp\ in the example above) are disturbed.  It then saves the response exactly as received from the server in the $0.raw file.

The $0.raw file usually contains HTTP headers followed by a possibly-encoded message body.  When you tell GuerrillaBrowser to Scrub HTML for Index 05 (or see Auto Scrub), it removes the headers, "chunked" Transfer-Encoding, and the "deflate" (RFC 1951) compression commonly used in the "gzip" Content-Encoding, producing the cleaned-up $2.htm file.

Grab Thumbs uses the $5T.txt list to download and rename thumbnail images and stylesheets (automatically given aliased names because conflicting filenames like http://xyz.com/a/1.jpg and http://xyz.com/b/1.jpg are not uncommon).  Open Map uses $4.map and $5T.txt to display the URLs found in $2.htm, along with associated information.  You can use the Save List command to store selected URLs from the Map in the $9L.txt list.

Grab Pics, Grab Movies, and Grab List use the $6P.txt, $7M.txt, and $9L.txt lists respectively to get the URLs they contain from the relevant servers.  If GuerrillaBrowser detects filename conflicts while saving these items (not very common), it generates random names (see $ErrLog.txt).  Otherwise it saves those downloaded resources using the same filenames found in the URLs themselves (like http://xyz.com/a/sample1.mpg in the example above).

GuerrillaBrowser only creates folders as needed, and you can use, reuse, or delete Cache Indexes in any order you like.  It does not automatically remove any folders except for the .\nn\nn_files\ folders as described for the Grab HTML command above.  GuerrillaBrowser has an option to automatically delete intermediate workfiles ($1.gz through $3U.txt) created during the Scrub HTML operation.


2. The User Interface


Layout

Because vertical space is often at a premium on typical computer monitors, GuerrillaBrowser tries to conserve it by dispensing with the customary status bar and menu bar (it would like to lose the horizontal scrollbar if it could!), and by using an abbreviated toolbar.  Status information shows up in the window title bar, the Results View in the main window, the Document View in the main window when no Map file is currently open, and possibly also in the $ErrLog.txt file and in popup messages where appropriate.  The main menu is accessed by right-clicking the mouse anywhere on the main window, and is subdivided into File, Action, Edit, and Options sections.

The GuerrillaBrowser user interface consists of a title bar, toolbar, and main window.  The toolbar contains 3 buttons pertaining to the document list, a spinner for entering the current Cache Index, an edit box for displaying or entering (Single-Index Mode only) its associated URL, a button for enabling/disabling Batch Mode (disables/enables the URL edit box), a Stop button, and 5 buttons for downloading files.

The toolbar buttons all have corresponding hot keys ("accelerators").  Apart from that, the (single) keyboard is shared between the spinner, edit box, and main window in the usual manner (i.e. by cycling among them using Tab or Shift+Tab).  The blinking keyboard cursor ("caret") indicates the currently active "child" window, but whether it shows up in the main window depends on if a document is open and the selection point scrolled into view.

Additionally, otherwise-wasted keystrokes cause automatic cycling of the "input focus" caret.  PageUp, PageDown, Ctrl+Home and Ctrl+End navigation keys cause the focus to switch from the toolbar to the main window.  Up and Down arrows entered in the URL edit box switch the focus to the Cache Index spinner, and number keys typed while the focus is in the main window do the same.  However, a disabled edit box (Batch Mode) or spinner (no $Cache.txt file open!) is ineligible to receive the input focus.

The main window can be toggled between the Results View and Document View by means of the Shift+F6 key.  The Document View can be used to display a series of Map files whose Cache Indexes are specified using the spinner prior to selecting the Open Map command.  The Open Batch command (Shift+F2) gets its list of Indexes from the current $Batch.txt file.  GuerrillaBrowser remembers which documents you opened in its session history navigator, operated by the Back and Forward commands (much like your conventional browser).  However, you can Close an individual document, causing it to be dropped from the session history (and leaving a temporary gap—Back/Forward causes normal session history traversal to resume).  The Results View shows a blank window when "closed", while the Document View contains a short status display.

The Cache Index spinner is a read-only control; it cannot paste text from the Windows clipboard (but can copy text to it).  It's operated by the number or Up/Down arrow keys on the keyboard, or by left-clicking the spinner's arrow buttons with the mouse.  The spinner is used to specify: (a) which document to open, (b) which Index to perform a Grab/Scrub on (Single-Index Mode only), or (c) where to start Autonumbering.  The URL edit box is a normal single-line Windows edit control (i.e. multi-line text pasted from the clipboard will only retain the first line).

The main window supports disjoint multiple selection (item-based, i.e. whole line), which is performed pretty conventionally.  Selections can be extended by using the Shift key or just dragging the mouse.  The Shift+F8 key toggles between Add Mode and Replace Mode (indicated by a change in the blinking caret's color).  It also toggles the default button between "Find All" and "Find Next" in the Find dialog.  Add mode allows disjoint selection, while Replace Mode does not.  The Scroll Lock key causes the keyboard navigation keys to scroll the document without affecting any current selections.

The real power of making URL selections in the current document comes with the Save List command, which writes them to the $9L.txt list for that Cache Index, allowing you to pick and choose URLs for later downloading with the Grab List command.  Selections can also be copied to the clipboard (but not pasted from it—GuerrillaBrowser isn't a text editor!).

Summary of Commands

The various commands recognized by GuerrillaBrowser, and whether they can be accessed via the toolbar, popup (main) menu, or accelerator keystoke, are shown in the following chart:

    toolbar     menu  hotkey     command
    ----------  ----  ---------  ---------------------------------------
                      Alt+Home   [Go to beginning of document list]
    '<' button        Alt+Left   "Back" [to prior document]
    '>' button        Alt+Right  "Forward" [to next document]
                      Alt+End    [Go to end of document list]
    'O' button   Y    Ctrl+O     "Open Map"
                 Y    Ctrl+S     "Save List"
                 Y    F4         "Close" [Map or Results View]
                      Shift+F2   [Open every Map in current $Batch.txt file]
                      Shift+F4   [Close All (empty the document list)]
                      Shift+F6   [Toggle Document/Results View]
                      Shift+F8   [Toggle Add/Replace Mode]
    spinner                      [Choose a Cache Index]
    edit box                     [Assign an URL to current Cache Index]
    'B' button        Ctrl+B     "Batch Mode" [Single-Index Mode toggle]
    'X' button        Esc        "Stop" [Grab/Scrub command]
                 Y               "Autonumber"
    'H' button   Y    Ctrl+H     "Grab HTML"
                 Y    Ctrl+R     "Scrub HTML"
                 Y               "Thumb Scout"
    'T' button   Y    Ctrl+T     "Grab Thumbs"
    'P' button   Y    Ctrl+P     "Grab Pics"
    'M' button   Y    Ctrl+M     "Grab Movies"
    'L' button   Y    Ctrl+L     "Grab List"
                 Y    Ctrl+C     "Copy"
                 Y               "Find..."
                      F3         [Find next matching line]
                 Y               "Open Cache..."
                 Y    Ctrl+W     "Write Cache" [to disk]
                 Y               "Auto Scrub"
                 Y               "Options..."

Document View

As already described, there is no dialog box to Open a Map document.  The Open Map command simply loads the .\nn\$4.map file (assuming it's been created), where nn is the 2-digit Cache Index number currently displayed in the spinner, and adds it to the session history document list at its current position, marking the new end of the list.  The session history list operates pretty conventionally (as explained above in the "Layout" section), but there is an issue related to GuerrillaBrowser's memory management strategy.

Unlike a lot of bloatware, GuerrillaBrowser has been created with performance in mind, particularly in its use of your PC's memory.  (Windows theoretically implements "virtual memory"—paging RAM to/from disk—but if you've ever tried using the Paint program to load a really big bitmap on a machine with limited RAM, you may have left the room crying!)

HTML documents can be hundreds of KB in size themselves, and also "embed" hundreds of thumbnail images.  To conserve memory, GuerrillaBrowser only retains one document in memory at a time (the one currently viewed), but keeps track of the lines (URLs) you have selected in all the documents currently in the session history list.  Also, because resizing hundreds of thumbnails can take several seconds, GuerrillaBrowser tries to improve list navigation performance by caching those.

Meanwhile, there's nothing to stop you from deciding to replace the URL in cache index 27 (or whatever) with a different one (using the Grab HTML command), or just modifying the .\27\$4.map file with a text editor.  GuerrillaBrowser tries to track changes it makes to documents in its session history list, but you should Close document 27 if you plan to modify its $4.map or $5T.txt file from outside the program and don't want the Document View's selections and/or thumbs to be out-of-date.  Reopening the Map thereafter will show you the current data.

Results View

The Results View (or Document View) can be manually selected by using the Shift+F6 toggle key.  Also, where the Document View is automatically selected by the Open Map command, the Results View is automatically selected for any of the commands (Grab/Scrub/Autonumber) on the Action submenu, and remains selected until the command completes or is aborted (by the Stop command).

Each URL (only 1 in Single-Index Mode, up to 100 in Batch Mode) is displayed in gray when the operation begins, and changes to black when it completes successfully or red when it is unsuccessful.  The same mouse or keyboard selection and copy-to-clipboard rules (Add Mode, etc.) apply to the Results View as to the Document View.

Batch Mode

You can toggle between Batch Mode and Single-Index Mode by using the 'B' button on the toolbar.  When the button is down (Batch Mode active), the URL edit box is disabled and its text grayed (assuming the current Cache Index is used).  This is because in Batch Mode, if an URL is needed (as it is only for the Grab HTML command), it will be taken from the $Batch.txt file.

In Single-Index Mode ('B' button is up), the URL edit box can be used to input an URL prior to giving the Grab HTML command.  This is the only command that can modify the in-memory copy of $Cache.txt, by associating a new URL with a particular Cache Index.  The other Grab commands and Scrub command use the Cache Index only, along with the URL already assigned during Grab HTML.

GuerrillaBrowser accepts only "absolute" URLs (i.e. begin with the http: scheme) like http://xyz.com/Z/index.html, not "relative" URLs like Z/index.html.  It has no "search" function.  Entering nothing (not even spaces) in the URL edit box and choosing the Grab HTML command frees up that Cache Index, and deletes all the files in the .\nn\ and .\nn\nn_files\ subfolders (unless you rename those folders and thereby "hide" them).  The files do not go to the "recycle bin" or anything like that—they're gone!

The $Batch.txt file is just a means of entering several Cache Indexes (and possibly URLs) at one go, where the toolbar can only specify one at a time.  The format of the $Batch.txt file is identical to the $Cache.txt file, where each line contains 2 decimal digits, 1 space, and an optional absolute URL beginning at column 4, like so:

    85 http://xyz.com/Z/index.html
    03 http://www.GuerrillaBrowser.com/

Cache Indexes can occur in any order you like, but duplicate Indexes in the same $Batch.txt file will cause the earlier results to be replaced by the later ones.  (Duplicate Indexes in the $Cache.txt file are treated as an error.)  The URLs are only required for the Grab HTML command, and are ignored (if present) for the others.  However, a missing URL in a Grab HTML command has the same effect in Batch Mode as in Single-Index Mode, i.e. it requests Cache Index deletion, meaning you can delete a lot of files in a short time!  You should always use the Grab HTML command with care, particularly in Batch Mode (see also Novice Mode).

The Action Commands

The commands on the Action submenu all apply to both Single-Index Mode and Batch Mode, with 2 exceptions: the Thumb Scout command (see next section), and the Autonumber command.  Its purpose is to automate production of a usable $Batch.txt file, by assigning unused Cache Indexes to a list of URLs you supply.  First set the Cache Index spinner to the Index where you want the assignments to begin.  Then copy your list of URLs to a file named $URLs.txt in the main folder for the current GB Cache (the one containing $Cache.txt, $Batch.txt, etc.).  Choosing the Autonumber command will create a new $Batch.txt file (overwriting the previous one).

GuerrillaBrowser considers a Cache Index "unused" if its associated URL is missing in the in-memory copy of $Cache.txt (which you can check by moving the spinner to the desired Index and seeing if an URL appears in the edit box).  It does not consider whether you already have files stored in the .\nn\ subfolder for that Index.  Autonumbering skips past any Indexes already assigned an URL, and stops when the Index reaches 99 (i.e. it does not "wrap around" and continue at 00).

Whether the other commands operate in Single-Index Mode (using the spinner and edit box) or Batch Mode (using the $Batch.txt file) depends on the state of the 'B' button on the toolbar.  As a reminder, when in Batch Mode a message box appears at the beginning to give you a chance to abort the command.  You can also quit at any time by hitting the Stop command ('X' button).  You're given the opportunity to abort altogether, or just skip the current Cache Index and continue (handy if you're in the middle of a long Batch and some web server starts giving you a hard time).

You can't Scrub HTML without first having used the Grab HTML command, because Scrubbing starts with the $0.raw file that Grab HTML produces.  In turn, Scrub HTML creates the $5T.txt, $6P.txt and $7M.txt lists used by the Grab Thumbs, Grab Pics, and Grab Movies commands respectively.  Grab List uses the $9L.txt list produced by the Save List command.  You could also create any of these files with a text editor—$6P.txt, $7M.txt, $8O.txt, $9L.txt and $URLs.txt all share the same format.

The $5T.txt list prepends a 4-digit "alias" to the front of each URL it contains, which acts as a translation table for the Document View. A line like 0003http://xyz.com/11/Q/sample1.jpg tells GuerrillaBrowser (and the GuerrillaViewer companion program) that the thumb associated with this URL can be found at .\nn\nn_files\0003.jpg.  You'll need to keep both the $4.map and $5T.txt files to see the Document View for a given Cache Index that has any thumbs.

The Thumb Scout Command

The Thumb Scout feature (available in Single-Index Mode only) is provided as way to "synthesize" some thumbnail images for many link servers that lack them, by going straight to the content servers identified in those links.  There are a number of issues surrounding this command that you'll want to be aware of.

The biggest issue is that the Thumb Scout command violates GuerrillaBrowser's cardinal rule that it only requests the specific URLs you tell it to.  For example, when using the Grab Thumbs command, GuerrillaBrowser will request the URLs it finds in the $5T.txt list for that Cache Index, and only those URLs.  You can see what they are prior to giving the command by viewing the $5T.txt file.

The Thumb Scout command, however, is really just a macro for the Grab HTML, Scrub HTML, and Grab Thumbs commands.  That is, it works its way down the existing $4.map file for the given Cache Index, and executes those 3 commands for each eligible-looking link that doesn't already have an associated thumb.  It grows the existing $4.map and $5T.txt files, but it also downloads those thumbs as it goes because it typically needs the "Referer" link in order to get them.

When done, the old $4.map and $5T.txt files will be renamed to $4.bak and $5T.bak and the new files will contain the added thumbs.  GuerrillaBrowser will also try to grab thumbs listed in the $5T.bak file if they haven't already been downloaded.  Thereafter, both Grab Thumbs and Thumb Scout are disabled for that Cache Index (by the presence of the $4.bak file), since to reuse them would be to risk losing what you already got.

GuerrillaBrowser has to borrow the given Cache Index's folder as a work area, so it creates the subfolder .\nn\TS\ to store any files that might get stepped on.  It will normally clean up after itself on completion, but if it encounters a problem or you decide to cancel out of the Thumb Scout operation, you'll have to look in the temporary subfolder to get the files you started with.

Link servers that host thumbnail images usually have them set up for fast retrieval, and having to hunt around on a bunch of content servers for thumbs is nowhere near as efficient, so expect Thumb Scout to take longer than Grab Thumbs would.  For instance, if it takes Grab Thumbs half a minute to download 100 thumbs, it may take Thumb Scout several minutes to do the same.  If the $8O.txt list is 1000 lines long (contains that many URLs), then Thumb Scout may potentially find nearly that many additional thumbs, so that's enough time for a good, long coffee break.

Some link servers will be poor candidates for the Thumb Scout feature, and GuerrillaBrowser has no way to know which ones those are but you may.  Some website authors baffle the heck out of their content, so that you have to click through several pages of fluff to get to the meat.  GuerrillaBrowser doesn't do automatic redirection, both for security reasons and because it's so often used just to jerk surfers around, so Thumb Scout will not add anything in situations like that.

If you're using GuerrillaBrowser to load $4.map lists over in the GuerrillaViewer companion program (by clicking on one of the images in GuerrillaBrowser's thumbs margin), GuerrillaViewer will pick up the new $4.map produced by the Thumb Scout command the next time you do that.  Otherwise, GuerrillaViewer has no way to know when its list files have been changed out from under it, so you'll need to use its Open List... command to manually reload a modified image list.

Error Messages

Computer program errors can be divided into 2 groups: those that are not rare (like file doesn't exist, server is busy, etc.), and those that are rare (like Windows can't provide a Device Context to refresh the display).  GuerrillaBrowser was given copious error-handling during its development, and it has been retained because that's better than having the program crash and leaving you, the user, wondering where it went.

Most of the error conditions you're likely to actually encounter have been provided with messages in what passes for English in the computerese world of Internet applications.  The remainder, should you ever happen to see any, may appear rather cryptic.  If you've had much experience using Windows, you've probably already developed strategies for coping with errant software—enough said.

Popup messages, however, are not all that suitable for something like Batch Mode because they tend to interrupt the flow, so GuerrillaBrowser also provides the $ErrLog.txt file (in each GB Cache's main folder) so that you can try to reconstruct how things went with each web server you accessed.  It contains the date & time, followed by a program location code & error code, followed by the Cache Index and URL that had a problem.  Examples of some code values you might see (and their meanings) would be:

    01=00000000  -  GB replaced URL you gave with one returned by server
    02=00000200  -  "200 OK" response from server (but something's funny)
    02=00000301  -  "301 Moved Permanently" response (redirect)
    02=00000302  -  "302 Found" response (redirect)
    02=00000307  -  "307 Temporary Redirect" response
    02=00000400  -  "400 Bad Request" response
    02=00000403  -  "403 Forbidden" response
    02=00000404  -  "404 Not Found" response
    02=00000504  -  "504 Gateway Timeout" response
    02=0000Cx60  -  response headers appear to be absent or mangled
    04=00000000  -  Scrubber ran out of memory while removing dup URLs
    05=00000000  -  invalid URL encountered during GB Cache open
    06=00000000  -  invalid URL encountered during GB Cache update
    07=00000000  -  server didn't return any data
    08=xxxxxxxx  -  download filename changed to "xxxxxxxx" to avoid dup
    09=0000C000  -  can't get an IP address for this server
    09=0000F2F7  -  server dropped the connection (10054:WSAECONNRESET)
    09=0000F9F0  -  server refused the connection (10061:WSAECONNREFUSED)


3. GuerrillaBrowser Options


The Auto Scrub Option

GuerrillaBrowser's operation is controlled by a combination of: (a) user interface elements (keystrokes, toolbar buttons, etc.), (b) work file contents, and (c) options settings.  The user interface has already been discussed, as have some of the details regarding how GuerrillaBrowser interacts with its work files—more follow.  Most of the options are set using the Options... dialog, with the exception of Cache Path specification (Open Cache... on the popup menu) and the Auto Scrub option (also on the popup menu).

GuerrillaBrowser does not "dechunk" a server response on-the-fly, so to speak.  The vast majority of HTTP messages containing multimedia types are already compressed in their native formats, and are of known length, and so require no special encoding for HTTP transmission.  If the HTTP headers look okay, GuerrillaBrowser simply strips them off and stores the message body to disk in a form suitable for your programs that use that media type.  If not, it stores the response exactly as received (replacing the 3rd character of the filename extension with "!" as a sign), and leaves it up to you to decide what you want to do next (see The DeChunk Utility).

HTML will normally be retrieved via the Grab HTML command, and is stored exactly as received under the name $0.raw in the given Cache Index's subfolder.  Whether you'll want to Scrub it or do anything at all depends on having actually gotten what you asked for.  As you may know from using your conventional browser, web servers don't always deliver what they seem to have promised (the infamous "bait-and-switch").

The URL stored for each Cache Index in $Cache.txt has to do triple-duty: (1) as the Request-URI in an HTTP GET when you Grab HTML, (2) as the Referer: when you Grab anything else (see Smart Mode), and (3) as the "base URL" used to resolve any relative URLs (like ../../home.html) when you Scrub HTML.  As mentioned, GuerrillaBrowser won't automatically download redirected URLs, but it will place the substitute URL in its in-memory copy of $Cache.txt in case you decide to Scrub that $0.raw file.

Setting the Auto Scrub option on the menu causes each Grab HTML to be immediately followed by a Scrub HTML.  Whether that's handy depends mainly on the likelihood that you got what you wanted.  With your conventional browser, the substituted page may just appear in its window (Hey...what's this?!).  With GuerrillaBrowser, you'll have to inspect the $0.raw file.  Generally, a small file size (less than 1KB) indicates a redirect—maybe you got stiffed, or maybe that resource genuinely moved to another location.

In Single-Index Mode, if the redirected URL (which replaces the URL you gave in the URL edit box) looks okay, you only have to Grab HTML again to retrieve it.  In Batch Mode, the $Batch.txt file will still contain the Indexes that came back okay mixed in with the ones that didn't, and so will need to be edited in order to do a follow-up.  You may find Single-Index Mode better suited to troublesome servers and Batch Mode to the more reliable ones.

Smart Mode

GuerrillaBrowser can use a template file called $ReqHdrs.txt (stored in the same folder as GB.exe and GB.ini) to govern what it sends to an HTTP server during a request.  For any Grab operation with Smart Mode enabled, GuerrillaBrowser uses what it read from $ReqHdrs.txt as a pattern and adds or changes a few key values to customize the request: the Request-URI in the Request-Line (1st line), the Host: header (allows servers to share an IP address), and the Referer: header.  Other headers (User-Agent:, Cookie:, etc.) appearing in $ReqHdrs.txt will be sent unchanged.

The Referer: header has sometimes been described as a risk to privacy, but the argument is a little hard to follow.  As a practical matter, it is often used as a sort of password to prevent "hotlinking" (i.e. stealing bandwidth by using files stored on somebody else's server), so to request any resources without it in those situations is to come up empty-handed.

(As an example of GuerrillaBrowser's aftermarket workarounds, the use of Referer: headers can be suppressed in at least two ways.  One would be to store a file like $6P.txt in an unused Cache Index's subfolder and use Grab Pics, so that the Index's associated URL is missing and therefore so is the Referer.  Another would be using Grab HTML (which doesn't use its own URL as a Referer!) to retrieve an URL like http://xyz.com/1.jpg and converting the resulting $0.raw to a JPEG using the DeChunk Utility.  You get the idea.)

Similar to the situation with the $Cache.txt file, GuerrillaBrowser uses its in-memory copy of $ReqHdrs.txt unless you tell it to reload the file.  You do that by either quitting and restarting the Browser, or by using the Options... dialog to choose Smart Mode (or verify that it's already chosen) and clicking OK rather than Cancel.

Verbatim Mode

Smart Mode has a few limitations, like it will only do an HTTP GET (not ordinarily a problem), line lengths (and therefore URLs) are limited to around 500 characters (ditto), etc.  To disable Smart Mode in the Options... dialog is to enable Verbatim Mode.  In Verbatim Mode, the contents of $ReqHdrs.txt are sent byte-for-byte to the server, and its byte-for-byte response is stored in the current Cache Index's subfolder as $0.raw.

Also, unlike in Smart Mode, the $ReqHdrs.txt file is re-read for every Grab.  What this means is that Verbatim Mode only supports Single-Index Mode (there's no practical way to edit $ReqHdrs.txt in-between Indexes in Batch Mode!) and the Grab HTML command (same reason—don't worry, GuerrillaBrowser has no way of knowing whether the data you're sending to the server asks for HTML or whatever).

In Verbatim Mode, you will still need to enter an URL in the edit box, but GuerrillaBrowser only makes use of the "authority" portion (host name and port number, like "xyz.com:80" in the URL http://xyz.com:80/index.html) and ignores the rest (but you still have to enter the "http://" at the beginning!).  GuerrillaBrowser makes a TCP connection to that server at that port (defaults to HTTP's port 80 if absent), sends the contents of $ReqHdrs.txt and saves the response to disk.

Verbatim Mode gives you a way to send URLs of arbitrary length, perform HTTP HEAD or POST operations, use a proxy server, and do other exotic things.  It's not actually even limited to HTTP, but could be used for any application protocol that uses a TCP connection and a simple, one-shot, request/response exchange.  (But GuerrillaBrowser doesn't support SSL, so still no https:!)

At a more practical level, it gives you a way to overcome GuerrillaBrowser's inability to customize Cookie: headers in Smart Mode, and return a server's cookies in situations where you think it might treat you better if you did.  (That could also be done in Smart Mode, but less conveniently because $ReqHdrs.txt isn't automatically reloaded.)

For web servers that use the User-Agent: header to identify your browser, either Smart or Verbatim Mode is suitable for doing the "15-second browser upgrade" (i.e. changing that header's value in the $ReqHdrs.txt file).  The User-Agent: header is also handy for retrieving HTML compatible with your conventional browser from servers that customize what they send.

It should be obvious that to expect anything useful back from an HTTP server, you'll need to send a properly-formed HTTP request.  It's an easy protocol to learn, though—there are many resources on the web that can instruct you and you already have a great tool (GuerrillaBrowser) for investigating its behind-the-scenes operation.  The more you know about the web's underlying machinery, the better your chances of getting what you want from your surfing.

The Use Proxy Option

This option causes GuerrillaBrowser to make all requests via a proxy server.  Use the two edit boxes to the right of the checkbox to enter the IP address (e.g. 192.76.71.99, in the left-hand box) and port number (e.g. 80, in the right-hand box) of the proxy server you want.  If you omit the port number, HTTP's default port 80 will be used.

Proxy servers can be kind of transient.  If GuerrillaBrowser seems to have lost all ability to get anything from the Internet, try unchecking this option or choosing the address and port of a different server.

When using a proxy server in Verbatim Mode, you'll still need to enter a valid-looking URL in the URL edit box on the toolbar to keep the GB Cache happy, but GuerrillaBrowser will connect to the proxy server you specified in the GuerrillaBrowser Options dialog box.

Novice Mode

With Novice Mode enabled, GuerrillaBrowser issues a lot more warnings for situations where files could be deleted or replaced.  With it disabled, it pretty much assumes you know what you're doing.  You may want to try Novice Mode while you're familiarizing yourself with GuerrillaBrowser's sometimes unconventional operation.  If you're the kind of person who seldom misclicks with the mouse (or whatever) and find the reminders annoying, you can always disable Novice Mode.

The Collapse URLs Option

As already mentioned, GuerrillaBrowser deals only in absolute URLs.  During Scrub HTML, it resolves whatever relative URLs it finds in the $2.htm file and saves them in $3U.txt (the URL master list) in their absolute (but otherwise unchanged) form.

URLs sometimes contain embedded URLs, like:

    http://xyz.com/Q/index.php?id=8&url=http://www.GuerrillaBrowser.com/

In fact, URLs can contain embedded URLs that contain still other URLs.  The Collapse URLs Option tells the Scrubber to try to extract the ultimate URL where possible (sometimes these things get pretty gnarly!), and use it in the $4.map, $5T.txt, and other lists further down the chain.  The original form still appears in $3U.txt if you need to refer to it.

You may find the $3U.txt file useful for another reason, namely that HTML sometimes associates two URLs with the same nominal link.  The Scrubber picks one to go on to the $4.map and list files, but records both in $3U.txt.  Also, the Scrubber removes duplicate (identical) URLs from the later files, but retains them all in the $3U.txt file.

(The duplicate removal process is memory-intensive, and can break down for HTML files larger than several hundred KB.  Splitting very large files into more than one piece—separate Cache Indexes, same base URL—is one way to get past this limitation.  The Scrubber could care less whether its $2.htm input file begins with an <HTML> tag or somewhere in the middle.  However, the multiple $4.map and $8O.txt output files could still duplicate each other's URLs.)

The Delete Intermediate Workfiles Option

Checking this option in the Options... dialog causes the files between $0.raw and $4.map to be automatically deleted on successful completion of a Scrub.  The $1.gz file (if present) is not ordinarily useful except when GuerrillaBrowser can't recognize its format, in which case the Scrub stops there.  In such cases, you may be able to identify the problem using the DeChunk Utility, find a suitable decoder for the file, rename its output to $2.htm, and continue the Scrub.

You might want to use the $2.htm file as a starting point if you plan to modify the HTML returned by the server, as it has the nuisance "chunks" (and possibly "deflation") removed but is otherwise unchanged.  The potential value of the $3U.txt file is discussed immediately above.

The Create "Detoxed" HTML Option

Things like "active content" and plugins can be used to do all sorts of gee-whiz stuff on your computer, or to completely destroy it.  It all depends on what the server sends you.  Unfortunately, no (inexpensive) computer program can reliably guess the intent of arbitrary code—that's a job that can prove tedious even for expert programmers.  Your conventional browser may have options to selectively enable/disable some of those things, placing you somewhat less at the mercy of persons unknown.

GuerrillaBrowser's approach is to abandon such mechanisms altogether.  It does, however, have a way to give you a head-start on removing the bad stuff and keeping the good stuff, should you wish to try your conventional browser on some content that's otherwise unsafe.  Choosing the Create "Detoxed" HTML Option causes the Scrubber to remove whatever looks problematic from the $2.htm file to produce an .\nn\nn.htm file (where nn is the 2-digit Index).  It also adjusts thumbs links to refer to the files in the .\nn\nn_files\ folder that you can download using $5T.txt and Grab Thumbs.

The resulting "detoxed" HTML may give you a safer way to see a more conventional rendering with your other browser, but you should not regard it as certifiably safe and should take additional precautions.  Temporarily disconnecting from the Internet (either unplugging the cable or using your firewall if it supports that) can absolutely prevent automatic linking and the resulting cascade of unwanted downloads.  Disabling scripting in your browser's option settings may prove sufficient to protect you from the "active content" threat (or may not, but it's worth a try).

In practice, many websites are fine and can be accessed with little risk using your conventional browser.  For the rest, there's GuerrillaBrowser!


4. The GuerrillaViewer Program


GuerrillaBrowser comes with a FREE (beta) copy of its companion slideshow program, GuerrillaViewer.  Even if you already have an image viewer, you may still be interested in this one because GuerrillaViewer and GuerrillaBrowser can talk to each other!

GuerrillaViewer knows how to use the image-related GB work files ($4.map, $5T.txt, $6P.txt) to load and display the corresponding images from your GB Cache (assuming they've been downloaded).  In addition, where GuerrillaBrowser displays its thumbs in a standardized size, loading the same $4.map file into GuerrillaViewer allows you to see them at a size of your choosing, and to make selections in the GuerrillaBrowser Document View remotely.

Like GuerrillaBrowser, GuerrillaViewer's main menu is accessed by clicking with the right mouse button on the main window.  The Open... command allows you to specify a single image by name.  The Open List... command allows you to specify a GB list file, or a file you create containing a list of complete paths (like C:\My Documents\pic1.jpg or whatever).  Entering the keyword "auto" in this dialog causes GuerrillaViewer to build a list from whatever images it can find in the currently displayed folder.  The Close All command unloads both individual image and image list.

The Fit Option determines whether images are displayed at their native size or resized to best fit GuerrillaViewer's window.  If you've Opened a List, it can be traversed using the following keystrokes:

    Alt+Home   -  First image
    Alt+Left   -  Prior image (also Shift+Spacebar)
    Alt+Right  -  Next image (also Spacebar)
    Alt+End    -  Last image

In addition, pressing Ctrl+Fn while an image is displayed will store its path for quick recall whenever the same Function key (F1 through F12) is pressed without using the Ctrl key.  If an image list is loaded, its traversal can be resumed after opening single images or Function key images—just press the spacebar (or whatever).

If GV and GB have the same $4.map file loaded (and GB isn't minimized or switched to Results View), GB will track your progress through the thumbs list in its own window as you traverse the list in GuerrillaViewer.  Double-clicking an image in GV with the left mouse button will cause its associated URL to be selected (if GB is in Replace Mode) or have its selection state toggled (if GB is in Add Mode), just like pressing the spacebar when you're in GuerrillaBrowser.

If you switch back to GuerrillaBrowser, left-clicking over in the thumbs margin will cause GV's list to advance to the corresponding image.  Changing Map documents in GB and left-clicking a thumb will cause GV to automatically load the new $4.map file, and GB will even automatically start its companion Viewer (the reverse, however, is not true).  When running multiple instances of the Browser and/or Viewer, they will pair up so that each talks to only one "buddy" program.


5. The DeChunk Utility


GuerrillaBrowser also comes with a utility program called DeChunk, an HTTP response "stripper" that can display a few hundred bytes from the beginning of any file.  As already mentioned, GuerrillaBrowser only "dechunks" a web server's response during a Scrub, not during a Grab.  The idea is that Grab HTML will normally be followed by Scrub HTML, and that Grab anything else won't ordinarily receive any response that needs dechunking.

If GuerrillaBrowser encounters anything that doesn't look right, however, it will store the server's response exactly as received so that you can use the HTTP headers to help you resolve the problem.  HTML is always saved this way, as $0.raw during a Grab HTML or as something like index.ht! during a Grab anything else.  Another situation would be a mismatch in media type.

It's not too uncommon to ask for 1.jpg and get back HTML instead (the "switcheroo").  GuerrillaBrowser would flag this as 1.jp!, but it isn't a JPEG at all—which you can use DeChunk to determine.  Occasionally, you may see something like an alleged ".jpg" that's actually a GIF (or vice versa, or whatever).  Some image programs have problems with those kinds of things, so GuerrillaBrowser leaves it up to you whether to rename the file to match its actual media type or to take some other measure.

In theory, the Content-type: HTTP header is supposed to be the definitive source of information on an object's media type, but in practice it's often no more accurate than the filename extension is.  A file's type is really determined by its format, which controls how text, color, sound, or whatever gets converted into numbers (the only thing computers "understand") and back again.  Various encoding schemes (like US-ASCII for text) control this conversion, and using one to encode (like MPEG) and a different one to decode (like PNG) is guaranteed to produce a complete mess!

A convention has arisen of placing special values at or near the beginning of many types of files (called "magic numbers") to help application programs to identify their format.  GuerrillaBrowser (like your conventional browser) uses a small table of this kind of information to auto-identify media types, but should it encounter problems you can use DeChunk to do a follow-up.  (Unfortunately, there's no central authority governing the use of these flag values, so the situation is sort of a mish-mash and making an accurate guess is as much art as it is science—do an Internet search on "magic numbers" to learn more.)

DeChunk's user interface is very similar to GuerrillaBrowser's, with a Document View and a Results View.  Its document list is populated using the Open... command and traversed the same way as in GB and GV.  Each group of 80 bytes from the beginning of a loaded file is displayed over 3 lines: 1 line showing an ASCII conversion (probably gibberish for non-text files), and 2 lines showing an over/under hex dump, meaning that each byte (hexadecimal value 00..FF) is displayed using one column, e.g.

    mailto:webmaster@123.com
    666676376666776743332666
    D19C4FA752D134520123E3FD

Printable ASCII values (hex 20..7E) are shown in black; the rest (hex 00..1F and 7F..FF) are shown in color, with the lower range converted to their control-character "letter" equivalents (e.g. ASCII linefeed, value 0A, shows up as letter 'J' for Ctrl+J).  However, if DeChunk detects what appear to be HTTP response headers at the start of the file, it displays those line-by-line without the hex dump for easier readability.  Also, they're implicitly "selected" and can be copied to the clipboard for pasting into your text editor.

Choosing the Save command will cause the loaded file to be dechunked to a new output file, just like the GuerrillaBrowser Scrub would do.  The Run/Quit command does the same thing for every file currently in the document list, allowing you to load and strip an entire batch of HTTP responses at one go.