Notice: You should use GuerrillaBrowser and any materials you retrieve from the web in accordance with all applicable (copyright, etc.) law. Misbehavior could mean banishment from some websites, loss of Internet service, or even legal action. The Internet is a communal resource, so play nice!
GuerrillaBrowser isn't an image editor, text editor, or file manager. Odds are you already have programs that do those jobsuse them.
Likewise, GuerrillaBrowser isn't a conventional browser, nor does it dream of becoming one when it grows up. The two main reasons why GuerrillaBrowser doesn't do many of the things you may take for granted with your conventional browser are: (1) some of them are risky on a non-secure network like the Internet, and (2) trying to be all things to all people tends to produce an unwieldy, overbloated mess.
GuerrillaBrowser intentionally contains NO script engine. It doesn't even render HTML in the normal manner. As a result, web pages designed with conventional browsers in mind may not display everything you'll want to see in GuerrillaBrowser's streamlined user interface. A text editor should allow you to safely view HTML and even script if you can learn to look past the (markup) clutter. (For an alternative, see the "detoxed" HTML feature.)
GuerrillaBrowser also contains no support for secure connections, meaning it
won't retrieve URLs which begin with the https: protocol scheme. Face itif
you can't safely use your conventional browser to do online banking, you
probably need a better bank even worse than you need a better browser!
GuerrillaBrowser is simply intended to give you an inexpensive way to protect yourself on the dangerous Internet, by giving you direct control over your interaction with web servers. It can't force web servers to do anything (any more than your conventional browser can), but it provides few if any of the mechanisms commonly used by servers to hijack your PC. If you've ever had your PC trashed by hackers, you'll know what that's worth.
GuerrillaBrowser comes packaged in a ZIP file. To install it, simply save that file in whatever folder you like, and then unzip it (e.g. by telling Windows Explorer to "Extract All..."). You should see the following components:
GB.exe - GuerrillaBrowser program
GB.ini - GuerrillaBrowser configuration file
GBReadMe.htm - GuerrillaBrowser User Guide (this file)
$ReqHdrs.txt - sample HTTP request template (see Smart/Verbatim Mode)
GV.exe - beta version of GuerrillaViewer companion program
DeChunk.exe - utility program for stripping HTTP responses
GuerrillaBrowser doesn't modify the Windows Registry during installation and can be "uninstalled" at any time by just deleting the above files.
GuerrillaBrowser is designed only to download files from Internet servers. Unlike your conventional browser, it is not designed to provide any kind of platform to web servers to execute any tasksgood or badon your PC.
URLs (web addresses) are the key to the Internet. GuerrillaBrowser tries to download ("grab") whatever URLs you give it (individually or in a list). It also can parse ("scrub") downloaded HTML for more URLs, sorting them into lists, and produces a document called a "map" that associates these URLs with their descriptive text and thumbnail images in a regularized, streamlined way. The GuerrillaBrowser user interface displays these Map files much like your conventional browser displays HTML and other documents.
GuerrillaBrowser requests only the URLs you give it, meaning it doesn't automatically download "frames" or other embedded objects or follow URL redirects from servers. Sometimes these other files are something you actually want, sometimes not, but with GuerrillaBrowser the choice to retrieve these is left up to you.
GuerrillaBrowser has a Batch Mode that processes a list of URLs, or even a list of lists, meaning it can download literally hundreds of files at one go. It can't increase your Internet bandwidth, but it can reduce the amount of flailing around otherwise needed to retrieve these files, which can minimize the amount of time you have to spend connected to the dangerous Internet.
GuerrillaBrowser originally stores all downloaded files in its cache, which is simply a folder that you specify. Thereafter, you can move them about as you please. This is to protect you from having important files elsewhere on your computer maliciously overwritten (a tactic popular with hackers).
The GB Cache is not a "cache" in the HTTP sense (i.e. a place to retrieve content without having to go to the original server), just a place to store downloaded files and work files. GuerrillaBrowser always goes to the server when you tell it to Grab files, and never goes to the server when you tell it to open a Map file. It makes no effort to track whether files are "fresh" or locally available or other HTTP caching issues.
Each instance of GuerrillaBrowser you run can access only one GB Cache at a time, but you can create as many GB Caches as you like. Each GB Cache is subdivided into 100 folders, numbered 00 through 99. You enter this "cache index" into the Cache Index spinner on the toolbar to tell GuerrillaBrowser which subfolder you want to operate on.
GuerrillaBrowser tracks usage of each GB Cache in a file called $Cache.txt,
located in the main folder for that cache. You "open" a GB Cache (and
automatically close the prior one) by entering the path to one of these
$Cache.txt files in the Open Cache... dialog. If the $Cache.txt file doesn't
exist there, GuerrillaBrowser creates an empty GB Cache at that path.
GuerrillaBrowser reads the $Cache.txt file into memory, and uses that
in-memory copy to track any changes you make to that cache (by means of the
Grab HTML command). To update the disk copy, use the Write Cache command, or
open a different GB Cache, or quit GuerrillaBrowser. If GuerrillaBrowser
hasn't recorded any changes to the in-memory copy, it won't bother to rewrite
the disk copy.
You should consider the GB Cache to be a work area. Some operations delete files in the current GB Cache, including downloaded files, so you'll want to backup any important files soon after you get them. (See also Novice Mode.)
GuerrillaBrowser creates a number of work files to keep track of what's going
on in the GB Cache and the various Cache Index subfolders. Their filenames
begin with "$" to help separate them from downloaded files, and their
filename extensions generally reflect their type. Most of them are plain
US-ASCII text files, so that you can exercise some aftermarket control over
GuerrillaBrowser's operation (of the kind not usually available with
bloatware).
A typical GB Cache might contain files like the following (where the ".\"
below represents the current GB Cache path):
.\$Cache.txt - GB Cache Index/URL record file
.\$ErrLog.txt - error log file
.\$Batch.txt - list of Cache Indexes (see Batch Mode)
.\$URLs.txt - list of URLs (see Autonumber command)
.\scratch\ - any subfolder you create
.\05\$0.raw - raw HTTP response from web server
.\05\$2.htm - "dechunked" HTML created from $0.raw
.\05\$3U.txt - preliminary list of URLs from $2.htm
.\05\$4.map - GB document file for Cache Index 05
.\05\$5T.txt - thumbs download/alias list
.\05\$6P.txt - pictures download list
.\05\$7M.txt - movies download list
.\05\$8O.txt - other media types URL list
.\05\$9L.txt - arbitrary download list (see Save List)
.\05\tmp\ - any subfolder you create
.\05\05.htm - "detoxed" HTML created from $2.htm
.\05\05_files\0001.css - downloaded stylesheet
.\05\05_files\0002.gif - downloaded image
.\05\05_files\0003.jpg - downloaded thumbnail image (like sample1.jpg)
.\05\sample1.mpg - downloaded movie file
GuerrillaBrowser divides web servers into two classes: link servers and content servers. (The division is purely conceptualmany servers will perform both roles.) HTML from content servers is expected to contain links to downloadable contentimages, multimedia, etc. HTML from link servers is expected to contain links to content servers and other link servers.
If the $0.raw file in Cache Index 05 above came from a content server, most
of the links it contains would likely show up in the $6P.txt and/or $7M.txt
listsassuming GuerrillaBrowser recognizes their media types. If $0.raw
came from a link server, most or all of its links would probably go to the
$8O.txt list. They could be assigned to other Cache Indexes and used to
retrieve additional $0.raw files (analogous to clicking on a link to another
web page in your conventional browser).
When you tell GuerrillaBrowser to Grab HTML for Cache Index 05 (or whatever),
it first deletes all files in the .\05\ and .\05\05_files\ subfolders. The
in-memory copy of the $Cache.txt record is also updated with the URL you gave
for Index 05. No files in folders that you may create (like .\scratch\ and
.\05\tmp\ in the example above) are disturbed. It then saves the response
exactly as received from the server in the $0.raw file.
The $0.raw file usually contains HTTP headers followed by a possibly-encoded
message body. When you tell GuerrillaBrowser to Scrub HTML for Index 05 (or
see Auto Scrub), it removes the headers, "chunked" Transfer-Encoding, and
the "deflate" (RFC 1951) compression commonly used in the "gzip"
Content-Encoding, producing the cleaned-up $2.htm file.
Grab Thumbs uses the $5T.txt list to download and rename thumbnail images and
stylesheets (automatically given aliased names because conflicting filenames
like http://xyz.com/a/1.jpg and http://xyz.com/b/1.jpg are not uncommon).
Open Map uses $4.map and $5T.txt to display the URLs found in $2.htm,
along with associated information. You can use the Save List command to store
selected URLs from the Map in the $9L.txt list.
Grab Pics, Grab Movies, and Grab List use the $6P.txt, $7M.txt, and
$9L.txt lists respectively to get the URLs they contain from the relevant servers.
If GuerrillaBrowser detects filename conflicts while saving these items (not
very common), it generates random names (see $ErrLog.txt). Otherwise it
saves those downloaded resources using the same filenames found in the URLs
themselves (like http://xyz.com/a/sample1.mpg in the example above).
GuerrillaBrowser only creates folders as needed, and you can use, reuse, or
delete Cache Indexes in any order you like. It does not automatically remove
any folders except for the .\nn\nn_files\ folders as described for the Grab
HTML command above. GuerrillaBrowser has an option to automatically delete
intermediate workfiles ($1.gz through $3U.txt) created during the Scrub HTML
operation.
Because vertical space is often at a premium on typical computer monitors,
GuerrillaBrowser tries to conserve it by dispensing with the customary status
bar and menu bar (it would like to lose the horizontal scrollbar if it
could!), and by using an abbreviated toolbar. Status information shows up in
the window title bar, the Results View in the main window, the Document View
in the main window when no Map file is currently open, and possibly also in
the $ErrLog.txt file and in popup messages where appropriate. The main menu
is accessed by right-clicking the mouse anywhere on the main window, and is
subdivided into File, Action, Edit, and Options sections.
The GuerrillaBrowser user interface consists of a title bar, toolbar, and main window. The toolbar contains 3 buttons pertaining to the document list, a spinner for entering the current Cache Index, an edit box for displaying or entering (Single-Index Mode only) its associated URL, a button for enabling/disabling Batch Mode (disables/enables the URL edit box), a Stop button, and 5 buttons for downloading files.
The toolbar buttons all have corresponding hot keys ("accelerators"). Apart from that, the (single) keyboard is shared between the spinner, edit box, and main window in the usual manner (i.e. by cycling among them using Tab or Shift+Tab). The blinking keyboard cursor ("caret") indicates the currently active "child" window, but whether it shows up in the main window depends on if a document is open and the selection point scrolled into view.
Additionally, otherwise-wasted keystrokes cause automatic cycling of the
"input focus" caret. PageUp, PageDown, Ctrl+Home and Ctrl+End navigation
keys cause the focus to switch from the toolbar to the main window. Up and
Down arrows entered in the URL edit box switch the focus to the Cache Index
spinner, and number keys typed while the focus is in the main window do the
same. However, a disabled edit box (Batch Mode) or spinner (no $Cache.txt
file open!) is ineligible to receive the input focus.
The main window can be toggled between the Results View and Document View by
means of the Shift+F6 key. The Document View can be used to display a series
of Map files whose Cache Indexes are specified using the spinner prior to
selecting the Open Map command. The Open Batch command (Shift+F2) gets its
list of Indexes from the current $Batch.txt file. GuerrillaBrowser remembers which documents
you opened in its session history navigator, operated by the Back and Forward
commands (much like your conventional browser). However, you can Close an
individual document, causing it to be dropped from the session history (and
leaving a temporary gapBack/Forward causes normal session history
traversal to resume). The Results View shows a blank window when "closed",
while the Document View contains a short status display.
The Cache Index spinner is a read-only control; it cannot paste text from the Windows clipboard (but can copy text to it). It's operated by the number or Up/Down arrow keys on the keyboard, or by left-clicking the spinner's arrow buttons with the mouse. The spinner is used to specify: (a) which document to open, (b) which Index to perform a Grab/Scrub on (Single-Index Mode only), or (c) where to start Autonumbering. The URL edit box is a normal single-line Windows edit control (i.e. multi-line text pasted from the clipboard will only retain the first line).
The main window supports disjoint multiple selection (item-based, i.e. whole line), which is performed pretty conventionally. Selections can be extended by using the Shift key or just dragging the mouse. The Shift+F8 key toggles between Add Mode and Replace Mode (indicated by a change in the blinking caret's color). It also toggles the default button between "Find All" and "Find Next" in the Find dialog. Add mode allows disjoint selection, while Replace Mode does not. The Scroll Lock key causes the keyboard navigation keys to scroll the document without affecting any current selections.
The real power of making URL selections in the current document comes with
the Save List command, which writes them to the $9L.txt list for that Cache
Index, allowing you to pick and choose URLs for later downloading with the
Grab List command. Selections can also be copied to the clipboard (but not
pasted from itGuerrillaBrowser isn't a text editor!).
The various commands recognized by GuerrillaBrowser, and whether they can be accessed via the toolbar, popup (main) menu, or accelerator keystoke, are shown in the following chart:
toolbar menu hotkey command
---------- ---- --------- ---------------------------------------
Alt+Home [Go to beginning of document list]
'<' button Alt+Left "Back" [to prior document]
'>' button Alt+Right "Forward" [to next document]
Alt+End [Go to end of document list]
'O' button Y Ctrl+O "Open Map"
Y Ctrl+S "Save List"
Y F4 "Close" [Map or Results View]
Shift+F2 [Open every Map in current $Batch.txt file]
Shift+F4 [Close All (empty the document list)]
Shift+F6 [Toggle Document/Results View]
Shift+F8 [Toggle Add/Replace Mode]
spinner [Choose a Cache Index]
edit box [Assign an URL to current Cache Index]
'B' button Ctrl+B "Batch Mode" [Single-Index Mode toggle]
'X' button Esc "Stop" [Grab/Scrub command]
Y "Autonumber"
'H' button Y Ctrl+H "Grab HTML"
Y Ctrl+R "Scrub HTML"
Y "Thumb Scout"
'T' button Y Ctrl+T "Grab Thumbs"
'P' button Y Ctrl+P "Grab Pics"
'M' button Y Ctrl+M "Grab Movies"
'L' button Y Ctrl+L "Grab List"
Y Ctrl+C "Copy"
Y "Find..."
F3 [Find next matching line]
Y "Open Cache..."
Y Ctrl+W "Write Cache" [to disk]
Y "Auto Scrub"
Y "Options..."
As already described, there is no dialog box to Open a Map document. The
Open Map command simply loads the .\nn\$4.map file (assuming it's been
created), where nn is the 2-digit Cache Index number currently displayed in
the spinner, and adds it to the session history document list at its current
position, marking the new end of the list. The session history list operates
pretty conventionally (as explained above in the "Layout" section), but there
is an issue related to GuerrillaBrowser's memory management strategy.
Unlike a lot of bloatware, GuerrillaBrowser has been created with performance in mind, particularly in its use of your PC's memory. (Windows theoretically implements "virtual memory"paging RAM to/from diskbut if you've ever tried using the Paint program to load a really big bitmap on a machine with limited RAM, you may have left the room crying!)
HTML documents can be hundreds of KB in size themselves, and also "embed" hundreds of thumbnail images. To conserve memory, GuerrillaBrowser only retains one document in memory at a time (the one currently viewed), but keeps track of the lines (URLs) you have selected in all the documents currently in the session history list. Also, because resizing hundreds of thumbnails can take several seconds, GuerrillaBrowser tries to improve list navigation performance by caching those.
Meanwhile, there's nothing to stop you from deciding to replace the URL in
cache index 27 (or whatever) with a different one (using the Grab HTML
command), or just modifying the .\27\$4.map file with a text editor.
GuerrillaBrowser tries to track changes it makes to documents in its
session history list, but you should Close document 27 if
you plan to modify its $4.map or $5T.txt file from outside the program and
don't want the Document View's selections and/or thumbs to be out-of-date.
Reopening the Map thereafter will show you the current data.
The Results View (or Document View) can be manually selected by using the Shift+F6 toggle key. Also, where the Document View is automatically selected by the Open Map command, the Results View is automatically selected for any of the commands (Grab/Scrub/Autonumber) on the Action submenu, and remains selected until the command completes or is aborted (by the Stop command).
Each URL (only 1 in Single-Index Mode, up to 100 in Batch Mode) is displayed in gray when the operation begins, and changes to black when it completes successfully or red when it is unsuccessful. The same mouse or keyboard selection and copy-to-clipboard rules (Add Mode, etc.) apply to the Results View as to the Document View.
You can toggle between Batch Mode and Single-Index Mode by using the 'B'
button on the toolbar. When the button is down (Batch Mode active), the URL
edit box is disabled and its text grayed (assuming the current Cache Index is
used). This is because in Batch Mode, if an URL is needed (as it is only for
the Grab HTML command), it will be taken from the $Batch.txt file.
In Single-Index Mode ('B' button is up), the URL edit box can be used to
input an URL prior to giving the Grab HTML command. This is the only command
that can modify the in-memory copy of $Cache.txt, by associating a new URL
with a particular Cache Index. The other Grab commands and Scrub command use
the Cache Index only, along with the URL already assigned during Grab HTML.
GuerrillaBrowser accepts only "absolute" URLs (i.e. begin with the http:
scheme) like http://xyz.com/Z/index.html, not "relative" URLs like
Z/index.html. It has no "search" function. Entering nothing (not even
spaces) in the URL edit box and choosing the Grab HTML command frees up that
Cache Index, and deletes all the files in the .\nn\ and
.\nn\nn_files\ subfolders (unless you rename those folders and thereby "hide"
them). The files do not go to the "recycle bin" or anything like thatthey're gone!
The $Batch.txt file is just a means of entering several Cache Indexes (and
possibly URLs) at one go, where the toolbar can only specify one at a time.
The format of the $Batch.txt file is identical to the $Cache.txt file, where
each line contains 2 decimal digits, 1 space, and an optional absolute URL
beginning at column 4, like so:
85 http://xyz.com/Z/index.html
03 http://www.GuerrillaBrowser.com/
Cache Indexes can occur in any order you like, but duplicate Indexes in the
same $Batch.txt file will cause the earlier results to be replaced by the
later ones. (Duplicate Indexes in the $Cache.txt file are treated as an
error.) The URLs are only required for the Grab HTML command, and are
ignored (if present) for the others. However, a missing URL in a Grab HTML
command has the same effect in Batch Mode as in Single-Index Mode, i.e. it
requests Cache Index deletion, meaning you can delete a lot of files in a
short time! You should always use the Grab HTML command with care,
particularly in Batch Mode (see also Novice Mode).
The commands on the Action submenu all apply to both Single-Index Mode and Batch Mode, with 2
exceptions: the Thumb Scout command (see next section), and the Autonumber command. Its purpose
is to automate production of a usable $Batch.txt file, by assigning unused Cache
Indexes to a list of URLs you supply. First set the Cache Index spinner to
the Index where you want the assignments to begin. Then copy your list of
URLs to a file named $URLs.txt in the main folder for the current GB Cache
(the one containing $Cache.txt, $Batch.txt, etc.). Choosing the Autonumber
command will create a new $Batch.txt file (overwriting the previous one).
GuerrillaBrowser considers a Cache Index "unused" if its associated URL is
missing in the in-memory copy of $Cache.txt (which you can check by moving
the spinner to the desired Index and seeing if an URL appears in the edit
box). It does not consider whether you already have files stored in the
.\nn\ subfolder for that Index. Autonumbering skips past any Indexes already
assigned an URL, and stops when the Index reaches 99 (i.e. it does not "wrap
around" and continue at 00).
Whether the other commands operate in Single-Index Mode (using the spinner
and edit box) or Batch Mode (using the $Batch.txt file) depends on the state of
the 'B' button on the toolbar. As a reminder, when in Batch Mode a
message box appears at the beginning to give you a chance to abort the
command. You can also quit at any time by hitting the Stop command ('X'
button). You're given the opportunity to abort altogether, or just skip the current
Cache Index and continue (handy if you're in the middle of a long
Batch and some web server starts giving you a hard time).
You can't Scrub HTML without first having used the Grab HTML command, because
Scrubbing starts with the $0.raw file that Grab HTML produces. In turn,
Scrub HTML creates the $5T.txt, $6P.txt and $7M.txt lists used by the
Grab Thumbs, Grab Pics, and Grab Movies commands respectively. Grab List uses the
$9L.txt list produced by the Save List command. You could also create any of
these files with a text editor$6P.txt, $7M.txt, $8O.txt,
$9L.txt and $URLs.txt all share the same format.
The $5T.txt list prepends a 4-digit "alias" to the front of each URL it
contains, which acts as a translation table for the Document View. A line
like 0003http://xyz.com/11/Q/sample1.jpg tells GuerrillaBrowser (and the
GuerrillaViewer companion program) that the thumb associated with this URL
can be found at .\nn\nn_files\0003.jpg. You'll need to keep both the
$4.map and $5T.txt files to see the Document View for a given Cache Index that has
any thumbs.
The Thumb Scout feature (available in Single-Index Mode only) is provided as way to "synthesize" some thumbnail images for many link servers that lack them, by going straight to the content servers identified in those links. There are a number of issues surrounding this command that you'll want to be aware of.
The biggest issue is that the Thumb Scout command violates GuerrillaBrowser's
cardinal rule that it only requests the specific URLs you tell it to. For
example, when using the Grab Thumbs command, GuerrillaBrowser will request
the URLs it finds in the $5T.txt list for that Cache Index, and only those
URLs. You can see what they are prior to giving the command by viewing the
$5T.txt file.
The Thumb Scout command, however, is really just a macro for the Grab HTML,
Scrub HTML, and Grab Thumbs commands. That is, it works its way down the
existing $4.map file for the given Cache Index, and executes those 3 commands
for each eligible-looking link that doesn't already have an associated thumb.
It grows the existing $4.map and $5T.txt files, but it also downloads those
thumbs as it goes because it typically needs the "Referer" link in order to
get them.
When done, the old $4.map and $5T.txt files will be renamed to $4.bak and
$5T.bak and the new files will contain the added thumbs. GuerrillaBrowser
will also try to grab thumbs listed in the $5T.bak file if they haven't
already been downloaded. Thereafter, both Grab Thumbs and Thumb Scout are
disabled for that Cache Index (by the presence of the $4.bak file), since to
reuse them would be to risk losing what you already got.
GuerrillaBrowser has to borrow the given Cache Index's folder as a work area,
so it creates the subfolder .\nn\TS\ to store any files that might get
stepped on. It will normally clean up after itself on completion, but if it
encounters a problem or you decide to cancel out of the Thumb Scout operation,
you'll have to look in the temporary subfolder to get the files you started
with.
Link servers that host thumbnail images usually have them set up for fast
retrieval, and having to hunt around on a bunch of content servers for thumbs
is nowhere near as efficient, so expect Thumb Scout to take longer than Grab
Thumbs would. For instance, if it takes Grab Thumbs half a minute to download
100 thumbs, it may take Thumb Scout several minutes to do the same. If the
$8O.txt list is 1000 lines long (contains that many URLs), then Thumb Scout
may potentially find nearly that many additional thumbs, so that's enough time
for a good, long coffee break.
Some link servers will be poor candidates for the Thumb Scout feature, and GuerrillaBrowser has no way to know which ones those are but you may. Some website authors baffle the heck out of their content, so that you have to click through several pages of fluff to get to the meat. GuerrillaBrowser doesn't do automatic redirection, both for security reasons and because it's so often used just to jerk surfers around, so Thumb Scout will not add anything in situations like that.
If you're using GuerrillaBrowser to load $4.map lists over in the
GuerrillaViewer companion program (by clicking on one of the images in
GuerrillaBrowser's thumbs margin), GuerrillaViewer will pick up the new $4.map
produced by the Thumb Scout command the next time you do that. Otherwise,
GuerrillaViewer has no way to know when its list files have been changed out
from under it, so you'll need to use its Open List... command to manually reload
a modified image list.
Computer program errors can be divided into 2 groups: those that are not rare (like file doesn't exist, server is busy, etc.), and those that are rare (like Windows can't provide a Device Context to refresh the display). GuerrillaBrowser was given copious error-handling during its development, and it has been retained because that's better than having the program crash and leaving you, the user, wondering where it went.
Most of the error conditions you're likely to actually encounter have been provided with messages in what passes for English in the computerese world of Internet applications. The remainder, should you ever happen to see any, may appear rather cryptic. If you've had much experience using Windows, you've probably already developed strategies for coping with errant softwareenough said.
Popup messages, however, are not all that suitable for something like Batch
Mode because they tend to interrupt the flow, so GuerrillaBrowser also
provides the $ErrLog.txt file (in each GB Cache's main folder) so that you
can try to reconstruct how things went with each web server you accessed. It
contains the date & time, followed by a program location code & error code,
followed by the Cache Index and URL that had a problem. Examples of some
code values you might see (and their meanings) would be:
01=00000000 - GB replaced URL you gave with one returned by server
02=00000200 - "200 OK" response from server (but something's funny)
02=00000301 - "301 Moved Permanently" response (redirect)
02=00000302 - "302 Found" response (redirect)
02=00000307 - "307 Temporary Redirect" response
02=00000400 - "400 Bad Request" response
02=00000403 - "403 Forbidden" response
02=00000404 - "404 Not Found" response
02=00000504 - "504 Gateway Timeout" response
02=0000Cx60 - response headers appear to be absent or mangled
04=00000000 - Scrubber ran out of memory while removing dup URLs
05=00000000 - invalid URL encountered during GB Cache open
06=00000000 - invalid URL encountered during GB Cache update
07=00000000 - server didn't return any data
08=xxxxxxxx - download filename changed to "xxxxxxxx" to avoid dup
09=0000C000 - can't get an IP address for this server
09=0000F2F7 - server dropped the connection (10054:WSAECONNRESET)
09=0000F9F0 - server refused the connection (10061:WSAECONNREFUSED)
GuerrillaBrowser's operation is controlled by a combination of: (a) user interface elements (keystrokes, toolbar buttons, etc.), (b) work file contents, and (c) options settings. The user interface has already been discussed, as have some of the details regarding how GuerrillaBrowser interacts with its work filesmore follow. Most of the options are set using the Options... dialog, with the exception of Cache Path specification (Open Cache... on the popup menu) and the Auto Scrub option (also on the popup menu).
GuerrillaBrowser does not "dechunk" a server response on-the-fly, so to
speak. The vast majority of HTTP messages containing multimedia types are
already compressed in their native formats, and are of known length, and so
require no special encoding for HTTP transmission. If the HTTP headers look
okay, GuerrillaBrowser simply strips them off and stores the message body to
disk in a form suitable for your programs that use that media type. If not,
it stores the response exactly as received (replacing the 3rd character of
the filename extension with "!" as a sign), and leaves it up to you to decide
what you want to do next (see The DeChunk Utility).
HTML will normally be retrieved via the Grab HTML command, and is stored
exactly as received under the name $0.raw in the given Cache Index's
subfolder. Whether you'll want to Scrub it or do anything at all depends on
having actually gotten what you asked for. As you may know from using your
conventional browser, web servers don't always deliver what they seem to have
promised (the infamous "bait-and-switch").
The URL stored for each Cache Index in $Cache.txt has to do triple-duty: (1)
as the Request-URI in an HTTP GET when you Grab HTML, (2) as the Referer:
when you Grab anything else (see Smart Mode), and (3) as the "base URL" used
to resolve any relative URLs (like ../../home.html) when you Scrub HTML. As
mentioned, GuerrillaBrowser won't automatically download redirected URLs, but
it will place the substitute URL in its in-memory copy of $Cache.txt in case
you decide to Scrub that $0.raw file.
Setting the Auto Scrub option on the menu causes each Grab HTML to be
immediately followed by a Scrub HTML. Whether that's handy depends mainly on
the likelihood that you got what you wanted. With your conventional browser,
the substituted page may just appear in its window (Hey...what's this?!).
With GuerrillaBrowser, you'll have to inspect the $0.raw file. Generally, a
small file size (less than 1KB) indicates a redirectmaybe you got
stiffed, or maybe that resource genuinely moved to another location.
In Single-Index Mode, if the redirected URL (which replaces the URL you gave
in the URL edit box) looks okay, you only have to Grab HTML again to retrieve
it. In Batch Mode, the $Batch.txt file will still contain the Indexes that
came back okay mixed in with the ones that didn't, and so will need to be
edited in order to do a follow-up. You may find Single-Index Mode better
suited to troublesome servers and Batch Mode to the more reliable ones.
GuerrillaBrowser can use a template file called $ReqHdrs.txt (stored in the
same folder as GB.exe and GB.ini) to govern what it sends to an HTTP server
during a request. For any Grab operation with Smart Mode enabled,
GuerrillaBrowser uses what it read from $ReqHdrs.txt as a pattern and adds or
changes a few key values to customize the request: the Request-URI in the
Request-Line (1st line), the Host: header (allows servers to share an IP
address), and the Referer: header. Other headers (User-Agent:,
Cookie:, etc.) appearing in $ReqHdrs.txt will be sent unchanged.
The Referer: header has sometimes been described as a risk to privacy, but
the argument is a little hard to follow. As a practical matter, it is often
used as a sort of password to prevent "hotlinking" (i.e. stealing bandwidth
by using files stored on somebody else's server), so to request any resources
without it in those situations is to come up empty-handed.
(As an example of GuerrillaBrowser's aftermarket workarounds, the use of
Referer: headers can be suppressed in at least two ways. One would be to
store a file like $6P.txt in an unused Cache Index's subfolder and use Grab
Pics, so that the Index's associated URL is missing and therefore so is the
Referer. Another would be using Grab HTML (which doesn't use its own URL as
a Referer!) to retrieve an URL like http://xyz.com/1.jpg and converting the
resulting $0.raw to a JPEG using the DeChunk Utility. You get the idea.)
Similar to the situation with the $Cache.txt file, GuerrillaBrowser uses its
in-memory copy of $ReqHdrs.txt unless you tell it to reload the file. You do
that by either quitting and restarting the Browser, or by using the
Options... dialog to choose Smart Mode (or verify that it's already chosen)
and clicking OK rather than Cancel.
Smart Mode has a few limitations, like it will only do an HTTP GET (not
ordinarily a problem), line lengths (and therefore URLs) are limited to
around 500 characters (ditto), etc. To disable Smart Mode in the Options...
dialog is to enable Verbatim Mode. In Verbatim Mode, the contents of
$ReqHdrs.txt are sent byte-for-byte to the server, and its byte-for-byte
response is stored in the current Cache Index's subfolder as $0.raw.
Also, unlike in Smart Mode, the $ReqHdrs.txt file is re-read for every Grab.
What this means is that Verbatim Mode only supports Single-Index Mode
(there's no practical way to edit $ReqHdrs.txt in-between Indexes in Batch
Mode!) and the Grab HTML command (same reasondon't worry,
GuerrillaBrowser has no way of knowing whether the data you're sending to
the server asks for HTML or whatever).
In Verbatim Mode, you will still need to enter an URL in the edit box, but
GuerrillaBrowser only makes use of the "authority" portion (host name and
port number, like "xyz.com:80" in the URL http://xyz.com:80/index.html) and
ignores the rest (but you still have to enter the "http://" at the
beginning!). GuerrillaBrowser makes a TCP connection to that server at that
port (defaults to HTTP's port 80 if absent), sends the contents of
$ReqHdrs.txt and saves the response to disk.
Verbatim Mode gives you a way to send URLs of arbitrary length, perform HTTP
HEAD or POST operations, use a proxy server, and do other exotic things.
It's not actually even limited to HTTP, but could be used for any application
protocol that uses a TCP connection and a simple, one-shot, request/response
exchange. (But GuerrillaBrowser doesn't support SSL, so still no https:!)
At a more practical level, it gives you a way to overcome GuerrillaBrowser's
inability to customize Cookie: headers in Smart Mode, and return a server's
cookies in situations where you think it might treat you better if you did.
(That could also be done in Smart Mode, but less conveniently because
$ReqHdrs.txt isn't automatically reloaded.)
For web servers that use the User-Agent: header to identify your browser,
either Smart or Verbatim Mode is suitable for doing the "15-second browser
upgrade" (i.e. changing that header's value in the $ReqHdrs.txt file). The
User-Agent: header is also handy for retrieving HTML compatible with your
conventional browser from servers that customize what they send.
It should be obvious that to expect anything useful back from an HTTP server, you'll need to send a properly-formed HTTP request. It's an easy protocol to learn, thoughthere are many resources on the web that can instruct you and you already have a great tool (GuerrillaBrowser) for investigating its behind-the-scenes operation. The more you know about the web's underlying machinery, the better your chances of getting what you want from your surfing.
This option causes GuerrillaBrowser to make all requests via a proxy server. Use the two edit boxes to the right of the checkbox to enter the IP address (e.g. 192.76.71.99, in the left-hand box) and port number (e.g. 80, in the right-hand box) of the proxy server you want. If you omit the port number, HTTP's default port 80 will be used.
Proxy servers can be kind of transient. If GuerrillaBrowser seems to have lost all ability to get anything from the Internet, try unchecking this option or choosing the address and port of a different server.
When using a proxy server in Verbatim Mode, you'll still need to enter a valid-looking URL in the URL edit box on the toolbar to keep the GB Cache happy, but GuerrillaBrowser will connect to the proxy server you specified in the GuerrillaBrowser Options dialog box.
With Novice Mode enabled, GuerrillaBrowser issues a lot more warnings for situations where files could be deleted or replaced. With it disabled, it pretty much assumes you know what you're doing. You may want to try Novice Mode while you're familiarizing yourself with GuerrillaBrowser's sometimes unconventional operation. If you're the kind of person who seldom misclicks with the mouse (or whatever) and find the reminders annoying, you can always disable Novice Mode.
As already mentioned, GuerrillaBrowser deals only in absolute URLs. During
Scrub HTML, it resolves whatever relative URLs it finds in the $2.htm file and
saves them in $3U.txt (the URL master list) in their absolute (but
otherwise unchanged) form.
URLs sometimes contain embedded URLs, like:
http://xyz.com/Q/index.php?id=8&url=http://www.GuerrillaBrowser.com/
In fact, URLs can contain embedded URLs that contain still other URLs. The
Collapse URLs Option tells the Scrubber to try to extract the ultimate URL
where possible (sometimes these things get pretty gnarly!), and use it in the
$4.map, $5T.txt, and other lists further down the chain. The original form
still appears in $3U.txt if you need to refer to it.
You may find the $3U.txt file useful for another reason, namely that HTML
sometimes associates two URLs with the same nominal link. The Scrubber picks
one to go on to the $4.map and list files, but records both in $3U.txt.
Also, the Scrubber removes duplicate (identical) URLs from the later files,
but retains them all in the $3U.txt file.
(The duplicate removal process is memory-intensive, and can break down for
HTML files larger than several hundred KB. Splitting very large files into
more than one pieceseparate Cache Indexes, same base URLis one way to
get past this limitation. The Scrubber could care less whether its $2.htm
input file begins with an <HTML> tag or somewhere in the middle. However,
the multiple $4.map and $8O.txt output files could still duplicate each
other's URLs.)
Checking this option in the Options... dialog causes the files between $0.raw and
$4.map to be automatically deleted on successful completion of a Scrub.
The $1.gz file (if present) is not ordinarily useful except when
GuerrillaBrowser can't recognize its format, in which case the Scrub stops
there. In such cases, you may be able to identify the problem using the
DeChunk Utility, find a suitable decoder for the file, rename its output to
$2.htm, and continue the Scrub.
You might want to use the $2.htm file as a starting point if you plan to
modify the HTML returned by the server, as it has the nuisance "chunks" (and
possibly "deflation") removed but is otherwise unchanged. The potential
value of the $3U.txt file is discussed immediately above.
Things like "active content" and plugins can be used to do all sorts of gee-whiz stuff on your computer, or to completely destroy it. It all depends on what the server sends you. Unfortunately, no (inexpensive) computer program can reliably guess the intent of arbitrary codethat's a job that can prove tedious even for expert programmers. Your conventional browser may have options to selectively enable/disable some of those things, placing you somewhat less at the mercy of persons unknown.
GuerrillaBrowser's approach is to abandon such mechanisms altogether. It
does, however, have a way to give you a head-start on removing the bad stuff
and keeping the good stuff, should you wish to try your conventional browser
on some content that's otherwise unsafe. Choosing the Create "Detoxed" HTML
Option causes the Scrubber to remove whatever looks problematic from the
$2.htm file to produce an .\nn\nn.htm
file (where nn is the 2-digit Index). It also adjusts
thumbs links to refer to the files in the .\nn\nn_files\
folder that you can download using $5T.txt and Grab Thumbs.
The resulting "detoxed" HTML may give you a safer way to see a more conventional rendering with your other browser, but you should not regard it as certifiably safe and should take additional precautions. Temporarily disconnecting from the Internet (either unplugging the cable or using your firewall if it supports that) can absolutely prevent automatic linking and the resulting cascade of unwanted downloads. Disabling scripting in your browser's option settings may prove sufficient to protect you from the "active content" threat (or may not, but it's worth a try).
In practice, many websites are fine and can be accessed with little risk using your conventional browser. For the rest, there's GuerrillaBrowser!
GuerrillaBrowser comes with a FREE (beta) copy of its companion slideshow program, GuerrillaViewer. Even if you already have an image viewer, you may still be interested in this one because GuerrillaViewer and GuerrillaBrowser can talk to each other!
GuerrillaViewer knows how to use the image-related GB work files ($4.map,
$5T.txt, $6P.txt) to load and display the corresponding images from your GB
Cache (assuming they've been downloaded). In addition, where
GuerrillaBrowser displays its thumbs in a standardized size, loading the same
$4.map file into GuerrillaViewer allows you to see them at a size of your
choosing, and to make selections in the GuerrillaBrowser Document View
remotely.
Like GuerrillaBrowser, GuerrillaViewer's main menu is accessed by clicking
with the right mouse button on the main window. The Open... command allows
you to specify a single image by name. The Open List... command allows you
to specify a GB list file, or a file you create containing a list of complete
paths (like C:\My Documents\pic1.jpg or whatever). Entering the keyword
"auto" in this dialog causes GuerrillaViewer to build a list from whatever
images it can find in the currently displayed folder. The Close All command
unloads both individual image and image list.
The Fit Option determines whether images are displayed at their native size or resized to best fit GuerrillaViewer's window. If you've Opened a List, it can be traversed using the following keystrokes:
Alt+Home - First image
Alt+Left - Prior image (also Shift+Spacebar)
Alt+Right - Next image (also Spacebar)
Alt+End - Last image
In addition, pressing Ctrl+Fn while an image is displayed will store its path for quick recall whenever the same Function key (F1 through F12) is pressed without using the Ctrl key. If an image list is loaded, its traversal can be resumed after opening single images or Function key imagesjust press the spacebar (or whatever).
If GV and GB have the same $4.map file loaded (and GB isn't minimized or
switched to Results View), GB will track your progress through the thumbs
list in its own window as you traverse the list in GuerrillaViewer.
Double-clicking an image in GV with the left mouse button will cause its
associated URL to be selected (if GB is in Replace Mode) or have its
selection state toggled (if GB is in Add Mode), just like pressing the
spacebar when you're in GuerrillaBrowser.
If you switch back to GuerrillaBrowser, left-clicking over in the thumbs
margin will cause GV's list to advance to the corresponding image. Changing
Map documents in GB and left-clicking a thumb will cause GV to automatically
load the new $4.map file, and GB will even automatically start its companion
Viewer (the reverse, however, is not true). When running multiple instances
of the Browser and/or Viewer, they will pair up so that each talks to only
one "buddy" program.
GuerrillaBrowser also comes with a utility program called DeChunk, an HTTP response "stripper" that can display a few hundred bytes from the beginning of any file. As already mentioned, GuerrillaBrowser only "dechunks" a web server's response during a Scrub, not during a Grab. The idea is that Grab HTML will normally be followed by Scrub HTML, and that Grab anything else won't ordinarily receive any response that needs dechunking.
If GuerrillaBrowser encounters anything that doesn't look right, however, it
will store the server's response exactly as received so that you can use the
HTTP headers to help you resolve the problem. HTML is always saved this
way, as $0.raw during a Grab HTML or as something like index.ht! during a
Grab anything else. Another situation would be a mismatch in media type.
It's not too uncommon to ask for 1.jpg and get back HTML instead (the
"switcheroo"). GuerrillaBrowser would flag this as 1.jp!, but it isn't a
JPEG at allwhich you can use DeChunk to determine. Occasionally, you may
see something like an alleged ".jpg" that's actually a GIF (or vice versa, or
whatever). Some image programs have problems with those kinds of things, so
GuerrillaBrowser leaves it up to you whether to rename the file to match its
actual media type or to take some other measure.
In theory, the Content-type: HTTP header is supposed to be the definitive
source of information on an object's media type, but in practice it's often
no more accurate than the filename extension is. A file's type is really
determined by its format, which controls how text, color, sound, or whatever
gets converted into numbers (the only thing computers "understand") and back
again. Various encoding schemes (like US-ASCII for text) control this
conversion, and using one to encode (like MPEG) and a different one to decode
(like PNG) is guaranteed to produce a complete mess!
A convention has arisen of placing special values at or near the beginning of many types of files (called "magic numbers") to help application programs to identify their format. GuerrillaBrowser (like your conventional browser) uses a small table of this kind of information to auto-identify media types, but should it encounter problems you can use DeChunk to do a follow-up. (Unfortunately, there's no central authority governing the use of these flag values, so the situation is sort of a mish-mash and making an accurate guess is as much art as it is sciencedo an Internet search on "magic numbers" to learn more.)
DeChunk's user interface is very similar to GuerrillaBrowser's, with a Document View and a Results View. Its document list is populated using the Open... command and traversed the same way as in GB and GV. Each group of 80 bytes from the beginning of a loaded file is displayed over 3 lines: 1 line showing an ASCII conversion (probably gibberish for non-text files), and 2 lines showing an over/under hex dump, meaning that each byte (hexadecimal value 00..FF) is displayed using one column, e.g.
mailto:webmaster@123.com
666676376666776743332666
D19C4FA752D134520123E3FD
Printable ASCII values (hex 20..7E) are shown in black; the rest (hex 00..1F and 7F..FF) are shown in color, with the lower range converted to their control-character "letter" equivalents (e.g. ASCII linefeed, value 0A, shows up as letter 'J' for Ctrl+J). However, if DeChunk detects what appear to be HTTP response headers at the start of the file, it displays those line-by-line without the hex dump for easier readability. Also, they're implicitly "selected" and can be copied to the clipboard for pasting into your text editor.
Choosing the Save command will cause the loaded file to be dechunked to a new output file, just like the GuerrillaBrowser Scrub would do. The Run/Quit command does the same thing for every file currently in the document list, allowing you to load and strip an entire batch of HTTP responses at one go.