GuerrillaBrowser presents:
The Web Technology Crash Course



(Note: If a picture is worth a thousand words, this series of illustrations may be all you need to get the general idea.  For a commentary on them, see comment.htm).


The Web is a "hypertext" world.  Most of its information is stored using HTML (Hypertext Markup Language) and exchanged using HTTP (Hypertext Transfer Protocol).



HTTP protocol stack — protocol, message unit, layer, addressing


HTTP request/response application http://xyz.com/index.html
TCP segment transport 192.168.0.100:1346
IP datagram internet 192.168.0.100
Ethernet frame link 44-45-53-54-00-00



Sample DNS lookup — domain name and IP address


www.yahoo.com       69.147.114.210


URL format and example


http://host:port/path?query

http://www.google.com/search?hl=en&q=dog+grooming&btnG=Search


Sample HTTP response and request


HTTP/1.1 200 OK
Set-Cookie: PHPSESSID=6cc7853a0610a9c8cce6d4a0aff9b789; path=/
Set-Cookie: tt2=1; expires=Mon, 24-Sep-07 22:21:21 GMT
Connection: close
Content-Type: text/html

<script language="JavaScript">
<!--
document.cookie='g2ref=noref; expires=Monday, 24-Sep-07 17:21:21 GMT;';
//-->
</script>


<HTML>
<HEAD>
<TITLE>Test Page</TITLE>
</HEAD>
<BODY>
<A HREF="/a/test.html">test resource</A>
</BODY>
</HTML>


GET /a/test.html HTTP/1.1
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, */*
Referer: http://192.168.0.100/index.html
Accept-Language: en-us
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)
Host: 192.168.0.100
Connection: Keep-Alive
Cookie: PHPSESSID=6cc7853a0610a9c8cce6d4a0aff9b789; tt2=1; g2ref=noref



Some relevant HTTP status codes


1xx Informational
2xx Successful
200 OK
3xx Redirection
301 Moved Permanently
302 Found
307 Temporary Redirect
4xx Client Error
400 Bad Request
403 Forbidden
404 Not Found
5xx Server Error
500 Internal Server Error
504 Gateway Time-out


Some relevant HTTP headers — requests and responses


Host:               Transfer-Encoding: chunked
Referer:            Content-Encoding: gzip
User-Agent:         Content-Length:
Cookie:             Content-Type:
                    Content-Location:
                    Location:
                    Connection:
                    Set-Cookie:


Sample "User-Agent:" header values


Mozilla/5.0 (Windows; U; Win98; en-US; rv:0.9.2) Gecko/20010726 Netscape6/6.1
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7
Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)
Opera/7.x (Windows NT 5.1; U) [en]


Common file formats — typical header (bytes 1-12), filename extension, media type


FFD8FFxx .jpgimage/jpeg
47494638 .gifimage/gif
89504E47 .pngimage/png
494433xx .mp3audio/mpeg
2E524D46 .rm audio/x-pn-realaudio
4D546864 .midaudio/midi
52494646 .wavaudio/wav
52494646 .avivideo/avi
000001Bx .mpgvideo/mpeg
3026B275 .wmvvideo/x-ms-wmv
6D6F6F76 .movvideo/quicktime
464C56xx .flvvideo/x-flv
465753xx .swfapplication/x-shockwave-flash
25504446 .pdfapplication/pdf
1F8B08xx .gz application/gzip
504B0304 .zipapplication/zip
xxxxxxxx .js application/x-javascript
xxxxxxxx .csstext/css
xxxxxxxx .htmtext/html
xxxxxxxx .txttext/plain



The two basic HTML document types — standard and frameset


<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
</head>
<body>
document content
</body>
</html>


<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN">
<html>
<head>
</head>
<frameset>
<frame src="document-content.html">
</frameset>
</html>


Some popular browsers — browser, layout engine, script engine


IE Trident JScript
Mozilla Gecko SpiderMonkey
Safari WebCore JavaScriptCore
Opera Presto linear_b



HTML markup handling — type of markup, example, result


comments <!-- whatever --> not rendered
content text, <img>, etc. rendered
SGML entities like &amp; evaluated by parser (substituted)
most HTML markup like <base> evaluated by parser (acted upon)
formatting info <style>, etc. passed to style system
script <script>, etc. passed to script engine



HTML element format and example


<element-name attribute-name="value">content</element-name>

<A HREF="doctors.jpg"><img src="doctors.small.jpg" width="124" height="100" alt="Image"></A>


HTML elements grouped by function


document info HEAD, TITLE, META, BASE, ISINDEX, LINK
document structure HTML, BODY, SPAN, DIV, INS, DEL, BDO
frames FRAMESET, FRAME, NOFRAMES, IFRAME
formatting H1, H2, H3, H4, H5, H6, P, PRE, ADDRESS, BLOCKQUOTE, Q, BR, HR, CENTER
phrase markup EM, STRONG, DFN, CITE, CODE, KBD, SAMP, VAR, ABBR, ACRONYM, SUB, SUP
font-related STRIKE, U, B, I, TT, S, BIG, SMALL, FONT, BASEFONT
lists UL, OL, LI, DIR, MENU, DL, DT, DD
tables TABLE, CAPTION, COLGROUP, COL, THEAD, TBODY, TFOOT, TR, TH, TD
forms FORM, INPUT, SELECT, OPTGROUP, OPTION, TEXTAREA, LABEL, FIELDSET, LEGEND, BUTTON
hyperlinks A, MAP, AREA
embedding IMG, APPLET, OBJECT, PARAM
script SCRIPT, NOSCRIPT
stylesheets STYLE