file:/home/hays/Course/lec7/lec7.html --Title: Math 481/581 Lecture 7: Introduction to the Internet
In this lecture we will briefly cover how the Internet works and describe basic Internet usage.
A network protocol is a specification of how data is transferred across a physical network link. For example, IP breaks data to be transferred into packets. You can think of a packet as a miniature email message: the packet contains a block of data, a source address (so the receiver knows who the message came from), and a destination address (so that the intervening network hardware knows where to send the packet).
The contents of the data block in an IP packet are unspecified -- it is up to additional network protocols to define the meaning of the data. The main role of IP is to provide an addressing scheme for packets and a mechanism to deliver packets.
Every machine on an IP network has a unique IP address which
is a set of four numbers. For example, the machine
ame2.math.arizona.edu has an IP address of
220.127.116.11. You'll sometimes see such addresses
referred to as dotted quads. Each of the numbers is between
0 and 255.
Packets are propagated across IP networks by IP routers. Routers are directly connected to one or more IP networks. When a packet arrives on one of the router's interfaces, it uses a set of rules to decide which interface to resend the packet on. This set of rules is (nowadays) dynamically maintained between routers using special protocols (RIP, EGP, BGP). In this fashion, the packet roams the network until it (hopefully) arrives on the destination's network interface.
IP isn't too useful on its own. There are 3 main reasons for this:
One of the most widely used protocols is called the Transmission Control Protocol, or TCP. TCP provides a realiable byte stream over an IP network. In other words, data arrives intact and in order (or else an error is passed to the TCP application). A TCP connection is a virtual circuit, kind of like a phone call. When sending data across a TCP connection, you are unaware that packet-based transport is occurring; instead, you feel like you are connected to your peer via a dedicated wire.
TCP accomplishes this amazing feat in two main ways. First, it puts a sequence number into each TCP packet sent. This sequence number is incremented for each packet sent on a given TCP connection. This way, the TCP layer knows when packets have been duplicated (and so the duplicates can be ignored) and when they arrive out of order (in which case packets are queued until the stragglers arrive). Second, TCP provides a timeout/retransmit mechanism. This mechanism allow the TCP layer to detect dropped packets and to request that the sender retransmit such packets.
Beyond providing a realiable byte stream, TCP says nothing about the contents of the data contained in the stream. It is up to application level protocols to define the meaning of the data passed over a TCP connection.
There are other protocols based on IP, but TCP is the one we'll be interested in because the most commonly available Internet-related software uses TCP.
Before closing the discussion on TCP/IP, let's briefly look at the concept of an IP port. If IP packets only contained source and destination addresses, each machine on an IP network could only engage in one conversation at a time -- because it would be impossible to tell which application caused the packet to be sent, and which application should receive the data on the remote end. Clearly, this is unacceptable.
To remedy this problem, each IP packet header also contains source and destination port numbers. A port number is usually a 16 bit unsigned integer and can therefore take on values from 0 to 65535. Each program on a machine that wants to send or receive IP traffic must open a socket first. A socket basically looks like a file, as far as the program is concerned. When the socket is created, the application chooses or is assigned a unique port number for that particular socket. In other words, a unique (IP address, port number) pair can be traced back to a particular process. When data is written to the socket, the destination IP address and port number are specified -- and so the remote operating system knows which program is to receive the data. In this fashion, each machine on an IP network can maintain multiple simultaneous network connections.
Enter the ubiquitous client-server communications model. This model resolves the issue of establishing network connections by requiring that one endpoint of the connection start executing and wait indefinitely for the other endpoint to contact it. Since one side is always running and simply responds to requests from remote programs, the issue of "how do we get connected in the first place?" is solved, philosophically.
In the client-server model, programs are classified as clients or servers depending upon who initiates the communication. The program that initiates communication is called the client. For example, Netscape, telnet, and FTP are clients. When a client program is executed by the user, it contacts the server. This has the side effect that a network connection is established, and the two programs are then free to communicate.
A server is a program that awaits connections from clients -- the server is the endpoint that is always executing (waiting for requests). You will often hear particular machines referred to as servers; for example, "ame2 is our web server". This statement is merely shorthand for "ame2 is the machine that runs our web server program".
Once the client and server are connected, they perform an application protocol. For example, when you telnet from one machine to another, the TELNET protocol is executed. The TELNET protocol provides handshaking during the login sequence, allows the client to interact with a remote shell, and to receive output from programs executed under the shell.
The time for which the two sides are connected forms a session or transaction. Some transactions are long-lived, like telnet sessions. Other transactions are short-lived, like a time-of-day query.
IP ports are the basis for getting clients and servers in touch with one another. Each service (protocol provided by a server process) listens on a particular port on its machine. When a client connects to this port, a network connection is formed and the two programs begin communicating.
Of course, it is essential that the client know which port the server is listening on. For most services that we will encounter during this course, the mapping from services to port numbers is handled by a (mostly static) database, shared by all machines. For example, the TELNET service lives on port 23, and the FTP service can be contacted on port 21. These "well known numbers" are usually burned into the client program. There are special methods for clients whose servers do not live at well-known addresses to find the remote port number, but such things are not of primary interest to us.
If you want to see a protocol in action, try the following: pick your favorite web server and type "telnet web_server 80". The telnet program will issue a statement that you are connected. Then, type "GET /index.html" and you should see the HTML code for the server's main page (as long as your telnet program allows you to specify the remote port, and the index.html file exists on the web server). At this point, the connection will drop and you will be returned to the shell prompt.
It grew out a collection of internets (networks connected via TCP/IP) that existed primarily at universities and research labs. For quite a while, the NSF maintained the Internet backbone that connected the sites to one another. TCP/IP itself was developed by a funded DARPA research project.
When the World Wide Web (which started at universities and research facilities, too) exploded a few years ago, the maintenance of the network infrastructure was farmed out to private companies. The companies were enticed by the rapidly falling price of the home computer and the rise of the Internet Service Providers (ISPs) that connect people's home machine to the rest of the Internet. In other words, there was a growing market for Internet access, and the prospects for making money were good. When you add in the commercial uses of the 'net, it looked real good. Finally, most of the major players are phone companies. These companies are equally happy to transport voice or data, as long as they're making a profit. Given that the phone companies already had a lot of the necessary infrastructure, it isn't too surprising that they manage the main Internet network hardware now.
telnet host [port]
As usual, the square brackets denote that the
argument is optional. The port defaults to 23, the well-known
location of the TELNET service.
host argument deserves a little explanation,
though. The host argument can be any of
If you know the IP address of the remote site, you can use it.
Most people find it hard to remember IP addresses and prefer to use host names or hostname aliases. The Domain Name Service (DNS) is a hierarchical database that maps Internet hostnames to IP addresses -- this lets people remember the name while giving the machine a mechanism to determine the IP number that it needs.
For example, the hostname
to the IP address
18.104.22.168. As an example of
ame2 also goes by the name
www.math.arizona.edu. Hostname aliases are useful for
administrators and essential for ISPs. For example, if the Math
Department ever needs to move the web server to a different
physical machine, the new machine's IP address will change, but the
name can stay the same. ISPs use DNS for a number of purposes. One
of the main ones is allow lots of commercial web sites to exist on
a single physical machine.
The example above also illustrates the hierarchical nature of the
naming scheme. Educational institutions use the
suffix. Each particular educational institution will prepend a
unique identifier for itself; for example
is the UofA,
asu.edu is ASU, etc. At the UofA,
departments that run their own systems will typically prepend
a token that identifies themselves, such as
In the Math Department, each machine is given a final name to
get it to a unique IP address; for example,
identifies the machine at the IP address
DNS is implemented as a TCP service. On UNIX systems, the client
side of DNS is available as the
It is also normally embedded into the C runtime library so that the
magic of DNS is transparent to the programmer and user.
Some days, you will find that you cannot anywhere on the Internet because DNS is down. This is really annoying when it happens, but all you can do is wait. Generally the problem exists somewhere "out there". To understand why this is the case, you need to know that DNS lookups operate "backwards" from the way you might expect. For example, if I am sitting at home running PPP through StarNet and type:
telnet ame2.math.arizona.edumy machine connects to StarNet's DNS server (that's how my machine is configured) requesting resolution for
ame2.math.arizona.edu. Their nameserver contacts the
eduroot nameserver (it knows this address) and requests resolution again. The lookup then proceeds from
math.arizona.edu. Once it gets to Math, an authoritative answer of
22.214.171.124is returned to me at home. This is a nice system but it is also prone to failure: if any of the intervening machines croak, nobody can look up names in that part of the namespace. There is redundancy built into the system, but it is not foolproof. I have seen all
.comsites disappear for several hours, for example.
DNS failure can also affect other services. For example, you may know that DNS is down and telnet to your favorite spot by IP address. After a few seconds, your telnet client may terminte without ever presenting you with a login prompt.
This is a security feature. Some systems look at the IP address of the incoming connection and use the DNS system to find the hostname associated with this address. Some go even farther and look up the IP address that corresponds to the name, just to make sure things match up. The goal is to minimize ill effects from some chucklehead hacker that plugs a laptop into a network somewhere, picks an otherwise unused IP address, and proceeds with their nefarious activities.
Before getting into how to use FTP, there is an evil but important thing you should know: some operating systems make a distinction between text files and binary files. I can think of no rationale to defend this dichotomy; nevertheless, the distinction exists.
On all systems, transferring a file in binary mode causes the contents of the file to be sent verbatim. For machine-executable files, tar and zip archives, and other such data, you'll want to do your transfers in binary mode. If you transfer them in ASCII mode (text mode), you'll probably end up with garbage. It is annoying to spend hours downloading an enormous binary file over your modem, only to find that you transferred it in ASCII mode.
It is hard to define what is meant by a text file, except to say that a text file generally contains human readable ASCII text. A file that you create in emacs on a UNIX system is a text file, as is a file you create using Windows Notepad. However, a Microsoft Word document is not a text file because it contains raw binary formatting information. At any rate, you should transfer text files between systems in ASCII mode to avoid data corruption.
That aside, the general usage of the FTP command is:
ftp [-i] hostThe "-i" argument is optional and will be discussed below. Host can take any of the forms described in the telnet section above (this will be true of all the network clients we'll cover, actually).
You will be prompted for your username and password on the remote system. Once you have provided this information, you will find yourself at the FTP prompt:
ftp>At the FTP prompt, you can type in various commands to send and receive files, change directories, switch between ASCII and binary mode, list the contents of remote directories, etc.
In what follows, local host will refer to the host on which you executed the FTP command and remote host will refer to the host you specified on the FTP command line.
To transfer files in binary mode, type "binary" at the FTP prompt. To transfer files in text mode, type "ascii" at the prompt.
To transfer a file from the local host to the remote host, type "put filename", where filename is the name of the file on the local host. To transfer a file from the remote host to the local host, type "get filename", where filename is the file's name on the remote system. Both of these commands assume that the files mentioned exist in the current working directory on the local/remote system.
To change your CWD on the local host, type "lcd path". To change directories on the remote host, use "cd path". The "pwd" command will give you the absolute pathname of your CWD on the remote host, while "lcd ." will usually give the same info on the local host.
To list the contents of the CWD on the remote end, use the "dir"
or "ls" commands. Normally, you can use any valid options for the
ls command for the FTP "ls" command. If you are
FTP'ing from a UNIX system, you can examine the contents of the
CWD by typing something like "!ls -l".
Finally, you can put or get multiple files (but not subdirectories, usually) by using the "mput" and "mget" commands. These commands take a shell globbing pattern as an argument and transfer all of the indicated files. If you invoke FTP without using "-i", you will be asked to confirm each transfer; otherwise, FTP will transfer all of the files without asking. You can also use the "prompt" command at the FTP prompt to toggle the sense of the "-i" switch.
Finally, some sites have a public area that anyone can download files from (sometimes upload, too). To log in to such a site, you generally use "anonymous" for your username, and your email address as a password.
Hypertext is usually viewed with a browser. In class, Netscape is the browser I use, but there are many other browsers available. The browser allows you to read hypertext documents and follow hyperlinks by clicking on them.
The most common hypertext system in use today is called the World Wide Web (WWW). Documents on the WWW are written in Hyper Text Markup Language (HTML). These documents are accessed via TCP/IP servers that support the Hyper Text Transport Protocol (HTTP). The client side of HTTP is the web browser. HTML has been around for a long time. It was invented by particle physicists at CERN for their internal information system.
We'll talk about writing your own HTML documents next lecture. HTML documents can contain text (in various fonts and sizes), lists, tables, images, movies, audio clips, and all sorts of other data. Browsers can handle some of this data on their own; for example, text, tables, and certain types of images. Data that cannot be handled directly by the browser is handed off to an "external viewer" -- we'll see how this works in a moment.
HTML documents can be hyperlinked to one another in arbitrary ways -- the author of a document determines what hyperlinks are to be used and where the links go. If you picture a bunch of documents connected by a bunch of links, the resulting picture looks like a spider web, hence the name World Wide Web (actually, it looks more like "chaos" to me -- spider webs are more organized than the WWW).
All browsers have a "bookmark" facility that allows you to keep a personal list of favorite web pages. Bookmarks allow you to quickly return to these locations.
Every web-accessible document has a unique address, called a Uniform Resource Locator or URL. To open a remote document, you simply type the document's URL into your web browser. URLs consist of 2 or 3 parts:
telnet://ame2.math.arizona.eduOpening this URL will, on a UNIX X-windows system, usually pop up an xterm containing a telnet process. You can log it and do stuff the same as if you telnet'ed by hand.
Another example: a linux binary for the Berkeley MPEG player is available
via anonymous FTP from
math.arizona.edu. The player is in the
mpeg_play.tar.gz in the
directory. The URL for this file is:
ftp://math.arizona.edu/pub/slirp/linux/mpeg_play.tar.gzWhen you open this URL, your browser will download this file and save it into your account.
Finally, the HTML home page for this course is available at the URL:
As you can see, URLs are generally of the form
method://host/documentThere are a bunch of other protcols that can be used in the method field. Many of them are now obsolete, and most of them require browser support. Some browsers, such as Netscape, have integrated email clients and support email-URLs:
mailto:firstname.lastname@example.orgIn general, the http and ftp methods are the most common.
Here is a simple form (only accessible from Math Dept machines):
When you submit form data for processing, it is fed to a program (specified in the HTML code for the form) spawned by the HTTP server. This program is called a CGI (Common Gateway Interface) program. CGI is a standard that allows the web server to interface with external programs. The CGI program processes the data you supplied, performs some action, and sends the results back to your browser. Normally, these results consist of HTML generated by the CGI program.
Here's an example of a really simple CGI program that returns the current date in your browser:
We won't be talking about writing CGI programs or forms in this course. However, there are tons of documentation on this complex subject available on the net. A good place to start is the perl home page:
and references therein.
If you need to reload a document from the net, you can hit the "reload" button on your browser to force it to go back to the net for it. It is necessary to do this when you are editing an HTML file and want to see your changes, for example.
If you are using Netscape, you can configure the RAM and disk cache sizes via a dialog under the Options menu. If your machine is short on RAM, you can reduce the memory cache size.
The disk cache normally causes the most trouble for people. The default disk cache size in Netscape is 5 MB. If your disk quota is small, the disk cache can gobble up a significant portion of your available space. The dialog that lets you set the size of your disk cache should also have a button to clear the cache -- and allow you to recover some space. On UNIX systems, you can usually clear the cache by hand with:
> cd $HOME/.netscape > /bin/rm -rf cache
Most search engines operate in a similar fashion: the search engine roams the web every so often and downloads and catalogs the pages it finds. To use the search engine, you type in some sort of search specification, usually a list of "key words", click a button, and the search engine responds with a list of pages it thinks are relevant to your query. If you are lucky, the list will be short and the pages listed will actually be relevant.
Unfortunately, it is often the case that the listed is very long and filled with items that don't interest you. When this happens, you can either try a different search engine or restrict your search criteria on the current search engine.
Here is a short list of available search engines:
On AltaVista you can use "+" and "-" to help restrict your search. The search term "+word" means to include documents containing "word" and "-word" means to exclude documents containing "word". For example, I was looking for the guitar tablature for Steve Howe's "Mood for a Day" recently. I went to AltaVista and searched for "+steve +howe" and got over 20000 matches, so I changed it to "+steve +howe +mood +day" and got over 800 matches. Finally, I tried "+steve +howe +mood +day +tablature" and got it down to 35 matches. The second item in the list took me to a site that led me to Belgium, where I finally found what I was looking for: black market tablature.
While refining your search, you may see lots of irrelevant (to you) pages with similar titles. This is a good place to use "-word". You can also use "-host:hostname" to exclude entire sites from the search.
Finally, it is sad but true that using "-" in conjunction with sexually explicit words will often reduce the list a lot more than you'd think.
Since plugins are by definition optional add-ons, you should avoid using them in your publically available content. For one thing, a given plugin may not be available for all platforms. For another, you will annoy people who try to read your page and find that they have to download a 30 MB plugin to do so.
If you want to use plugins for your personal use, check out your browser's home page. There are some pretty cool plugins available. Many of them are commercial, but demo versions are available for a lot of them.
Several years ago, Sun Microsystems embraced Java as "the programming language of the Internet". Since Java can be ported to any platform, the idea was that it could be embedded in all browsers. This, in turn, would allow content authors to create highly customized web applications that would execute on the client machine. Anyway, much hype resulted, dozens of books were written, and people made money.
I have seen some cool applications written in Java. However, Java has a number of downsides that prevent it from being embraced by programmers and users.
First, it is almost universally agreed that Java is not an easy language to learn. Java is object oriented and resembles C++ in many ways. There are hundreds of books on object oriented design, which should clue you in to the fact that it's a complicated, multifaceted issue. I've been trying to figure OOP out for about a decade. I haven't succeeded yet.
Java was touted as a "secure language" by Sun when they first adopted it. This statement is as stupid as the day is long. The person who said this probably lives in a van down by the river these days. Protocols can be secure or insecure. Programs can be secure or insecure. Languages are just languages. Languages can provide unsafe features, but it is ultimately up to the programmer to choose to make use of such features. Even if unsafe programming constructs are not used, the final product may still contain "unforseen features" (read "bugs") that can be exploited by criminals.
At any rate, the moment Sun made this statement, the programming community went to work to produce counterexamples. In a short time it was determined that it was fairly easy to create Java programs that did evil things. And I mean really evil.
This has serious ramifications for the user community. When you download a Java program, the code executes on your machine. This means that it can do mean, nasty, ugly things to your machine. And you cannot be sure what the code is going to do unless you can obtain and take the time to study the source code. In other words, unless you are a Java programmer and Jedi master, you can't be sure what the thing is going to do.
Executing a Java program is no different from downloading a Windows binary and running it. The program may contain (or simply consist of) evil instructions. There are no Java virus checkers, as far as I know. In any case, you don't want to run any program you get off the net unless you trust the source from which you obtained it.
This problem is, in my opinion, unresolvable. On the web, anyone can say most anything they want. There is no guarantee that any particular piece of web content is true, helpful, or even nondamaging. The same can be said of Java code. The nice thing about Java is that it doesn't require any CPU time on the web server -- the execution happens on the client machine that requested the program.
The bad thing about Java is that the onus is on the client to ensure that the program is not harmful, and this is generally not possible. The good thing about CGI is that you, the client, don't have to worry (well, not much) about your system getting nuked. The onus is on the web server administrator to make sure that the CGI program cannot be used for evil purposes.
Still, if you have the time to sit down and learn Java, you can do some nifty things with it.
If the document is HTML source, plain ASCII text, a GIF image, an X bitmap image, or a JPEG image, Netscape can display it internally, for example. Every other type of data must be farmed out to an external application (or a plugin, if one exists).
How does the browser know what type of data it is receiving? There are two options here. If the data is a file, the browser can look at the extension on the filename to determine the type of data in the file. Second, every time you download data from a web server, the server adds a couple of header lines that describe the type of data in the file, as well as the encoding used for the data.
If the data type is not something that can be handled internally, your browser looks in a table for an application on your system that can handle that type of data. If such an application exists, it is invoked; otherwise, the browser will normally offer to save the data to disk (and take no further action).
The mechanism by which this occurs is called MIME (Multipurpose Internet Mail Extension). MIME was invented to allow people to send multimedia content via regular email. Since email can only handle ASCII data, you need an encoding scheme to send binary data like GIF images. We'll talk about using MIME for email in a couple of lectures; for now, we'll focus on how MIME relates to web browsers.
Every piece of MIME data has an associated content-type. For example, vanilla ASCII text has a content type of "text/plain", while HTML source has a content type of "text/html". GIF images have a content type of "image/gif" and MPEG movies are "video/mpeg".
When the browser downloads something from the server, the headers contain a "Content-type: whatever/whatever" line if the content type is known. If the content type is not known, the server sends the raw data and relies on the browser to interpret the data.
The browser looks at the filename suffix (characters following the
last "." in the filename) and attempts to map that to a
content type using some internally defined rules, as well as
(on UNIX systems) the contents of
.mime.types file consists of lines like:
# this is a comment. application/pdf pdf application/postscript eps ps video/mpeg mpeg mpgLines beginning with a "#" are ignored. Other lines consist of a content type, whitespace, and a list of filename suffixes that correspond to that content type, separated by whitespace.
So, if I download a tax form from the IRS in PDF format using Netscape, the browser will know that the content type is "application/pdf". What can Netscape do with a PDF file? Nothing, yet.
$HOME/.mailcap file maps content types to
applications that can display that content type. For example:
# this is a comment. application/pdf;acroread %s application/postscript;ghostview %s video/mpeg;mpeg_play %sLines in the
.mailcapfile consist of a content type, a semicolon, and a command to invoke an appropriate application. A "%s" in the command will be replaced by a filename (Netscape will download the data to a temporary file on the local machine).
Now when you click on the PDF file at the IRS, Netscape will download the file and run "acroread" on it. When you quit acroread, netscape will delete the temporary file.
Most browsers allow you to set all of this up via a configuration dialog, so that you don't have to edit the files directly; however, you should make sure that your MIME mail program (if you use one) is able to read the files that your browser writes.
There is one serious problem with MIME: it is not platform independent. For example, the Macintosh uses a program called "StuffIt" to pack multiple files into a single file. StuffIt files have the suffix ".sit" and have content type "application/x-stuffit". Mac users have no problem dealing with StuffIt files. But everyone else does because odds are good that there is no application on platform X to deal with StuffIt files. The same can be said for Microsoft Video files and UNIX tar files.
The only way to deal with this problem is to go find a machine that can handle files of type Y and do your browsing or MIME email decoding there.
The other problem is that multiple versions of an application may exist, and there may be add-on components to these applications that are not locally available. For example, there are many versions of Microsoft Word in use. If someone sends you an MSWord version 6 file and you have MSWord 5, you are out of luck. Worse yet, it is often the case the foreign versions of Word are shipped with different fonts than US versions (for various accents, etc). Even if there isn't a version mismatch, you still won't be able to look at the thing because the fonts that the document refers to aren't available in the US.
If you find yourself in such a situation, I can recommend three courses of action: either
The upshot is: MIME = OUCH.