Thursday, July 07, 2005

Project: PyFileServer

More definition on the PyFileServer, as it gets developed.

The PyFileServer application maintains a number of mappings between local file directories and "virtual" roots. For example, if there was such a mapping:
C:\SoC\Dev\WSGI -> /wsgi
and the server application is at http://localhost:8080/pyfileserver
=> a request to GET http://localhost:8080/pyfileserver/wsgi/WSGIUtils-0.5/License/LICENSE.txt would return the file C:\SoC\Dev\WSGI\WSGIUtils-0.5\License\LICENSE.txt

This means the file repository shared does not interface directly with the web server but through the application, which is what we want.
* Relative paths are supported and resolved to a normal canonical path, but you cannot relative out of the base path C:\DoC\Dev\WSGI
* Case sensitivity depends on the server OS.

Browsing Directories
When the request GETs and POSTs to a directory url, like http://localhost:8080/pyfileserver/wsgi/WSGIUtils-0.5/ , the response pretty-prints a index of the items in the directory (name, type, size, date last modified).

If POST is used and querystring contains "listitems", the information is returned in a tab-delimited format instead. If querystring contains "listtree", all items in that directory tree structure (not just the directory) listed in a tab-delimited format. This is to support the PyFileSyncClient - normal web-browser usage will GET the html prettyprint.


Downloading Files
When the request GETS and POSTs to a file url, like http://localhost:8080/pyfileserver/wsgi/WSGIUtils-0.5/License/LICENSE.txt, the response returns the file. The following features are planned and mostly come from the HTTP 1.1 specification


FILE REQUEST --> [Conditional Processing] --> [Multipart] --> [Compression] --> [Encryption] --> [MD5] --> Channel Transfer


Conditional Processing
Here it refers to the conditional headers If-Modified-Since, If-Unmodified-Since, etc. They will be supported eventually.


Multiple Partial Content
This is to support multiple partial download - the client would request for different partial content ranges multiple times. Feature is activated by prescence of the Range request header, or by a yet-to-be-defined equivalent in the POST querystring. The server accepts only one Range (may change if I figure out that multipart thingy), multiple or invalid ranges will yield a response of 416 Range not Satisfiable and Content-Range header of '*'. Otherwise the data is returned as 206 Partial Content and Content-Range set accordingly.

Sending partial content always forces the Content-Type to application/octet-stream. Assembly of the file is up to the client. Note that the byte range specified may not compute to Content-Length, since it is the byterange requested BEFORE encryption, compression and other messy stream modifiers.

*Design Note* It might be better to do partial content after compression and encryption, but it means I would have to compress and encrypt the whole file to potentially just send one partial piece of it, and that the client would need to receive all pieces to successfully uncompress and decrypt -> not desirable.


Compression
HTTP protocol specifies identity (none), gzip, zlib or compress. I plan to implement only "gzip" for the initial version, but will make it pluggable so that someone can easily implement some other compression (lossy at your own risk). Feature is activated by corresponding compression identifier in "Accept-Encoding" or equivalent activator in POST querystring.

Encryption (File encryption, not channel encryption like SSL)
Looking at a couple of schemes, but for proof of concept most likely a simple shared key will be implemented. I am unsure as to add it to part of Accept-Encoding and Content-Encoding, or have it separate as an entity header. Accept-Encoding and Content-Encoding seems plausible, etc "gzip/sharedkey". And knowing how to compress it is no help if its encrypted and you don't know how to decrypt it. This will allow clients that support only "gzip" to throw the appropriate errors.

Since it is not part of the HTTP spec, obviously only the PyFileSyncClient will use this functionality.

415 Unsupported Media Type is returned if the server does not understand Compression or Encryption requested. We do this rather than send the file unencrypted.
To support Compression and Encryption, the no-transform cache control directive would be included in the response.

MD5
Server includes it by default. Computes a 16 byte MD5 digest of the body sent and puts it in header Content-MD5. It is up to the client to check it.


Pending Definition
File upload via PUT
Authentication (maintaining session)

1 comment:

Ian Bicking said...

I think authentication can be handled at a different level (e.g., WSGI middleware), and it will confuse things to handle it in the file server. Though the client has to be authentication-aware.