Saturday, July 30, 2005

Week 4 was kidnapped by the Martians

Hi,

Just dropping a message to the blog here (ping? pong!). I said I was going to spend less time on the project this week, turns out I spent none :| Exams didn't go too good either.

In the coming week 5, working on LOCKS. I hope I am still making good time on this project.

Saturday, July 23, 2005

Class 1 WebDAV Server

Failed to finish what I wanted to finish yesterday, thanks to some amazing careless errors that neon-litmus deftly caught in COPY, MOVE and PROPFIND, PROPPATCH. Most of these were mistakes and places that I missed in the spec.

Had some troubles with Unicode. I forgot that encoding a high-unicode string to UTF-8 and then concatenating that string with some ascii-encodable "normal" unicode string *changes* the whole thing back to unicode, and was surprised it was still throwing errors after I encoded it. But its working now.

Test Status
I will spend less time on the project for a week - thanks to summer course exams, but right now I have a Class 1 WebDAV server, with only locking lacking (admittedly that will take a bit of work and affect all other methods).

The code has been updated on SVN and CVS. Naming convention is there but script to change to 4-indent is still on the TODO list.

-> running `basic':
0. init.................. pass
1. begin................. pass
2. options............... pass
3. put_get............... pass
4. put_get_utf8_segment.. pass
5. mkcol_over_plain...... pass
6. delete................ pass
7. delete_null........... pass
8. delete_fragment....... WARNING: DELETE removed collection resource with Request-URI including fragment; unsafe ...................... pass (with 1 warning)
9. mkcol................. pass
10. mkcol_again........... pass
11. delete_coll........... pass
12. mkcol_no_parent....... pass
13. mkcol_with_body....... pass
14. finish................ pass
<- summary for `basic': of 15 tests run: 15 passed, 0 failed. 100.0%
-> 1 warning was issued.
-> running `copymove':
0. init.................. pass
1. begin................. pass
2. copy_init............. pass
3. copy_simple........... pass
4. copy_overwrite........ pass
5. copy_nodestcoll....... pass
6. copy_cleanup.......... pass
7. copy_coll............. pass
8. move.................. pass
9. move_coll............. pass
10. move_cleanup.......... pass
11. finish................ pass
<- summary for `copymove': of 12 tests run: 12 passed, 0 failed. 100.0%
-> running `props':
0. init.................. pass
1. begin................. pass
2. propfind_invalid...... pass
3. propfind_invalid2..... pass
4. propfind_d0........... pass
5. propinit.............. pass
6. propset............... pass
7. propget............... pass
8. propextended.......... pass
9. propmove.............. pass
10. propget............... pass
11. propdeletes........... pass
12. propget............... pass
13. propreplace........... pass
14. propget............... pass
15. propnullns............ pass
16. propget............... pass
17. prophighunicode....... pass
18. propget............... pass
19. propvalnspace......... pass
20. propwformed........... pass
21. propinit.............. pass
22. propmanyns............ pass
23. propget............... pass
24. propcleanup........... pass
25. finish................ pass
<- summary for `props': of 26 tests run: 26 passed, 0 failed. 100.0%

there's still lock and http test sets to go. I cannot skip lock and go straight to http, so guess that will have to wait.

Friday, July 22, 2005

Galore of Errors

Yesterday I finished work on the MOVE and COPY commands and was ready to start on testing this application as integrated. Only the LOCK mechanisms are lacking and I will implement that, prefering to do a midway test. The litmus library http://www.webdav.org/neon/litmus/ was perfect for this. I installed and compiled it over cygwin.

Extending the server:

After several failures on the first test (MKCOL) and some telnetting of my own, I realise that paster was using wsgiServer, based off SimpleHTTPHandler, which filters off all the requests except HEAD, GET, PUT. There was not much choice but to duplicate a paste server plug-in that allowed these methods. I copied code off wsgiServer and wsgiutils_server, and borrowed a pattern from http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/307618 , which uses __getattr__ to allow me to specify a list of supported methods rather than define a separate method to do the same thing - pass request to application. It works now.

Reading the request body

The next thing I found was that reading off environ['wsgi.input'] would end up blocking, since it is a stream that is still open. I will have to check the Content Length variable to get the number of bytes to read, but is there another option to not block_read, in case Content Length is not given? My unfixed code is:

requestbody = ''
if 'wsgi.input' in environ:
....inputstream = environ['wsgi.input']
....readbuffer = inputstream.read(BUF_SIZE)
....while len(readbuffer) != 0:
........requestbody = requestbody + readbuffer
........readbuffer = inputstream.read(BUF_SIZE)


Catching Iterator Code Exceptions
Some of the exceptions thrown were not being caught by my error processing middleware, it was because some of my sub-apps return a iterator object, which is iterated out of the scope of my error processor. for example:


def A():
....for v in iter(B()):
........print v

def B():
....try:
........return C()
....except:
........return ['exception']

def C():
....raise Exception
....yield '1'
....yield '2'


From my old functional perspective, B() should be able to catch the exception in C(), since C is within its scope. In reality, when B is executed, B returns C() immediately as a generator/iterator object, without executing C() itself. B then exits its scope. When A tries to use the iterator object returned, C() runs and throws the exception on A which thought it was protected. So instead I went to look at paste.errormiddleware on how it was catching iterator exceptions and implemented it accordingly.


Current Test Status
-> running `basic':
0. init.................. pass
1. begin................. pass
2. options............... pass
3. put_get............... pass
4. put_get_utf8_segment.. pass
5. mkcol_over_plain...... pass
6. delete................ pass
7. delete_null........... pass
8. delete_fragment....... WARNING: DELETE removed collection resource with Request-URI including fragment; unsafe...................... pass (with 1 warning)
9. mkcol................. pass
10. mkcol_again........... pass
11. delete_coll........... pass
12. mkcol_no_parent....... pass
13. mkcol_with_body....... pass
14. finish................ pass
<- summary for `basic': of 15 tests run: 15 passed, 0 failed. 100.0%
-> 1 warning was issued.

This is the basic test set, following which are sets for copymove, props, locks and http. As of this moment I am still sorting out the copymove errors identified.

Wednesday, July 20, 2005

Update

Refactoring was the order of the day, including getting rid of some out-of-the-way customizations like POSTing to get directory lists. Moved things around and cleaned it up, essentially

PROPPATCH is also working, even if the property value is specified in XML substructures.

Added support for the If header in webDAV. Improved support for the HTTP range header to consolidate multiple-overlapping ranges into a single range - left over from TODO.

Question: General confusion over whether the If header should apply to PROPFIND, and whether the HTTP If-Match/If-Modified-Since conditionals should apply to PROPFIND, and how If and HTTP Ifs would apply to the webDAV methods if multiple files are involved.
I have posted to the W3C Dist Authoring working group to ask for clarifications.

This week code will flow slightly slower since the summer course is holding exams (and I am cramming for it)

Tuesday, July 19, 2005

Third Week Starts

Property Library
Spent some time yesterday looking at shelve and related modules for a persistent store of dead properties and locks. I put together a property library using shelve, but had some problems placing the initialization and saving of the library to persistent store as the server tends to result in multiple initialization of the application, and no fixed deallocator (atexit does not work since os_exit() is used for restarting).

Current implementation is a library to initialies on first request (using threading to protect from concurrent initializations) and syncs after every write. The file lock should be released automatically when the process terminates.


XML, PROPFIND
Installed PyXML and started work on PROPFIND, which is working. Will link it to the main application once PROPPATCH is also working. Depth support is present.

CVS / SVN
Both repositories are up to date. I modified the top level directory name of the CVS repository to refresh it. "thirdparty" probably should be removed, it contains code that was referenced but not actually included.

Saturday, July 16, 2005

Weekend again!

Week 2 is over and coding is progressing. Hope I am going along at a reasonable speed at least.

Thanks to my mentor who pointed out that webDAV servers do support Basic. The spec mentions that webDAV servers should support Digest but I missed the fact that they can also support Basic (especially over SSL). I have done both basic and digest, right now it is standalone module (in httpauthentication.py) but digest would be a great addition to paste.login once I figure out how to put it in with the cookies portion.

Application configuration is currently done in a python syntax file PyFileServer.conf - similar to server.conf and I also use paste.pyconfig to read it.

GZip is still unsupport on content-encoding level.

There were some changes in case naming conventions to match PEP-8. Indentation has to be changed but I will write a script to do that later.

Starting next week I will look at some sort of server session storage, for both locking support and dead properties.



Code repositories

There were some problems with the code repositories, after I tried to make some changes in case to the files. CVS doesnt do it very well, and still does not accept the fact that I've renamed RequestServer.py to requestserver.py (even after delete-commit-update-add-commit-update). I've requested for a complete wipe of CVS so that I can check-in everything again from scratch.
Update: I just got this idea. If the CVS problem doesn't get dealt with by BerliOS soon I'll just rename one of the top level directories to something else, that should constitute as a total re-add of my code files. Either way, I lose all that versioning information.

SVN, well anonymous checkout isn't working for some reason:
$ svn checkout svn://svn.berlios.de/pyfilesync
svn: Can't open file '/svnroot/repos/pyfilesync/format': Permission denied

Thursday, July 14, 2005

Digest Authentication

Progress

Worked on digest authentication, and now have it working with both firefox and IE (which never returns the opaque). I wanted to use Paste.login, but webDAV specifically indicated that digest and not basic authentication was to be used.

It was only belatedly that I saw the digest authentication implementation on Python Cookbook. Well I guess it helps to have my code tested alongside a known working copy. :|

Musings
If I separated my application into several layers of middleware, can I use the environ dictionary to pass variables or information between them? like adding a dictionary to value "pyfileserver.config" and putting custom configuration information in there. Apparently I can, but is it a good practice/the-way-to-do-it?

Wednesday, July 13, 2005

Almost done with HTTP

PyFileServer

Completed:
+ HTTP functions: OPTIONS, HEAD, GET, POST(modded GET), PUT, DELETE, TRACE (not supported)
+ Partial Ranges for GET (single range only - which is most servers and browsers for download resume)
+ Conditional GETs, PUTs and DELETEs

Working on:
+ GZip support: I realize that the GzipMiddleware did not work well with partial ranges since Content-Range comes on the compressed portion. Back-to-the-drawing-board with regard to GZip support. Also need to know the compressed file size in advance for Range headers, while trying to avoid any sort of buffering of the whole file (compressed or not).

+ Authentication - this is starting on WebDAV, since it requires compulsory authentication - looking it up.

On the shelf:
+ MD5, encryption. Partial and Conditional support for PUT. Content-Encodings for PUT

CVS and SVN repositories updated

Tuesday, July 12, 2005

Project: HTTP Status

PyFileServer
Completed:
+ HTTP functions: OPTIONS,HEAD,GET,POST(modified GET),PUT

Working On:
+ DELETE, TRACE
+ Partial Ranges for GET
+ Conditionals: If-Modified-Since, If-Unmodified-Since, If-Range, If-Matched, etc

On the shelf:
+ MD5, encryption
+ Partial Ranges for PUT (currently returns 501 Not Implemented)
+ Conditionals for PUT

To Clarify:
+ if TRACE is implemented by the web server, rather than the application
+ if Content-* headers are applicable to the entity UPLOADED by PUT, that is, if a PUT entity could be under Content-Encoding: gzip and how the application should react.

Project: Week 2 Commences

Project Organization
The BerliOS account finally got through! I tried to set up my account all Monday, but apparently their server went down. They just set up an emergency one and its running fine.

CVS was a piece of cake. Install TortoiseCVS, put in the SSH magic bits, and code was up in a few seconds: http://sourceforge.net/docman/display_doc.php?group_id=1&docid=25888#top
SVN took more trouble. http://tortoisesvn.sourceforge.net/?q=node/5 is a guide, but I had to play around with bits and pieces before everything fell into place.

I will probably stick with SVN and delete the CVS repository (when I get around to it) or update both. I work off a development local copy anyway.
For anonymous access to the project: svn checkout svn://svn.berlios.de/pyfilesync

Web page for the project hasn't started. It would just be a description and a link to this site.


Code
A bit slow. There were a couple of false starts, like time wasted on figuring out the mechanics of Python (returning a function callable vs the result of the call, defining application as a function vs as an object), but its all sorted out now. Finishing off the HTTP spec and starting on the WebDAV spec this week.

Some of the things, like MD5 digest and encryption, has been postponed in favor of the WebDAV implementation. MD5 is a dilemma. Because the header goes before the content, calculating and sending the md5 means either compressing and reading the file twice (once just to get the MD5, once to send it), caching/buffering (in memory the entire contents to be sent) or some sort of temporary file buffering.

Friday, July 08, 2005

Project: WebDAV more

I was looking for an example on gzip support for web applications, and came across this page:
http://www.saddi.com/software/py-lib/

Middleware

Maybe I should have browsed through wsgi sample code right at the start, but reading gzipMiddleWare made something clicked in order to understand how WSGI and middleware actually worked. I guess in Java I had been dealing with pre-filters and post-processors to HTTP requests as separate entities, it did not occur to me that middleware really did both.

It also made me take another look at the Py Tut on generators and iterators as they are used (new to them). The WSGI application either returns a generator (yields) or a iterable as a object or a sequence of strings. If the application returned a single string (like what I did initially), it gets treated inefficiently as a iterable (sequence) of characters.


GZipping: Transfer or Content Encoding

The middleware for gzipping got me confused slightly, since the PEP333 gives:
"(Note: applications and middleware must not apply any kind of Transfer-Encoding to their output, such as chunking or gzipping; as "hop-by-hop" operations, these encodings are the province of the actual web server/gateway. See Other HTTP Features below, for more details.)".

Transfer-Encoding alludes only to "chunking". "Gzipping" is under Content-Encoding and could be supported by the application? Also, the Content-MD5 digest is produced from the gzipped content rather than the original content, if gzip content-encoding is in effect, and ignores Transfer-Encoding. Hope I got it right.


In which case, if gzip is present -> buffering of the sent content may be required to compute the MD5 and Content Length of the compressed data before sending. This may adversely affect performance given the size of the files that might be sent?



Project

On the project, I read more of the WebDAV specification, and it builds on nicely on the existing PyFileServer implementation planned. Locking and supporting arbitary dead properties may require a database to support.

There are a few other python DAV servers mentioned on http://www.webdav.org/projects/, but they are older (since 2000). In any case a WSGI WebDAV server would be great :)

Thursday, July 07, 2005

Project: WebDAV

After initial discussion with my mentor:
* While so far I have been working on the HTTP protocol, perhaps WebDAV is a better system to based off. Its a superset of HTTP but contains more extensions specific to management of files on web servers. Taking some time to read the WebDAV specification.
* Looking at the scope introduced, concentration on the server component is probably best. There would be a skeletal client for testing, in conjunction with any WebDAV test suite used.
* Full time effort is definitely demanded. I tend to work in burst mode myself - getting a lot of production code done (in my perspective) in a few days and falling back to code experimentations for a few days later, but hopefully that gets averaged out. Starting might be slow since it runs with a concurrent summer course till end July
* Writing tests even while coding.

WebDAV test suite:
* litmus: http://www.webdav.org/neon/litmus

Other WebDAV resources:
* akaDAV (in development) : http://akadav.sourceforge.net/
* http://cvs.sourceforge.net/viewcvs.py/webware-sandbox/Sandbox/ianbicking/DAVKit

Project: PyFileServer

More definition on the PyFileServer, as it gets developed.

The PyFileServer application maintains a number of mappings between local file directories and "virtual" roots. For example, if there was such a mapping:
C:\SoC\Dev\WSGI -> /wsgi
and the server application is at http://localhost:8080/pyfileserver
=> a request to GET http://localhost:8080/pyfileserver/wsgi/WSGIUtils-0.5/License/LICENSE.txt would return the file C:\SoC\Dev\WSGI\WSGIUtils-0.5\License\LICENSE.txt

This means the file repository shared does not interface directly with the web server but through the application, which is what we want.
* Relative paths are supported and resolved to a normal canonical path, but you cannot relative out of the base path C:\DoC\Dev\WSGI
* Case sensitivity depends on the server OS.

Browsing Directories
When the request GETs and POSTs to a directory url, like http://localhost:8080/pyfileserver/wsgi/WSGIUtils-0.5/ , the response pretty-prints a index of the items in the directory (name, type, size, date last modified).

If POST is used and querystring contains "listitems", the information is returned in a tab-delimited format instead. If querystring contains "listtree", all items in that directory tree structure (not just the directory) listed in a tab-delimited format. This is to support the PyFileSyncClient - normal web-browser usage will GET the html prettyprint.


Downloading Files
When the request GETS and POSTs to a file url, like http://localhost:8080/pyfileserver/wsgi/WSGIUtils-0.5/License/LICENSE.txt, the response returns the file. The following features are planned and mostly come from the HTTP 1.1 specification


FILE REQUEST --> [Conditional Processing] --> [Multipart] --> [Compression] --> [Encryption] --> [MD5] --> Channel Transfer


Conditional Processing
Here it refers to the conditional headers If-Modified-Since, If-Unmodified-Since, etc. They will be supported eventually.


Multiple Partial Content
This is to support multiple partial download - the client would request for different partial content ranges multiple times. Feature is activated by prescence of the Range request header, or by a yet-to-be-defined equivalent in the POST querystring. The server accepts only one Range (may change if I figure out that multipart thingy), multiple or invalid ranges will yield a response of 416 Range not Satisfiable and Content-Range header of '*'. Otherwise the data is returned as 206 Partial Content and Content-Range set accordingly.

Sending partial content always forces the Content-Type to application/octet-stream. Assembly of the file is up to the client. Note that the byte range specified may not compute to Content-Length, since it is the byterange requested BEFORE encryption, compression and other messy stream modifiers.

*Design Note* It might be better to do partial content after compression and encryption, but it means I would have to compress and encrypt the whole file to potentially just send one partial piece of it, and that the client would need to receive all pieces to successfully uncompress and decrypt -> not desirable.


Compression
HTTP protocol specifies identity (none), gzip, zlib or compress. I plan to implement only "gzip" for the initial version, but will make it pluggable so that someone can easily implement some other compression (lossy at your own risk). Feature is activated by corresponding compression identifier in "Accept-Encoding" or equivalent activator in POST querystring.

Encryption (File encryption, not channel encryption like SSL)
Looking at a couple of schemes, but for proof of concept most likely a simple shared key will be implemented. I am unsure as to add it to part of Accept-Encoding and Content-Encoding, or have it separate as an entity header. Accept-Encoding and Content-Encoding seems plausible, etc "gzip/sharedkey". And knowing how to compress it is no help if its encrypted and you don't know how to decrypt it. This will allow clients that support only "gzip" to throw the appropriate errors.

Since it is not part of the HTTP spec, obviously only the PyFileSyncClient will use this functionality.

415 Unsupported Media Type is returned if the server does not understand Compression or Encryption requested. We do this rather than send the file unencrypted.
To support Compression and Encryption, the no-transform cache control directive would be included in the response.

MD5
Server includes it by default. Computes a 16 byte MD5 digest of the body sent and puts it in header Content-MD5. It is up to the client to check it.


Pending Definition
File upload via PUT
Authentication (maintaining session)

Project: Architecture

Been thinking for some time to lay out the architecture of the project.

I have taken to calling the project "PyFileSync" unless a better name comes along or my mentor disapproves of this name. It would consist of two components (for the moment):

PyFileServer:
This is the server component that would perform the file-serving/uploading operations

The server acts as a generic file server with the additional features of authentication, transfer verification, compression, file encryption, multipart sending.

PyFileServer is being developed as a WSGI application running over "paster", the default development webserver that comes with Paste.


PyFileSyncClient:
This is the client component that will communicate with the server to perform file-uploading/downloading operations.

The client is responsible for the higher level management that results in the sync'ing of a file repository between the server and the client.

Technical: Initial Contact with WSGI and Paste

Checked out Paste and downloaded WSGIUtils today. Installed svn to check out 3rd party packages. Ran through some of the articles on http://pythonpaste.org.

There was quite a bit of confusion as to what was going on, and the TODO tutorial did not help much. Admittedly after my experience in Java web frameworks I hoped to breeze through this initial learning curve. Getting the tutorial (WebKit and ZPT) running was quick and easy, but there was no clear flow of control to work from.

With Struts, you knew how things tied together -> how different components of the server were called to process an incoming request, where you redirect control to Actions/Servlets, the order in which stuff was done. I think a large part of this plumbing was hidden in __init__.py and SitePage.py and paster, and the user does not see the connections to index.py/templates.

What I did figure out was to use the ""application(environ, start_response)"" overwrite for the URLparser to hijack control of the request and process it. This ties in neatly with PyFileServer's need to bypass urlparser, since the URL path requested does not go directly to a webserver-shared script or resource but is translated to a directory file path in the file repository outside of the web application.

Still, I would need to learn that flow of control for future purposes. Any recommendations?

First Post

This is my first post for this blog, which is set up currently for my Google's Summer of Code 2005 project.

The project is mentored by Ian Bicking and is under the Python Software Foundation. It aims to set up a Python-based Data Serving/Collection Framework.

More information on the project will appear first on this blog as I go along. It would be consolidated at http://cwho.pbwiki.com/ -> once in a while *grinz*

A project website and repository is being set up on BerliOS: http://developer.berlios.de/

I struggled with Blogger's fixed width templates for a while and settled for a standard 1024px width. Apologies in advance to folks with non-standard screen sizes, like my widescreen laptop.