The implementation of buzhug was made with a main objective : make all operations, and especially selections, as fast as possible. When a choice had to be made between different options, I took the one that optimized the speed, even if this meant more files to manage and more space used on disk

This is why a base is not kept in a single file, but in a directory with many files. There are two sorts of files :

The field files

In the base definition, a field is defined by its name and its type. A field file is created for each of the fields, and two other field files are added by buzhug :

The field files store information in a different way depending on the type of the field. Field types can be split into two categories :

For fixed-length types, the values are stored as strings using a conversion function. Because of the way the select() method works, it is essential that this conversion function preserves the ordering of values, that is : if you have two values v1 and v2 and the conversion function is conv() then you must have :

if v1 > v2 then conv(v1) > conv(v2)
and more generally,
cmp(v1,v2) == cmp(conv(v1),conv(v2))

The conversion functions can be found in the methods to_block() and from_block() of the classes in the module buzhug_classes

For variable-length files, the value is converted to a string without any line break, so that it can be stored on one line in the file. Python has a loop for line in _file: which is extremely fast ; I found that it is much faster to browse a file with date and datetime using this loop than reading blocks of fixed-length data with read(), this is why they are stored as variable length types

For all types, the "blocks" that store a value begin with a flag which can take 3 different values :

Internal files

The files defined for internal use are :

Deleting a record

When a record is deleted, the field files and position file are marked as deleted with the flag set to #. For instance if the record for 'jean' is deleted, here is how the field files and position file will look like :

Field file 'name'
posvalue
0-pierre\n
8-claire\n
16-simon\n
23-camille\n
32#jean\n
38-florence\n
48-marie-anne\n
Field file '__id__'
posvalue (hex)
02D 40 00 00 00
52D 40 00 00 01
102D 40 00 00 02
152D 40 00 00 03
2023 40 00 00 04
252D 40 00 00 05
302D 40 00 00 06
Field file '__version__'
posvalue (hex)
02D 40 00 00 00
52D 40 00 00 00
102D 40 00 00 00
152D 40 00 00 00
2023 40 00 00 00
252D 40 00 00 00
302D 40 00 00 00
Position file
posflag pos __id__ pos __version__ pos name
02D00 00 00 0000 00 00 0000 00 00 00
132D00 00 00 0500 00 00 0500 00 00 08
262D00 00 00 0A00 00 00 0A00 00 00 10
392D00 00 00 0F00 00 00 0F00 00 00 17
522300 00 00 1400 00 00 1400 00 00 20
652D00 00 00 1900 00 00 1900 00 00 28
782D00 00 00 1E00 00 00 1E00 00 00 30

The change is minimal (and therefore very fast) : flag '-' (hex 2D) replaced by '#' (hex 23) in 4 files

The problem is that the now useless information about 'jean' remains physically on disk ; when many records have been deleted this can become a problem of memory and speed of selections

Cleanup

So from time to time the method cleanup() should be applied. It removes the useless records from the field files and updates the position file to the new positions in the field files

After cleanup, the files will be :

Field file 'name'
posvalue
0-pierre\n
8-claire\n
16-simon\n
23-camille\n
32-florence\n
42-marie-anne\n
Field file '__id__'
posvalue (hex)
02D 40 00 00 00
52D 40 00 00 01
102D 40 00 00 02
152D 40 00 00 03
202D 40 00 00 05
252D 40 00 00 06
Field file '__version__'
posvalue (hex)
02D 40 00 00 00
52D 40 00 00 00
102D 40 00 00 00
152D 40 00 00 00
202D 40 00 00 00
252D 40 00 00 00
Position file
posflag pos __id__ pos __version__ pos name
02D00 00 00 0000 00 00 0000 00 00 00
132D00 00 00 0500 00 00 0500 00 00 08
262D00 00 00 0A00 00 00 0A00 00 00 10
392D00 00 00 0F00 00 00 0F00 00 00 17
522300 00 00 1400 00 00 1400 00 00 20
652D00 00 00 1400 00 00 1400 00 00 20
782D00 00 00 1900 00 00 1900 00 00 2A

Note that the row for the deleted record is still in the position file ; it takes a useless space, but this is not a real problem because the row will be reused if another record is inserted. Also note that the file _id_pos is not modified by cleanup(), because the rows in the position file remain at the same place for a given record

Updating

When a record is updated, one of these two cases may occur :

The first case is the easiest to manage : buzhug finds the position in the field files and modifies the value at the same place ; there is no need to update the position file

In the second case, the update takes place in 3 steps :

In all cases, the field __version__ is modified (incremented by 1). This field is used to check that, at the time when update() is called, the record stored in the base has not been modified since it was selected (which may happen if the access to the table is shared between many simultaneous users)