One of the main concept of PyCVF is the concept of dataset. A dataset is simply a collection of elements of the same type that have addresses and to which we may associate metadata.
To visualize a dataset simply use the -d option
pycvf -d "image.caltech256()"
pycvf -d "video.keyframes_of(video.demo())"
pycvf -d "vectors.clustered_points()"
The metadata are called labeling in PyCVF and will be very important to further processing. For instance metadata are used in supervised learning, but also to associate already computed feature to data, or to attach tags to an image.
To implement a dataset, the recommended way to proceed is as follow :
In the “datasets” directory of you package create a new file named as the named of your dataset, and starts with following template
from pycvf.core import dataset
from pycvf.datatypes import generic
class Dataset(dataset.Dataset):
def datatype(self)
return generic.Datatype
def keys(self):
return [1,2,3,4] ## specify a list of key
def __getitem__(self,key):
return file("%d.txt"%(key,)).read() ## fetch the data from your dataset
#... #... #... def labeling_groundtruth(self)
- class Label:
- def datatype(lself)
- return generic.Datatype
- def __getitem__(lself,key):
- return file(“%d.txt”%(key,)).read() ## fetch the data from your dataset
return Label()
When working with very large datasets :
First of all to save memory, it is of course recommended to use a generator for enumerating the key:
def keys(self):
for i in range(10**9):
for j in range(10**9):
yield i,j
In the keys cannot fit in memory, then you should forget about using convenient keywords such as “randomized”, “traindb”, “chunked” . This is a limitation of current implementation. We will solve these issues as the existence of such needs will appear.
If an exact count of the elements in the datasets, is not available, or too expensive to compute, you are invited to implement a upper_bound method that should provide a reasonable guess of an upperbound. It may be used in some case to allocate some structures to store results.
All dataset must inherit from pycvf.core.dataset.Dataset.
This class implements a minimal set of functions for a database, so that database implementation conform to standard.
For full details about the database interface, please have a look at pycvf.database.SPECS
Basically the intuition behind a database is that it is typed collection of objects, to which some metadata may be attached by the mean of the “labeling” methods.
Tries to compute a meaningful hash for this dataset instance.
The hase value are based on the class name and the module name
This is a reference database object specification.
Work under progress...