Datasets

Overview

One of the main concept of PyCVF is the concept of dataset. A dataset is simply a collection of elements of the same type that have addresses and to which we may associate metadata.

To visualize a dataset simply use the -d option

pycvf -d "image.caltech256()"
pycvf -d "video.keyframes_of(video.demo())"
pycvf -d "vectors.clustered_points()"

The metadata are called labeling in PyCVF and will be very important to further processing. For instance metadata are used in supervised learning, but also to associate already computed feature to data, or to attach tags to an image.

To implement a dataset, the recommended way to proceed is as follow :

In the “datasets” directory of you package create a new file named as the named of your dataset, and starts with following template

from pycvf.core import dataset
from pycvf.datatypes import generic

class Dataset(dataset.Dataset):
   def datatype(self)
     return generic.Datatype
   def keys(self):
     return [1,2,3,4] ## specify a list of key
   def __getitem__(self,key):
     return file("%d.txt"%(key,)).read()   ## fetch the data from your dataset

Labeling

class Dataset(dataset.Dataset):

#... #... #... def labeling_groundtruth(self)

class Label:
def datatype(lself)
return generic.Datatype
def __getitem__(lself,key):
return file(“%d.txt”%(key,)).read() ## fetch the data from your dataset

return Label()

Very Large Datasets

When working with very large datasets :

First of all to save memory, it is of course recommended to use a generator for enumerating the key:

def keys(self):
  for i in range(10**9):
    for j in range(10**9):
      yield i,j

In the keys cannot fit in memory, then you should forget about using convenient keywords such as “randomized”, “traindb”, “chunked” . This is a limitation of current implementation. We will solve these issues as the existence of such needs will appear.

If an exact count of the elements in the datasets, is not available, or too expensive to compute, you are invited to implement a upper_bound method that should provide a reasonable guess of an upperbound. It may be used in some case to allocate some structures to store results.

Implementation

All dataset must inherit from pycvf.core.dataset.Dataset.

class pycvf.core.dataset.Dataset

This class implements a minimal set of functions for a database, so that database implementation conform to standard.

For full details about the database interface, please have a look at pycvf.database.SPECS

Basically the intuition behind a database is that it is typed collection of objects, to which some metadata may be attached by the mean of the “labeling” methods.

add_labelexp(name, value)
This save a label
add_precomputed_labels()
This look at in a directory specific to this database expression for the existence of potential labels that would match this database expression, and if no other label exist with that name it add thems to the dataset object.
dataset_hash()

Tries to compute a meaningful hash for this dataset instance.

The hase value are based on the class name and the module name

import_labeling_from(altdb, label_op=<function <lambda> at 0x2cc3668>, label_prefix='', label_suffix='', overload=False)
This method allows you to also pass labels from one model to another, when you create decorative database such as “keyframes of”, “faces_of” and so on...

Abstract Dataset Type

class pycvf.datasets.SPECS.ref.DB(arg1, arg2)

This is a reference database object specification.

__init__(arg1, arg2)
Constructor is mandatory
__iter__()
Iterator is mandatory Return type = couple formed by one data element and the address of his element
__len__()
len is optional, but maybe useful when parallelizing computation in order to evaluate necessary storage for distributed reduce, or some operation of this kind

Existing Datasets

Work under progress...

Metadataset

class pycvf.datasets.from_trackfile.DB(trackfilename, st=0, datatype=None, filter_nulls=True, disable_meta=False, addressrepr=None, oidxmode='pickle', addr_mode=True)
When your features are heavy to compute, it is smart to precompute them in a trackfile. You may then access the computed features through this module.
class pycvf.datasets.transformed.DB(db='imgkanji()', model='histogram()', modelpath=-1, datatype=None, subdir='transformed_db', traindb=None)
This allows you to apply some model on the database before to explore it.
class pycvf.datasets.exploded.DB(db, structure=None, quick_len=False, cache_id=None, subsample=<function <lambda> at 0x3a6d5f0>, orig_label_prefix='orig_', addrmap_label_prefix='addrmap_', exploded_label_prefix='expl_', raise_iter_errors=False, memcache=5)
This create a dabase from another database by exploding all the elements of the initial database according to specified structure.
class pycvf.datasets.aggregated_dataset.DB(dbs)
This database allow you to create a new database composed by aggregating different other database.
class pycvf.datasets.interactive.DB
class pycvf.datasets.limit.DB(db='image.kanji()', limit=-1, limit_per_class=None, limit_classes=None)
This is a filter database that allows you to limit the number of elements extracted in a database. This avoid developpers to include this useful feature in all the database.
class pycvf.datasets.filtered.DB(vdb, model, subdir='filtered', modelpath=-1, with_progressmeter=True)
Returns a part of database according to some filtering expression

Vector Datasets

class pycvf.datasets.vectors.clustered_points.DB(npoints_per_clusters=100, clusters=10, ndim=4, sigma=0.12, space_size=20, seed=None, shuffled=True)