Dataset¶
-
tinder.dataset.
BalancedDataLoader
(dataset, classes, **kwargs)[source]¶ If your dataset is unbalanced, this wrapper provides a uniform sampling.
Example:
# -3 is sampled twice as many as 2 or 3. loader = BalancedDataLoader([-3,5,2,3], ['R','G','B','B'], batch_size=1)
- Parameters
dataset (iterable) – torch Dataset, list, or any sequence with known length.
classes (iterable) – A list of hashable type. Its length should be equal to the dataset.
**kwargs – arguments to torch.utils.data.DataLoader
-
tinder.dataset.
DataLoaderIterator
(loader, num=None, last_step=0)[source]¶ Convenient DataLoader wrapper when you need to iterate more than a full batch.
It is recommended to set drop_last=True in your DataLoader.
Example:
loader = DataLoader(num_workers=8) for step, batch in DataLoaderIterator(loader): pass for step, _ in DataLoaderIterator(loader, num=6, last_step=2): print(step) # 3, 4, 5, 6
-
class
tinder.dataset.
StreamingDataloader
(q, batch_size: int, num_workers: int, transform)[source]¶ A dataloader for streaming.
If you have a stream of data (e.g. from RabbitMQ or Kafka), you cannot use a traditional Pytorch Dataset which requires __len__ to be defined. In this case, you can put your streaming data into multiprocessing.Manager().Queue() in the background and pass it to StreamingDataloader.
StreamingDataloader is an iterator.
__next__ is blocking and returns at least one element.
It never raises StopIteration.
Example:
import tinder def preprocess(msg:str): return '(' + msg + ')' + str(len(msg)) c = tinder.queue.KafkaConsumer(topic='filenames', consumer_id='anonymous_123') q = c.start_drain(batch_size=3, capacity=20) loader = tinder.dataset.StreamingDataloader(q, batch_size=5, num_workers=2, transform=preprocess) for batch in loader: print('batch: ', batch)
-
tinder.dataset.
hash100
(s: str)[source]¶ Hash a string into 1~100. Useful when you split a dataset into subsets.
-
tinder.dataset.
random_split
(dataset, ratio, seed=None)[source]¶ Split a given dataset into several ones, e.g. train/val/test.
The source of randomness comes from torch, which can be fixed by torch.manual_seed.
- Parameters
{[type]} -- pytorch dataset object (dataset) –
- Keyword Arguments
{list} -- A list representing the first n-1 portions. (example (ratio) – {[0.7,0.2]} for 70% / 20% / 10%)