Know your dataset¶

有兩種類型的 dataset objects，一種是常規 Dataset，另一種是 ✨ IterableDataset ✨。 Dataset 提供對數據集內的每一行(row) data 的快速隨機訪問和內存映射，因此即使加載大型數據集也僅會使用相對少量的設備記憶體。但對於非常非常大的數據集，甚至無法容納在磁盤或內存中，IterableDataset 允許您訪問和使用數據集，而無需等待它完全下載！

本教程將向您展示如何加載和訪問 Dataset 和 IterableDataset。

Dataset¶

當您加載數據集拆分時，您將獲得一個 Dataset 物件。您可以使用 Dataset 物件執行許多操作，這就是為什麼學習如何操作存儲在其中的數據並與之交互非常重要。

本教程使用 rotten_tomatoes 數據集，但請隨意加載您想要的任何數據集並繼續操作！

from datasets import load_dataset

dataset = load_dataset("rotten_tomatoes", split="train")

Indexing¶

數據集包含數據列，每列可以是不同類型的數據。索引或軸標籤用於訪問數據集中的示例。例如，按行索引會返回數據集中示例的字典：

# Get the first row in the dataset
dataset[0]

結果:

{'label': 1,
 'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}

使用 - 運算符從數據集末尾開始：

# Get the last row in the dataset
dataset[-1]

結果:

{'label': 0,
 'text': 'things really get weird , though not particularly scary : the movie is all portent and no content .'}

按列名索引會返回該列中所有值的列表：

dataset["text"]

結果:

['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .',
 'effective but too-tepid biopic',
 ...,
 'things really get weird , though not particularly scary : the movie is all portent and no content .']

您可以組合行名和列名索引以返回某個位置的特定值：

dataset[0]["text"]

結果:

'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'

但重要的是要記住，索引順序很重要，尤其是在處理大型音頻和圖像數據集時。按列名索引首先返回該列中的所有值，然後加載該位置的值。對於大型數據集，首先按列名建立索引可能會比較慢。

with Timer():
   dataset[0]['text']

結果:

Elapsed time: 0.0031 seconds

with Timer():
  dataset["text"][0]

結果:

Elapsed time: 0.0094 seconds

Slicing¶

切片返回數據集的切片或子集，這對於一次查看多行很有用。要對數據集進行切片，請使用 : 運算符指定位置範圍。

# Get the first three rows
dataset[:3]

結果:

{'label': [1, 1, 1],
 'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
  'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .',
  'effective but too-tepid biopic']}

# Get rows between three and six
dataset[3:6]

結果:

{'label': [1, 1, 1],
 'text': ['if you sometimes like to go to the movies to have fun , wasabi is a good place to start .',
  "emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one .",
  'the film provides some great insight into the neurotic mindset of all comics -- even those who have reached the absolute top of the game .']}

IterableDataset¶

當您在 load_dataset() 中將 streaming 參數設置為 True 時，將加載 IterableDataset：

from datasets import load_dataset

iterable_dataset = load_dataset("food101", split="train", streaming=True)

for example in iterable_dataset:
    print(example)
    break

結果:

{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F0681F5C520>, 'label': 6}

IterableDataset 一次逐步迭代一個數據集的一個示例，因此您無需等待整個數據集下載完畢即可使用它。正如您可以想像的，這對於您想要立即使用的大型數據集非常有用！

然而，這意味著 IterableDataset 的行為與常規數據集不同。您無法隨機訪問 IterableDataset 中的示例。相反，您應該迭代其元素，例如，通過調用 next(iter()) 或使用 for loop 從 IterableDataset 返回下一筆數據：

```pythonfor example in iterable_dataset: print(example) break

結果:

```bash
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F0681F59B50>,
 'label': 6}

for example in iterable_dataset:
    print(example)
    break

結果:

{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F7479DE82B0>, 'label': 6}

您可以使用 IterableDataset.take() 返回數據集的子集，其中包含特定數量的示例：

# Get first three examples
list(iterable_dataset.take(3))

結果:

[{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F7479DEE9D0>,
  'label': 6},
 {'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512 at 0x7F7479DE8190>,
  'label': 6},
 {'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x383 at 0x7F7479DE8310>,
  'label': 6}]

但與切片不同的是，IterableDataset.take() 創建一個新的IterableDataset`。

Next steps¶

有興趣了解更多關於這兩類數據集之間的差異嗎？在 Dataset 和 IterableDataset 之間的差異概念指南中了解有關它們的更多信息。

要更多地實踐這些數據集類型，請查看處理指南以了解如何預處理數據集，或查看 Stream 指南以了解如何預處理 IterableDataset。