Dataset of 11,228 newswires from Reuters, labeled over 46 topics. As with
dataset_imdb()
, each wire is encoded as a sequence of word indexes (same
conventions).
Usage
dataset_reuters(
path = "reuters.npz",
num_words = NULL,
skip_top = 0L,
maxlen = NULL,
test_split = 0.2,
seed = 113L,
start_char = 1L,
oov_char = 2L,
index_from = 3L
)
dataset_reuters_word_index(path = "reuters_word_index.pkl")
Arguments
- path
Where to cache the data (relative to
~/.keras/dataset
).- num_words
Max number of words to include. Words are ranked by how often they occur (in the training set) and only the most frequent words are kept
- skip_top
Skip the top N most frequently occuring words (which may not be informative).
- maxlen
Truncate sequences after this length.
- test_split
Fraction of the dataset to be used as test data.
- seed
Random seed for sample shuffling.
- start_char
The start of a sequence will be marked with this character. Set to 1 because 0 is usually the padding character.
- oov_char
words that were cut out because of the
num_words
orskip_top
limit will be replaced with this character.- index_from
index actual words with this index and higher.
Value
Lists of training and test data: train$x, train$y, test$x, test$y
with same format as dataset_imdb()
. The dataset_reuters_word_index()
function returns a list where the names are words and the values are
integer. e.g. word_index[["giraffe"]]
might return 1234
.
See also
Other datasets: dataset_boston_housing()
dataset_cifar10()
dataset_cifar100()
dataset_fashion_mnist()
dataset_imdb()
dataset_mnist()