Important Datasets – Data shall set you free…

The Dataset 4: Amazon Product Reviews

An early product dataset that was open-sourced to help on recommendations

Product reviews with product metadata and a buying graph
This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 – Feb 2015.  This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).

Why they collected it: Amazon made this dataset open as “a rich source of information for academic researchers in the fields of Natural Language Processing (NLP), Information Retrieval (IR), and Machine Learning (ML), amongst others. Accordingly, we are releasing this data to further research in multiple disciplines related to understanding customer product experiences. Specifically, this dataset was constructed to represent a sample of customer evaluations and opinions, variation in the perception of a product across geographical regions, and promotional intent or bias in reviews.”

License details:  By accessing the Amazon Customer Reviews Library (“Reviews Library”), you agree that the Reviews Library is an Amazon Service subject to the Conditions of Use and you agree to be bound by them, with the following additional conditions: In addition to the license rights granted under the Conditions of Use, Amazon or its content providers grant you a limited, non-exclusive, non-transferable, non-sublicensable, revocable license to access and use the Reviews Library for purposes of academic research. You may not resell, republish, or make any commercial use of the Reviews Library or its contents, including use of the Reviews Library for commercial research, such as research related to a funding or consultancy contract, internship, or other relationship in which the results are provided for a fee or delivered to a for-profit organization.  See more here:

How they got it:  Amazon’s dataset contains the customer review text with accompanying metadata, consisting of three major components, plus a fourth from UCSD:

  • A collection of reviews written in the marketplace and associated metadata from 1995 until 2015. This is intended to facilitate study into the properties (and the evolution) of customer reviews potentially including how people evaluate and express their experiences with respect to products at scale. (130M+ customer reviews)
  • A collection of reviews about products in multiple languages from different Amazon marketplaces, intended to facilitate analysis of customers’ perception of the same products and wider consumer preferences across languages and countries. (200K+ customer reviews in 5 countries)
  • A collection of reviews that have been identified as non-compliant with respect to Amazon policies. This is intended to provide a reference dataset for research on detecting promotional or biased reviews. (several thousand customer reviews). This part of the dataset is distributed separately and is available upon request – please contact the email address below if you are interested in obtaining this dataset.
  • UCSD extracted visual features from each product image using a deep CNN (see citation below). Image features are stored in a binary format, which consists of 10 characters (the product ID), followed by 4096 floats (repeated for every product). 

Why this dataset matters: It’s one of the great product sales datasets to come out, and Amazon has created a lot of customer value and made lots of money by building good recommendation algorithms based off this data.  It is also broken into smaller categories (books, electronics, Movies and TV, etc) and offers both numerical features (price, sales rank, etc), NLP features (the actual review), images (the object reviewed), and a graph (other items bought by customers).

Default task:  Regression  [assuming the goal is to predict what features lead to a higher sales rank?]

Attribute Type: Mixed

Data Type: Multivariate  Text  Image  Graph

Area: Business 

# Attributes: 17

# Instances: 142.8MM

Format Type:Non-matrix

Play with the dataset
Relevant paper(s):  Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering, R. He, J. McAuley, WWW, 2016

Image-based recommendations on styles and substitutes.  J. McAuley, C. Targett, J. Shi, A. van den Hengel.  SIGIR, 2015

Relevant algorithms


The Dataset 3: ImageNET as C. Elegans

The second, important dataset for visual object recognition

…ML algorithms + ImageNet database and competition = computer vision research boom…
The ImageNet project is a large visual database designed for visual object recognition ML research. More than 14 million images were hand-annotated by the project to connect objects with a text label (specifically a synset).  ImageNet contains more than 20,000 categories with a typical category, such as “balloon” or “strawberry”, consisting of several hundred images.  Note that ImageNet organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a “synonym set” or “synset”. There are more than 100,000 synsets in WordNet.  The majority of them are nouns (80,000+). ImageNet provides on average 1000 images to illustrate each synset. Images of each concept are quality-controlled and human-annotated.

Since 2010, the ImageNet project runs an annual software contest, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), where software programs compete to correctly classify and detect objects and scenes.  The contest was to classify an image with where success was to give the top 5 answers on what the model thought the image was, or a much harder of task of figuring out the top 1.  The challenge uses a “trimmed” list of one thousand non-overlapping classes. 

Why they collected it: The ImageNet project came from the need for more data. Ever since the birth of the digital era and the availability of web-scale data exchanges, researchers in these fields have been working hard to design more and more sophisticated algorithms to index, retrieve, organize and annotate text, images, audio, and videos.  To tackle problems at scale, it would be tremendously helpful to researchers if there existed a massive image database.  Stanford AI researcher Fei-Fei Li began working on the idea for ImageNet in 2006. At a time when most AI research focused on models and algorithms, Li wanted to expand and improve the data available to train AI algorithms.  In 2007, Li met with Princeton professor Christiane Fellbaum, one of the creators of WordNet to discuss the project.  From this meeting, Li went on to build ImageNet starting from the word-database of WordNet and using many of its features.

How they got it: Li assembled a team of researchers to work on the ImageNet project. They used Amazon Mechanical Turk to help with the classification of images. They presented their database for the first time as a poster at the 2009 Conference on Computer Vision and Pattern Recognition (CVPR) in Florida.  ImageNet crowdsourced its annotation process. Image-level annotations showed the presence or absence of an object class in an image, such as “there are lions in this image” or “there are no lions in this image”. ImageNet uses a variant of the broad WordNet schema to categorize objects, augmented with 120 categories of dog breeds to showcase fine-grained classification.  In 2012 ImageNet was the world’s largest academic user of Mechanical Turk. The average worker identified 50 images per minute.

Why this dataset matters: The 2010s saw dramatic progress in image processing. Around 2011, a good ILSVRC classification top 5 error rate was 25%. In 2012, a deep convolutional neural net called AlexNet achieved 16%; in the next couple of years, error rates fell to a few percent.  While the 2012 breakthrough “combined pieces that were all there before”, the dramatic quantitative improvement marked the start of an industry-wide artificial intelligence boom as companies like Google, Facebook, Amazon, and Microsoft jumped in to incorporate this technology into their products.  By 2015, researchers at Microsoft reported that their CNNs exceeded human ability at the narrow ILSVRC tasks.  However, the programs only have to identify images as belonging to one of a thousand categories; humans can recognize a larger number of categories, and also (unlike the programs) can judge the context of an image.  As of early 2020, an ImageNet leaderboard shows the top 1 accuracy of 88.5% and top 5 of 98.7%.

Default task: Classification 

Attribute Type: Categorical 

Data Type: Image 

Area: CS/Engineering 

# Attributes:  Image, Synset

# Instances:

Total number of images: 14,197,122

Number of images with bounding box annotations: 1,034,908

Total number of non-empty synsets: 21841

Number of synsets with SIFT features: 1000

Number of images with SIFT features: 1.2 million

Format Type: Non-matrix

Play with the dataset

Relevant paper(s):  Hugo Touvron, Andrea Vedaldi, Matthijs Douze, Hervé Jégou “Fixing the train-test resolution discrepancy: FixEfficientNet.”

Relevant algorithms


The Dataset 2: Twitter Sentiment Analysis

An early, useful dataset for Natural Language Processing (NLP)

Checking sentiment is the new “Hello World” of NLP classification
The Twitter/Kaggle database has 1.6MM tweets extracted using the Twitter API and classified from a scale of 0 = negative, 2 = neutral, to 4 = positive

Why they collected it:  It was collected for a Stanford CS224 NLP class project in 2009 to play with some ML tools – it ended up being used for later deep learning NLP libraries as a quick test dataset.

How they got it: This dataset was put together in a sloppy way and is not very reliable. The students describe their data collection and annotation process here: “Our approach was unique because our training data was automatically created, as opposed to having humans manual annotate tweets. In our approach, we assume that any tweet with positive emoticons, like :), were positive, and tweets with negative emoticons, like :(, were negative. We used the Twitter Search API to collect these tweets by using keyword search.”

Because it was put together with a crude heuristic, here are poorly labelled examples of “negative sentiment” (zero values):

@julieebaby awe i love you too!!!! 1 am here i miss you

currently at work..

i have to take my sidekick back.

@lauredhel What happened?

Why this dataset matters: This dataset has been used by many later ML researchers as a quick way to test a new NLP classifier. It’s almost become the MNIST for NLP, except that it’s a pretty shoddy dataset and should be avoided.

Default task: Classification 

Attribute Type: Categorical 

Data Type: Text/tweet 

Area: CS/Engineering/NLP

# Attributes:

  1. target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
  2. ids: The id of the tweet ( 2087)
  3. date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)
  4. flag: The query (lyx). If there is no query, then this value is NO_QUERY.
  5. user: the user that tweeted (robotickilldozr)
  6. text: the text of the tweet (Lyx is cool)

# Instances: 1.6MM tweets

Format Type: Matrix

Play with the dataset

Relevant paper(s)Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.

Relevant algorithms


The Dataset 1: MNIST as the Drosophilia

The first, important dataset for machine learning

…ML algorithms + MNIST data = early computer vision problem solved…
The MNIST database of handwritten digits has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits were size-normalized and centered in a fixed-size image.  It is the first, widely-used good database for people to try ML techniques and pattern recognition methods on real-world data, with minimal effort required on preprocessing and formatting.

Why they collected it: Yann LeCun and his colleagues collected and cleaned a subset ot NIST data to do some computer vision (CV) research.  The early real-world use cases was to build CV systems that could help post offices read mail addresses quickly (for faster sorting) and for banks to read checks quickly (for faster processing and verification).

How they got it: The MNIST database was constructed by LeCun et al. from NIST’s Special Database 3 and Special Database 1 which contain binary images of handwritten digits. NIST originally designated SD-3 as their training set and SD-1 as their test set. However, SD-3 is much cleaner and easier to recognize than SD-1 because SD-3 was collected among Census Bureau employees (adults), while SD-1 was collected among high-school students (teens). Drawing sensible conclusions from learning experiments requires that the result be independent of the choice of training set and test among the complete set of samples. Therefore LeCun built a new database by mixing NIST’s datasets. The MNIST training set is composed of 30,000 patterns from SD-3 and 30,000 patterns from SD-1. The test set was composed of 5,000 patterns from SD-3 and 5,000 patterns from SD-1. The 60,000 pattern training set contained examples from around 250 writers (to capture variation in how humans write digits), and the sets of writers of the training set and test set are disjoint.

Why this dataset matters: Basically it’s the drosophilia of computer vision research to test all the early machine learning and then deep learning algorithms.  It’s still one of the quickest and fastest dataset a CV researcher can use to test a new algorithm or architecture (before proceeding to Imagenet, the C. Elegans of CV research).   In 2020 January, Jonas Matuzas published on Github easily replicable code which gives (0.17±0.01) % accuracy.  An extended dataset similar to MNIST called EMNIST has been published in 2017, which contains 240,000 training images, and 40,000 testing images of handwritten digits and characters.

Default task: Classification 

Attribute Type: Categorical 

Data Type: Image 

Area: CS/Engineering 

# Attributes: 28×28 pixel box

# Instances: 60K training set, 10K test set

Format Type: Non-matrix

Play with the dataset  

Relevant paper(s):  Romanuke, Vadim. “The single convolutional neural network best performance in 18 epochs on the expanded training data at Parallel Computing Center, Khmelnitskiy, Ukraine”. Retrieved 16 November 2016.

Relevant algorithms


Follow My Blog

Get new content delivered directly to your inbox.

%d bloggers like this: