New York Public Library "What's on the Menu?" Dataset
The dataset is created by the New York Public Library. It contains historical data on the menus of hotels, restaurants and cafes with the dishes along with their prices.
Source: http://menus.nypl.org/data The data is in public domain.
The data is from library's archive and it may be incomplete and difficult for statistical analysis. Nevertheless it is also very yummy. The size is just 1.3 million records about dishes in the menus — it's a very small data volume for ClickHouse, but it's still a good example.
Download the Dataset
Run the command:
Replace the link to the up to date link from http://menus.nypl.org/data if needed. Download size is about 35 MB.
Unpack the Dataset
Uncompressed size is about 150 MB.
The data is normalized consisted of four tables:
- Menu— Information about menus: the name of the restaurant, the date when menu was seen, etc.
- Dish— Information about dishes: the name of the dish along with some characteristic.
- MenuPage— Information about the pages in the menus, because every page belongs to some menu.
- MenuItem— An item of the menu. A dish along with its price on some menu page: links to dish and menu page.
Create the Tables
We use Decimal data type to store prices.
Import the Data
Upload data into ClickHouse, run:
We use CSVWithNames format as the data is represented by CSV with header.
We disable format_csv_allow_single_quotes as only double quotes are used for data fields and single quotes can be inside the values and should not confuse the CSV parser.
We disable input_format_null_as_default as our data does not have NULL. Otherwise ClickHouse will try to parse \N sequences and can be confused with \ in data.
The setting date_time_input_format best_effort allows to parse DateTime fields in wide variety of formats. For example, ISO-8601 without seconds like '2000-01-01 01:02' will be recognized. Without this setting only fixed DateTime format is allowed.
Denormalize the Data
Data is presented in multiple tables in normalized form. It means you have to perform JOIN if you want to query, e.g. dish names from menu items.
For typical analytical tasks it is way more efficient to deal with pre-JOINed data to avoid doing JOIN every time. It is called "denormalized" data.
We will create a table menu_item_denorm where will contain all the data JOINed together:
Validate the Data
Query:
Result:
Run Some Queries
Averaged historical prices of dishes
Query:
Result:
Take it with a grain of salt.
Burger Prices
Query:
Result:
Vodka
Query:
Result:
To get vodka we have to write ILIKE '%vodka%' and this definitely makes a statement.
Caviar
Let's print caviar prices. Also let's print a name of any dish with caviar.
Query:
Result:
At least they have caviar with vodka. Very nice.
Online Playground
The data is uploaded to ClickHouse Playground, example.
