Formatting bulk exports for viaLibri

This document is intended for those running large aggregation websites that list books from multiple dealers. If you’re running a site for a single bookseller then please see the harvest documentation.

Our standard method of searching large sites is to download one file containing all book records once every 24 hours. This is then loaded into our database and used as part of our search system. We are not able to accept paged data. All your records must be in a single file.

We accept four different file formats: XML, CSV, tab delimited, and a JSON object per line (NDJSON). All these are described later in this document. For all formats we expect the files to be UTF8 encoded.

All text fields (including description) should be provided as plain text, not HTML. We do make an effort to clean HTML elements from the text in the files, but including them in your file may result in strange characters being displayed in your search listings on viaLibri.

Making the file available

The file should be downloadable via HTTP or FTP. As these files can get very large you may choose to compress it to save on bandwidth. Our system is set up to take the bzip2, gzip and zip compression algorithms.

If you wish to secure the file then there are a few different options:

  • Username and password. Either an FTP account, or using HTTP Basic authentication.
  • Require requests to the URL to have a particular access code in the query string.
  • Require requests to the URL to have a particular access code in the Authorization header.

Do not limit downloads by IP address as the IP address used will sometimes change.

When we are downloading the file using HTTP we will use the User-Agent string “viaLibri Bulk Download”. You should check that there are no security settings on your server that would prevent this User-Agent from accessing the file on your server.

Updating the file

The file should ideally be updated once every 24 hours at the same time of day. Please let us know what time you will be updating the file at and we will set our download to happen shortly afterwards.

What to include

The file should include all relevant books that are currently available for sale through your site. You should not include:

  • books that have been sold.
  • new books.
  • ebooks.
  • books that are only for sale in a single country (this is frustrating for our users who come from all over the world).

Include as much data as you have on each book, at least as much as is displayed on your site. Many of the fields are marked as optional, but they should always be included if you have the information.

Data format 1: XML

The data file should look something like this:

<?xml version="1.0" encoding="UTF-8" ?>
<Books>
  <Book>
    <dealer_name>Al's Books</dealer_name>
    <dealer_id_on_site>DEF345</dealer_id_on_site>
    <dealer_location>Cambridge, UK</dealer_location>
    <author>George Orwell</author>
    <title>Nineteen Eighty-Four</title>
    <description>Nineteen Eighty-Four, often published as 1984, is
    a dystopian novel by English author George Orwell published in 1949. The
    novel is set in Airstrip One (formerly known as Great Britain), a
    province of the superstate Oceania in a world of ...</description>
    <book_id_on_site>12345</book_id_on_site>
    <dealers_book_id>ABC123</dealers_book_id>
    <year>1949</year>
    <edition>First edition</edition>
    <publisher>Secker & Warburg</publisher>
    <price>1234.56</price>
    <currency>GBP</currency>
    <keywords>dystopian, sci-fi</keywords>
    <isbn>9780547249643</isbn>
    <first_edition>yes</first_edition>
    <signed>no</signed>
    <dust_jacket>yes</dust_jacket>
    <url>https://www.example.com/1984/</url>
    <image_url>https://www.example.com/1984.jpg</image_url>
  </Book>
  <Book>
  ...
</Books>

The Books element contains a Book element for each book currently for sale on the site. Some of the fields in Book are optional, and there are other fields that are not shown here. A full list of the available fields is included below.

We prefer well-formed XML and recommend using an existing library for generating XML rather than hand-rolling your own solution. However, we try to be quite forgiving in the way we process the files, so they don’t necessarily need to pass a strict XML validation test.

Data format 2: CSV

The CSV file should be RFC 4180 compliant. The important parts of this are:

  • Each record must contain the same number of comma-separated fields.
  • Any field may be quoted (with double-quotes).
  • Fields containing a line-break, double-quote or commas should be quoted. (If they are not, the file will likely be impossible to process correctly).
  • A double-quote character in a field must be represented by two double-quote characters.

You can either include the field names in the first line of the file or let us know the ordering via email to support@vialibri.net.

Data format 3: Tab delimited

Tab delimited files are often easier to generate than XML, but there are still a couple of rules to follow:

  1. Fields must not contain tab characters. Any lines that contain the wrong number of tabs will be ignored.
  2. The columns used and their ordering must always be the same.

You can either include the field names in the first line of the file or let us know the ordering via email to support@vialibri.net.

Data format 4: JSON object per line (NDJSON)

The file should contain JSON objects that are separated by new line characters. There must be no new lines within the objects themselves. As a result each line will be a valid JSON object, but the file overall will not.

It should look like what’s shown below, but with more fields.

{"book_id_on_site": "123ABC", "author": "George Orwell", "title": "Animal Farm"}
{"book_id_on_site": "456DEF", "author": "Jane Austen", "title": "Emma"}
{"book_id_on_site": "789GHI", "author": "J.R.R. Tolkien", "title": "The Hobbit"}

Fields

The fields below are used for both file formats.

Fields with stars (*) are required, but you should include all fields that you have data for.

  • dealer_name * – The name of the dealer that has listed this book for sale.
  • dealer_id_on_site * – Your database’s internal ID for the dealer.
  • dealer_location – The dealer’s location as text. This should contain at least the dealer’s country but can also include city and/or state as well.
  • dealer_country_code – The dealer’s location as a two-letter ISO 3166-1 country code. The value “XX” should be used if the dealer’s country is unknown.
  • author – The author of the book.
  • title * – The title of the book.
  • description – A description of the book. This field should merge, in the preferred order, all data that is needed as part of the book description, such as publisher, condition, place of publication, edition, format, comments, publication date, etc. Data which is included in the edition and publisher fields will not be displayed unless it has also been added to this field. However, it is possible for our system to automatically add the publisher and year fields to the start of all your descriptions. Let us know if you’d like us to turn that on for your data.
  • listing_type – This must be either “used” or “auction”. The value “used” indicates an old or second-hand book that is on sale for a fixed price. “auction” should be used for old or second-hand books that are being auctioned online. If all your books have the same listing_type value then let us know and this field can be left out.
  • end_date – The date that the sale of this item will finish. This is required for items with a listing_type of “auction” and optional for other listing_type values. It should be given in the UTC timezone and formatted as “yyyy-mm-dd hh:mm:ss”.
  • book_id_on_site * – Your database’s internal ID for the book. This must be a unique value across the whole file. If you don’t have a unique value for each book then you could create a compound ID from the dealer ID and the book’s SKU, e.g. “72/ABC123”.
  • dealers_book_id – Dealer’s inventory code for the book.
  • year – Publication year. This can be just given as four digits, but you may include some text as well. However, any extra text will be ignored.
  • edition – A description of the edition of the book.
  • publisher – The book’s publisher.
  • price * – This should not include a currency symbol or any extra formatting. Just the price. For items with a listing_type of “auction” this should be the value of the current bid.
  • estimate_min – The lower end of the estimated price range for an auctioned item. Only used for items with a listing_type of “auction”. This should not include a currency symbol or any extra formatting. Just the estimate.
  • estimate_max – The upper end of the estimated price range for an auctioned item. Only used for items with a listing_type of “auction”. This should not include a currency symbol or any extra formatting. Just the estimate.
  • currency – Three letter currency code for the currency that the price is given in. If all your books are listed in the same currency then let us know and this field can be left out.
  • keywords – A set of topics or areas that are relevant for this book.
  • isbn – The book’s ISBN number (if it has one).
  • first_edition – A yes/no representing whether this book is a first edition or not.
  • signed – A yes/no representing whether this book is signed or not.
  • dust_jacket – A yes/no representing whether this book has a dust jacket or not.
  • url * – The full URL for the book on your website. This should be a page that displays the full details & description for a single book only. It should not automatically add the book to the user’s basket.
  • image_url – The full URL for an image of the book. This should be the largest version of the image available. If you don’t have an image for this particular book then leave this field blank. Do not give us the URL for a placeholder image.

Formatting boolean values

Boolean yes/no columns such as first_edition can be formatted in a number of ways.

  • Positive values can be “yes”, “y”, “true” or “1”.
  • Negative values can be “no”, “n”, “false” or “0”.

When using the JSON object per line format then a boolean JavaScript value can also be used.