Using XML in Oracle DB for Google Books

Summary

I demonstrate how to upload and extract information from an XML document that was exported from a Google Books bookshelf. Some data qaulity issues were detected with this bookshelf export. The final data model is a minor change from the model provided by Google Books: individual contributors are split out into a separate table; and the ISBN is used to identify books (where available).

About Google Books

Google Books is a free product from Google that, according to their about page, allows you to:

Search: Book Search works just like web search. Try a search on Google Books or on Google.com. When we find a book with content that contains a match for your search terms, we'll link to it in your search results.

Browse books online: If the book is out of copyright, or the publisher has given us permission, you'll be able to see a preview of the book, and in some cases the entire text. If it's in the public domain, you're free to download a PDF copy.

Buy books or borrow from the library: If you find a book you like, click on the "Buy this book" and "Borrow this book" links to see where you can buy or borrow the print book. You can now also buy the ebook from the Google Play Store.

Learn more fast: We've created reference pages for every book so you can quickly find all kinds of relevant information: book reviews, web references, maps and more.

However, there is another feature that is not mentioned in the about page: a private library. Once you signs in with a Google account, you can access your library.

Within your private library, you can create multiple bookshelves. A bookshelf is created through the New Shelf button on the main page to Your Library (see below).

Screenshot showing "New Shelf" button within the main menu for "Your Library" in "Google Books"

For example, I have created several bookshelves, one of which is called Database. Into that bookshelf, I had added references to various books about databases. References to books can be added either in bulk through a list of ISBNs, or through searching within Google Books. See screenshot below for location of tool to Add by ISBN or ISSN within the Database bookshelf:

screenshot of bookshelf tools menu

Download Bookshelf

I exported my Database bookshelf from Google Books (using the Export as XML tool from the same menu to Add by ISBN or ISSN as shown above) into Database.xml, which I renamed to Database_20240309.xml. The first few lines of this file are (some details are blurred):

<?xml version="1.0" encoding="UTF-8" ?>
<library>
  <list_title>Database</list_title>
  <blurb></blurb>
  <url>https://books.google.com.au/books?uid=*******************&amp;as_coll=****</url>
  <books>
    <book>
      <id>G9AJA91PL54C</id>
      <url>https://books.google.com.au/books?id=G9AJA91PL54C</url>
      <title>Oracle Core: Essential Internals for DBAs and Developers</title>
      <contributor>Jonathan Lewis</contributor>
      <identifier>
        <type>ISBN</type>
        <value>1430239549</value>
      </identifier>
      <labels />
    </book>
    <book>
      <id>45fR_OeonfMC</id>
      <url>https://books.google.com.au/books?id=45fR_OeonfMC</url>
      <title>Effective Oracle by Design</title>
      <contributor>Thomas Kyte</contributor>
      <identifier>
        <type>ISBN</type>
        <value>0072230657</value>
      </identifier>
      <labels>
        <label>Oracle</label>
        <label>Database</label>
      </labels>
    </book>
    ...
    </book>
  </books>
</library>

Note that there is no schema published for this downloaded file.

Note I append the ISO date to the file name to keep versions of the exported Database bookshelf separate.

Analysis of XML Document

I examine the XML document to determine its overall size and structure. I devise a bare bones logical ER diagram.

Size of XML Document

The size of this file is determined using the following command (an alternative is to use the ls -l command):

wc --bytes Database_20240309.xml

The result is, with the size of the file in bytes as the first argument:

11895 Database_20240309.xml

Distribution of XML Tags

To see what tags are used in the XML document, I ran the following code to create a Markdown table skeleton:

sed --quiet \
    --regexp-extended \
    --expression='s!\s*<([^</> ]+).*!\1!p' Database_20240309.xml | \
    sort | \
    uniq --count | \
    awk '{ print $2, "|", $1, "|" }'

The results are presented as follows:

Tag	Num Occurrences	Comment
blurb	1	Mandatory attribute of the root element, `<library>`.
book	29	Element that describes a book.
books	1	Mandatory attribute of the root element, `<library>`. It is a sequence of one or more `<book>` elements.
contributor	29	This is a mandatory attribute of `<book>`, and contains a comma-separated list of authors.
id	29	This is the primary key for the `<book>` element.
identifier	25	This is an optional structured type consisting of a pair of `<type>` and `<value>` tags.
label	38	Classification of the book. A book can have zero or more such labels.
labels	29	Mandatory element of the `<book>` element, and contains a sequence of zero or more `<label>` elements.
library	1	Root element of entire XML document.
list_title	1	Mandatory attribute of root element, `<library>`. It is the name of the bookshelf that is displayed by Google Books.
title	29	Mandatory attribute of `<book>` element, and is the title of the book.
type	25	Part of the `<identifier>` element. Seems to always to be 'ISBN'. Not all books have an ISBN for some reason.
url	30	Tag is used in both the `<library>` and `<book>` elements. This tag is mandatory for both elements.
value	25	Part of the `<identifier>` element. It is the ISBN of the book. Not all books have an ISBN for some reason.

There are 29 books currently in this bookshelf. I used this frequency and the frequency of each tags to classify the above tags, after inspection of the XML document, as follows:

A frequency of one (1) indicates that the tag belongs to the root element, <library>. These are identified as:
- blurb
- books
- list_title
A frequency of 29 indicates that the tag is a mandatory attribute of the <book> element. These are identified as:
- contributor
- id
- labels
- title
A frequency of 30 indicates that the tag is a mandatory attribute of book the <book> and <library> elements. This is identified as:
- url
A frequency less than 29 indicates a possible optional tag for the <book> element. These are identified as:
- identifier
- type
- value
Other tags which have a frequency gather than 30. This is identified as:
- label

Book Identifier

I used the following commands to confirm that all book identifiers are ISBNs. The use of the [hxselect](https://www.w3.org/Tools/HTML-XML-utils/man1/hxselect.html) command allows me to query the path book//identifier//type and return each occurrence on a separate line through the -s '\n' parameter:

hxselect -s '\n' book identifier type < /opt/oracle/xml/Database_20240309.xml | \
  sort | \
  uniq --count

The result is:

     25 <type>ISBN</type>

This means that are four (4) books that do not have an ISBN. This highlights another data quality issue as every book should have an ISBN.

Logical Database Model of Google Books Bookshelf

From this distribution of extracted tags, I infer that the logical model for this Google Books bookshelf is:

Logical database model for Google Books bookshelf

In other words, a library consists of three (3) mandatory attributes (blurb, list title, and URL). Within a library, there are one or more books. Each book has four (4) mandatory attributes (contributpr, ID, title, and URL), and two (2) optional attributes (a structured type called 'identifier' and a list of labels). The structured type, 'identifier', descibes how the book is identified. Usually, this is an ISBN. Each label is a single word.

Load Bookshelf Into XE Database

I created a user and database directory for demonstrating the load and access to an XML document.

Create User

I created a user using the following SQL using a user with DBA access:

-- USER SQL
CREATE USER "DB_BOOKS" IDENTIFIED BY "&pw."  
DEFAULT TABLESPACE "USER"
TEMPORARY TABLESPACE "TEMP";

-- QUOTAS
ALTER USER "DB_BOOKS" QUOTA UNLIMITED ON "DB_BOOKS";

-- SYSTEM PRIVILEGES
GRANT CREATE MATERIALIZED VIEW TO "DB_BOOKS" ;
GRANT CREATE VIEW TO "DB_BOOKS" ;
GRANT CREATE SESSION TO "DB_BOOKS" ;
GRANT CREATE TABLE TO "DB_BOOKS" ;

Create Directory

I created a directory for use by SQL*Loader to hold the source file and SQL*Loader logs.

mkdir --parents /opt/oracle/xml

I used the following SQL to register this directory to the XE database, and to give the user, BOOKS, the necessary privileges to use this directory with SQL*Loader.

CREATE OR REPLACE DIRECTORY xml_dir AS '/opt/oracle/xml';
GRANT READ, WRITE ON DIRECTORY xml_dir TO "DB_BOOKS";

The WRITE privilege on the directory is needed because the log files for SQL*Loader (aka ORACLE_LOADER) are written to the same directory, by default.

Create External Table

The downloaded bookself was copied to the /opt/oracle/xml directory.

I used the following SQL as the DB_BOOKS database user to create an external table, called DATABASE_BOOKSHELF_EXT, that loads the XML from the file /opt/oracle/xml/Database_20240309.xml starting at the <books> tag:

CREATE TABLE "DB_BOOKS"."DATABASE_BOOKSHELF_EXT" 
   ("OBJECT_VALUE" CLOB
   ) 
   ORGANIZATION EXTERNAL 
    ( TYPE ORACLE_LOADER
      DEFAULT DIRECTORY "XML_DIR"
      ACCESS PARAMETERS
      ( records
        xmltag ("books")
        FIELDS NOTRIM MISSING FIELD VALUES ARE NULL (
          object_value char(1000000)
        )
      )
      LOCATION
       ( 'Database_20240309.xml' )
    )
   REJECT LIMIT UNLIMITED ;

The directory, XML_DIR, points to /opt/oracle/xml, as described above in Create Directory.

I am extracting the XML document starting at the <books> tag instead of the real root element, <library>, as I am only instead in the books, rather than the whole bookshelf. This is done through the use of the [XMLTAG](https://docs.oracle.com/en/database/oracle/oracle-database/21/sutil/oracle_loader-access-driver.html#GUID-04134666-4940-484E-BD7C-0E5BAF057DB9) parameter.

The specified size of 1,000,000 characters for the OBJECT_VALUE column is more than enough to load the XML document. (Size of XML Documents says that the size is under 13,000 characters.)

The LOCATION parameter specifies the name of the XML document to be loaded from the /opt/oracle/xml directory.

Verify Creation of External Table

I used the following SQL to confirm that the XML is accessible and that the loaded XML document has the expected root element.

SELECT
    xmltype(e.object_value).getrootelement()
FROM
    DATABASE_BOOKSHELF_EXT e;

Note: I had to cast the CLOB column, OBJECT_VALUE, to XMLTYPE in order to use the XML function, GETROOTELEMENT.

The result is:

XMLTYPE(E.OBJECT_VALUE).GETROOTELEMENT()
----------------------------------------
books

Examine Bookshelf Data

To see what data had been loaded, I ran the following SQL:

SELECT
    b.*
FROM
    database_bookshelf_ext e,
    XMLTABLE ( '/books/book'
            PASSING xmltype(e.object_value)
        COLUMNS
            "ID" VARCHAR2(16) PATH 'id',
            url VARCHAR2(512) PATH 'url',
            title VARCHAR2(1024) PATH 'title',
            contributer VARCHAR2(1024) PATH 'contributor',
            isbn VARCHAR2(16) PATH 'identifier/value'
    )                      b;

The first five (5) rows are:

ID	URL	TITLE	CONTRIBUTER	ISBN
G9AJA91PL54C	`https://books.google.com.au/books?id=G9AJA91PL54C`	Oracle Core: Essential Internals for DBAs and Developers	Jonathan Lewis	1430239549
45fR_OeonfMC	`https://books.google.com.au/books?id=45fR_OeonfMC`	Effective Oracle by Design	Thomas Kyte	0072230657
1Aqw6mrxz5AC	`https://books.google.com.au/books?id=1Aqw6mrxz5AC`	Expert Oracle Exadata	Kerry Osborne, Randy Johnson, Tanel Põder	1430233923
lA3QygAACAAJ	`https://books.google.com.au/books?id=lA3QygAACAAJ`	Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement	Eric Redmond, Jim R. Wilson	1934356921
-0QPlh0stXAC	`https://books.google.com.au/books?id=-0QPlh0stXAC`	Data Architecture: From Zen to Reality	Charles Tupper	0123851270

Extract Labels

When I try:

SELECT
    l.*
FROM
    database_bookshelf_ext e,
    XMLTABLE ( '/books/book'
            PASSING xmltype(e.object_value)
        COLUMNS
            "ID" VARCHAR2(16) PATH 'id',
            label VARCHAR2(16) PATH 'labels/label'
    )                      l;

I get the following error message:

  2    3    4    5    6    7    8    9   10  ERROR:
ORA-19279: XPTY0004 - XQuery dynamic type mismatch: expected singleton sequence
- got multi-item sequence



no rows selected

There is a hit from Oracle Support matching this error message: ORA-19279 XQuery dynamic type mismatch expected singleton sequence - got multi-item sequence (Doc ID 1505158.1). Based on the example provided in that support note:

SELECT
    l.*
FROM
  database_bookshelf_ext e,
  XMLTABLE (
    'for $i in /books/book/labels/label
      return <row>
        {
          $i/../../id,
          $i
        }
      </row>'
    PASSING xmltype(e.object_value)
    COLUMNS
      "ID" VARCHAR2(16) PATH 'id',
      label VARCHAR2(16) PATH 'label'
    )                      l;

This SQL takes each row from the external table (DATABASE_BOOKSHELF_EXT) and converts the CLOB in the OBJECT_VALUE into XML using the XMLTABLE function. For each occurrence on the LABEL tag that is found by the XPATH /books/book/labels/label, I get the contents of the LABEL tag (which is pointed to by the $i loop variable), then I go up two (2) levels (to XPATH /books/book) and get the contents of the ID tag. These two (2) values are combined into a row for the XMLTABLE function.

The first few rows returned are:

ID               LABEL
---------------- --------------------------------
45fR_OeonfMC     Oracle
45fR_OeonfMC     Database
suhSAAAAMAAJ     Database
suhSAAAAMAAJ     Security
OhNBJ8_wMLwC     Oracle
OhNBJ8_wMLwC     Database
OhNBJ8_wMLwC     Security
2egNzTk871wC     Database
2egNzTk871wC     Relational Theory
2egNzTk871wC     Logic
2ImPFP6Yk64C     Database

Extract Contributors

Following on from my success in extracting the labels from the uploaded XML document, I tried the following SQL to extract individual contributors:

SELECT
    c.*
  FROM
    database_bookshelf_ext e,
    XMLTABLE (
      'for $b in /books/book/contributor[text()]
        let $list := ora:tokenize($b, ","),
            $cnt  := count($list)
        for $c at $r in $list where $r < $cnt
          return $c'
    PASSING xmltype(e.object_value)
    COLUMNS
      contributor VARCHAR2(512) PATH '.'
    ) c;

This query failed with:

SELECT
*
ERROR at line 1:
ORA-00604: error occurred at recursive SQL level 1
ORA-19176: FORX0003: regular expression matches zero-length string

The problem is caused by books with a single contributor. That is, there is no comma for contributor elements as shown by the following command:

grep --extended-regexp '<contributor>[^,]+</contributor>' /opt/oracle/xml/Database_20240309.xml

The contributor elements with only one (1) person are:

      <contributor>Jonathan Lewis</contributor>
      <contributor>Thomas Kyte</contributor>
      <contributor>Charles Tupper</contributor>
      <contributor>Fabian Pascal</contributor>
      <contributor>S.M. Deen</contributor>
      <contributor>John G. Hughes</contributor>
      <contributor>Clare Churcher</contributor>
      <contributor>C.J. Date</contributor>
      <contributor>David Litchfield</contributor>
      <contributor>C. J. Date</contributor>
      <contributor>Edward Whalen</contributor>
      <contributor>Ron Ben-Natan</contributor>
      <contributor>James Luetkehoelter</contributor>
      <contributor>Christopher Date</contributor>
      <contributor>C.J. Date</contributor>
      <contributor>Dan Tow</contributor>
      <contributor>Sajal Dam</contributor>
      <contributor>Sam R. Alapati</contributor>
      <contributor>Thomas Kyte</contributor>

Note: There are two (2) C.J. Dates (the other is 'C. J. Date'). There are some data quality issues with Google Books.

My XQUERY knowledge is too limited to get around this issue. All of the examples that I have encountered so far getting around this issue by appending a comma to the contributor element text, such as shown by Splitting Strings using XMLTable and FLWOR expressions. I consider this to be a kludge. I will have to do a two-step process:

Extract the text of the contributor element into a SQL view
Use SQL to extract the individual contributors.

According to Chris Saxon in his post, How to split comma separated value strings into rows in Oracle Database, the secret is to use regexp_substr looping through a CONNECT BY. However, this solution only works for a single row table.

Based on the example from 13.5.3 Pipelined Table Functions as Transformation Functions

First, I create the package definition. A package helps to hide the three (3) types used as these are used only within the package:

CONTRIBUTOR_LIST_TYPE for the input row.
CONTRIBUTOR_TYPE for the output row.
CONTRIBUTOR_TAB for the return value of the function. That is, it has to be a table of the output row type.

I use a weak cursor, SYS_REFCURSOR. (See 14.18 Cursor Variable Declaration for more details.) With a weak cursor, there is no checking that of columns by the RDBMS engine. Since I have complete control in this schema, I am willing to accept the risk of suppressing type checking for cursors.

CREATE OR REPLACE PACKAGE contributor_pkg AUTHID current_user IS
    TYPE contributor_list_type IS RECORD (
        "ID"             VARCHAR2(32),
        contributor_list VARCHAR2(1024)
    );
    TYPE contributor_type IS RECORD (
        "ID"        VARCHAR2(32),
        contributor VARCHAR2(64)
    );
    TYPE contributor_tab IS
        TABLE OF contributor_type;
    FUNCTION get_contributor (
        p_csr sys_refcursor
    ) RETURN contributor_tab
        PIPELINED;

END contributor_pkg;
/

The actual implementation that uses Chris Saxon's idea but within a PL/SQL FOR loop:

CREATE OR REPLACE PACKAGE BODY contributor_pkg IS

    FUNCTION get_contributor (
        p_csr SYS_REFCURSOR
    ) RETURN contributor_tab
        PIPELINED
    IS
        out_rec           contributor_type;
        in_rec            contributor_list_type;
        l_num_contributor NUMBER := 0;
        l_idx_contributor NUMBER := 0;
    BEGIN
        LOOP
            FETCH p_csr INTO in_rec;  -- input row
            EXIT WHEN p_csr%notfound;
            out_rec.id := in_rec.id;
            l_num_contributor := regexp_count(in_rec.contributor_list, '[^,]+');
            FOR l_idx_contributor IN 1..l_num_contributor LOOP
                out_rec.contributor := ltrim(regexp_substr(in_rec.contributor_list, '[^,]+', 1, l_idx_contributor));
                PIPE ROW ( out_rec );
            END LOOP;

        END LOOP;

        CLOSE p_csr;
        RETURN;
    END get_contributor;

END contributor_pkg;
/

NOTE: The LTRIM function is needed to remove spaces after commas.

For each CONTRIBUTOR tag encountered, I use the REGEXP_SUBSTR to extract the nth occurrence of the required regular expression (which is a string of at least one (1) character that is not a comma, i.e. '[^,]+'). The REGEXP_COUNT gives the number of such substrings, and is used to drive the FOR loop.

For each substring that I find, I create a row and pass that to the calling procedure or SQL statement using the PIPE ROW statement.

This worked as was verified by the following SQL. This SQL has three (3) steps:

Create a data source using the CURSOR expression on the innermost SQL statement that extracts ID and CONTRIBUTOR elements from each BOOK element of the uploaded XML document.
Transforms this data source using the contributor_pkg.get_contributor function by splitting individual contributors from the comma-separated list in the CONTRIBUTOR element.
Converts the output from the transformation function back into an SQL table using the TABLE collection expression.

SELECT
    *
FROM
    TABLE ( contributor_pkg.get_contributor(CURSOR(
        SELECT
            c.*
        FROM
            database_bookshelf_ext e, XMLTABLE('/books/book'
                    PASSING xmltype(e.object_value)
                COLUMNS
                    "ID" VARCHAR2(32) PATH 'id',
                    contributors VARCHAR2(1024) PATH 'contributor'
            )                      c
    )) );

The results are:

ID            CONTRIBUTOR
G9AJA91PL54C  Jonathan Lewis
45fR_OeonfMC  Thomas Kyte
1Aqw6mrxz5AC  Kerry Osborne
1Aqw6mrxz5AC  Randy Johnson
1Aqw6mrxz5AC  Tanel Põder
lA3QygAACAAJ  Eric Redmond
lA3QygAACAAJ  Jim R. Wilson
-0QPlh0stXAC  Charles Tupper
t9ZQAAAAMAAJ  Fabian Pascal
2dsUS-64JmIC  S.M. Deen
g5lLCV2mCnwC  John G. Hughes
r1OpkDVDDmwC  Clare Churcher
grTubz0fjSEC  C.J. Date
grTubz0fjSEC  Hugh Darwen
grTubz0fjSEC  Nikos A. Lorentzos
grTubz0fjSEC  Nikos Lorentzos
FU7uuHc3oNcC  C.J. Date
X85QAAAAMAAJ  C. J. Date
X85QAAAAMAAJ  Hugh Darwen
rdBQAAAAMAAJ  C. J. Date
rdBQAAAAMAAJ  Hugh Darwen
suhSAAAAMAAJ  David Litchfield
suhSAAAAMAAJ  Chris Anley
suhSAAAAMAAJ  John Heasman
suhSAAAAMAAJ  Bill Grindlay
OhNBJ8_wMLwC  David Litchfield
2egNzTk871wC  C. J. Date
2ImPFP6Yk64C  Edward Whalen
-uYZcnEt4uQC  Lex deHaan
-uYZcnEt4uQC  Toon Koppelaars
TD5eibhYyd8C  Ron Ben-Natan
v43VNrxWo70C  James Luetkehoelter
y_eVBB5qdwMC  Christopher Date
406_pJtiJ6sC  C.J. Date
dOYPtfVqS_MC  Dan Tow
yK9wRfYX9nkC  Sajal Dam
RRsHc7iXcyAC  Sam R. Alapati
GjYTbJTIr54C  Steve Shaw
GjYTbJTIr54C  Julian Dyke
EmvWP1vgTsYC  Rick Greenwald
EmvWP1vgTsYC  Robert Stackowiak
EmvWP1vgTsYC  Jonathan Stern
3aEIIqRHkY8C  Matthew Hart
3aEIIqRHkY8C  Robert G. Freeman
TmPoYfpeJAUC  Thomas Kyte

Create Materialised Views

To complete the transformation of the uploaded extract from the Google Books bookshelf, I want to create three (3) materialised views:

BOOKS
BOOK_LABELS
BOOK_CONTRIBUTORS

Another advantage of using materialised views is that SQL*Loader is invoked every time the external table is accessed. This makes sense as the Oracle RDBMS engine has no control over external files, and has to assume that the contents of the file can be changed since the last access, thereby invalidating any possible caching. With materialised views, I can have a consistent snapshot of the loaded data.

Create BOOKS Materialised View

I used the following SQL to create a materialised view, called BOOKS, with four (4) columns:

ID for that assigned by the Google Books to uniquely identify an entry
TITLE for the book title
URL for the Google Books page about the book
ISBN of the book

CREATE MATERIALIZED VIEW books AS
  SELECT
      b.ID,
      b.title,
      b.url,
      b.isbn
  FROM
    database_bookshelf_ext e,
    XMLTABLE ( '/books/book'
            PASSING xmltype(e.object_value)
        COLUMNS
            "ID" VARCHAR2(16) PATH 'id',
            url VARCHAR2(512) PATH 'url',
            title VARCHAR2(1024) PATH 'title',
            contributer VARCHAR2(1024) PATH 'contributor',
            isbn VARCHAR2(16) PATH 'identifier/value'
    )                      b;

I ran the following SQL to find books without an ISBN:

SELECT title FROM books WHERE isbn IS NULL;

There are five (5) books that do not have an ISBN:

TITLE
--------------------------------------------------------------------------------
Practical Issues in Database Management: A Reference for the Thinking Practitioner
Databases, Types and the Relational Model: The Third Manifesto
Foundation for Object/relational Databases: The Third Manifesto : a Detailed Study of the Impact of Objects and Type Theory on the Relational Model of Data Including a Comprehensive Proposal for Type Inheritance
The Database Hacker's Handbook: Defending Database Servers

This is strange as the book, Practical Issues in Database Management: A Reference for the Thinking Practitioner, has the following ISBNs:

9780201485554
0201485559

Just another data quality issue from Google Books!

Create BOOK_LABELS Materialised View

I used the following SQL to create the materialised view, BOOK_LABELS, with the columns:

ID for that assigned by the Google Books to uniquely identify an entry
LABEL describing a book

CREATE MATERIALIZED VIEW book_labels AS
  SELECT
      l.ID,
      l.label
  FROM
    database_bookshelf_ext e,
  XMLTABLE (
    'for $i in /books/book/labels/label
      return <row>
        {
          $i/../../id,
          $i
        }
      </row>'
    PASSING xmltype(e.object_value)
    COLUMNS
      "ID" VARCHAR2(16) PATH 'id',
      label VARCHAR2(32) PATH 'label'
    ) l;

I used the following SQL to validate the materialised view, BOOK_LABELS, by getting the frequency distribution for each label:

SELECT
    label,
    count(*) AS freq
  FROM
    book_labels
  GROUP BY
    label
  ORDER BY
    label;

There are nine (9) different labels with the following frequencies (FREQ):

LABEL                            FREQ
-------------------------------- ----------
Database                                 15
Exam Guide                                1
Linux                                     2
Logic                                     1
Oracle                                    9
Relational Theory                         3
SQL                                       2
SQL Server                                2
Security                                  3

9 rows selected.

Create BOOK_CONTRIBUTORS Materialised View

I used the following SQL to create the materialised view, BOOK_LABELS, with the columns:

ID for that assigned by the Google Books to uniquely identify an entry
CONTRIBUTOR for each author of a book

CREATE MATERIALIZED VIEW book_contributors AS
  SELECT
      "ID",
      contributor
  FROM
    TABLE (
      contributor_pkg.get_contributor(
        CURSOR(
          SELECT
              c.*
            FROM
              database_bookshelf_ext e,
              XMLTABLE(
                '/books/book'
                PASSING xmltype(e.object_value)
                COLUMNS
                  "ID" VARCHAR2(32) PATH 'id',
                  contributors VARCHAR2(1024) PATH 'contributor'
              ) c
        )
      )
    );

I verified the materialised view by running the following SQL:

COLUMN contributor FOR A30
SELECT "ID", contributor FROM book_contributors

The first few rows returned are:

ID                 CONTRIBUTOR
------------------ --------------
G9AJA91PL54C       Jonathan Lewis
45fR_OeonfMC       Thomas Kyte
1Aqw6mrxz5AC       Kerry Osborne
1Aqw6mrxz5AC       Randy Johnson
1Aqw6mrxz5AC       Tanel P?der
lA3QygAACAAJ       Eric Redmond
lA3QygAACAAJ       Jim R. Wilson
-0QPlh0stXAC       Charles Tupper
t9ZQAAAAMAAJ       Fabian Pascal
2dsUS-64JmIC       S.M. Deen
g5lLCV2mCnwC       John G. Hughes

Logical Database Model of DB_BOOKS Schema

With the completion of materialised views, I composed a new logical database model for the DB_BOOKS schema:

Logical database model for the DB_BOOKS schema

There are three (3) entities:

BOOKS with attributes:
- ID is primary key
- TITLE is mandatory
- URL is mandatory
BOOKS LABELS with attributes:
- ID is mandatory and refers to BOOKS.ID
- LABEL is mandatory
- ID and LABEL form the primary key
BOOK CONTRIBUTORS
- ID is mandatory and refers to BOOKS.ID
- CONTRIBUTOR is mandatory
- ID and CONTRIBUTOR form the primary key

References

Oracle Manuals

Oracle Support

ORA-19279 XQuery dynamic type mismatch expected singleton sequence - got multi-item sequence (Doc ID 1505158.1)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Using XML in Oracle DB for Google Books

Summary

Table of Contents

About Google Books

Download Bookshelf

Analysis of XML Document

Size of XML Document

Distribution of XML Tags

Book Identifier

Logical Database Model of Google Books Bookshelf

Load Bookshelf Into XE Database

Create User

Create Directory

Create External Table

Verify Creation of External Table

Examine Bookshelf Data

Extract Labels

Extract Contributors

Create Materialised Views

Create BOOKS Materialised View

Create BOOK_LABELS Materialised View

Create BOOK_CONTRIBUTORS Materialised View

Logical Database Model of DB_BOOKS Schema

References

Oracle Manuals

Oracle Support

Linux Commands

Wikipedia

Other Web Pages

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally