Oracle 12c – How to Check File Names in a Folder Against Filenames in a Table

filetableoracleoracle-12c

I have a records system in an Oracle database. The records system is used to retrieve pdf, jpeg, video documents, etc.

The design is quite simple: there is a table called infrastructure_records that has a column with hyperlinks to files:

+------------+------------------------------------------+
| project_id |                file_path                 |
+------------+------------------------------------------+
| Project 1  | X:\Records_System\Folder A\Project 1.pdf |
| Project 2  | X:\Records_System\Folder B\Project 2.jpg |
| Project 3  | X:\Records_System\Folder C\Project 3.mpg |
+------------+------------------------------------------+

We use a front-end that creates a clickable hyperlink out of the file paths. Using the hyperlink, users can navigate to the files.

Question:

I want to check the files on the network drive against the list of files in the database table.

If I could get a list of files from the network into a database table, then I could easily query for errors.

However, I'm not sure what the best way would be to create the list of files in an Oracle table.

There are approximately 60,000 files on the network drive, nestled in a variety of subfolders
The files don't exist on the same server as the Oracle database (I'm not an I.T. guy, so I'm pretty clueless about this sort of thing). However, I do have the drive mapped in Windows Explorer on the computer that I would be running the query from.
This QC operation would be run about once per month.
Performance isn't a huge concern. The list of files doesn't need to be "live", although that would certainly be a plus.
I can think of quick-and-dirty ways to do this using VBA scripting in MS Office or the like. But, it has occurred to me that there might be a more professional or out-of-box way to do this (perhaps using built-in Oracle functionality).

How can a I elegantly insert a list of files on a network drive into an Oracle table?

Best Answer

The following steps will need some tweaking, as you are working with multiple machines - your client machine, a file server (Windows?), a database server etc. Thus, the paths and the file locations are just examples (which have been tested, though). The steps A-B-C are performed only once.

Prerequisites / setup:

[A] Operating system: create a folder for storing the dir listings (in form of text files, see footnote 1) .

eg C:\Users\Wilson\datafiles

[B] Oracle: you need the permissions CREATE TABLE and CREATE ANY DIRECTORY (ie read/write)

SQL> grant create any directory to wilson;
Grant succeeded.

[C] In Oracle, create a directory and an external table.

SQL> create directory external_tables as 'C:\Users\Wilson\datafiles';

Directory created.

create table xtable ( path_ varchar2(4000) ) 
organization external (
  type oracle_loader 
  default directory external_tables
  access parameters 
  ( 
    records delimited by newline
    fields terminated by ',' 
    missing field values are null
  ) 
  location 
  (
    'dir_contents.txt'
  ) 
)
/

 Table XTABLE created.

Get the file list, and use the external table:

[1] Open a command line window. Navigate to the network drive (X:). Execute the following command:

dir /s /b > C:\Users\Wilson\datafiles\dir_contents.txt

REM (this may take a minute ...)
REM Explanation: 
REM  dir /s dives into every subdirectory (recursively), 
REM  /b returns the dir listing without headers and summaries.
REM  > redirects the output into the file (absolute path provided)

Load the dir_contents.txt file into a text editor, make sure that it contains what you need. If - as you wrote - Oracle runs on a different machine, you may have to upload it (via sftp).

[2] Use the external table (see footnote2).

SQL> select count(*) from xtable;
COUNT(*)  
19362     

SQL> select * from xtable fetch first 7 rows only;
PATH_                             
C:\Windows\System32\0409          
C:\Windows\System32\12520437.cpx  
C:\Windows\System32\12520850.cpx  
C:\Windows\System32\3DAudio.ax    
C:\Windows\System32\3DES.dll      
C:\Windows\System32\8point1.wav   
C:\Windows\System32\aaclient.dll  


7 rows selected.

footnote1: This must be accessible when you are working with Oracle. If you collect the datafile(s) on your client machine first, you must "upload" the files (eg via sftp) into the directory that you have specified when executing CREATE DIRECTORY ...

footnote2: The sample file contains the output of

C:\Windows\System32>dir /s /b > C:\Users\Wilson\datafiles\dir_contents.txt

(notice the paths and the command redirection). If you are working with Linux, use "find / -type f > dir_contents.txt"

Learn and use an ETL tool

You'll save yourself a lot of time and pain if you use ETL tools like Pentaho Kettle or Talend Studio to automate the data transfer and merging. They'll happily connect to the Pg instance and the Oracle instance and do the required copying and merging using quite flexible strategies. They can consume CSV dumps too.

Use CSV dumps

If you need to use dumps, try to dump individual tables using COPY ... FORMAT CSV rather than attempting to wrangle pg_dump into doing Oracle-compatible partial dumps. This will save you a great deal of time when importing them into other DBMses, since pretty much everything understands CSV. Keep the schemas consistent for each instance so you don't need to handle different schemas for different instances.

Use disjoint set or composite keys

Where unique keys are used, make sure you assign disjoint key ranges to different instances (eg the key range is instance_id * 10^7 to (instance_id + 1 * 10^7)-1) or use composite keys of (instance_id, generated_key). If you use disjoint ranges add a CHECK constraint to enforce it; similarly, if using composite keys add a CHECK constraint to force the instance_id part to a constant.

Lowest common denominator data types

Avoid using PostgreSQL arrays, record types, and hstore. They'll all dump as string literals in CSV dumps and you'll have to parse them in order to do something useful with them. For example:

regress=> SELECT ROW( ARRAY[1,2,3], ARRAY['a','b','withquotehere:''inline', 'doublequotehere:"'], ARRAY[ ROW(1,'a'), ROW(2,'b') ] );
                                            row                                             
--------------------------------------------------------------------------------------------
 ("{1,2,3}","{a,b,withquotehere:'inline,""doublequotehere:\\""""}","{""(1,a)"",""(2,b)""}")
(1 row)

You really, really don't want to have to parse that. It's a pretty extreme case, an anonymous record value containing three different arrays, one an array of integers, one an array of strings including quotes, and one an array of anonymous two-field row tuples. But seriously ... use plain relational modelling and CSV export, the convenience of composite types, hstore, and arrays aren't worth the compatibility pain if you know you'll be doing lots of cross-DB work.

It's actually possible to use them in your schema, you'll just need to COPY FROM a set of queries that expands them into appropriate relational structures. Since you're going to have to do that, you might as well use regular relational modelling in the first place and save yourself some pain.

Enums are OK; you can just treat them as strings when loading them. They're just a convenient and performant shortcut for a side table and a FK constraint anyway, allowing you to avoid joins and sorts.

In modern PostgreSQL, bytea is output as hex strings. In older versions, it's output as octal escape sequences. Either way, you can load it into a varchar column in Oracle then convert it if you can't get Oracle to natively understand the representation. I haven't dealt with Oracle binary types, so I can't help much more there.

Oracle – Cannot Update Parent Rows with Trigger on Same Table

It looks like what you want to do is actually have a summary of the sum of used bytes in each folder, right? I believe that triggers in Oracle are not the best choice of technology to provide you this functionality. This could be better modeled using a view, or even better, a materialized view.

The problem with triggers is that they will serialize the DML to the table and will not scale too well. Imagine a situation where you are updating and inserting lots of rows... for each one of them you'll have an update to the folder table.

If you model your problem with a view instead you would not incur the overhead of the extra DML on the folder table. A simple query like this should do the trick:

SELECT fd.name AS folder_name,
       SUM(f.bytes) AS folder_size
  FROM folders fd, files f
 WHERE fd.id = f.folderId
 GROUP BY fd.name;

You may then make this query persistent by creating a view (the sum will run every time you query the view) or a materialized view with fast refresh on commit (in this case the materialized view stores the calculation as a pre-computed table, useful if the table is huge and the sum is costly).

The materialized view "fast refresh on commit" feature delegates to Oracle the task of recalculating the agregation every time the underlying table(s) changes.

The Data Warehousing Guide has an in dept explanation on this feature: https://docs.oracle.com/database/121/DWHSG/refresh.htm#DWHSG015

Best Answer

Related Solutions

Postgresql – Moving data between Oracle and PostgreSQL

Learn and use an ETL tool

Use CSV dumps

Use disjoint set or composite keys

Lowest common denominator data types

Oracle – Cannot Update Parent Rows with Trigger on Same Table

Related Question