I have a project that I'm working on where I extract data from PDFs and map/visualize the relationships between the extracted pieces.
Here's an example of my problem:
file: 11425646.pdf
author: bob
company: abc co
date: 1/1/2011
mentioned_users: [alice,sue,mike,sally]
images: [1958.jpg,535.jpg,35735.jpg]
file: 15421484.pdf
author: betty
company: ionga
date: 2/15/2011
mentioned_users: [john,alex,george]
images: [819.jpg,9841.jpg,78.jpg]
file: 11975748.pdf
author: micah
company: zoobi
date: 9/26/2011
mentioned_users: [alice,chris,joe]
images: [526.jpg,5835.jpg,355.jpg]
How can I model this in a graph database like Neo4j?
I would like to be able to be given one piece of data (like a person's name) and find all related (images, co-mentions, authors, etc.) at up to 10 depth. Here's what I'm thinking for the structure, but I'm not sure if it's a good approach: (this isn't any kind of actual syntax)
[file: 11425646.pdf date:1/1/2011] -written_by-> bob
[file: 11425646.pdf date:1/1/2011] -from_company-> abc co
[file: 11425646.pdf date:1/1/2011] -mentions-> alice
[file: 11425646.pdf date:1/1/2011] -mentions-> sue
[file: 11425646.pdf date:1/1/2011] -mentions-> mike
[file: 11425646.pdf date:1/1/2011] -mentions-> sally
[file: 11425646.pdf date:1/1/2011] -has_image-> 1958.jpg
[file: 11425646.pdf date:1/1/2011] -has_image-> 535.jpg
[file: 11425646.pdf date:1/1/2011] -has_image-> 35735.jpg
Is this the right way to structure this data in a graph database?
Best Answer
Modeling neo4j graphs from relational data is quite simple:
Note: Mapping from relational to graph could take only selected entities from relational model, and single table rows can explode into multiple vertexes and multiple edges.
Yes, it looks OK. Assuming that
file
,author
,company
,user
, andimage
are nodes, anddate
is only an attribute, thisshould convert to this
Useful links: data modeling guide and Cypher reference