diff --git a/modules/ROOT/content-nav.adoc b/modules/ROOT/content-nav.adoc index cfe641a..667acc8 100644 --- a/modules/ROOT/content-nav.adoc +++ b/modules/ROOT/content-nav.adoc @@ -9,6 +9,7 @@ ** xref:finserv/retail-banking/index.adoc[] *** xref:finserv/retail-banking/automated-facial-recognition.adoc[] +*** xref:finserv/retail-banking/entity-resolution.adoc[] *** xref:finserv/retail-banking/synthetic-identity-fraud.adoc[] *** xref:finserv/retail-banking/transaction-ring/transaction-ring-introduction.adoc[] diff --git a/modules/ROOT/images/agnostic/agnostic-entity-resolution-model.svg b/modules/ROOT/images/agnostic/agnostic-entity-resolution-model.svg new file mode 100644 index 0000000..17277a4 --- /dev/null +++ b/modules/ROOT/images/agnostic/agnostic-entity-resolution-model.svg @@ -0,0 +1 @@ +AddressRegAddressAddressLine1:37 ALBYN PLACERegAddressAddressLine2:ALBYN PLACERegAddressPostTown:ABERDEENRegAddressPostCode:AB101JBFullAddress:37 ALBYN PLACE ALBYN PLACE ABERDEEN AB101JBLatitude:57.14266Longitude:-2.12415AddressRegAddressAddressLine1:COMPANY NAMERegAddressAddressLine2:37 ALBYN PLACERegAddressPostTown:ABERDEENRegAddressPostCode:AB101JBFullAddress:COMPANY NAME 37 ALBYN PLACE ABERDEEN AB101JBLatitude:57.14266Longitude:-2.12415 \ No newline at end of file diff --git a/modules/ROOT/images/agnostic/agnostic-entity-resolution-schema.svg b/modules/ROOT/images/agnostic/agnostic-entity-resolution-schema.svg new file mode 100644 index 0000000..60620f2 --- /dev/null +++ b/modules/ROOT/images/agnostic/agnostic-entity-resolution-schema.svg @@ -0,0 +1 @@ +Neo4j Graph VisualizationCreated using Neo4j (http://www.neo4j.com/) Address \ No newline at end of file diff --git a/modules/ROOT/pages/agnostic/entity-resolution.adoc b/modules/ROOT/pages/agnostic/entity-resolution.adoc new file mode 100644 index 0000000..9192208 --- /dev/null +++ b/modules/ROOT/pages/agnostic/entity-resolution.adoc @@ -0,0 +1,226 @@ += Entity Resolution - Technical Walkthrough + +== 1. Industry Introductions + +* xref:finserv/retail-banking/entity-resolution.adoc[Retail Banking - Entity Resolution] + +== 2. Introduction +As previously discussed, entity resolution is a crucial aspect of any data project, regardless of the type of data being analysed. This includes: + +* Customers +* Trades +* Products +* Orders +* Addresses +* Policies +* Product applications +* and much more + +Whenever a human is required to enter information into a free text box, there is potential for data inconsistencies. This guide aims to demonstrate how a knowledge graph can be uniquely positioned to assist with this issue. In this example, we will focus on de-duplicating addresses, but the same principles can be applied to any aspect of your organisation. + +== 3. Modelling + +This section will show examples of cypher queries on an example graph. The intention is to illustrate what the queries look like and provide a guide on how to structure your data in a real setting. We will do this on a small graph of several nodes. The example graph will be based on the data model below: + +=== 3.1. Data Model + +image::agnostic/agnostic-entity-resolution-model.svg[] + +==== 3.1.1 Required Fields + +Below are the fields required to get started: + +`Adress` Node: + +* `RegAddressAddressLine1`: First line of the address +* `RegAddressAddressLine2`: Second line of the address +* `RegAddressPostTown`: Town +* `RegAddressPostCode`: Postcode +* `Latitude`: Latitude based on postcode +* `Longitude`: Longitude based on postcode + +=== 3.2. Demo Data + +The following Cypher statement will create the example graph in the Neo4j database: + +[source, cypher, role=noheader] +---- +// Create all Address Nodes +CREATE (:Address {`RegAddressAddressLine1`: "37 ALBYN PLACE", `RegAddressAddressLine2`: "ALBYN PLACE", RegAddressPostTown: "ABERDEEN", RegAddressPostCode: "AB101JB", FullAddress: "37 ALBYN PLACE ALBYN PLACE ABERDEEN AB101JB"}) +CREATE (:Address {`RegAddressAddressLine1`: "COMPANY NAME", `RegAddressAddressLine2`: "37 ALBYN PLACE", RegAddressPostTown: "ABERDEEN", RegAddressPostCode: "AB101JB", FullAddress: "COMPANY NAME 37 ALBYN PLACE ABERDEEN AB101JB"}); + +// Update each Address Node with longitude and latitude +MATCH (a:Address) +CALL apoc.spatial.geocode(a.RegAddressPostCode) YIELD location +SET a.Latitude = location.latitude, + a.Longitude = location.longitude; +---- + +=== 3.3. Neo4j Scheme + +If you call: + +[source, cypher, role=noheader] +---- +// Show neo4j scheme +CALL db.schema.visualization() +---- + +You will see the following response: + +image::agnostic/agnostic-entity-resolution-schema.svg[] + +== 4. Cypher Queries + +=== 4.1. Calculate the distance in meters between addresses + +This Cypher query is designed to calculate the distance between different `Address` nodes based on their geographical coordinates (latitude and longitude). A unique aspect of this query is its use of the `point.distance` function to compute the distance directly within the query, as well as the use of `ID(a1) > ID(a2)` to avoid duplicate comparisons. + +[source, cypher, role=noheader] +---- +// Calculate the distance between Address Nodes +MATCH (a1:Address), (a2:Address) +WHERE ID(a1) > ID(a2) +RETURN a1.FullAddress AS FullAddress1, a2.FullAddress AS FullAddress2, + point.distance(point({ latitude: a1.Latitude, longitude: a1.Longitude }), + point({ latitude: a2.Latitude, longitude: a2.Longitude })) AS DistanceInMeters +---- + +==== 4.1.1. What is the query doing? + +1. `MATCH (a1:Address), (a2:Address)`: This part of the query matches all nodes with the label `Address`. Two separate variables `a1` and `a2` are used to represent these `Address` nodes. + +2. `WHERE ID(a1) > ID(a2)`: This condition ensures that the query does not compare an address with itself and avoids duplicate comparisons by ensuring that `a1` and `a2` are distinct, based on their internal Neo4j IDs. + +3. `RETURN a1.FullAddress AS FullAddress1, a2.FullAddress AS FullAddress2`: This part of the query returns the full addresses of the two nodes being compared, renaming them as `FullAddress1` and `FullAddress2` for easier interpretation. + +4. `point.distance(point({ latitude: a1.Latitude, longitude: a1.Longitude }), point({ latitude: a2.Latitude, longitude: a2.Longitude })) AS DistanceInMeters`: This is the core part of the query, which calculates the geographical distance between the two address nodes. + + a. `point({ latitude: a1.Latitude, longitude: a1.Longitude })` constructs a point from the latitude and longitude of `a1`. + b. `point({ latitude: a2.Latitude, longitude: a2.Longitude })` does the same for `a2`. + c. `point.distance()` is then used to compute the distance between these two points in meters. + +=== 4.2. Similarity Score Address Nodes + +This complex Cypher query aims to calculate similarity scores between different `Address` nodes based on multiple attributes, such as address lines and postcodes. The query uses the APOC (Awesome Procedures On Cypher) library's `apoc.cypher.mapParallel2` function to execute the similarity scoring in parallel, enhancing performance. The Levenshtein algorithm measures text similarity, allowing for a nuanced comparison of address fields. The query also incorporates several layers of selection logic to ensure high-quality similarity matching. + +[source, cypher, role=noheader] +---- +// Parallel Similarity Scoring Version +MATCH (a:Address) +WITH COLLECT(DISTINCT(left(a.RegAddressPostCode, 3))) AS postcodes +CALL apoc.cypher.mapParallel2(" + MATCH (a:Address), (b:Address) + WHERE id(a) > id(b) AND a.RegAddressPostCode STARTS WITH _ AND b.RegAddressPostCode STARTS WITH _ + // Pass Variables + WITH a, b, + // Build similarity scores + apoc.text.levenshteinSimilarity(a.RegAddressAddressLine1, b.RegAddressAddressLine1) AS line_1_sim, + apoc.text.levenshteinSimilarity(a.RegAddressAddressLine2, b.RegAddressAddressLine2) AS line_2_sim, + apoc.text.levenshteinSimilarity(a.RegAddressAddressLine1, b.RegAddressAddressLine2) AS a_b_line_1, + apoc.text.levenshteinSimilarity(a.RegAddressAddressLine2, b.RegAddressAddressLine1) AS b_a_line_1, + apoc.text.levenshteinSimilarity(a.RegAddressPostCode, b.RegAddressPostCode) AS post_sim, + apoc.text.levenshteinSimilarity(a.FullAddress, b.FullAddress) AS full_address_sim + WITH a, b, line_1_sim, line_2_sim, a_b_line_1, b_a_line_1, post_sim, full_address_sim, ((line_1_sim + line_2_sim) / 2) as add_1_2_calculation + + // Selection logic // + + // Limit the similarity of the full address + WHERE full_address_sim > 0.6 + + // Postcodes can not be too far apart + AND post_sim > 0.7 + // Looks at addresses that have prefixes, e.g. 37 ALBYN PLACE vs COMPANY NAME 37 ALBYN PLACE + // This addition pushes the address into Line 2 + AND ((line_1_sim = 1 OR a_b_line_1 = 1 OR b_a_line_1 = 1) AND post_sim > 0.85) + AND NOT (add_1_2_calculation > 0.6 AND full_address_sim > 0.91 AND post_sim > 0.9) + + RETURN id(a) as a_id, a.FullAddress as a_FullAddress,id(b) as b_id, b.FullAddress as b_FullAddress, full_address_sim; + ", + {parallel:True, batchSize:1000, concurrency:6}, postcodes, 6) YIELD value +RETURN value.a_id AS a_id, value.a_FullAddress AS a_full_address, value.b_id AS b_id, value.b_FullAddress AS b_full_address, value.full_address_sim AS full_address_similarity; +---- + + +==== 4.2.1. What is the query doing? + +1. `MATCH (a:Address)`: Initiates the query by matching all nodes labeled Address. + +2. `WITH COLLECT(DISTINCT(left(a.RegAddressPostCode, 3))) AS postcodes`: Collects the distinct first three characters of these postcodes into a list called postcodes. + +3. `CALL apoc.cypher.mapParallel2("...", {parallel:True, batchSize:1000, concurrency:6}, postcodes, 6) YIELD value`: Executes the nested Cypher query in parallel, with a batch size of 1000 and a concurrency level of 6. + +==== Nested Query Details + +1. `MATCH (a:Address), (b:Address)`: Matches all pairs of Address nodes for comparison. + +2. `WHERE id(a) > id(b) AND a.RegAddressPostCode STARTS WITH _ AND b.RegAddressPostCode STARTS WITH _`: Ensures that each pair is unique and that both addresses start with a postcode in the postcodes list. + +3. *Levenshtein Similarity Calculations:* Utilises `apoc.text.levenshteinSimilarity` to calculate the similarity between different addresses `a` and `b` attributes. + +4. *Selection Logic:* Applies various conditions to filter the results. For instance, it demands a high similarity in full addresses (full_address_sim > 0.6) and postcodes (post_sim > 0.7). + +5. `RETURN id(a) as a_id, a.FullAddress as a_FullAddress, id(b) as b_id, b.FullAddress as b_FullAddress, full_address_sim;`: Returns the IDs and full addresses of `a` and `b`, along with the full address similarity score. + +This query is exceptionally well-suited for capturing nuanced relationships between addresses by incorporating advanced text similarity algorithms and detailed selection logic + +=== 4.3. Create Similarity Relationship between Address Nodes + +This Cypher query is intended to create a relationship of type `SIMILAR_ADDRESS` between `Address` nodes based on several similarity scores calculated via the Levenshtein algorithm. Notably, the query performs these calculations using the APOC (Awesome Procedures On Cypher) library's `apoc.text.levenshteinSimilarity` function. It also employs intricate selection logic to filter out relationships that don't meet specific similarity criteria. This query is particularly aimed at cases where addresses share common prefixes or when there are slight discrepancies in address lines. + +[source, cypher, role=noheader] +---- +// Create Similarity Relationship +MATCH (a:Address), (b:Address) + +// Pass Variables +WITH a, b, + +// Build similarity scores +apoc.text.levenshteinSimilarity(a.RegAddressAddressLine1, b.RegAddressAddressLine1) AS line_1_sim, +apoc.text.levenshteinSimilarity(a.RegAddressAddressLine2, b.RegAddressAddressLine2) AS line_2_sim, +apoc.text.levenshteinSimilarity(a.RegAddressAddressLine1, b.RegAddressAddressLine2) AS a_b_line_1, +apoc.text.levenshteinSimilarity(a.RegAddressAddressLine2, b.RegAddressAddressLine1) AS b_a_line_1, +apoc.text.levenshteinSimilarity(a.RegAddressPostCode, b.RegAddressPostCode) AS post_sim, +apoc.text.levenshteinSimilarity(a.FullAddress, b.FullAddress) AS full_address_sim + +WITH a, b, line_1_sim, line_2_sim, a_b_line_1, b_a_line_1, post_sim, full_address_sim, ((line_1_sim + line_2_sim) / 2) as add_1_2_calculation + +// Selection logic + +// Limit the similarity of the full address +WHERE full_address_sim > 0.6 + + // Postcodes can not be too far apart + AND post_sim > 0.7 + + // Looks at addresses who have prefixes, e.g. 37 ALBYN PLACE vs COMPANY NAME 37 ALBYN PLACE + // This addition pushes the address into Line 2 + AND ((line_1_sim = 1 OR a_b_line_1 = 1 OR b_a_line_1 = 1) AND post_sim > 0.85) + AND NOT (add_1_2_calculation > 0.6 AND full_address_sim > 0.91 AND post_sim > 0.9) + +MERGE (a)-[:SIMILAR_ADDRESS { + full_address_similarity: full_address_sim, + postcode_similarity: post_sim, + line_2_similarity: line_2_sim, + line_1_similarity: line_1_sim, + line_1_2_similarity: a_b_line_1, + line_2_1_similarity: b_a_line_1 + }]->(b); +---- + +==== 4.3.1. What is the query doing? + +* `MATCH (a:Address), (b:Address)`: The query starts by matching all nodes with the label Address, represented by variables `a` and `b`. + +* `WITH a, b, …`: This clause passes the matched `a` and `b` nodes and several calculated similarity scores to the subsequent query parts. + +* *Levenshtein Similarity Calculations:* It employs `apoc.text.levenshteinSimilarity`` to calculate similarity scores between various attributes of a and b, like address lines and postcodes. + +* `WITH a, b, line_1_sim, …`: The query retains the original nodes and the calculated similarity scores for the next part of the query. + +* *Selection Logic:* This query section imposes multiple filtering conditions to refine the similarity matching. These conditions consider the full address similarity, postcode similarity, and even address prefixes to create the most meaningful relationships. + +* `MERGE (a)-[:SIMILAR_ADDRESS {...}]->(b);`: Finally, it creates a `SIMILAR_ADDRESS` relationship between `a` and `b` if they satisfy the conditions. It also stores the calculated similarity scores as properties of this relationship for future use. + +This query is exceptionally well-suited for capturing nuanced relationships between addresses by incorporating advanced text similarity algorithms and detailed selection logic. diff --git a/modules/ROOT/pages/finserv/index.adoc b/modules/ROOT/pages/finserv/index.adoc index 53d87aa..6c9ba3f 100644 --- a/modules/ROOT/pages/finserv/index.adoc +++ b/modules/ROOT/pages/finserv/index.adoc @@ -12,5 +12,6 @@ Furthermore, Neo4j's flexibility and scalability make it an effective tool for m == Retail Banking * xref:finserv/retail-banking/automated-facial-recognition.adoc[] +* xref:finserv/retail-banking/entity-resolution.adoc[] * xref:finserv/retail-banking/synthetic-identity-fraud.adoc[] -* xref:finserv/retail-banking/transaction-ring/transaction-ring-introduction.adoc[] \ No newline at end of file +* xref:finserv/retail-banking/transaction-ring/transaction-ring-introduction.adoc[] \ No newline at end of file diff --git a/modules/ROOT/pages/finserv/retail-banking/entity-resolution.adoc b/modules/ROOT/pages/finserv/retail-banking/entity-resolution.adoc new file mode 100644 index 0000000..5ebe508 --- /dev/null +++ b/modules/ROOT/pages/finserv/retail-banking/entity-resolution.adoc @@ -0,0 +1,55 @@ += Entity Resolution + +// .A walkthrough of Automated Facial Recognition use case +// video::id[youtube] + +== 1. Introduction + +In the dynamic landscape of retail banking, the stakes for accurate and efficient entity resolution have never been higher. Traditional systems often operate in silos, leading to fragmented customer data and challenging obtaining a unified view of a single entity. This fragmentation compromises compliance with stringent regulatory requirements such as *AML (Anti-Money Laundering)* and *KYC (Know Your Customer)* and hampers effective risk management and customer engagement strategies. + +The current challenges posed by outdated entity resolution systems can manifest as operational inefficiencies, increased risk of financial crime, and missed opportunities for cross-selling and upselling. Moreover, as customer expectations for seamless and personalised services continue to rise, an inability to resolve entities accurately can result in lost business and damaged reputation. + +Investing in a modernised entity resolution system is not merely an operational upgrade; it's a strategic imperative. The cost and complexity of change are considerable, but the long-term business benefits far outweigh these challenges. Improved accuracy in *entity resolution enhances regulatory compliance, reduces the risk of fraud, and enables targeted customer engagement*. In a competitive market where customer trust and operational efficiency are paramount, lagging in this crucial area is not an option. Failing to adapt could result in regulatory penalties, reputational damage, and, ultimately, a loss of market share. Therefore, modernising entity resolution systems should be a top priority for any forward-thinking retail banking organisation. + +== 2. Scenario + +* *Regulatory Compliance:* Traditional systems struggle to meet the stringent demands of modern regulations such as AML (Anti-Money Laundering) and KYC (Know Your Customer). Non-compliance could lead to hefty fines and reputational damage. + +* *Fraud Detection:* Inadequate entity resolution hampers the bank's ability to identify suspicious activities across multiple accounts, increasing the risk of financial crimes like fraud and money laundering. + +* *Operational Inefficiencies:* Outdated entity resolution systems are often slow and require significant manual intervention, leading to higher operational costs and slower customer service. + +* *Customer Experience:* In the age of personalised banking, failure to accurately resolve entities results in missed opportunities for targeted marketing, cross-selling, and upselling, thereby affecting customer satisfaction and loyalty. + +* *Data Fragmentation:* Traditional systems usually operate in silos, making it difficult to consolidate customer data for a unified view, affecting risk management and decision-making processes. + +The retail banking industry is at a pivotal juncture where modernising entity resolution is not just an upgrade but a necessity. The cost and complexity of implementing a new system are significant, yet the potential downsides of not adapting are far more severe. These range from regulatory penalties and heightened risk of fraud to loss of customer trust and potential market share. By addressing these challenges, banks not only stand to improve operational efficiency but can also significantly enhance customer relationships and compliance postures. Therefore, the strategic importance of upgrading entity resolution systems in the current competitive and regulatory environment cannot be overstated. + + +== 3. Solution + +To overcome the challenges in entity resolution, retail banks should consider implementing advanced technologies that offer real-time, comprehensive insights. Graph databases provide a robust solution, revolutionising how data is connected and queried. Addressing issues from regulatory compliance to customer experience, the technology offers a multi-faceted approach to solving complex business problems. Failing to modernise in a sector where data-driven decisions are vital could be costly. The investment in change, although significant, positions the bank for greater efficiency, compliance, and customer satisfaction in the long term.recommendations, thus maximising its business value. + +=== 3.1. How Graph Databases Can Help? + +* *Regulatory Compliance (AML/KYC):* Graph databases can dynamically link disparate data points, helping to identify complex relationships and hidden patterns that could signify money laundering or fraud. This ensures a more robust, real-time compliance mechanism. + +* *Fraud Detection:* The real-time analysis of relationships and connections allows graph databases to spot inconsistencies or suspicious behaviours across multiple accounts and transactions, thereby significantly improving fraud detection capabilities. + +* *Operational Efficiency:* Traditional relational databases require complex queries for entity resolution, which can be time-consuming and resource-intensive. Graph databases simplify this by treating relationships as first-class citizens, reducing the time and computational power needed. + +* *Enhanced Customer Experience:* By consolidating fragmented data, graph databases enable a 360-degree view of the customer. This facilitates targeted marketing strategies, personalised services, and effective cross-selling and upselling. + +* *Risk Management:* Graph databases can provide more nuanced risk assessments by examining the intricate web of relationships between different entities, be they individual customers or corporate accounts. + +==== Technical Insight + +Graph databases are uniquely positioned to solve these challenges because they are designed to handle interconnected data naturally. Unlike traditional databases that store data in tables, graph databases focus on the relationships between data points. This is especially useful in retail banking, where understanding the connections between accounts, transactions, and customers is vital for compliance, fraud detection, and customer engagement. By using graph algorithms, organisations can perform deep relational analytics, thereby uncovering insights that would be impossible, or at least computationally expensive, to obtain with traditional systems. + +In a highly competitive and regulated industry, adopting graph database technology is not just a matter of keeping up with the times but a strategic necessity for risk mitigation, regulatory compliance, and maintaining a competitive edge. + +== 4. Technical Walkthrough + +A technical deep dive into this use case can be found here: + +* xref:agnostic/entity-resolution.adoc[] \ No newline at end of file diff --git a/modules/ROOT/pages/finserv/retail-banking/index.adoc b/modules/ROOT/pages/finserv/retail-banking/index.adoc index 76e1bc3..48ff47b 100644 --- a/modules/ROOT/pages/finserv/retail-banking/index.adoc +++ b/modules/ROOT/pages/finserv/retail-banking/index.adoc @@ -13,5 +13,6 @@ In conclusion, the combination of retail banking's data-intensive nature and Neo == Use Cases * xref:finserv/retail-banking/automated-facial-recognition.adoc[] +* xref:finserv/retail-banking/entity-resolution.adoc[] * xref:finserv/retail-banking/synthetic-identity-fraud.adoc[] * xref:finserv/retail-banking/transaction-ring/transaction-ring-introduction.adoc[] \ No newline at end of file diff --git a/modules/ROOT/pages/insurance/quote-fraud.adoc b/modules/ROOT/pages/insurance/quote-fraud.adoc index 18ef503..f3e9a70 100644 --- a/modules/ROOT/pages/insurance/quote-fraud.adoc +++ b/modules/ROOT/pages/insurance/quote-fraud.adoc @@ -289,7 +289,7 @@ CASE END AS Fraud_Level ---- -=== 5.5. Real-time fraud scoring +=== 5.6. Real-time fraud scoring For our last cypher query, we'll add a new quote to Neo4j and run a fraud score calculation to obtain a real-time response showing the similarity score. This code could be used behind an API or directly in Cypher, which would provide an in-flight indication of fraud.