matching names and assigning unique identifier - master client index



  • create table dbo.data1
    (
    ClinetKey int, 
    SourceID int, 
    SourceName varchar(5),
    ClientID int,
    ClientFirstName varchar(50),
    ClientLastName varchar(50),
    UID int
    );
    
    insert into dbo.data1 (ClinetKey, SourceID , SourceName ,ClientID ,ClientFirstName,ClientLastName,UID)
    values (1,100101',db1',1001,'Tomas','Jones');
    insert into dbo.data1 (ClinetKey, SourceID , SourceName ,ClientID ,ClientFirstName,ClientLastName,UID)
    values (1,100308',db2',1003,'Toma','Jones');
    insert into dbo.data1 (ClinetKey, SourceID , SourceName ,ClientID ,ClientFirstName,ClientLastName,UID)
    values (1,100909',db3',1009,'Thhoma','Jones');
    insert into dbo.data1 (ClinetKey, SourceID , SourceName ,ClientID ,ClientFirstName,ClientLastName,UID)
    values (1,111001',db4',11155,'Tomass','Jones');
    insert into dbo.data1 (ClinetKey, SourceID , SourceName ,ClientID ,ClientFirstName,ClientLastName,UID)
    values (1,122033',db5',1233,'Toma','Jones');
    insert into dbo.data1 (ClinetKey, SourceID , SourceName ,ClientID ,ClientFirstName,ClientLastName,UID)
    values (1,133009',db1',1009,'Tomas','Jones');
    

    Based on the matching name logic I need to assign a unique identifier or unique value. Like in the data above all the names sound similar and very close if I need to assign a unique value to the it will be the same for all the 6 records.

    Initially the UID is null or donot have a value when loaded into the table



  • There's definitely no definitive way to solve this because of the complexity in how many variations there can be in words both physically and phonetically, and the differences in what one may find acceptable as a match but someone else may not, e.g. Thomas Jones vs Tom Jones vs Jon Jones vs Tohas Jones.

    SOUNDEX and DIFFERENCE might get you close enough, but it also can give false results because two words can be spelled differently but phonetically sound the same or close such as Amelia and Emelia or Rose and her brother Roe. Shawn vs Sean is another good example, which you may or may not want to consider the same.

    You can also look into using a https://stackoverflow.com/a/27734606/5059085 which is basically a string comparison algorithm, but that too may give you false results like Bobby Jones being very similar to Robby Jones and Shawn vs Sean being somewhat dissimilar in spelling.

    Likely you'll find the most fitting solution for your use case is a combination of techniques which may include either or both of the aforementioned options and / or some manual logic as well. This will be dependent on your data and what you care to consider a match. It will require a lot of a testing across all of your data.


Log in to reply
 


Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2