MapReduce of an application for unique users



  • There's a large entry file (ppm) with the following format: NameяFamiliaя Road

    Example:

    Yana|Petrova|21.01.1990
    Kseniya|Ivanova|22.02.1990
    Kseniya|Ivvanova|22.02.1990
    Jana|Petrova|21.01.1091
    

    ...

    Users can introduce data, both with errors in any field and with different versions of the name and name writing. The user may also misinterpret the name in the file (the date of birth cannot be confused). File may contain several records of the same user. It is also expected that there are discernible data that can verify the algorithm performance.

    MapReduce needs to be implemented with a unique user application.

    Example of exit file after algorithm works:

    1|Yana|Petrova|21.01.1991
    2|Kseniya|Ivanova|22.02.1990
    2|Kseniya|Ivvanova|22.02.1990
    1|Jana|Petrova|21.01.1091
    

    Where the first field is the ID of a unique user. The line in the exit file is not important. Only the correctly delivered ID of a unique user that minimises the chosen meter is important.

    Tell me how best to implement the algorithm, especially the stage of reduction. How do you better compare and recognize the same users?



  • In such cases, lines are used for unclear comparisons https://ru.wikipedia.org/wiki/%D0%A0%D0%B0%D1%81%D1%81%D1%82%D0%BE%D1%8F%D0%BD%D0%B8%D0%B5_%D0%9B%D0%B5%D0%B2%D0%B5%D0%BD%D1%88%D1%82%D0%B5%D0%B9%D0%BD%D0%B0 - Number of editing operations that will allow one line to be converted.

    For identical lines, the distance is zero. For one or two fingerprints, the distance will be small. You choose the threshold, after which the lines are no longer considered equal.

    Since you're suggesting that the name and surname may be mixed with places, you're going to have to make two comparisons (whether a name with a name straight and backwards, followed by the best (lower) meaning.

    public boolean fuzzyEqual(String firstname1, String lastname1, String firstname2, String lastname2, int treshold) {
        return treshold >= Math.min(
                   dist(firstname1 + " " + lastname1, firstname2 + " " + lastname2),
                   dist(firstname1 + " " + lastname1, lastname2 + " " + firstname2)
        );
    }
    

    public int dist(String a, String b) {
    // тут ваша реализация расстояния Левенштейна
    }

    PS. When using the algorithm of the line itself, it is necessary to lead to the upper or lower register.




Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2