How to remove duplicate elements from a list in C?



  • I have a contact list and I need it to be removed duplicates, I was trying to solve using the code below:

    Elemento* remove_duplicados(Elemento* agenda)
    {
        Elemento* a = NULL; //POnteiro do Elemento anterior
        Elemento* p;
        Elemento* q;
    
    for(p = agenda; p != NULL; p = p -> prox){//Analisa do primeiro elem até o ultimo
        for(q = p -> prox; q != NULL; q = q -> prox){//Analisa proximo elem de p até o fim
            if(0 != strcmp(p -> info.nome, q -> info.nome)){
                a = q; 
            }
        }
    
        if(a == NULL){ //Remove elemento do início e aponta para o 2º
            agenda = p -> prox;
        } 
        else
            a -> prox = p -> prox; //Ponteiro anterior levará para o prox
    
        free(p); //Libera o espaço
    }
    return agenda;
    

    }

    The problem is that when running the program, when arriving at this stage it ends:/



  • Problems

    It has some flaws in logic that is implemented:

    • It is not removing the repeated several successively, because it only uses a at the end of the second for
    • Can't do it free(p) and then use for to advance to the next with p = p -> prox
    • The advance agenda = p -> prox it is not necessary to remove the duplicates from the first instead of the first in itself. This causes your instruction to be agenda->prox = p->prox or p->prox =q->prox once it stops a being NULL p must be exactly in the next element

    Fixing problems

    Using the logic and variables that I suggest you do so:

    Elemento* remove_duplicados(Elemento* agenda){
        Elemento* a;
        Elemento* p;
        Elemento* q;
    
    for(p = agenda; p != NULL; p = p -> prox ){
        a = NULL; //a cada varrimento a começa a NULL
    
        for(q = p -> prox; q != NULL; ){ //sem incremento
            if(0 == strcmp(p -> info.nome, q -> info.nome)){ //teste de igual
                if(a == NULL){ //remove do inicio
                    p -> prox = q -> prox;
                }
                else{ //ou do meio/fim
                    a -> prox = q -> prox;
                }
    
                Elemento* remover = q;
                q = q -> prox;
                free(remover);
            }
            else { //so atualiza o anterior quando não é igual
                a = q;
                q = q->prox;
            }
        }
    }
    
    return agenda;
    

    }

    I removed the comments I had to be able to highlight those I put, and with the explanations associated with the exchanges I made.

    https://ideone.com/QSnGFY

    Refactoring for better readability

    The variable names you have are very little suggestive and make reading difficult. You should always try to give representative names of the content that the variables have. Moreover, the first element is never changed since the duplicates are removed to the front, and so the return type is also not necessary, and should be void.

    Taking this into consideration we can make the function quite clearer:

    void remove_duplicados(Elemento* agenda){
    Elemento *anterior, *corrente, *seguinte;

    for(corrente = agenda; corrente != NULL; corrente = corrente -> prox ){
        anterior = NULL;
    
        for(seguinte = corrente -> prox; seguinte != NULL; ){
            if(0 == strcmp(corrente -> info.nome, seguinte -> info.nome)){
                if(anterior == NULL)
                    corrente -> prox = seguinte -> prox;
                else
                    anterior -> prox = seguinte -> prox;
    
                Elemento* remover = seguinte;
                seguinte = seguinte -> prox;
                free(remover);
            }
            else {
                anterior = seguinte;
                seguinte = seguinte->prox;
            }
        }
    }
    

    }

    Efficiency with Hash table

    If there is concern about efficiency, the solution presented by you, which was what I took, can bring problems if the amount of data is wide, since it is a quadratic solution, with time complexity in the order of O(n2).

    There are several ways to improve the solution but with a hash table and ensuring few collisions we have achieved a solution in the order of O(n).

    We start with the structure of each hash table entry and the respective array that represents it:

    typedef struct entrada {
    struct info *dados;
    struct entrada *prox;
    } entrada;

    #define TAMANHO_HASH 100
    entrada *hash_nomes[TAMANHO_HASH];

    To find the input for a given value, you need the function of hash, that in this case we can use as strings:

    unsigned long hash(char *str){
    unsigned long hash = 5381;
    int c;
    while (c = *str++)
    hash = ((hash << 5) + hash) + c;

    return hash;
    

    }

    Create a good function hash it is only by itself a science, and will reflect on the performance of the algorithm as it will create many or few collisions depending on how it is built. For simplicity I used http://www.cse.yorku.ca/~oz/hash.html .

    The function remove_duplicados is now more simplified as it is based entirely on the table hash to know if the current value already exists:

    void remove_duplicados(Elemento* agenda){
    Elemento *corrente = agenda, *anterior = NULL;

    while (corrente != NULL){
        if (adiciona_se_inexistente(&amp;corrente-&gt;info) == 0){ //0 não adicionou por já existir
            anterior-&gt;prox = corrente-&gt;prox;
            Elemento* remover = corrente;
            corrente = corrente -&gt; prox;
            free(remover);
        }
        else {
            anterior = corrente;
            corrente = corrente -&gt; prox;
        }
    }
    

    }

    Here the function was used adiciona_se_inexistente to add in table hash if there is no:

    int adiciona_se_inexistente(struct info* dados){
    unsigned long key = hash(dados->nome) % TAMANHO_HASH; //achar a entrada

    entrada *corrente = hash_nomes[key];
    while (corrente != NULL){ //verificar se já existe
        if (strcmp(corrente-&gt;dados-&gt;nome, dados-&gt;nome) == 0){
            return 0;
        }
        corrente = corrente-&gt;prox;
    }
    
    //não existe adiciona a nova entrada
    entrada *nova = malloc(sizeof(entrada));
    nova-&gt;prox = hash_nomes[key];
    nova-&gt;dados = dados;
    hash_nomes[key] = nova;
    return 1;
    

    }

    The table hash has to be initialized with all inputs NULL, what I did using the function http://www.cplusplus.com/reference/cstring/memset/ .

    Observations:

    • I implemented the table hash with a list of internal collisions rather than open addressing, which ends up using more space, but is potentially simpler to implement.
    • This solution is much more efficient at performance level if we ensure that we have few collisions, something that depends on the function of hash and the size of the table. However you will use more memory than your original solution. Note that the size 100 that I chose for the table is good for the amount of names I used (5), but if you have more names you can already generate many collisions, then it is something that has to adapt to your needs.
    • It is also worth remembering that if you just need this function at a point of the program and are sure that further ahead will not be necessary, you must release the space of all completed entries of the table hash, with free.

    https://ideone.com/B04m8z




Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2