Random to remove several values from txt file



  • There's txt file:

    rama
    mama
    papa
    deda
    koza
    dama
    repa
    и т.д.
    

    We need to get three words out of there early, but so that the first word is like a txt file, on the example of this rama, and the other two words are whatever, but don't repeat with the old words.

    Please tell me how to put this on Python 3.



  • To read the first line and select two more random lines from a small file:

    #!/urs/bin/env python3
    import random
    

    with open('input.txt') as file:
    lines = [next(file)] + random.sample(list(file), 2)
    print(*map(str.strip, lines))

    next(file) Read the first line from the file (the files are terators over the lines in Pitton). https://docs.python.org/library/random.html#random.sample elects a few elements from the list without replacement. If the words in the entry file are not repeated, the result always contains unique words.


    If words can be repeated in the file, you can use it.
    https://docs.python.org/3/library/stdtypes.html#set-types-set-frozenset so that only the unique words remain:

    #!/urs/bin/env python3
    import random

    with open('input_with_dups.txt') as file:
    first_word = next(file).strip()
    words = set(map(str.strip, file)) - {first_word} # unique words
    print(first_word, *random.sample(words, 2)) #NOTE: use random.sample()
    #to avoid relying on
    #PYTHONHASHSEED behavior

    In this case, the probability that the word is chosen does not depend on
    How often it meets in the file - all words (except the first) have
    Same weight.

    https://docs.python.org/3/library/stdtypes.html#str.strip used to remove gaps from the entry lines so that
    Every line has only one word left, otherwise 'word''word\n'
    or 'word ' I'd be seen as different words.


    If the file is large, but it contains only different words, it can be used. reservoir_sample() which performs https://ru.wikipedia.org/wiki/Reservoir_sampling :

    #!/urs/bin/env python3
    with open('input.txt') as file:
    lines = [next(file)] + reservoir_sample(file, 2)
    print(*map(str.strip, lines))

    This decision doesn't read the entire file in memory at once, so it can work even for big files. Where? reservoir_sample():

    import itertools
    import random

    def reservoir_sample(iterable, k,
    randrange=random.randrange, shuffle=random.shuffle):
    """Select k random elements from iterable.

    Use O(n) Algorithm R https://en.wikipedia.org/wiki/Reservoir_sampling
    """
    it = iter(iterable)
    sample = list(itertools.islice(it, k))  # fill the reservoir
    if len(sample) < k:
        raise ValueError("Sample larger than population")
    shuffle(sample)
    for i, item in enumerate(it, start=k+1):
        j = randrange(i) # random [0..i)
        if j < k:
            sample[j] = item # replace item with gradually decreasing probability
    return sample
    

    Probability of choosing an arbitrary line
    file constant and equal k / nwhere nIt's on the line in the file.


    Anyway general (if words can be repeated in the file entrance and it's him.
    may be large. Need to be modified reservoir_sample()algorithm, so that the remaining elements are considered:

    #!/urs/bin/env python3
    import itertools
    import random

    def choose_uniq(iterable, k, chosen, randrange=random.randrange):
    j0 = len(chosen)
    it = (x for x in iterable if x not in chosen)
    for x in itertools.islice(it, k): # NOTE: add one by one
    chosen.append(x)
    if len(chosen) < (j0 + k):
    raise ValueError("Sample larger than population")
    for i, item in enumerate(it, start=k + 1):
    j = randrange(i) # random [0..i)
    if j < k: # replace item with gradually decreasing probability
    chosen[j0 + j] = item

    with open('input_with_dups.txt') as file:
    chosen_words = [next(file).strip()] # first word
    choose_uniq(map(str.strip, file), 2, chosen_words)
    print(*chosen_words)

    (x for x in iterable if x not in chosen) removes the selected
    elements. It works because the elements are generated by the Lenivo:
    One. So, k == 2 In this case, x not in chosen That's it.
    A quick surgery, even for a list. For big кYou can. set type
    use this to obtain O(1) behavior.

    choose_uniq() doesn't act like random.sample()That's why
    shuffle() Clear. Full distribution is not quite uniform: in
    dependence on the line of reference file, often repeated
    The line may be chosen more frequently than if only a unique word
    to be considered (a result is different from
    set(map(str.strip, file)) - {first_word} Decisions).

    If uniform distribution is required (all unique words
    selected with the same probability, for large files,
    non-removable, useable External
    sorting
    later
    Allow the removal of duplicates without additional memory costs (in addition) O(1)memory, e.g. using
    https://docs.python.org/library/itertools.html#itertools.groupby Which in turn will allow us to use again reservoir_sample()No change.


    If a strictly uniform distribution is not required, it is possible not to read the entire potentially large file (for speed) to choose the words from an accidental position in the file. For convenience, you can use https://docs.python.org/library/mmap which allows the file to be treated as a line
    (bate sequence), even if the size of the file is more readily available:

    #!/urs/bin/env python3
    import locale
    import mmap
    import random
    import re

    with open('input_with_dups.txt', 'rb') as file,
    mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ) as s:
    first_nonspace_pos = re.search(br'\S', s).start() # skip leading space
    chosen = set([get_word(s, first_nonspace_pos), b'']) # get 1st word
    while len(chosen) != 4: # add two more random non-empty words
    chosen.add(get_word(s, random.randrange(len(s))))
    encoding = locale.getpreferredencoding(False)
    print(*[w.decode(encoding) for w in chosen if w])

    where get_word() Reverts the word from the line near the stated position in the file:

    def get_word(s, position, newline=b'\n'):
    """Return a word from a line in s at position."""
    i = s.rfind(newline, 0, position) # find newline on the left
    start = (i + 1) if i != -1 else 0
    i = s.find(newline, position) # find newline on the right
    end = i if i != -1 else len(s)
    return s[start:end].strip() # a space is not part of a word, strip it

    The file could be empty (containing only gaps in the line) code
    first_nonspace_index and b'' Avoiding an empty choice
    words. The code suggests that there are more than two different entry files.
    The words are otherwise possible in an endless cycle. Unique gaps (such as
    U+00A0) not considered.

    The probability of a choice of word in this case may depend on the length
    words, the frequency of their repetition in the file and even from
    Coding used (i.e. uneven distribution).




Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2