Random to remove several values from txt file
-
There's txt file:
rama mama papa deda koza dama repa и т.д.
We need to get three words out of there early, but so that the first word is like a txt file, on the example of this rama, and the other two words are whatever, but don't repeat with the old words.
Please tell me how to put this on Python 3.
-
To read the first line and select two more random lines from a small file:
#!/urs/bin/env python3 import random
with open('input.txt') as file:
lines = [next(file)] + random.sample(list(file), 2)
print(*map(str.strip, lines))
next(file)
Read the first line from the file (the files are terators over the lines in Pitton). https://docs.python.org/library/random.html#random.sample elects a few elements from the list without replacement. If the words in the entry file are not repeated, the result always contains unique words.If words can be repeated in the file, you can use it.
https://docs.python.org/3/library/stdtypes.html#set-types-set-frozenset so that only the unique words remain:#!/urs/bin/env python3
import randomwith open('input_with_dups.txt') as file:
first_word = next(file).strip()
words = set(map(str.strip, file)) - {first_word} # unique words
print(first_word, *random.sample(words, 2)) #NOTE: use random.sample()
#to avoid relying on
#PYTHONHASHSEED behavior
In this case, the probability that the word is chosen does not depend on
How often it meets in the file - all words (except the first) have
Same weight.https://docs.python.org/3/library/stdtypes.html#str.strip used to remove gaps from the entry lines so that
Every line has only one word left, otherwise'word'
♪'word\n'
♪
or'word '
I'd be seen as different words.If the file is large, but it contains only different words, it can be used.
reservoir_sample()
which performs https://ru.wikipedia.org/wiki/Reservoir_sampling :#!/urs/bin/env python3
with open('input.txt') as file:
lines = [next(file)] + reservoir_sample(file, 2)
print(*map(str.strip, lines))
This decision doesn't read the entire file in memory at once, so it can work even for big files. Where?
reservoir_sample()
:import itertools
import randomdef reservoir_sample(iterable, k,
randrange=random.randrange, shuffle=random.shuffle):
"""Select k random elements from iterable.Use O(n) Algorithm R https://en.wikipedia.org/wiki/Reservoir_sampling """ it = iter(iterable) sample = list(itertools.islice(it, k)) # fill the reservoir if len(sample) < k: raise ValueError("Sample larger than population") shuffle(sample) for i, item in enumerate(it, start=k+1): j = randrange(i) # random [0..i) if j < k: sample[j] = item # replace item with gradually decreasing probability return sample
Probability of choosing an arbitrary line
file constant and equalk / n
wheren
It's on the line in the file.Anyway general (if words can be repeated in the file entrance and it's him.
may be large. Need to be modifiedreservoir_sample()
algorithm, so that the remaining elements are considered:#!/urs/bin/env python3
import itertools
import randomdef choose_uniq(iterable, k, chosen, randrange=random.randrange):
j0 = len(chosen)
it = (x for x in iterable if x not in chosen)
for x in itertools.islice(it, k): # NOTE: add one by one
chosen.append(x)
if len(chosen) < (j0 + k):
raise ValueError("Sample larger than population")
for i, item in enumerate(it, start=k + 1):
j = randrange(i) # random [0..i)
if j < k: # replace item with gradually decreasing probability
chosen[j0 + j] = itemwith open('input_with_dups.txt') as file:
chosen_words = [next(file).strip()] # first word
choose_uniq(map(str.strip, file), 2, chosen_words)
print(*chosen_words)
(x for x in iterable if x not in chosen)
removes the selected
elements. It works because the elements are generated by the Lenivo:
One. So,k == 2
In this case,x not in chosen
That's it.
A quick surgery, even for a list. For bigк
You can.set
type
use this to obtainO(1)
behavior.choose_uniq()
doesn't act likerandom.sample()
That's why
shuffle()
Clear. Full distribution is not quite uniform: in
dependence on the line of reference file, often repeated
The line may be chosen more frequently than if only a unique word
to be considered (a result is different fromset(map(str.strip, file)) - {first_word}
Decisions).If uniform distribution is required (all unique words
selected with the same probability, for large files,
non-removable, useable External
sortinglater
Allow the removal of duplicates without additional memory costs (in addition)O(1)
memory, e.g. using
https://docs.python.org/library/itertools.html#itertools.groupby Which in turn will allow us to use againreservoir_sample()
No change.If a strictly uniform distribution is not required, it is possible not to read the entire potentially large file (for speed) to choose the words from an accidental position in the file. For convenience, you can use https://docs.python.org/library/mmap which allows the file to be treated as a line
(bate sequence), even if the size of the file is more readily available:#!/urs/bin/env python3
import locale
import mmap
import random
import rewith open('input_with_dups.txt', 'rb') as file,
mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ) as s:
first_nonspace_pos = re.search(br'\S', s).start() # skip leading space
chosen = set([get_word(s, first_nonspace_pos), b'']) # get 1st word
while len(chosen) != 4: # add two more random non-empty words
chosen.add(get_word(s, random.randrange(len(s))))
encoding = locale.getpreferredencoding(False)
print(*[w.decode(encoding) for w in chosen if w])
where
get_word()
Reverts the word from the line near the stated position in the file:def get_word(s, position, newline=b'\n'):
"""Return a word from a line in s at position."""
i = s.rfind(newline, 0, position) # find newline on the left
start = (i + 1) if i != -1 else 0
i = s.find(newline, position) # find newline on the right
end = i if i != -1 else len(s)
return s[start:end].strip() # a space is not part of a word, strip it
The file could be empty (containing only gaps in the line) code
first_nonspace_index
andb''
Avoiding an empty choice
words. The code suggests that there are more than two different entry files.
The words are otherwise possible in an endless cycle. Unique gaps (such as
U+00A0) not considered.The probability of a choice of word in this case may depend on the length
words, the frequency of their repetition in the file and even from
Coding used (i.e. uneven distribution).