How to remove unnecessary words and leave only numbers
-
I'm new to programming. There's a csv file, 3,77185 lines. All values in the pole baths - str, tell me how to remove the word baths from almost every line of this column? In hand 377ths, it's not real. I wanted to try regular expressions first, but for their use, it's necessary that the numbers be int or float format for the search, I understand correctly?
id baths 0 3.5 1 3 Baths 2 2 Baths 5 8 200 2
-
Use the module. https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html - it's perfect for processing, analysing and visualizing tabular data:
import pandas as pd # https://pandas.pydata.org/docs/getting_started/install.html
парсим CSV файл в Pandas DataFrame
df = pd.read_csv("file.csv", sep=",")
в столбце
baths
удаляем все символы, кроме цифр и точкиdf["baths"] = df["baths"].str.replace(r"[^\d.]", "", regex=True)
записываем DataFrame обратно в CSV файл
df.to_csv("result.csv", index=False)
Regular expression
r"[^\d.]"
- indicates:- Everything in square brackets is any symbol of the set.
- if First The symbol inside square brackets is
^
- it means denying a set of symbols. i.e. within the meaning of any symbols other than those in square brackets after^
♪ \d
- denotes any0
before9
♪.
- defines the symbol of the point, that is..
without shielding shall be indicated in RegEx, any single symbol.
Together, it means finding all symbols other than the numbers and the symbol of the point and replacing them with an empty line, i.e. removing them.
PS There are services that permit protest
RegEx
and explain how it works. For example: https://regex101.com/r/3MJlZ8/1https://docs.python.org/3/library/re.html