O
The problem is divided into two parts:Get the links forms the Google spreadsheetThe simplest way that Google certainly prefers is to use Google's own API, we can do it with https://pypi.org/project/gspread/ For example. That said, since the question focuses on scraping and Selenium, I'm going to try to give a solution based on them, though with some trap.I say "trap" because in this case instead of raising a scraping of Google Docs' own page in if to get the links, I can think of the following:We use Selenium to open the Google spreadsheet. Click on Archive - PHP Download - 2005 Values separated by commaet voilà!, we have a csv with the contents of the very simple leaf to parsear with Python to obtain the links. We obviously lose things like styles and so on, but in this case those are irrelevant.What follows is to get the csv links and get us to process each link.Important: It is essential that while this part of the scraping takes place we do not interfere with the Driver window if we do not have it hidden, because there are operations of "hovering" by means, the simple fact of being interacting with the pointer on the window untie everything.Get the data from each formAs to how to get the questions and answers, I had previously answered this in this question: https://es.stackoverflow.com/q/359844/15089 We simply need to iterate about the links to the forms and proceed as explained in the answer.However, there are some changes in how to make scraping on the forms, with respect to the previous solution:In this case we have some forms in which you have to select an answer in all cases or write one in one textarea (mandatory fields). That was not contemplated in the previous answer, so you have to solve it by clicking randomly on the radiobutons and writing text in textarea timely.I have modified how to detect when a form is finished, before it was simply sent, but within the ethically questionable of scraping (although Google lives from it XD), send a form filled by a machine randomly seems less ethical yet.
I have chosen to detect the presence of a "Send" text button to know when we are facing the last page of the form, the problem is that if our language is not the Spanish, the text of the button will change, so you have to modify that part of the code accordingly, in the final code shown below:Linea 136.
if btn.text == "Enviar":
All right.import csv
import glob
import pathlib
import random
import shutil
import sys
import time
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common import keys
location = "https://docs.google.com/spreadsheets/d/1iLqEFRaHPYxpJKU05VXt3HUCQ2OQUAg8FfWlyFbvaXc/edit?usp=sharing"
Creamos directorio "temporal" en directorio padre del script
file_dir = pathlib.Path(file).absolute().parent / "temp_files"
shutil.rmtree(file_dir.as_posix(), ignore_errors=True)
file_dir.mkdir(parents=True, exist_ok=True)
Permitimos que el navegador descaargue archivos csv sin preguntar
options = webdriver.FirefoxProfile()
options.set_preference(
"browser.helperApps.neverAsk.saveToDisk",
"text/csv"
)
options.set_preference("browser.download.folderList", 2)
options.set_preference("browser.helperApps.alwaysAsk.force", False)
options.set_preference("browser.download.manager.showWhenStarting", False)
options.set_preference("browser.download.dir", file_dir.as_posix())
driver = webdriver.Firefox(options)
driver.get(location)
menu_archivo = driver.find_element_by_id("docs-file-menu")
menu_archivo.click()
submenu_descargar = driver.find_element_by_css_selector(
"#:2z > div:nth-child(1) > span:nth-child(1)"
)
action = ActionChains(driver)
action.move_to_element(submenu_descargar)
action.perform()
action.send_keys(keys.Keys.DOWN)
action.perform()
submenus = driver.find_elements_by_class_name('goog-menuitem')
for submenu in submenus:
try:
label = submenu.find_element_by_class_name("goog-menuitem-label")
aria = label.get_attribute("aria-label")
if ".csv" in aria:
action = ActionChains(driver)
action.move_to_element(submenu)
action.perform()
action.send_keys(keys.Keys.ENTER)
action.perform()
break
except Exception:
pass
else:
print("Descarga de csv no disponible")
shutil.rmtree(file_dir.as_posix(), ignore_errors=True)
sys.exit(1)
for _ in range(60):
files = glob.glob((file_dir / "*.csv").as_posix())
if files:
csv_path = files[0]
break
time.sleep(1)
with open(csv_path) as file:
reader = csv.reader(file)
form_links = [
col for row in reader for col in row
if col.startswith("https://docs.google.com/forms")
]
if not form_links:
print("No se encontraron formularios")
shutil.rmtree(file_dir.as_posix(), ignore_errors=True)
sys.exit(2)
data = {}
for form_link in form_links:
driver.get(form_link)
title = driver.find_element_by_class_name(
"freebirdFormviewerViewHeaderTitleRow"
).text
data[title] = {}
fin = False
while not fin:
containers = driver.find_elements_by_class_name(
"freebirdFormviewerViewNumberedItemContainer"
)
btns = driver.find_elements_by_css_selector(".appsMaterialWizButtonEl")
for container in containers:
try:
question = container.find_element_by_class_name(
"freebirdFormviewerViewItemsItemItemTitle"
)
except NoSuchElementException:
continue
try:
radiobtn_cont = container.find_elements_by_class_name(
"freebirdFormviewerViewItemsRadioOptionContainer"
)
except NoSuchElementException:
pass
else:
responses = container.find_elements_by_class_name(
"docssharedWizToggleLabeledLabelText"
)
data[title][question.text] = [
response.text for response in responses
]
rdbtns = container.find_elements_by_class_name(
"appsMaterialWizToggleRadiogroupRadioButtonContainer"
)
if rdbtns:
random.choice(rdbtns).click()
continue
try:
text = container.find_element_by_xpath("//textarea")
except NoSuchElementException:
pass
else:
text.send_keys("No sabría decir")
for btn in btns:
if btn.text == "Enviar":
fin = True
break
else:
btns[-1].click()
driver.quit()
shutil.rmtree(file_dir.as_posix(), ignore_errors=True)
print(data)
Outcome{
"Air Travel": {
"How often did you fly before the Covid-19 epidemic?": [
"Once per year",
"Between two and five times per year",
"More than five times per year"
],
"What did you dislike the most about Pre-Covid 19 air travel?": [
"Long waits at security",
"Layovers",
"Entertainment on flight",
"Cramped seating on the flight",
"Otro:"
],
"Did you feel safe flying before the Covid-19 pandemic?": [
"Yes",
"No"
],
"How long will you wait before flying again, after Covid-19?": [
"30 Days",
"30 - 90 Days",
"90 - 180 Days",
"More than 180 Days"
],
"Will you fly for pleasure or only when absolutely necessary?": [
"Pleasure",
"Only When Absolutely Necessary"
],
"Would a flight with a guaranteed empty seat between you and the person sitting next to you make you more comfortable?": [
"Yes",
"No"
],
"What can we do to make you feel as safe as possible while flying with us?": [],
"Will you be more interested in the cheapest flight or the airline that offers the highest level of protection from disease?": [
"Cheapest Flight",
"Level of Protection From Disease",
"Mix of Both"
],
"Would you take advantage of a new class of seating that offered improved social distancing during the flight as well as complimentary personal protection equipment?": [
"Yes",
"No"
],
"Which of these options would you like to see the most of your next flight?": [
"The latest movies",
"Free hand sanitizer and wipes",
"Free meal for longer flight or free drinks on shorter ones",
"Scheduled check-in time so you can avoid waiting in a large crowd",
"Otro:"
]
},
"Travel willingness": {
"Would you be comfortable travelling within country by airplane? *": [
"Very uncomfortable",
"Moderate uncomfortable",
"Slightly uncomfortable",
"Neutral",
"Slightly comfortable",
"Moderate comfortable",
"Very comfortable"
],
"Would you be comfortable travelling internationally by airplane? *": [
"Very uncomfortable",
"Moderate uncomfortable",
"Slightly uncomfortable",
"Neutral",
"Slightly comfortable",
"Moderate comfortable",
"Very comfortable"
],
"Would you be comfortable travelling due to business by airplane? *": [
"Very uncomfortable",
"Moderate uncomfortable",
"Slightly uncomfortable",
"Neutral",
"Slightly comfortable",
"Moderate comfortable",
"Very comfortable"
],
"Would you be comfortable travelling for leisure by airplane? *": [
"Very uncomfortable",
"Moderate uncomfortable",
"Slightly uncomfortable",
"Neutral",
"Slightly comfortable",
"Moderate comfortable",
"Very comfortable"
],
"How do you expect the flight ticket price to be changed compared to pre Covid-19 times? *": [
"Much expensive than before",
"More expensive than before",
"Slightly more expensive than before",
"Same as before",
"Slightly cheaper than before",
"Moderate cheaper than before",
"Much cheaper than before"
],
"If wearing mask is required on airplane, would it make you want to travel by airplane less? *": [
"Yes",
"No",
"Maybe"
],
"Would you expect your body temperature to be taken at the airport? *": [
"Yes",
"No",
"Maybe"
],
"Are you ok with body temperature being taken a couple of times during the flight? *": [
"Yes",
"No",
"Maybe"
],
"Would you expect the flight attendants to hand out hand sanitizers on the flight? *": [
"Yes",
"No",
"Maybe"
],
"What are somethings that airline companies can do to make you feel safer and more comfortable to travel? *": []
},
"Untitled form": {
"Have you traveled with Airline XYZ pre-pandemic?": [
"Yes",
"No"
],
"If you have traveled with Airline XYZ pre-pandemic, did you choose low-cost or exclusive flight options?": [
"low-cost",
"exclusive",
"neither"
],
"If you chose low-cost or exclusive flight options, what were your reasons for doing so?": [],
"Have you traveled with Airline XYZ post-pandemic?": [
"Yes",
"No"
],
"If you have traveled with Airline XYZ post-pandemic, are you more or less likely to choose a low-cost option?": [
"much less likely",
"less likely",
"no change",
"more likely",
"much more likely"
],
"If you have traveled with Airline XYZ post-pandemic, are you more or less likely to choose an exclusive option?": [
"much less likely",
"less likely",
"no change",
"more likely",
"much more likely"
],
""I have lost a primary source of income due to COVID-19." How true is this for you?": [
"Not true at all",
"Mostly untrue",
"Somewhat true",
"Mostly true",
"True"
],
""I will be reducing travel to protect myself from COVID-19." How true is this for you?": [
"Not true at all",
"Mostly untrue",
"Somewhat true",
"Mostly true",
"True"
],
"What role do you believe private business should play regarding the COVID-19 pandemic?": [
"No role at all",
"Some role",
"Large role",
"Unsure"
],
"To what extent do you believe that Airline XYZ is catering to it's customers during this crisis?": [
"No extent at all",
"Mostly no extent",
"Some extent",
"A great extent",
"Unsure"
],
"What offerings, if any, would you be interested in during this crisis?": [
"more low-cost flights",
"more exclusive flights",
"discount on exclusive flights",
"discount on low-cost flights",
"Otro:"
],
"What channels would you be interested in hearing future updates about Airline XYZ from?": [
"Facebook",
"Twitter",
"Email",
"Text",
"Instagram",
"Snapchat",
"Tumblr",
"Youtube",
"TikTok"
]
}
}
The code is optimized in several points, one of them is to obtain the csv submenu, another the mentioned way to detect the end of the form. Also, it is only tested with the sheet of this question and its forms, surely it would need more testicle to detect possible problems not taken into account.