Como organizar informações de Web Scraping?

jcferreira · Junho 16

Saudações a todos.

Sou iniciante em Python, estou desenvolvendo um projeto para capturar informações de um site de apostas esportivas. Consegui capturar os dados de que preciso, porém estou com dificuldade para organizá-los. As informações foram puxadas em blocos, contendo o horário, campeonato, times, odds, etc. Eu gostaria de organizá-las de uma forma parecida com a do site (cadas bloco em uma única linha), conforme a imagem abaixo. Há alguma maneira de disponibilizar as informações no terminal do Python de um modo parecido com o do site: betwatch.fr/money?

Kleverson Cuzzuol Lopes · Junho 16

Olá @jcferreira, tudo bem?

Como você está gravando estes dados, em formato txt, csv? É só apresentação em tela com print()? Nos dê mais informações.

jcferreira · Junho 16

Boa tarde, Kleverson, estou bem e você?

É somente apresentanção em tela com print( ).

Busquei os elementos pelo xpath e fiz um for para retornar as informações na tela em .text

games = chrome.find_elements(By.XPATH, '/html/body/div/div/div[@class="container"]/div[@id="matchs"]/div')

for game in games: chrome.find_element(By.XPATH,'/html/body/div/div/div[@class="container"]/div[@id="matchs"]/div/div[@class="match-issues slider"]/section/div/div[@class="issue-header"]')

print(game.text)

Editado Junho 16 por jcferreira

Ryan Zimerman Leite · Junho 17

Você pode utilizar regex para organizar isso em blocos aqui esta um exemplo:

import re

# Exemplo de texto
text = """
16:00
International : UEFA Euro 2024
Serbia - England
1
138 902€
8%
7.6
X
158 625€
9%
4.6
2
1 444 253€
83%
1.53
"""

# Regex para capturar os dados
regex = re.compile(r"""
    (?P<time>\d{2}:\d{2})\s*
    (?P<event>[^\n]+)\s*
    (?P<teams>[^\n]+)\s*
    (?P<section1>[^\n]+)\s*
    (?P<one_value>[\d\s]+€)\s*
    (?P<one_percentage>\d+%)\s*
    (?P<one_change>[\d.]+)\s*
    (?P<sectionX>[^\n]+)\s*
    (?P<x_value>[\d\s]+€)\s*
    (?P<x_percentage>\d+%)\s*
    (?P<x_change>[\d.]+)\s*
    (?P<section2>[^\n]+)\s*
    (?P<two_value>[\d\s]+€)\s*
    (?P<two_percentage>\d+%)\s*
    (?P<two_change>[\d.]+)
    """, re.VERBOSE)

# Aplicando o regex no texto
match = regex.search(text)

if match:
    data = match.groupdict()

    # Organizando os dados em um formato estruturado
    result = f"""
    {data['time']} | {data['event']} | {data['teams']}
    {data['section1']}: {data['one_value']} ({data['one_percentage']}) {data['one_change']}
    {data['sectionX']}: {data['x_value']} ({data['x_percentage']}) {data['x_change']}
    {data['section2']}: {data['two_value']} ({data['two_percentage']}) {data['two_change']}
    """

    print(result)
else:
    print("Dados não encontrados")

Resultado:

jcferreira · Junho 17

5 horas atrás, Ryan Zimerman Leite disse:

Você pode utilizar regex para organizar isso em blocos aqui esta um exemplo:

import re

# Exemplo de texto
text = """
16:00
International : UEFA Euro 2024
Serbia - England
1
138 902€
8%
7.6
X
158 625€
9%
4.6
2
1 444 253€
83%
1.53
"""

# Regex para capturar os dados
regex = re.compile(r"""
    (?P<time>\d{2}:\d{2})\s*
    (?P<event>[^\n]+)\s*
    (?P<teams>[^\n]+)\s*
    (?P<section1>[^\n]+)\s*
    (?P<one_value>[\d\s]+€)\s*
    (?P<one_percentage>\d+%)\s*
    (?P<one_change>[\d.]+)\s*
    (?P<sectionX>[^\n]+)\s*
    (?P<x_value>[\d\s]+€)\s*
    (?P<x_percentage>\d+%)\s*
    (?P<x_change>[\d.]+)\s*
    (?P<section2>[^\n]+)\s*
    (?P<two_value>[\d\s]+€)\s*
    (?P<two_percentage>\d+%)\s*
    (?P<two_change>[\d.]+)
    """, re.VERBOSE)

# Aplicando o regex no texto
match = regex.search(text)

if match:
    data = match.groupdict()

    # Organizando os dados em um formato estruturado
    result = f"""
    {data['time']} | {data['event']} | {data['teams']}
    {data['section1']}: {data['one_value']} ({data['one_percentage']}) {data['one_change']}
    {data['sectionX']}: {data['x_value']} ({data['x_percentage']}) {data['x_change']}
    {data['section2']}: {data['two_value']} ({data['two_percentage']}) {data['two_change']}
    """

    print(result)
else:
    print("Dados não encontrados")

Resultado:

Boa tarde, Ryan, obrigado pelo retorno. No caso, o Regex eu utilizo depois que eu puxei as informações do site com o for?

Ryan Zimerman Leite · Junho 18

Boa tarde @jcferreiraisso, o regex vai trabalhando dentro do seu for ou seja para cada informação que voce puxar no for ele vai procurar pelo regex se encontrar ele estrutura para voce

Ryan Zimerman Leite · Junho 18

coloque seu codigo aqui para pode auxiliar voce melhor

jcferreira · Junho 18

3 horas atrás, Ryan Zimerman Leite disse:

coloque seu codigo aqui para pode auxiliar voce melhor

Boa noite, Ryan. Segue o código. Minha dificuldade é aproveitar a variável game para seguir adiante com a organização dos dados.

import os
import re
import requests
from bs4 import BeautifulSoup
import time
import warnings

import re
from selenium import webdriver
from selenium.webdriver import Keys
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By

from webdriver_manager.chrome import ChromeDriverManager

warnings.filterwarnings('ignore')
options = webdriver.ChromeOptions()
#
# options.add_argument('--headless')
# options.add_argument('--no-sandbox')
# options.add_argument('--disable-dev-shm-usage')

service = Service(ChromeDriverManager().install())
chrome = webdriver.Chrome(service=service)

chrome.implicitly_wait(3)
print('\n')
print('Lista de Jogos')
print('\n')

# Não detectar automação
url = 'https://betwatch.fr/money'
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 \
(KHTML, like Gecko) Chrome / 86.0.4240.198Safari / 537.36"
}

chrome.get(url)

time.sleep(5)

chrome.find_element(By.XPATH, '//*[@id="refresh-dropdown"]').click()

time.sleep(2)

chrome.find_element(By.XPATH, '//*[@id="refresh-dropdown"]/option[2]').click()

time.sleep(3)

# = chrome.find_element(By.XPATH, '//*[@id="matchs"]')

last_height = chrome.execute_script("return document.body.scrollHeight")

while True:

# Scroll down to the bottom

chrome.execute_script("window.scrollTo(0, document.body.scrollHeight);")

# Wait for page to load

time.sleep(3)

# Calculate new scroll height and compare with last scroll height

new_height = chrome.execute_script("return document.body.scrollHeight")

if new_height == last_height:

break

last_height = new_height

games = chrome.find_elements(By.XPATH, '/html/body/div/div/div[@class="container"]/div[@id="matchs"]/div')

for game in games:
chrome.find_element(By.XPATH,'/html/body/div/div/div[@class="container"]/div[@id="matchs"]/div/div[@class="match-issues slider"]/section/div/div[@class="issue-header"]')
print(game.text)

Ryan Zimerman Leite · Junho 20

import os
import re
import requests
from bs4 import BeautifulSoup
import time
import warnings

import re
from selenium import webdriver
from selenium.webdriver import Keys
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

from webdriver_manager.chrome import ChromeDriverManager

warnings.filterwarnings('ignore')
options = webdriver.ChromeOptions()
#
# options.add_argument('--headless')
# options.add_argument('--no-sandbox')
# options.add_argument('--disable-dev-shm-usage')

service = Service(ChromeDriverManager().install())
chrome = webdriver.Chrome(service=service)

chrome.implicitly_wait(3)
print('\n')
print('Lista de Jogos')
print('\n')

# Não detectar automação
url = 'https://betwatch.fr/money'
headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 \
    (KHTML, like Gecko) Chrome / 86.0.4240.198Safari / 537.36"
}

chrome.get(url)

# Wait for the page to load
WebDriverWait(chrome, 10).until(EC.presence_of_element_located((By.XPATH, '//*[@id="refresh-dropdown"]')))

# Click on refresh dropdown
chrome.find_element(By.XPATH, '//*[@id="refresh-dropdown"]').click()

# Wait for refresh options to appear
WebDriverWait(chrome, 10).until(EC.presence_of_element_located((By.XPATH, '//*[@id="refresh-dropdown"]/option[2]')))

# Click on "Live" option
chrome.find_element(By.XPATH, '//*[@id="refresh-dropdown"]/option[2]').click()

# Wait for the page to update
WebDriverWait(chrome, 10).until(EC.presence_of_element_located((By.XPATH, '//*[@id="matchs"]')))

# Scroll to the bottom of the page
last_height = chrome.execute_script("return document.body.scrollHeight")
while True:
    chrome.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(3)  # Adjust the sleep duration as needed
    new_height = chrome.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

games = chrome.find_elements(By.XPATH, '/html/body/div/div/div[@class="container"]/div[@id="matchs"]/div')

# Capture data using regex
for game in games:
    # Explicitly wait for the "issue-header" element to be present
    WebDriverWait(chrome, 10).until(EC.presence_of_element_located((By.XPATH, '//*[@class="issue-header"]')))

    # Find the specific element that contains the relevant data
    match_element = game.find_element(By.XPATH, './div[@class="match-issues slider"]/section/div/div[@class="issue-header"]')
    text = match_element.text

    # Regex to capture data from the text
    regex = re.compile(r"""
        (?P<time>\d{2}:\d{2})\s*
        (?P<event>[^\n]+)\s*
        (?P<teams>[^\n]+)\s*
        (?P<section1>[^\n]+)\s*
        (?P<one_value>[\d\s]+€)\s*
        (?P<one_percentage>\d+%)\s*
        (?P<one_change>[\d.]+)\s*
        (?P<sectionX>[^\n]+)\s*
        (?P<x_value>[\d\s]+€)\s*
        (?P<x_percentage>\d+%)\s*
        (?P<x_change>[\d.]+)\s*
        (?P<section2>[^\n]+)\s*
        (?P<two_value>[\d\s]+€)\s*
        (?P<two_percentage>\d+%)\s*
        (?P<two_change>[\d.]+)
        """, re.VERBOSE)

    # Applying the regex to the text
    match = regex.search(text)

    if match:
        data = match.groupdict()

        # Organizing data in a structured format
        result = f"""
        {data['time']} | {data['event']} | {data['teams']}
        {data['section1']}: {data['one_value']} ({data['one_percentage']}) {data['one_change']}
        {data['sectionX']}: {data['x_value']} ({data['x_percentage']}) {data['x_change']}
        {data['section2']}: {data['two_value']} ({data['two_percentage']}) {data['two_change']}
        """
        print(result)
    else:
        print("Dados não encontrados")

chrome.quit()

jcferreira · Sexta às 00:50

14 horas atrás, Ryan Zimerman Leite disse:

import os
import re
import requests
from bs4 import BeautifulSoup
import time
import warnings

import re
from selenium import webdriver
from selenium.webdriver import Keys
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

from webdriver_manager.chrome import ChromeDriverManager

warnings.filterwarnings('ignore')
options = webdriver.ChromeOptions()
#
# options.add_argument('--headless')
# options.add_argument('--no-sandbox')
# options.add_argument('--disable-dev-shm-usage')

service = Service(ChromeDriverManager().install())
chrome = webdriver.Chrome(service=service)

chrome.implicitly_wait(3)
print('\n')
print('Lista de Jogos')
print('\n')

# Não detectar automação
url = 'https://betwatch.fr/money'
headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 \
    (KHTML, like Gecko) Chrome / 86.0.4240.198Safari / 537.36"
}

chrome.get(url)

# Wait for the page to load
WebDriverWait(chrome, 10).until(EC.presence_of_element_located((By.XPATH, '//*[@id="refresh-dropdown"]')))

# Click on refresh dropdown
chrome.find_element(By.XPATH, '//*[@id="refresh-dropdown"]').click()

# Wait for refresh options to appear
WebDriverWait(chrome, 10).until(EC.presence_of_element_located((By.XPATH, '//*[@id="refresh-dropdown"]/option[2]')))

# Click on "Live" option
chrome.find_element(By.XPATH, '//*[@id="refresh-dropdown"]/option[2]').click()

# Wait for the page to update
WebDriverWait(chrome, 10).until(EC.presence_of_element_located((By.XPATH, '//*[@id="matchs"]')))

# Scroll to the bottom of the page
last_height = chrome.execute_script("return document.body.scrollHeight")
while True:
    chrome.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(3)  # Adjust the sleep duration as needed
    new_height = chrome.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

games = chrome.find_elements(By.XPATH, '/html/body/div/div/div[@class="container"]/div[@id="matchs"]/div')

# Capture data using regex
for game in games:
    # Explicitly wait for the "issue-header" element to be present
    WebDriverWait(chrome, 10).until(EC.presence_of_element_located((By.XPATH, '//*[@class="issue-header"]')))

    # Find the specific element that contains the relevant data
    match_element = game.find_element(By.XPATH, './div[@class="match-issues slider"]/section/div/div[@class="issue-header"]')
    text = match_element.text

    # Regex to capture data from the text
    regex = re.compile(r"""
        (?P<time>\d{2}:\d{2})\s*
        (?P<event>[^\n]+)\s*
        (?P<teams>[^\n]+)\s*
        (?P<section1>[^\n]+)\s*
        (?P<one_value>[\d\s]+€)\s*
        (?P<one_percentage>\d+%)\s*
        (?P<one_change>[\d.]+)\s*
        (?P<sectionX>[^\n]+)\s*
        (?P<x_value>[\d\s]+€)\s*
        (?P<x_percentage>\d+%)\s*
        (?P<x_change>[\d.]+)\s*
        (?P<section2>[^\n]+)\s*
        (?P<two_value>[\d\s]+€)\s*
        (?P<two_percentage>\d+%)\s*
        (?P<two_change>[\d.]+)
        """, re.VERBOSE)

    # Applying the regex to the text
    match = regex.search(text)

    if match:
        data = match.groupdict()

        # Organizing data in a structured format
        result = f"""
        {data['time']} | {data['event']} | {data['teams']}
        {data['section1']}: {data['one_value']} ({data['one_percentage']}) {data['one_change']}
        {data['sectionX']}: {data['x_value']} ({data['x_percentage']}) {data['x_change']}
        {data['section2']}: {data['two_value']} ({data['two_percentage']}) {data['two_change']}
        """
        print(result)
    else:
        print("Dados não encontrados")

chrome.quit()

Boa noite, Ryan. Rodei seu código extamente como está. Retornou "Dados não Encontrados".

Fiz algumas pequenas alterações e retornou apenas dados de uma partida só, de forma repetida.

Editado Sexta às 01:52 por jcferreira

Entrar

Como organizar informações de Web Scraping?

Postagens Recomendadas

Link to comment

Compartilhe em outros sites

Link to comment

Compartilhe em outros sites

Link to comment

Compartilhe em outros sites

Link to comment

Compartilhe em outros sites

Link to comment

Compartilhe em outros sites

Link to comment

Compartilhe em outros sites

Link to comment

Compartilhe em outros sites

Link to comment

Compartilhe em outros sites

Link to comment

Compartilhe em outros sites

Link to comment

Compartilhe em outros sites

Crie uma conta ou entre para comentar 😀

Crie a sua conta

Entrar

Você também pode se interessar por:

Quem está online 1 Membro, 0 Anônimos, 40 Visitantes (Ver lista completa)

Próximos Eventos

Estatísticas de Membros