Environment Prepration

create python environment that has same version with the python runtime located on the server using conda (recommended as it can maange multiple python environment unline python virtual env).

conda create --name ayo-makaryo python=3.12.12

in this case, python 3.12.12 will be choosen.

activate the recently created conda project

conda activate ayo-makaryo

Ensure the python and pip command is pointed to right conda environment directory using following command

which pip
which python

if “pip” and “python” still pointed to the system wide python, try to restart the terminal and re-active the conda environment and also clear the terminal cache.

Both pip and python command already pointed to correct conda environment folder

one last checking: if visual studio code(VSC) is used as the development IDE, ensure the python runtime for the VSC is pointed to the recently created conda environment otherwise the IDE will throw linting error and the autocomplete won’t working.

image vsc has been repointed to the correct conda environment

Installing Beautiful Soup Library for HTML Parsing

there are various python html-parser available on the market. The most used and widely adopted is called beautiful-soup due to the light weight performance. Beautiful soup is a wrapper for several html parser such as : html.parser, lxml and html5lib. It allow the developer to all three packages depending on the scenario.

Beautiful-soup has straight forward syntax. So its easy to use.

To achive best performance beautiful soup + lxml is recommended because lxml is written using C libraries. hence its has the fastest performance among the three. However, the size of to-be parsed html file will also determine whether lxml is required. If the file is just less than 2 mb, built in html.parser suppose to be enough.

use following command to install beautiful soup on the conda project using pip (dont forget to activate the conda environment before executing the command)

pip install beautifulsoup4

pip is the prefereable way to install beautifulsoup4 unless lxml is going to be used which require C libraries.

validate the beautifulsoup installation by executing following command

pip list | grep soup

image beautiful soup is successfully installed as it show on the active pip libaries list

Parse the HTML File

install lxml (assuming the size of HTML is greater than 2 mb)

pip install lxml

create a blank python file called extract.py

touch extract

load the html file

from bs4 import BeautifulSoup

with open("./s_output/target_file.html", "r", encoding="utf-8") as file:
    soup = BeautifulSoup(file, "lxml")

locate target html elements

items = soup.select('.target-class')

in the the dom has multiple target-class, it can be iterated

for item in items:
	urn_value = item['some-parameter']

can also create a new selector without re initializing new beautifulsoup inside the target-class

    job_title_element = item.select_one('.child-of-target-class')

get only text value by removing all html syntax (useful for scrapping)

location_element.text.strip()

note: its recommended to only execute .text command if only if the element exist. Hence, its better to separate the selector and getText command

child_element = item.select_one('.child-of-target-class')
child_element_text_value = child_element.text.strip() if child_element else "child element of target class not found"

below is the sample of parsing in action using beautiful soup with LXML.

HTML parsing is success

Parse Html File From Landing Zone Using Python Into Bronze Create The Staging File

Environment Prepration

Installing Beautiful Soup Library for HTML Parsing

Parse the HTML File