Welcome, Guest: Register On Nairaland / LOGIN! / Trending / Recent / New
Stats: 3,162,166 members, 7,849,602 topics. Date: Tuesday, 04 June 2024 at 05:33 AM

NnuReader's Posts

Nairaland Forum / NnuReader's Profile / NnuReader's Posts

(1) (of 1 pages)

Programming / Re: Scraping Jiji Ideas by nnuReader: 11:39am On Feb 28, 2023
Here is a python script to scrape all the data, including phone number in less than an hour.

The scripts works by directly fetching data from the jiji API endpoints and paginate:

https:///api_web/v1/listing?slug=X&webp=true&page=Y
where X is the category(vehicles, real-estate...) you want to scrape and Y is the page in the data(23 products returned per page),

You just change keep changing the slug when you're done scraping a ctegory,
and for every category, you keep increasing the page while you save the info and check for duplicates(a vendor can appear muliple times due to multiple product upload)

This approach is miles faster than using tools like puppeteer, selenium or beautiful soup because you're not loading irrelevant files like css, js, images, html...

You can run the script in CMD like the following:

python3 scrape.py vehicles Vehicles

The above scrape the vehicles category

python3 scrape.py real-estate Properties

The above scrape real estates.

If you need more info, mail me at hello@feyitech.com


The Script:

import requests
import time, sys
from common import get_profile_id_list_and_profiles, update_profiles, dict_to_profile_row
from coded_addesses import address_for_fresh, address_for_new, address_for_slider

S = requests.Session()

SCRAPE_TYPES = {
"fresh": "fresh",
"new": "new",
"slider": "slider"
}

ACCEPTED_TYPES = [
'vehicles', 'real-estate', 'mobile-phones-tablets',
'electronics', 'home-garden', 'health-and-beauty',
'fashion-and-beauty', 'hobbies-art-sport', 'seeking-work-cvs',
'services', 'babies-and-kids', 'animals-and-pets',
'agriculture-and-foodstuff', 'office-and-commercial-equipment-tools',
'repair-and-construction'
]
if len(sys.argv) < 2 or sys.argv[1] not in ACCEPTED_TYPES:
print('No category specified\n\n. Example: "python3 scrape.py vehicles"\n\n.Accepted categories are: %s' % ", ".join(ACCEPTED_TYPES))
else:
type = sys.argv[1]
name = type
if len(sys.argv) > 2:
name = sys.argv[2]
profile_id_list_and_profiles = get_profile_id_list_and_profiles()
#print(profile_id_list_and_profiles[1])
if profile_id_list_and_profiles is not None:
def get_address(page):
return "https:///api_web/v1/listing?slug=%s&webp=true&page=%d" % (type, page)

profile_id_list = profile_id_list_and_profiles[0]
keep_running = True
total_pages = 0
page = 1
total_new_profiles = 0
while keep_running:
res = S.get(get_address(page))
total = 0
counts = 0

if res.status_code == 200 and res.json()["status"] == "ok":
new_profiles = []
body = res.json()
data = body["adverts_list"]
list = data["adverts"]
total = len(list)
counts = data["count"]
total_pages = data["total_pages"]
#print(list)
print("Count: %d | Size: %d\n | Page: %d" % (counts, total, page))
for p in list:
if p["user_id"] not in profile_id_list:
new_profiles.append(p)
profile_id_list.append(p["user_id"])
#print("phone:", p["id"])
update_profiles(new_profiles)
total_new_profiles = total_new_profiles + len(new_profiles)
page = page + 1
else:
print("Error: %d\n" % res.status_code)
if page >= total_pages:
keep_running = False
else:
time.sleep(1.5)
print("TotalNewEntry: %d" % total_new_profiles)

(1) (of 1 pages)

(Go Up)

Sections: politics (1) business autos (1) jobs (1) career education (1) romance computers phones travel sports fashion health
religion celebs tv-movies music-radio literature webmasters programming techmarket

Links: (1) (2) (3) (4) (5) (6) (7) (8) (9) (10)

Nairaland - Copyright © 2005 - 2024 Oluwaseun Osewa. All rights reserved. See How To Advertise. 9
Disclaimer: Every Nairaland member is solely responsible for anything that he/she posts or uploads on Nairaland.