Welcome, Guest: Register On Nairaland / LOGIN! / Trending / Recent / NewStats: 3,162,166 members, 7,849,602 topics. Date: Tuesday, 04 June 2024 at 05:33 AM |
Nairaland Forum / NnuReader's Profile / NnuReader's Posts
(1) (of 1 pages)
Programming / Re: Scraping Jiji Ideas by nnuReader: 11:39am On Feb 28, 2023 |
Here is a python script to scrape all the data, including phone number in less than an hour. The scripts works by directly fetching data from the jiji API endpoints and paginate: https:///api_web/v1/listing?slug=X&webp=true&page=Y where X is the category(vehicles, real-estate...) you want to scrape and Y is the page in the data(23 products returned per page), You just change keep changing the slug when you're done scraping a ctegory, and for every category, you keep increasing the page while you save the info and check for duplicates(a vendor can appear muliple times due to multiple product upload) This approach is miles faster than using tools like puppeteer, selenium or beautiful soup because you're not loading irrelevant files like css, js, images, html... You can run the script in CMD like the following: python3 scrape.py vehicles Vehicles The above scrape the vehicles category python3 scrape.py real-estate Properties The above scrape real estates. If you need more info, mail me at hello@feyitech.com The Script: import requests import time, sys from common import get_profile_id_list_and_profiles, update_profiles, dict_to_profile_row from coded_addesses import address_for_fresh, address_for_new, address_for_slider S = requests.Session() SCRAPE_TYPES = { "fresh": "fresh", "new": "new", "slider": "slider" } ACCEPTED_TYPES = [ 'vehicles', 'real-estate', 'mobile-phones-tablets', 'electronics', 'home-garden', 'health-and-beauty', 'fashion-and-beauty', 'hobbies-art-sport', 'seeking-work-cvs', 'services', 'babies-and-kids', 'animals-and-pets', 'agriculture-and-foodstuff', 'office-and-commercial-equipment-tools', 'repair-and-construction' ] if len(sys.argv) < 2 or sys.argv[1] not in ACCEPTED_TYPES: print('No category specified\n\n. Example: "python3 scrape.py vehicles"\n\n.Accepted categories are: %s' % ", ".join(ACCEPTED_TYPES)) else: type = sys.argv[1] name = type if len(sys.argv) > 2: name = sys.argv[2] profile_id_list_and_profiles = get_profile_id_list_and_profiles() #print(profile_id_list_and_profiles[1]) if profile_id_list_and_profiles is not None: def get_address(page): return "https:///api_web/v1/listing?slug=%s&webp=true&page=%d" % (type, page) profile_id_list = profile_id_list_and_profiles[0] keep_running = True total_pages = 0 page = 1 total_new_profiles = 0 while keep_running: res = S.get(get_address(page)) total = 0 counts = 0 if res.status_code == 200 and res.json()["status"] == "ok": new_profiles = [] body = res.json() data = body["adverts_list"] list = data["adverts"] total = len(list) counts = data["count"] total_pages = data["total_pages"] #print(list) print("Count: %d | Size: %d\n | Page: %d" % (counts, total, page)) for p in list: if p["user_id"] not in profile_id_list: new_profiles.append(p) profile_id_list.append(p["user_id"]) #print("phone:", p["id"]) update_profiles(new_profiles) total_new_profiles = total_new_profiles + len(new_profiles) page = page + 1 else: print("Error: %d\n" % res.status_code) if page >= total_pages: keep_running = False else: time.sleep(1.5) print("TotalNewEntry: %d" % total_new_profiles) |
(1) (of 1 pages)
(Go Up)
Sections: politics (1) business autos (1) jobs (1) career education (1) romance computers phones travel sports fashion health religion celebs tv-movies music-radio literature webmasters programming techmarket Links: (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) Nairaland - Copyright © 2005 - 2024 Oluwaseun Osewa. All rights reserved. See How To Advertise. 9 |