The building of a Medicine Search Engine using Elastic Search from scratch using Python?
The rise of the search engine Google, along with other freely available search engines, has made our day-to-day online search easier. As per the research, Google receives more than 1 billion health questions every day. i.e 7 percent of Google’s daily searches are health-related.
Traditionally, medicine solely relied on the discretion advised by the doctors. For example, a doctor would have to suggest suitable medicine based on a patient’s symptoms. However, this isn’t always correct. Online medicine information can increase patients’ knowledge of, competence with, and engagement in health decision-making strategies. Independent online medicine inquiries can complement and be used in synergy with doctor-patient interactions in the clinic.
This article aims to provide an end to end strategy to make a Medicine Search Engine using Elastic Search from scratch in Python.
I will be covering the following topics in this post:
- How I collected the data for the Project? (Web Scraping)
- Store the scraped data into Redis and MongoDB
- Data Cleaning and Preprocessing
- Upload the data to Elastic Search
- Build a UI using Flask
- Build a Rest API that calls the endpoint for each medicine search
- Make a CURL request for each search and get the details
- Deploy the project on Heroku
1. How I collected the data for the Project?
The Internet certainly provides a number of resources for finding medical evidence, But we need a website that has trustable information about drugs that we can scrape. So, my choice goes with 1mg. 1mg is India’s leading online chemist with over 2 lakh medicines available.
Every online store consist of 2 views of the product,
- List Views
- Product Views
- A list view is a way to display the content of the store if there are lots of products. It’s particularly suitable for listing more products per page so that customers can quickly find what they want without needing to scroll between lots of pages.
- A product view is a way to display the content of a particular product in more detail. Every product on the list view has its individual product-view pages, which contain more information about that product.
Find The API
With the rise of modern web app frameworks like React and Vue.js, more and more sites are using REST API to send and receive data, then render the final layout on the client-side. I was also expecting a REST API by 1mg, which can make my work a little easier.
After spending some time on the website, I found the REST API, where the client-side page is receiving the data.
But how did I know this website is using an API call for displaying data?
- Many websites use the concept of ‘one page displaying’. I opened the webpage and inspect it. Went to XHR of networks and reload the page. I saw a list of all of the requests which are the type of XHR (data specific). Analyzed all network requests that I saw.
- When I hit the ‘next page’ button, the API request was made.
- Once I got the API, tried to modify it and made requests like page numbers and the alphabets that medicines are ordered with. That gave me the confirmation that I have caught the API.
First Victory!✌
Now, this is the time, I started thinking of extracting the data from the source API. I have used Scrapy, an open-source scraper framework written in Python is one of the most popular choices for such purpose.
All the Medicines are stored in order. And each list-view page has the information about its last page.
For example, there are 726 list-view pages for the word “A”.
Store the scraped data into Redis and MongoDB
Here comes the main topic: writing logic to extract each product URL from each list-views. And store each URL into a Redis list. (see the Architecture — web scraping). Let’s make it!
Redis Lists are simply lists of strings, sorted by insertion order. It is possible to add elements to a Redis List pushing new elements on the head (on the left) or on the tail (on the right) of the list. Redis list follows the first-in, first-out (FIFO) methodology.
As mentioned above, we will be using the site’s own REST API.
Now I had all the product-view URLs for each of the medicine in my Redis List.
Now comes the tricky part. I had to write a spider that takes the URLs from the Redis list and scrapes the necessary fields from the API and stores them into MongoDB.
Required data to scrape from API:
- name — Name of the medicine
- price — the price of the medicine.
- pack_size — How many tablets do a strip contains for each medicine.
- Uses — What is medicine is used for?
- intro_0 — First detail paragraph of that medicine.
- intro_1— Second detail paragraph of that medicine.
- How_to_use?
- How_it_works?
- expert_advice — What experts tell about that particular medicine?
- safety_advice — Understand the risks of having medicine?
- faqs — Frequently asked questions for the medicine.
- lv_url — List view URL, from where the product-view URL of the drug was scrapped.
- url — the product view URL of the medicine.
We just extracted all the needed information and to store it into MongoDB we can establish a connection by initializing MongoDBPipeline().
To analyze the data, I used MongoDB Compass. Compass is a MongoDB GUI client that can be used to manages collections and documents. After analyzing the data into MongoDB, exported the data into CSV format.
Second Victory! ✌
2. Data Cleaning and Preprocessing
Data preprocessing involves the transformation of the raw dataset into an understandable format. Although we have extracted the data from the API, we still need some data cleaning to brighten up the data.
Hence in the Preprocessing phase, we do the following in the order below:-
3. Upload bulk JSON data to ElasticSearch:
1. What is Elasticsearch?
Elasticsearch is a distributed, open-source search and analytics engine for all types of data, including textual, numerical, geospatial, structured, and unstructured. Elasticsearch is built on Apache Lucene and was first released in 2010 by Elasticsearch N.V. (now known as Elastic). Known for its simple REST APIs, distributed nature, speed, and scalability, Elasticsearch is the central component of the Elastic Stack, a set of open-source tools for data ingestion, enrichment, storage, analysis, and visualization. Commonly referred to as the ELK Stack (after Elasticsearch, Logstash, and Kibana), the Elastic Stack now includes a rich collection of lightweight shipping agents known as Beats for sending data to Elasticsearch.
A long story, right?
No worry’s, you can think of Elasticsearch as a server that can process JSON requests and give you back JSON data.
I have used a server-side Elastic Search i.e Bonsai. Bonsai provides high-performance Elasticsearch clusters on demand. Bonsai’s platform makes it easy to set up and manage all the open-source tools around Elasticsearch.
2. Connecting to Bonsai Elastic Search
Bonsai requires basic authentication for all read/write requests. You can get a Bonsai URL and the credentials after login to its dashboard.
The following code starts integrating Bonsai Elasticsearch into our app:
If you get the response as “True”, congrats! you are now connected with the Bonsai Elastic Search.
3. Convert the CSV into Python dict:
As I told earlier Elastic search can process JSON requests and give you back JSON data. So we need to make our CSV data into a python dictionary.
Now we should set our mapping for our index in elastic search. On the newest version of ELK, this task can be handled automatically by a dynamic mapping technique.
To do so, we’ll specify the mapping for each field in our data.
We are setting up our index name as “my_med”. If you get the output as follows, then you have successfully created your index on Bonsai elastic search.
{‘acknowledged’: True, ‘shards_acknowledged’: True, ‘index’: ‘my_med’}
4. Update each document in the dictionary, into an Elastic search format.
5. Use Python helpers to bulk load data into an Elasticsearch index
One of the most efficient ways to streamline indexing is through the helpers.bulk
method. Indexing large datasets without putting them into memory is the key to expediting search results and saving system resources.
As a limitation of Bonsai, it allows 125 MB of data and 10k documents for a cluster. We can increase the memory and the size of the documents by taking there paid plans.
As a result, we don’t have all the data into the Bonsai Elastic search.
Third Victory!✌
4. Build a Flask app
Let’s build a flask app, which takes the user input and search as per the user types the name of the medicine.
Build a Rest API that calls the endpoint for each medicine search
As we have a search bar, whenever a user searches for any medicine, the web app should give related keywords as per it’s searched. I made an API, that request a match_phrase or wild card regex query through the Elastic Seach database. (here, I have used match phrase prefix query)
For each type in the search bar, it’ll look for the match phrase prefix query in the database and show the result as autocomplete in a drop-down format.
Make a CURL request for each search to get the details:
Whenever the user, search for the medicine that he/she has searched, our flask app gonna make a curl request to the Bonsai Elasticsearch Database. And will show the details of that drug in a clean format.
The details include,
- Name of the medicine
- What is the use of medicine?
- More detail about its components.
- What is the pack type?
- Price of that drug in the rupee.
- Experts advice
- What is the side effect of that drug?
Deploy the project on Heroku
Lastly, let’s deploy all our code into Heroku.
Heroku is a platform as a service (PaaS) that enables developers to build, run, and operate applications entirely in the cloud.
After the deployment of our app on the cloud, now we have access to our app publicly.
Project URL: https://osso-shritam.herokuapp.com/
GitHub Repo for the project!
I hope you find this project helpful! For any feedback, or if you just want to see my works and projects, reach out to me at www.shritam.com.