Creating Real-Time API with Beautiful Soup and Django REST Framework

Creating Real-Time API with Beautiful Soup and Django REST Framework

Creating Real-Time API with Beautiful Soup and Django REST Framework

A few weeks ago, I was interested in trading and found that the majority of companies are offering their paid services to analyze the forex data. My objective was to implement some ML algorithms to predict the market. Therefore, I decided to create a real-time API to use it in React and test my own automated strategies.

At the end of this tutorial, you’ll be able to turn any website into an API without using any online service. We will mainly use the Beautiful Soup and Django REST Framework to build real-time API by crawling the forex data.

You’ll need a basic understanding of Django and Ubuntu to run some important commands. If you’re using other operating systems, you can download Anaconda to make your work easier.

Installation and Configuration

To get started, create and activate a virtual environment by following commands:

Once the environment activated, install Django and Django REST Framework:

Now, create a new project named trading and inside your project create an app named forexAPI.

then open your settings.py and update INSTALLED_APPS configuration:

settings.py

In order to create a real-time API, we’ll need to crawl and update data continuously. Once our application is overloaded with traffic, the web server can only handle a certain number of requests and leave the user waiting for way too long. At this point, Celery is the best choice for doing background task processing. Passing the crawlers to queue to be executed in the background will keep the server ready to respond to new requests.

Additionally, Celery requires a message broker to send and receive messages, so we have to utilize RabbitMQ as a solution. You can install RabbitMQ through Ubuntu’s repositories by the following command:

then enable and start the RabbitMQ service:

If you are using other operating systems, you can follow the download instructions from the official documentation of RabbitMQ.

After installation completed, add CELERY_BROKER_URL configuration at the end of settings.py file:

settings.py

Now, we have to set the default Django settings module for the ‘celery’ program. Create a new file named celery.py inside the root directory, as shown in the schema below:

celery.py

We are setting the default Django settings module for the ‘celery’ program and loading task modules from all registered Django app configs.

Open  __init__.py in the same directory (root) and import the celery to ensure our Celery app is loaded once Django starts.

Crawling Data with Beautiful Soup

We are going to crawl one of the popular real-time market screeners named investing.com using Beautiful Soup which is easy to use parser tool and doesn’t require any knowledge of actual parsing theory and techniques. Thanks to the excellent documentation that makes it easy to learn with many code examples. Install the Beautiful Soup with the following command:

The next step is to create a model to save crawled data in the database. If you open the website, you can see a forex table with column names which will be our model fields.

models.py

then migrate your database by following commands:

After migrations, create a new file named tasks.py inside the app directory (forexAPI) which will include all our Celery tasks. The Celery app that we built in the root of the project will collect all of the tasks in all Django apps mentioned in the INSTALLED_APPS. Before implementation, open developer tools of browser to inspect table elements that are going to be crawled.

Inspect-Element-Forex

Inspect-Element-Forex

Initially, we are using abstraction class Request of urllib to open the website because Beautiful Soup can’t make a request to a particular web server. Then, we have to get all table rows (<tr>) and iterate through them to get into details of cells (<td>). Consider the table cells inside rows, and you’ll notice that class names include increment value that defines the number of the specific row, so we also need to keep a count of iterations to get the right information about the row. Python provides a built-in function enumerate() for dealing with this kind of iterators, enumerate rows to pass index inside the class name.

tasks.py

@shared_task will create an independent instance of the task for each app, making task reusable, so it’s important to specify this decorator for time-consuming tasks. The function will create a new object for each crawled row and sleep a few seconds to avoid blocking the database.

Save the file and run Celery worker in your console to see the result.

Once you run the worker, results will appear in the console and if you want to see the created objects, navigate to Django admin and check inside your app. Create a superuser to access the admin page.

Then, register your model in admin.py:

To create real-time data, we’ll need to continuously update these objects. We can achieve that by making small changes in the previous function.

tasks.py

To update an existing object, we should use the filter method to find a particular object and pass the dictionary to update() method. This is one of the best ways to handle multiple fields at once. Here is the full code for real-time updates:

Real-time crawlers can interrupt servers that can end with preventing you to access a certain webpage, so it is important being undetected while scraping continuously and bypass any restriction. You can prevent detection by setting a proxy on an instance of class Request.

Creating API with Dango REST Framework

The final step is to create serializers to build a REST API from crawled data. By using serializers we can convert our model instance to native Python datatype that can be easily rendered into JSON. The ModelSerializer class provides a shortcut that lets you automatically create a Serializer class with fields that correspond to the Model fields. For more information, check official documentation of the Django REST Framework.

Create serializers.py inside your app:

serializers.py

Now, open views.py to create ListAPIView that represents a collection of model instances. It’s used for read-only endpoints and provides a get method handler.

For more information about generic views visit Generic Views. Finally, configure the urls.py to render views:

In class-based views, the function as_view() must be called to return a callable view that takes a request and returns a response. It’s the main entry-point for generic views in the request-response cycle.

You’re almost done! In order to run the project properly, you have to run celery and Django server separately. The final result should look like this:

Final result

Final result

Try to refresh the page after 15 seconds and you’ll see the values are changing.

Source Code

GitHub repository to download the project.

Conclusion

Web scraping plays main role in the data industry and used by corporations to stay competitive. The real-time mode becomes useful when you want to get information on demand. Keep in mind, though, that you’re going to put a lot of server load on the site you’re scraping, so maybe check to see if they have an API or some other way to get the data. Companies put a lot of effort to provide services, so it’s best to respect their business and request permission before using it in production.

About the author

Stay Informed

It's important to keep up
with industry - subscribe!

Stay Informed

Looks good!
Please enter the correct name.
Please enter the correct email.
Looks good!

Related articles

27.07.2023

Creating Our Own Chat GPT

In June, OpenAI announced that third-party applications' APIs can be passed into the GPT model, opening up a wide range of possibilities for creating ...

12.06.2023

The Ultimate Guide to Pip

Developers may quickly and easily install Python packages from the Python Package Index (PyPI) and other package indexes by using Pip. Pip ...

Building a serverless web application with Python and AWS Lambda

AWS Lambda is a serverless computing solution that enables you to run code without the need for server provisioning or management. It automatically ...

4 comments

Nikita Bragin February 26, 2020 at 1:08 pm
0

1. Why would you put a sleep inside a for-loop inside a task? You should setup properly a rate-limit.
2. This could be improved by creating smaller tasks that take care of individual parts.
3. print? use logging module.
4. What if your request doesn’t return 2xx?
5. since you already wrote that dict to print it, you can use it as Currency.objects.create(**dct)

 
Rashid Maharamli February 26, 2020 at 1:42 pm
0

Thanks for corrections this small changes will make code clean and professional.

Rikesh Kayastha October 2, 2020 at 12:43 pm
0

Websockets using django channels can be used to display the changing values without refreshing.

Sign in

Forgot password?

Or use a social network account

 

By Signing In \ Signing Up, you agree to our privacy policy

Password recovery

You can also try to

Or use a social network account

 

By Signing In \ Signing Up, you agree to our privacy policy