Full-stack Data Science: Building & deploying an ML app tutorial — Part 1

Data Scientists NEED to learn to package and deploy their own models.

I’m not being a gatekeeper here, I’m giving you facts. I interview, hire, and lead data professionals, whether it be Data Scientists, Data Analysts, Machine Learning Engineers, or Data Engineers. Packaging and deploying models are consistently gaps for people without a software engineering background.

That’s why in this article and video I’ll show you a rapid deployment of a Natural Language Processing (NLP) app, from start to finish. I’m not worrying about developing the model because modeling isn’t the gap I see in the market.

Alright, let’s get started.

Setting up the project environment

To get started, we’re going to open our terminal and run the following command to create the application directory:

mkdir ner-service

Now ‘cd’ into the directory

cd ner-service

Create the poetry project by running:

poetry init

Note: Alternatively, you could run poetry new ner-service which would start a structured poetry project for you.

After you run the ‘poetry init’ command, you’ll go through a project setup that looks something like this:

Now that we have the poetry project setup, let’s launch the poetry shell.

poetry shell

If you’ve done everything properly to this point, you should see something like this in your terminal:

Setting up the project file structure

The first thing we’ll want to do is create our ‘src’ directory. To do that we’ll run:

mkdir src; cd src

Then we’ll want to create the __init__.py and main.py files.

touch __init__.py main.py

At this point, your project structure should look like this:

Spacy’s Named Entity Recognition (NER)

Background Information: SpaCy & NER

SpaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython

Wikipedia

SpaCy is a fantastic library used to simplify the building and development of NLP solutions. In this project, for the sake of simplicity, we’re using SpaCy’s built-in named-entity recognition (NER) feature.

If you’d like to learn more about NER, you can check out the screenshot below or see the link in the image description.

Screenshot from Wikipedia definition of named-entity recognition (link)

Implementing SpaCy’s NER system

#~/ner-service/src/main.py
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)

Now, to use spaCy, we’ll need to add it to our environment. You can do that by running the following code:

poetry add spacy

You’re terminal will look something like this after:

Then you’ll want to add the language model.

poetry add https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz

Then just to make sure everything has properly installed, run:

poetry update

At this point, your pyproject.toml file should look something like this:

Now, let’s quickly test that the code works in our environment. Run the following command:

python main.py

If everything is running as expected in your environment you should get an output like this:

Setup the API with FastAPI

Before we jump in, I’m going to introduce FastAPI. Feel free to skip to the Implementing FastAPI section.

What is FastAPI?

According to the FastAPI website:

FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.6+ based on standard Python type hints.

https://fastapi.tiangolo.com/

I don’t want to be too lazy about this, but that quote pretty much sums it up. FastAPI is fast, easy, clean, and extensible.

Why use FastAPI?

  • FastAPI comes with Swagger docs built-in. This is awesome for rapid prototyping and testing your API.
  • Clear, concise documentation and examples.
  • Extensibility.
  • Speed, speed, speed to production.

Implementing FastAPI

poetry add fastapi uvicorn

That command will install FastAPI and Uvicorn. According to the documentation, “Uvicorn is a lightning-fast ASGI server implementation, using uvloop and httptools.” For our use case, Uvicorn helps us serve our app to the world.

Now that FastAPI and Uvicorn are installed, let’s go back to main.py and implement FastAPI.

#~/ner-service/src/main.py
from fastapi import FastAPI
from typing import List
import spacy
from .models import Payload, Entitiesapp = FastAPI()nlp = spacy.load("en_core_web_sm")
@app.post('/ner-service')
async def get_ner(payload: Payload):
tokenize_content: List[spacy.tokens.doc.Doc] = [
nlp(content.content) for content in payload.data
]
document_enities = []
for doc in tokenize_content:
document_enities.append([ {'text': ent.text, 'entity_type': ent.label_} for ent in doc.ents ])
return [
Entities(post_url=post.post_url, entities=ents)
for post, ents in zip(payload.data, document_enities)
]

That’s a lot of new code added to the file, so let’s go through it piece by piece.

app = FastAPI()

This part simply instantiates the FastAPI application.

@app.post('/ner-service')

This sets our route or path. For example, www.mktr.ai/ner-service would have the above path if this were a service we ran from the MKTR.AI website.

@app.post('/ner-service') 
async def get_ner(payload: Payload):
...

The async for path operation functions are super helpful and I suggest you take a look at FastAPI’s documentation to learn more.

Notice the payload: Payload piece is telling the application the type of data to expect. We’ll get to that in a minute when we make a models.py file. For now, think of it as a way to format the data we’ll accept in a request to our API.

tokenize_content: List[spacy.tokens.doc.Doc] = [
nlp(content.content) for content in payload.data
]

Here we’re using list comprehension to tokenize the text data that’s passed to our API. The List[spacy.tokens.doc.Doc] portion declares the type/format of the data we’re assigning to the tokenize_content variable. This may be a little redundant but becomes more important as you attempt to account for edge cases and potential issues in production.

document_enities = []
for doc in tokenize_content:
document_enities.append([ {'text': ent.text, 'entity_type': ent.label_} for ent in doc.ents ])

Here we’re creating a list, document_entities, and using list comprehension to create a dictionary with the text and entity type for each piece of text passed to the API. The document_entities list is a list of dictionaries.

return [
Entities(post_url=post.post_url, entities=ents)
for post, ents in zip(payload.data, document_enities)
]

Finally, we format our response object. Based on the previous chunks, you can probably tell what’s going on here. Basically, the Entities() piece hydrates the Entities objects for each text string passed.

Ok, now that the main.py file is good to go, we’re going to create another python file named models.py like this:

touch models.py

Then let’s add the following code to models.py:

#~/ner-service/src/models.py
from typing import List
from pydantic import BaseModel
class Content(BaseModel):
post_url: str
content: str
class Payload(BaseModel):
data: List[Content] # this makes list of Content objects
class SingleEntity(BaseModel):
text: str
entity_type: str
class Entities(BaseModel):
post_url: str
entities: List[SingleEntity] # this makes a list of SingleEntity objects

Basically, Content() is a single object with a post_url string and a content string. The Payload() object is a list of Content objects.

The same goes for Entities and SingleEntity.

Test your FastAPI locally

uvicorn main:app --reload

After you run the above code, your terminal should look something like this:

If everything is going as planned, you should be able to visit http://127.0.0.1:8000/docs and check out your FastAPI app and swagger doc.

In closing…

In the next post, we’ll containerize our app using Docker, push the docker image to DockerHub, setup a GCP Virtual Machine, and run our app so the world can use it.

If you’re inpatient, you can always cut to the chase and watch the original YouTube video for this project and the GitHub repo.

I hope this was helpful! If you run into issues or have any questions, drop a comment, shoot me an email, or connect with me on LinkedIn and I’ll get you squared away.

This article was originally posted on MKTR.AI.

Data hacker. Tinkerer.