How to Use Open Source Models with Hugging Face?

https://medium.com/@cbarkinozer/huggingface-ile-a%C3%A7%C4%B1k-kaynak-modeller-nas%C4%B1l-kullan%C4%B1l%C4%B1r-e6c284d52d98

An overview of Deeplearning.ai's "Open Source Models with Hugging Faces" course.

Selecting Models

Huggingface hub provides numerous open-source models. There are numerous models; you can sort them by operation, language, and licensing.

When you click on a model, it opens the model card. The model card displays key information about the model, such as the parameter count, the data on which it was trained, and the various checkpoints.

The model is utilized in Transformers, as shown by the button. The pipeline object performs the preprocessing steps for you.

Natural Language Processing

Natural language processing (NLP) is a machine learning technology that enables computers to interpret, manipulate, and comprehend human language.

You can test chatting with open-source models at hf.co/chat. Huggingface's open-source model chat interface is hosted by Huggingface.

Let's create a chatbot using the pipeline object.

from transformers import pipeline
chatbot = pipeline(task="conversational",
                   model="./models/facebook/blenderbot-400M-distill")

user_message = """
What are some fun activities I can do in the winter?
"""
from transformers import Conversation
conversation = Conversation(user_message)
print(conversation)

conversation = chatbot(conversation)
print(conversation)
print(chatbot(Conversation("What else do you recommend?")))

However, because the chatbot has no memory of previous talks, it may respond in a random manner. To include former talks in the LLM's context, add a'message' with the previous chat history.

conversation.add_message(
    {"role": "user",
     "content": """
What else do you recommend?
"""
    })
print(conversation)

conversation = chatbot(conversation)
print(conversation)

Human evaluation is the most valuable type of evaluation. There is an LLM evaluation arena where people can choose which of two answers is superior.

https://chat.lmsys.org/

Based on these scores, a benchmark is established:

LMSys Chatbot Arena Leaderboard - a Hugging Face Space by lmsys
Discover amazing ML apps made by the communityhuggingface.co

Translation and Summarization

Create a translation pipeline using the Transformers Library:

!pip install transformers 
!pip install torch

from transformers import pipeline 
import torch

translator = pipeline(task="translation",
                      model="./models/facebook/nllb-200-distilled-600M",
                      torch_dtype=torch.bfloat16) 

text = """\
My puppy is adorable, \
Your kitten is cute.
Her panda is friendly.
His llama is thoughtful. \
We all have nice pets!"""

text_translated = translator(text,
                             src_lang="eng_Latn",
                             tgt_lang="fra_Latn")

print(text_translated)

To select additional languages, you may locate the various language codes on the website.: Languages in FLORES-200

For example:

Afrikaans: afr_Latn
Chinese: zho_Hans
Egyptian Arabic: arz_Arab
French: fra_Latn
German: deu_Latn
Greek: ell_Grek
Hindi: hin_Deva
Indonesian: ind_Latn
Italian: ita_Latn
Japanese: jpn_Jpan
Korean: kor_Hang
Persian: pes_Arab
Portuguese: por_Latn
Russian: rus_Cyrl
Spanish: spa_Latn
Swahili: swh_Latn
Thai: tha_Thai
Turkish: tur_Latn
Vietnamese: vie_Latn
Zulu: zul_Latn

To free up memory on the machine so that the rest of the code may run, please execute the following commands.

import gc
del translator
gc.collect()

Creating the summarizing pipeline with Transformers Library:

summarizer = pipeline(task="summarization",
                      model="./models/facebook/bart-large-cnn",
                      torch_dtype=torch.bfloat16)

text = """Paris is the capital and most populous city of France, with
          an estimated population of 2,175,601 residents as of 2018,
          in an area of more than 105 square kilometres (41 square
          miles). The City of Paris is the centre and seat of
          government of the region and province of Île-de-France, or
          Paris Region, which has an estimated population of
          12,174,880, or about 18 percent of the population of France
          as of 2017."""

summary = summarizer(text,
                     min_length=10,
                     max_length=100)

print(summary)

Sentence Embeddings

In natural language processing, sentence embedding is a numeric representation of a sentence in the form of a real number vector that encodes relevant semantic information.

!pip install sentence-transformers

Create a sentence embedding pipeline using the Transformers Library:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
sentences1 = ['The cat sits outside',
              'A man is playing guitar',
              'The movies are awesome']
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
print(embeddings1)


sentences2 = ['The dog plays in the garden',
              'A woman watches TV',
              'The new movie is so great']
embeddings2 = model.encode(sentences2, 
                           convert_to_tensor=True)
print(embeddings2)

Calculate the cosine similarity between two sentences to determine how similar they are.

from sentence_transformers import util
cosine_scores = util.cos_sim(embeddings1,embeddings2)
print(cosine_scores)
for i in range(len(sentences1)):
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i],
                                                 sentences2[i],
                                                 cosine_scores[i][i]))

Zero-shot Audio Classification

Sound Classification is one of the most popular applications of Audio Deep Learning. It entails learning to classify sounds and determine which group they belong in.

!pip install transformers
!pip install datasets
!pip install soundfile
!pip install librosa

The librosa library may require the installation of ffmpeg. This librosa page contains installation instructions for FFmpeg.

Preparing the dataset of audio recordings

from datasets import load_dataset, load_from_disk

# This dataset is a collection of different sounds of 5 seconds
# dataset = load_dataset("ashraq/esc50",
#                       split="train[0:10]")
dataset = load_from_disk("./models/ashraq/esc50/train")

audio_sample = dataset[0]
print(audio_sample)

from IPython.display import Audio as IPythonAudio
IPythonAudio(audio_sample["audio"]["array"],
             rate=audio_sample["audio"]["sampling_rate"])

Create audio categorization pipeline with 🤗 Transformers Library.

from transformers import pipeline

zero_shot_classifier = pipeline(
    task="zero-shot-audio-classification",
    model="./models/laion/clap-htsat-unfused")

More info on laion/clap-htsat-unfused.

Sampling Rate

Sampling is the process of taking set time steps to measure the value of a continuous signal.

The sampling rate (Hz) is the number of samples taken in a single second.

Examples of sampling rates are 8k Hz for telephone and radio, 16k for human speech recording, and 192k for high-resolution audio.

The sampling rate for transformer models

In a 5-second sound:

At 8000 Hz => 5*8000 = 40,000 signal values.
At 16000 Hz => 5*16000 = 80,000 signal values.
At 192000 Hz => 5*192000 = 960,000 signal values.

For a transformer trained on 16 kHz audio, an array of 960k values corresponds to a 60-second recording at 16 kHz.

How long does one second of high-resolution audio (192,000 Hz) appear to the Whisper model (which has been taught to expect audio files at 16,000 Hz)?

(1 * 192000) / 16000 = 12.0 . The model perceives one second of high-resolution audio as twelve seconds of audio.

How about five seconds of audio?

5 * 192000 / 16000 = 60. The model perceives 0.5 seconds of high-resolution audio as 60 seconds of audio.

zero_shot_classifier.feature_extractor.sampling_rate
audio_sample["audio"]["sampling_rate"]
from datasets import Audio
dataset = dataset.cast_column(
    "audio",
     Audio(sampling_rate=48_000))
audio_sample = dataset[0]


candidate_labels = ["Sound of a dog",
                    "Sound of vacuum cleaner"]
zero_shot_classifier(audio_sample["audio"]["array"],
                     candidate_labels=candidate_labels)
candidate_labels = ["Sound of a child crying",
                    "Sound of vacuum cleaner",
                    "Sound of a bird singing",
                    "Sound of an airplane"]

Automatic Speech Recognition

Automatic Speech Recognition (ASR) is the application of machine learning or artificial intelligence (AI) technology to convert human speech into legible text.

  !pip install transformers
  !pip install -U datasets
  !pip install soundfile
  !pip install librosa
  !pip install gradio

Data preparation

from datasets import load_dataset

dataset = load_dataset("librispeech_asr",
                       split="train.clean.100",
                       streaming=True,
                       trust_remote_code=True)

example = next(iter(dataset))

dataset_head = dataset.take(5)
list(dataset_head)

list(dataset_head)[2]
print(example)

from IPython.display import Audio as IPythonAudio

IPythonAudio(example["audio"]["array"],
             rate=example["audio"]["sampling_rate"])

Build the pipeline

from transformers import pipeline

asr = pipeline(task="automatic-speech-recognition",
               model="./models/distil-whisper/distil-small.en")
asr.feature_extractor.sampling_rate
example['audio']['sampling_rate']
asr(example["audio"]["array"])
example["text"]

Info about distil-whisper/distil-small.en

Build a shareable app with Gradio

import os
import gradio as gr

demo = gr.Blocks()

def transcribe_speech(filepath):
    if filepath is None:
        gr.Warning("No audio found, please retry.")
        return ""
    output = asr(filepath)
    return output["text"]

mic_transcribe = gr.Interface(
    fn=transcribe_speech,
    inputs=gr.Audio(sources="microphone",
                    type="filepath"),
    outputs=gr.Textbox(label="Transcription",
                       lines=3),
    allow_flagging="never")

file_transcribe = gr.Interface(
    fn=transcribe_speech,
    inputs=gr.Audio(sources="upload",
                    type="filepath"),
    outputs=gr.Textbox(label="Transcription",
                       lines=3),
    allow_flagging="never",
)

with demo:
    gr.TabbedInterface(
        [mic_transcribe,
         file_transcribe],
        ["Transcribe Microphone",
         "Transcribe Audio File"],
    )

demo.launch(share=True, 
            server_port=int(os.environ['PORT1']))

demo.close()

Testing with a longer audio file

import soundfile as sf
import io

audio, sampling_rate = sf.read('narration_example.wav')

sampling_rate

asr.feature_extractor.sampling_rate

asr(audio) # ValueError: We expect a single channel audio input for AutomaticSpeechRecognitionPipeline

Convert the audio from stereo to mono (Using librosa)

audio.shape

import numpy as np

audio_transposed = np.transpose(audio)
audio_transposed.shape
import librosa
audio_mono = librosa.to_mono(audio_transposed)
IPythonAudio(audio_mono,
             rate=sampling_rate)
asr(audio_mono) # Warning: The cell above might throw a warning because the sample rate of the audio sample is not the same of the sample rate of the model.

Let’s fix this.

sampling_rate
asr.feature_extractor.sampling_rate
audio_16KHz = librosa.resample(audio_mono,
                               orig_sr=sampling_rate,
                               target_sr=16000)
asr(
    audio_16KHz,
    chunk_length_s=30, # 30 seconds
    batch_size=4,
    return_timestamps=True,
)["chunks"]

Build Gradio UI

import gradio as gr
demo = gr.Blocks()

def transcribe_long_form(filepath):
    if filepath is None:
        gr.Warning("No audio found, please retry.")
        return ""
    output = asr(
      filepath,
      max_new_tokens=256,
      chunk_length_s=30,
      batch_size=8,
    )
    return output["text"]

mic_transcribe = gr.Interface(
    fn=transcribe_long_form,
    inputs=gr.Audio(sources="microphone",
                    type="filepath"),
    outputs=gr.Textbox(label="Transcription",
                       lines=3),
    allow_flagging="never")

file_transcribe = gr.Interface(
    fn=transcribe_long_form,
    inputs=gr.Audio(sources="upload",
                    type="filepath"),
    outputs=gr.Textbox(label="Transcription",
                       lines=3),
    allow_flagging="never",
)

with demo:
    gr.TabbedInterface(
        [mic_transcribe,
         file_transcribe],
        ["Transcribe Microphone",
         "Transcribe Audio File"],
    )
demo.launch(share=True, 
            server_port=int(os.environ['PORT1']))

demo.close()

Test your own audio files.

import soundfile as sf
import io

audio, sampling_rate = sf.read('narration_example.wav')

sampling_rate

asr.feature_extractor.sampling_rate

Text to Speech

    !pip install transformers
    !pip install gradio
    !pip install timm
    !pip install inflect
    !pip install phonemizer

Note: py-espeak-ng is only available Linux operating systems.

To run locally on a Linux system, use these commands:

sudo apt-get update
sudo apt-get install espeak-ng
pip install py-espeak-ng

Build the `text-to-speech` pipeline using the 🤗 Transformers Library

from transformers import pipeline

narrator = pipeline("text-to-speech",
                    model="./models/kakao-enterprise/vits-ljs")

Info about kakao-enterprise/vits-ljs

text = """
Researchers at the Allen Institute for AI, \
HuggingFace, Microsoft, the University of Washington, \
Carnegie Mellon University, and the Hebrew University of \
Jerusalem developed a tool that measures atmospheric \
carbon emitted by cloud servers while training machine \
learning models. After a model’s size, the biggest variables \
were the server’s location and time of day it was active.
"""

narrated_text = narrator(text)

from IPython.display import Audio as IPythonAudio

IPythonAudio(narrated_text["audio"][0],
             rate=narrated_text["sampling_rate"])

Object Detection

Object detection is a computer vision and image processing technique that detects instances of semantic items of a specific class (such as individuals, buildings, or cars) in digital photos and videos.

    !pip install transformers
    !pip install gradio
    !pip install timm
    !pip install inflect
    !pip install phonemizer

Note: py-espeak-ng is only available in Linux operating systems.

To run locally on a Linux system, use these commands:

sudo apt-get update
sudo apt-get install espeak-ng
pip install py-espeak-ng

Object detection is detecting items inside an image.

Object detection allows you to classify and localize at the same time.

Build the `object-detection` pipeline using 🤗 Transformers Library

This model was released in the paper End-to-End Object Detection with Transformers by Carion et al. (2020)

from helper import load_image_from_url, render_results_in_image
from transformers import pipeline

od_pipe = pipeline("object-detection", "./models/facebook/detr-resnet-50")

Info about facebook/detr-resnet-50

Explore more of the Hugging Face Hub for more object detection models

Use the Pipeline

from PIL import Image
raw_image = Image.open('huggingface_friends.jpg')
raw_image.resize((569, 491))
pipeline_output = od_pipe(raw_image)
# Return the results from the pipeline using the helper function render_results_in_image.
processed_image = render_results_in_image(
    raw_image, 
    pipeline_output)
processed_image

Demo with Gradio

import os
import gradio as gr

def get_pipeline_prediction(pil_image):
    
    pipeline_output = od_pipe(pil_image)
    
    processed_image = render_results_in_image(pil_image,
                                            pipeline_output)
    return processed_image

demo = gr.Interface(
  fn=get_pipeline_prediction,
  inputs=gr.Image(label="Input image", 
                  type="pil"),
  outputs=gr.Image(label="Output image with predicted instances",
                   type="pil")
)
# share=True will provide an online link to access to the demo

demo.launch(share=True, server_port=int(os.environ['PORT1']))
demo.close()

Make an AI Powered Audio Assistant

Developing SOTA audio and object detection models to create an AI assistant.

Combine the object detector with a text-to-speech model to help determine what is in the image. Examine the results of the object detection pipeline.

pipeline_output
od_pipe
raw_image = Image.open('huggingface_friends.jpg')
raw_image.resize((284, 245))
from helper import summarize_predictions_natural_language
text = summarize_predictions_natural_language(pipeline_output)
text

Generate Audio Narration of an Image

tts_pipe = pipeline("text-to-speech",
                    model="./models/kakao-enterprise/vits-ljs")

More info about kakao-enterprise/vits-ljs.

narrated_text = tts_pipe(text)

Play the Generated Audio

from IPython.display import Audio as IPythonAudio
IPythonAudio(narrated_text["audio"][0],
             rate=narrated_text["sampling_rate"])

Image Segmentation

Mask generation possibly accepts input 2D points and/or bounding boxes but does not output the mask's class!

    !pip install transformers
    !pip install gradio
    !pip install timm
    !pip install torchvision

Mask Generation with SAM

Architectural features of the SAM model with the output of the automated mask generating process.

The Segment Anything Model (SAM) model was released by Meta AI.

from transformers import pipeline
sam_pipe = pipeline("mask-generation",
    "./models/Zigeng/SlimSAM-uniform-77")

Info about Zigeng/SlimSAM-uniform-77

from PIL import Image
raw_image = Image.open('meta_llamas.jpg')
raw_image.resize((720, 375))

Running this will require some time. The larger the value of 'points_per_batch', the better pipeline inference will be.

output = sam_pipe(raw_image, points_per_batch=32)
from helper import show_pipe_masks_on_image
show_pipe_masks_on_image(raw_image, output)

Faster Inference: Infer an Image and a Single Point

from transformers import SamModel, SamProcessor
model = SamModel.from_pretrained(
    "./models/Zigeng/SlimSAM-uniform-77")

processor = SamProcessor.from_pretrained(
    "./models/Zigeng/SlimSAM-uniform-77")
raw_image.resize((720, 375))

Separate the blue shirt Andrew is wearing. Provide any single 2D point that would be in the region (blue shirt).

input_points = [[[1600, 700]]]

Create the input with the image and a single point. return_tensors="pt" returns PyTorch Tensors.

inputs = processor(raw_image,
                 input_points=input_points,
                 return_tensors="pt")

Determine the model's output based on the inputs.

import torch
with torch.no_grad():
    outputs = model(**inputs)
predicted_masks = processor.image_processor.post_process_masks(
    outputs.pred_masks,
    inputs["original_sizes"],
    inputs["reshaped_input_sizes"]
)

The length of predicted_masks corresponds to the number of photos utilized in the input.

len(predicted_masks)

predicted_mask = predicted_masks[0]
predicted_mask.shape
outputs.iou_scores
from helper import show_mask_on_image
for i in range(3):
    show_mask_on_image(raw_image, predicted_mask[:, i])

Depth Estimation with DPT

This model was introduced in the paper Vision Transformers for Dense Prediction by Ranftl et al. (2021) and first released in isl-org/DPT.

depth_estimator = pipeline(task="depth-estimation",
                        model="./models/Intel/dpt-hybrid-midas")

Info about ‘Intel/dpt-hybrid-midas’

raw_image = Image.open('gradio_tamagochi_vienna.png')
raw_image.resize((806, 621))
output = depth_estimator(raw_image)
output

Post-process the output image to make it the same size as the original.

output["predicted_depth"].shape
output["predicted_depth"].unsqueeze(1).shape
prediction = torch.nn.functional.interpolate(
    output["predicted_depth"].unsqueeze(1),
    size=raw_image.size[::-1],
    mode="bicubic",
    align_corners=False,
)
prediction.shape
raw_image.size[::-1],
prediction

Normalize the anticipated tensors (from 0 to 255) so they may be presented.

import numpy as np 
output = prediction.squeeze().numpy()
formatted = (output * 255 / np.max(output)).astype("uint8")
depth = Image.fromarray(formatted)
depth

Gradio Example:

import os
import gradio as gr
from transformers import pipeline

def launch(input_image):
    out = depth_estimator(input_image)

    # resize the prediction
    prediction = torch.nn.functional.interpolate(
        out["predicted_depth"].unsqueeze(1),
        size=input_image.size[::-1],
        mode="bicubic",
        align_corners=False,
    )

    # normalize the prediction
    output = prediction.squeeze().numpy()
    formatted = (output * 255 / np.max(output)).astype("uint8")
    depth = Image.fromarray(formatted)
    return depth

iface = gr.Interface(launch, 
                     inputs=gr.Image(type='pil'), 
                     outputs=gr.Image(type='pil'))

iface.launch(share=True, server_port=int(os.environ['PORT1']))

iface.close()

Image Retrieval

An image retrieval system is a computer system that allows users to browse, search, and retrieve images from a huge collection of digital images.

    !pip install transformers
    !pip install torch

Load the model and the processor.

from transformers import BlipForImageTextRetrieval
model = BlipForImageTextRetrieval.from_pretrained(
    "./models/Salesforce/blip-itm-base-coco")

More info about Salesforce/blip-itm-base-coco.

from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained(
    "./models/Salesforce/blip-itm-base-coco")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'

from PIL import Image
import requests
raw_image =  Image.open(
    requests.get(img_url, stream=True).raw).convert('RGB')
raw_image

Test, if the image matches the text

text = "an image of a woman and a dog on the beach"
inputs = processor(images=raw_image,
                   text=text,
                   return_tensors="pt")
inputs
itm_scores = model(**inputs)[0]
itm_scores
import torch

To calculate probabilities, use a softmax layer.

itm_score = torch.nn.functional.softmax(
    itm_scores,dim=1)
itm_score
print(f"""\
The image and text are matched \
with a probability of {itm_score[0][1]:.4f}""")

Image Captioning

Image captioning is the effort of collecting relevant information from an image and creating appropriate captions to describe its content.

BLIP Bootstrapping Language-Image Pretraining Architecture

!pip install transformers

Load the model and processor:

from transformers import BlipForConditionalGeneration
model = BlipForConditionalGeneration.from_pretrained(
    "./models/Salesforce/blip-image-captioning-base")

Info about Salesforce/blip-image-captioning-base

from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained(
    "./models/Salesforce/blip-image-captioning-base")

from PIL import Image
image = Image.open("./beach.jpeg")
image

text = "a photograph of"
inputs = processor(image, text, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True)) # A photograph of a woman and her dog on the beach.

Unconditional Image Captioning

inputs = processor(image,return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

Multimodel Visual Question Answering

Visual Question Answering (VQA) is a key cross-disciplinary challenge in computer vision and natural language processing that requires a computer to generate a natural language response based on images and questions posed by the images.

!pip install transformers

Load model and processor:

from transformers import BlipForQuestionAnswering
model = BlipForQuestionAnswering.from_pretrained(
    "./models/Salesforce/blip-vqa-base")

Info about Salesforce/blip-vqa-base

from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained(
    "./models/Salesforce/blip-vqa-base")

from PIL import Image
image = Image.open("./beach.jpeg")

question = "how many dogs are in the picture?"
inputs = processor(image, question, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

Zero-shot Image Classification

!pip install transformers

from transformers import CLIPModel
model = CLIPModel.from_pretrained(
    "./models/openai/clip-vit-large-patch14")
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained(
    "./models/openai/clip-vit-large-patch14")

More info about openai/clip-vit-large-patch14.

CLIP (Contrasive Language-Image Pre-Training) is a multimodel vision and language model that can do zero-shot image categorization.

from PIL import Image
image = Image.open("./kittens.jpeg")
labels = ["a photo of a cat", "a photo of a dog"]
inputs = processor(text=labels,
                   images=image,
                   return_tensors="pt",
                   padding=True)
outputs = model(**inputs)
outputs.logits_per_image
probs = outputs.logits_per_image.softmax(dim=1)[0]
probs = list(probs)
for i in range(len(labels)):
  print(f"label: {labels[i]} - probability of {probs[i].item():.4f}")
# label: a photo of a cat - probability of 0.9992

Deploying ML Models on 🤗 Hub using Gradio

!pip install transformers
!pip install gradio
!pip install gradio_client

If you encounter problems while performing an API request to your own space, you can try upgrading your version of gradio_client.

pip install -U gradio_client

You can establish an account on hugging face at [huggingface.co] and follow the instructions provided.

import os
import gradio as gr
from transformers import pipeline
pipe = pipeline("image-to-text",
                model="./models/Salesforce/blip-image-captioning-base")
def launch(input):
    out = pipe(input)
    return out[0]['generated_text']
iface = gr.Interface(launch,
                     inputs=gr.Image(type='pil'),
                     outputs="text")
iface.launch(share=True, 
             server_port=int(os.environ['PORT1']))
iface.close()

Visit https://huggingface.co/spaces.
 Click the "Create New Space" button.
    Give the space a name, such "blip-image-captioning".
    Select a license, such as Apache 2.0.
    Click "Gradio" in the "Select the Space SDK" box.
    For Hardware, select the default free choice. "CPU Basic"
    Leave it as "public".
    Click "Create Space".
    You'll see a new page with instructions for cloning and updating a GitHub repository.
    If you want to get a small app up and running quickly, you may add the necessary files right in the web browser. Click the "Files" tab at the top.
    Click "+ Add File" then "Create New File".
    Create a file named requirements.txt.
    Paste the following:

transformers
torch
gradio

Keep "Commit Directly to the Main Branch" selected.
    Select "commit new file to main".
    In the "Name Your File" textbox, input "app.py".
    In the textbox for your code, paste the code that you ran above, or copy the block below:

import gradio as gr
from transformers import pipeline

pipe = pipeline("image-to-text",
                model="Salesforce/blip-image-captioning-base")


def launch(input):
    out = pipe(input)
    return out[0]['generated_text']

iface = gr.Interface(launch,
                     inputs=gr.Image(type='pil'),
                     outputs="text")

iface.launch()

Keep "Commit Directly to the Main Branch" selected.
    Select "Commit new file to main".
    The app will remain "Building" for a few minutes.
    To view the console while the space is being developed, click on the "App" menu to the left of the "Files" menu.
    When the build is finished, you'll see your app!
    At the bottom, select "Use via API" to view sample code for using your model with an API call.
    If you haven't done so before, run the pip install command.
    Gradio_client should be installed in the classroom by default.
    Copy the sample code, which will seem like this:

from gradio_client import Client

client = Client("eddyS/blip-image-captioning-2")
result = client.predict(
        "https://raw.githubusercontent.com/gradio-app/gradio/main/test/test_files/bus.png",    # filepath  in 'input' Image component
        api_name="/predict"
)
print(result)

You can replace the string within the client.Use predict() with a string pointing to a local file.
    In the classroom, you can use two image files.
    "kittens.jpg" "huggingface_friends.jpg"
    Please feel free to upload your own to the file directory. So, your code might look like this:

from gradio_client import Client

client = Client("eddyS/blip-image-captioning-2")
result = client.predict(
        "kittens.jpg",
        api_name="/predict"
)
print(result)

Inspect the information in the API.

client.view_api()

The output may look like this:

Client.predict() Usage Info
---------------------------
Named API endpoints: 1

 - predict(input, api_name="/predict") -> output
    Parameters:
     - [Image] input: filepath 
    Returns:
     - [Textbox] output: str

Access to your private space as an API

You can make your space private, allowing access only with an access token.
To make the place private, select the "Settings" menu at the top. Scroll down to "Change space visibility" and select the "Make private" button.
To obtain an access token, navigate to your profile (click the profile icon).
On your profile page, click the "Settings" icon to the left.
In your profile settings, on the left side menu, select "Access Tokens".
Select "New token".In the pop-up, explain what the token is for.
You can leave it at "read" (the alternative is "write").
Select "create new token".
You can copy the access token.
You can change the API call to include your access token.

from gradio_client import Client

client = Client("eddyS/blip-image-captioning-2",
                hf_token=hf_access_token
               )
result = client.predict(
        "kittens.jpg",
        api_name="/predict"
)
print(result)
# client = Client("abidlabs/whisper-large-v2", 
)

Saving your access token securely

It is suggested that you do not hardcode the access token.

HF_TOKEN="abc1234" # not recommended

You can save your access token to a file called ".env".

HF_ACCESS_TOKEN="abc123"

Then access that environment variable with the dotenv library

# !pip install python-dotenv # install library
from dotenv import load_dotenv, find_dotenv
import os
_ = load_dotenv(find_dotenv())
hf_access_token = os.getenv("HF_ACCESS_TOKEN")

GPU Zero Space

ZeroGPU Explorers [https://huggingface.co/zero-gpu-explorers] : A platform for spinning free GPUs on demand for your venues. You can click "request to join this organization". It may take many days or weeks for this to be approved.

Resources

[1] Deeplearningai, (February 2024), Open Source Models with Huggingface:

[https://www.deeplearning.ai/short-courses/open-source-models-hugging-face/]